Usage
======

.. _usage:

Quick start
~~~~~~~~~~~

There are 6 commands in PLM-interact package
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- `inference_PPI`: PPI prediction.
- `train_mlm`: Training PPI models using mask and binary classification losses.
- `train_binary`: Training PPI models using only binary classification loss.
- `predict_ddp`: Choose the best trained checkpoints by testing on the validation datasets and evaluate the model's performance on the test datasets.
- `mutation_train`:Fine-tuning in the binary mutation effect task.
- `mutation_predict`: Inference in the binary mutation effect task.

Predict a list of PPIs
^^^^^^^^^^^^^^^^^^^^^^

To predict a list of PPIs, you can download pre-trained models from `Hugging Face <https://huggingface.co/danliu1226>`_.
Protein sequence pair should be listed as follows:
Required Input:

   (--test_filepath): A CSV file with the following two columns: 'query': The sequence of protein 1, and 'text': The sequence of protein 2.

   (--resume_from_checkpoint): the trained model that can be downloaded from `Hugging Face <https://huggingface.co/danliu1226>`_.

   (--output_filepath): a path to save the results.

.. code-block:: bash

    torchrun --nproc_per_node=1 -m PLMinteract inference_PPI --seed 2 --batch_size_val 1 --test_filepath [a list of paired protein sequences] --resume_from_checkpoint [traiend model] --output_filepath $output_filepath --offline_model_path $offline_model_path --model_name esm2_t12_35M_UR50D --embedding_size 480 --max_length [length threshold of the paired protein] 


Training PPI models 
^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

    torchrun --nproc_per_node=1 -m PLMinteract train_mlm --epochs [total of training epochs] --seed 2 --data 'human_V11' --task_name '1vs10' --batch_size_train 1 --train_filepath [paired protein sequences for train] --model_name esm2_t12_35M_UR50D --embedding_size 480 --warmup_steps 10 --gradient_accumulation_steps 32 --max_length [length threshold of the paired protein] --weight_loss_mlm 1 --weight_loss_class 10 --offline_model_path [Path to a locally stored ESM-2 model] --output_filepath $output_filepath

.. code-block:: bash

    torchrun --nproc_per_node=1 -m PLMinteract train_binary --epochs [total of training epochs] --seed 2 --data 'human_V11' --task_name 'binary' --batch_size_train 2 --batch_size_val 32 --train_filepath [paired protein sequences for train] --dev_filepath [paired protein sequences for validation] --test_filepath [paired protein sequences for test] --model_name 'esm2_t12_35M_UR50D' --embedding_size 480 --warmup_steps 2000 --gradient_accumulation_steps 1 --max_length [length threshold of the paired protein] --offline_model_path [Path to a locally stored ESM-2 model] --evaluation_steps [evaluation steps] --sub_samples [subsamples of evaluation] --output_filepath $output_filepath 

Validation and test of trained models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

    torchrun --nproc_per_node=1 -m PLMinteract predict_ddp --epochs [total of training epochs] --seed 2 --batch_size_val 1 --dev_filepath [paired protein sequences for validation] --test_filepath [paired protein sequences for test] --resume_from_checkpoint [the path of checkpints] --model_name esm2_t12_35M_UR50D --embedding_size 480 --max_length [length threshold of the paired protein] --offline_model_path [Path to a locally stored ESM-2 model] --output_filepath $output_filepath 

Fine-tuning in the binary mutation effect task.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

    torchrun --nproc_per_node=1 -m PLMinteract mutation_train --epochs [total of training epochs] --seed 2 --task_name $task_name --batch_size_train 1 --batch_size_val 1 --train_filepath [paired protein sequences for train] --dev_filepath [paired protein sequences for validation]  --warmup_steps 2000 --resume_from_checkpoint [the path of checkpints] --model_name esm2_t33_650M_UR50D --embedding_size 1280 --max_length [length threshold of the paired protein] --gradient_accumulation_steps 1 --offline_model_path [Path to a locally stored ESM-2 model] --output_path $output 

Inference in the binary mutation effect task.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

    torchrun --nproc_per_node=1 -m PLMinteract mutation_predict --seed 2 --batch_size_val 1 --test_filepath $[paired protein sequences for test] --resume_from_checkpoint [traiend model] --model_name esm2_t33_650M_UR50D --embedding_size 1280 --max_length [length threshold of the paired protein] --offline_model_path [Path to a locally stored ESM-2 model] --output_path $output 


PPI prediction
~~~~~~~~~~~~~~~

.. code-block:: bash

   usage: PLMinteract inference_PPI [-h] [--seed SEED] --test_filepath TEST_FILEPATH --output_filepath OUTPUT_FILEPATH
                                 --resume_from_checkpoint RESUME_FROM_CHECKPOINT [--batch_size_val BATCH_SIZE_VAL]
                                 [--max_length MAX_LENGTH] --offline_model_path OFFLINE_MODEL_PATH --model_name MODEL_NAME
                                 --embedding_size EMBEDDING_SIZE

   PPI prediction.

   options:
   -h, --help            show this help message and exit
   --seed SEED           Random seed for reproducibility (default: 2).
   --test_filepath TEST_FILEPATH
                           Path to the test dataset (CSV format).
   --output_filepath OUTPUT_FILEPATH
                           Path to save the prediction results.
   --resume_from_checkpoint RESUME_FROM_CHECKPOINT
                           Path to a trained model (default: None).
   --batch_size_val BATCH_SIZE_VAL
                           The validation batch size on each device (default: 16).
   --max_length MAX_LENGTH
                           Maximum sequence length for tokenizing paired proteins (default: 1603).
   --offline_model_path OFFLINE_MODEL_PATH
                           Path to a locally stored ESM-2 model.
   --model_name MODEL_NAME
                           Choose the ESM-2 model to load (esm2_t12_35M_UR50D / esm2_t33_650M_UR50D).
   --embedding_size EMBEDDING_SIZE
                           Set embedding vector size based on the selected ESM-2 model (480 / 1280).


Training PPI models using mask and binary classification losses
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

   usage: PLMinteract train_mlm [-h] [--seed SEED] [--data DATA] [--task_name TASK_NAME] --train_filepath TRAIN_FILEPATH
                              --output_filepath OUTPUT_FILEPATH [--epochs EPOCHS] [--resume_from_checkpoint RESUME_FROM_CHECKPOINT]
                              [--warmup_steps WARMUP_STEPS] [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS]
                              [--weight_loss_mlm WEIGHT_LOSS_MLM] [--weight_loss_class WEIGHT_LOSS_CLASS] [--max_length MAX_LENGTH]
                              [--batch_size_train BATCH_SIZE_TRAIN] --offline_model_path OFFLINE_MODEL_PATH --model_name MODEL_NAME
                              --embedding_size EMBEDDING_SIZE

   Training PPI models using mask and binary classification losses.

   options:
   -h, --help            show this help message and exit
   --seed SEED           Random seed for reproducibility (default: 2).
   --data DATA           Set the dataset name (e.g., cross_species)(default: "").
   --task_name TASK_NAME
                           Set the task name (e.g., 1vs10, 1vs1)(default: "").

   Input data and path of output results:
   --train_filepath TRAIN_FILEPATH
                           Path to the training dataset (CSV format).
   --output_filepath OUTPUT_FILEPATH
                           Path to save trained model checkpoints and training results.

   PLM-interact setting:
   --epochs EPOCHS       Total number of training epochs (default: 10)
   --resume_from_checkpoint RESUME_FROM_CHECKPOINT
                           Path to a checkpoint file for resuming training from a previous run.
   --warmup_steps WARMUP_STEPS
                           Number of warmup steps for the learning rate scheduler (default: 2000).
   --gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS
                           Number of steps to accumulate gradients before performing an optimizer step (default: 8).
   --weight_loss_mlm WEIGHT_LOSS_MLM
                           Weight applied to the masked language modelling (MLM) loss (default: 1).
   --weight_loss_class WEIGHT_LOSS_CLASS
                           Weight applied to the classification loss (default: 10).
   --max_length MAX_LENGTH
                           Maximum sequence length for tokenizing paired proteins (default: 1603).
   --batch_size_train BATCH_SIZE_TRAIN
                           The training batch size on each device (default: 16).

   ESM2 model loading:
   --offline_model_path OFFLINE_MODEL_PATH
                           Path to a locally stored ESM-2 model.
   --model_name MODEL_NAME
                           Choose the ESM-2 model to load (esm2_t12_35M_UR50D / esm2_t33_650M_UR50D).
   --embedding_size EMBEDDING_SIZE
                           Set embedding vector size based on the selected ESM-2 model (480 / 1280).


Training PPI models using only binary classification loss.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

      usage: PLMinteract train_binary [-h] [--seed SEED] [--data DATA] [--task_name TASK_NAME] --train_filepath TRAIN_FILEPATH
                                    --dev_filepath DEV_FILEPATH --test_filepath TEST_FILEPATH --output_filepath OUTPUT_FILEPATH
                                    [--epochs EPOCHS] [--resume_from_checkpoint RESUME_FROM_CHECKPOINT] [--warmup_steps WARMUP_STEPS]
                                    [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS] [--evaluation_steps EVALUATION_STEPS]
                                    [--sub_samples SUB_SAMPLES] [--max_length MAX_LENGTH] [--batch_size_train BATCH_SIZE_TRAIN]
                                    [--batch_size_val BATCH_SIZE_VAL] --offline_model_path OFFLINE_MODEL_PATH --model_name MODEL_NAME
                                    --embedding_size EMBEDDING_SIZE

      Fine-tuning in the binary mutation effect task

      options:
      -h, --help            show this help message and exit
      --seed SEED           Random seed for reproducibility (default: 2).
      --data DATA           Set the dataset name (e.g., cross_species)(default: "").
      --task_name TASK_NAME
                              Set the task name (e.g., binary)(default: "").

      Input data and path of output results:
      --train_filepath TRAIN_FILEPATH
                              Path to the training dataset (CSV format).
      --dev_filepath DEV_FILEPATH
                              Path to the validation dataset (CSV format).
      --test_filepath TEST_FILEPATH
                              Path to the test dataset (CSV format).
      --output_filepath OUTPUT_FILEPATH
                              Path to save trained model checkpoints and training results.

      PLM-interact setting:
      --epochs EPOCHS       Total number of training epochs (default: 10).
      --resume_from_checkpoint RESUME_FROM_CHECKPOINT
                              Path to a checkpoint file for resuming training from a previous run.
      --warmup_steps WARMUP_STEPS
                              Number of warmup steps for the learning rate scheduler (default: 2000).
      --gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS
                              Number of steps to accumulate gradients before performing an optimizer step (default: 8).
      --evaluation_steps EVALUATION_STEPS
                              Perform evaluation every N steps during training (default: 5000).
      --sub_samples SUB_SAMPLES
                              Number of subsamples to use for evaluation (default: 128).
      --max_length MAX_LENGTH
                              Maximum sequence length for tokenizing paired proteins (default: 1603).
      --batch_size_train BATCH_SIZE_TRAIN
                              The training batch size on each device (default: 16).
      --batch_size_val BATCH_SIZE_VAL
                              The validation batch size on each device (default: 16).

      ESM2 model loading:
      --offline_model_path OFFLINE_MODEL_PATH
                              Path to a locally stored ESM-2 model.
      --model_name MODEL_NAME
                              Choose the ESM-2 model to load (esm2_t12_35M_UR50D / esm2_t33_650M_UR50D).
      --embedding_size EMBEDDING_SIZE
                              Set embedding vector size based on the selected ESM-2 model (480 / 1280).

Evaluation and test with multi-nodes and multi-GPUs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

   usage: PLMinteract predict_ddp [-h] [--seed SEED] --dev_filepath DEV_FILEPATH --test_filepath TEST_FILEPATH --output_filepath
                                 OUTPUT_FILEPATH [--epochs EPOCHS] [--resume_from_checkpoint RESUME_FROM_CHECKPOINT]
                                 [--batch_size_val BATCH_SIZE_VAL] [--max_length MAX_LENGTH] --offline_model_path OFFLINE_MODEL_PATH
                                 --model_name MODEL_NAME --embedding_size EMBEDDING_SIZE

   Choose the best trained checkpoints by testing on the validation datasets and evaluate the model's performance on the test
   datasets.

   options:
   -h, --help            show this help message and exit
   --seed SEED           Random seed for reproducibility (default: 2).

   Input data and output results:
   --dev_filepath DEV_FILEPATH
                           Path to the validation dataset (CSV format).
   --test_filepath TEST_FILEPATH
                           Path to the test dataset (CSV format).
   --output_filepath OUTPUT_FILEPATH
                           Path to save validation and test results.

   PLM-interact setting:
   --epochs EPOCHS       Total epochs of trained models (default: 10).
   --resume_from_checkpoint RESUME_FROM_CHECKPOINT
                           Path to trained models(default: None).
   --batch_size_val BATCH_SIZE_VAL
                           The validation batch size on each device (default: 16)
   --max_length MAX_LENGTH
                           Maximum sequence length for tokenizing paired proteins (default: 1603).

   ESM2 model loading:
   --offline_model_path OFFLINE_MODEL_PATH
                           Path to a locally stored ESM-2 model.
   --model_name MODEL_NAME
                           Choose the ESM-2 model to load (esm2_t12_35M_UR50D / esm2_t33_650M_UR50D).
   --embedding_size EMBEDDING_SIZE
                           Set embedding vector size based on the selected ESM-2 model (480 / 1280).


Fine-tuning in the binary mutation effect task.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

   usage: PLMinteract mutation_train [-h] [--seed SEED] [--task_name TASK_NAME] --train_filepath TRAIN_FILEPATH --dev_filepath
                                  DEV_FILEPATH --output_path OUTPUT_PATH [--epochs EPOCHS]
                                  [--resume_from_checkpoint RESUME_FROM_CHECKPOINT] [--warmup_steps WARMUP_STEPS]
                                  [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS] [--weight_loss_mlm WEIGHT_LOSS_MLM]
                                  [--weight_loss_class WEIGHT_LOSS_CLASS] [--max_length MAX_LENGTH]
                                  [--batch_size_train BATCH_SIZE_TRAIN] [--batch_size_val BATCH_SIZE_VAL] --offline_model_path
                                  OFFLINE_MODEL_PATH --model_name MODEL_NAME --embedding_size EMBEDDING_SIZE

   Predict mutant effects in human PPIs.

   options:
   -h, --help            show this help message and exit
   --seed SEED           Random seed for reproducibility. (default: 2)
   --task_name TASK_NAME
                           Set the task name (e.g., mutation_effects_training)(default: "")

   Input data and path of output results:
   --train_filepath TRAIN_FILEPATH
                           Path to the training dataset (CSV format)
   --dev_filepath DEV_FILEPATH
                           Path to the validation dataset (CSV format)
   --output_path OUTPUT_PATH
                           Path to save trained model checkpoints and training results

   PLM-interact setting:
   --epochs EPOCHS       Total number of training epochs (default: 50).
   --resume_from_checkpoint RESUME_FROM_CHECKPOINT
                           Path to a checkpoint file for resuming training from a previous run.
   --warmup_steps WARMUP_STEPS
                           Number of warmup steps for the learning rate scheduler (default: 2000)
   --gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS
                           Number of steps to accumulate gradients before performing an optimizer step (default: 8)
   --weight_loss_mlm WEIGHT_LOSS_MLM
                           Weight applied to the masked language modelling (MLM) loss (default: 1)
   --weight_loss_class WEIGHT_LOSS_CLASS
                           Weight applied to the classification loss (default: 10)
   --max_length MAX_LENGTH
                           Maximum sequence length for tokenizing paired proteins (default: 1603)
   --batch_size_train BATCH_SIZE_TRAIN
                           The training batch size on each device (default: 16)
   --batch_size_val BATCH_SIZE_VAL
                           The validation batch size on each device (default: 16)

   ESM-2 model loading:
   --offline_model_path OFFLINE_MODEL_PATH
                           Path to a locally stored ESM-2 model
   --model_name MODEL_NAME
                           Choose the ESM-2 model to load (esm2_t12_35M_UR50D / esm2_t33_650M_UR50D)
   --embedding_size EMBEDDING_SIZE
                           Set embedding vector size based on the selected ESM-2 model (480 / 1280)


Inference in the binary mutation effect task.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

   usage: PLMinteract mutation_predict [-h] [--seed SEED] [--task_name TASK_NAME] --test_filepath TEST_FILEPATH --output_path
                                    OUTPUT_PATH --resume_from_checkpoint RESUME_FROM_CHECKPOINT
                                    [--weight_loss_mlm WEIGHT_LOSS_MLM] [--weight_loss_class WEIGHT_LOSS_CLASS]
                                    [--max_length MAX_LENGTH] [--batch_size_val BATCH_SIZE_VAL] --offline_model_path
                                    OFFLINE_MODEL_PATH --model_name MODEL_NAME --embedding_size EMBEDDING_SIZE

   Inference in the binary mutation effect task

   options:
   -h, --help            show this help message and exit
   --seed SEED           Random seed for reproducibility (default: 2).
   --task_name TASK_NAME
                           Set the task name (e.g., mutation_effects_pre)(default: "").

   Input data and path of output results:
   --test_filepath TEST_FILEPATH
                           Path to the input CSV file for testing.
   --output_path OUTPUT_PATH
                           Path to save prediction results.

   PLM-interact parameters:
   --resume_from_checkpoint RESUME_FROM_CHECKPOINT
                           Path to a trained model.
   --weight_loss_mlm WEIGHT_LOSS_MLM
                           Weight applied to the masked language modeling (MLM) loss (default: 1).
   --weight_loss_class WEIGHT_LOSS_CLASS
                           Weight applied to the classification loss (default: 10).
   --max_length MAX_LENGTH
                           Maximum sequence length for tokenizing paired proteins (default: 1603).
   --batch_size_val BATCH_SIZE_VAL
                           The validation batch size on each device (default: 16).

   ESM2 model loading:
   --offline_model_path OFFLINE_MODEL_PATH
                           Path to a locally stored ESM-2 model.
   --model_name MODEL_NAME
                           Choose the ESM-2 model to load (esm2_t12_35M_UR50D / esm2_t33_650M_UR50D).
   --embedding_size EMBEDDING_SIZE
                           Set embedding vector size based on the selected ESM-2 model (480 / 1280).