Usage

Quick start

Predict a list PPIs

To predict a list of PPIs, you can download pre-trained models from Hugging Face. Protein sequnce pair should be listed as follwing format: Required Input:

(–test_filepath): A CSV file with the following two columns: ‘query’: The sequence of protein 1, and ‘text’: The sequence of protein 2.

(–resume_from_checkpoint): the traiend model that can be downldoed from Hugging Face <https://huggingface.co/danliu1226>.

(–output_filepath): a path to save the results.

There are 6 commands in PLM-interact package

inference_PPI: PPI prediction.
train_mlm: Training PPI models using mask and binary classification losses.
train_binary: Training PPI models using only binary classification loss.
predict_ddp: Choose the best trained checkpoints by testing on the validation datasets and evaluate the model’s performance on the test datasets.
mutation_train:Fine-tuning in the binary mutation effect task.
mutation_predict: Inference in the binary mutation effect task.

PPI prediction

Training PPI models using mask and binary classification losses

usage: PLMinteract train_mlm [-h] [--seed SEED] [--data DATA] [--task_name TASK_NAME] --train_filepath TRAIN_FILEPATH
                           --output_filepath OUTPUT_FILEPATH [--epochs EPOCHS] [--resume_from_checkpoint RESUME_FROM_CHECKPOINT]
                           [--warmup_steps WARMUP_STEPS] [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS]
                           [--weight_loss_mlm WEIGHT_LOSS_MLM] [--weight_loss_class WEIGHT_LOSS_CLASS] [--max_length MAX_LENGTH]
                           [--batch_size_train BATCH_SIZE_TRAIN] --offline_model_path OFFLINE_MODEL_PATH --model_name MODEL_NAME
                           --embedding_size EMBEDDING_SIZE

Training PPI models using mask and binary classification losses.

options:
-h, --help            show this help message and exit
--seed SEED           Random seed for reproducibility (default: 2).
--data DATA           Set the dataset name (e.g., cross_species)(default: "").
--task_name TASK_NAME
                        Set the task name (e.g., 1vs10, 1vs1)(default: "").

Input data and path of output results:
--train_filepath TRAIN_FILEPATH
                        Path to the training dataset (CSV format).
--output_filepath OUTPUT_FILEPATH
                        Path to save trained model checkpoints and training results.

PLM-interact setting:
--epochs EPOCHS       Total number of training epochs (default: 10)
--resume_from_checkpoint RESUME_FROM_CHECKPOINT
                        Path to a checkpoint to resume training from, if continuing a previous run.
--warmup_steps WARMUP_STEPS
                        Number of warmup steps for the learning rate scheduler (default: 2000).
--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS
                        Number of steps to accumulate gradients before performing an optimizer step (default: 8).
--weight_loss_mlm WEIGHT_LOSS_MLM
                        Weight applied to the masked language modeling (MLM) loss (default: 1).
--weight_loss_class WEIGHT_LOSS_CLASS
                        Weight applied to the classification loss (default: 10).
--max_length MAX_LENGTH
                        Maximum sequence length for tokenizing paired proteins (default: 1603).
--batch_size_train BATCH_SIZE_TRAIN
                        The training batch size on each device (default: 16).

ESM2 model loading:
--offline_model_path OFFLINE_MODEL_PATH
                        Path to a locally stored ESM-2 model.
--model_name MODEL_NAME
                        Choose the ESM-2 model to load (esm2_t12_35M_UR50D / esm2_t33_650M_UR50D).
--embedding_size EMBEDDING_SIZE
                        Set embedding vector size based on the selected ESM-2 model (480 / 1280).

Training PPI models using only binary classification loss.

Evaluation and test with multi nodes and multi GPUs

Fine-tuning in the binary mutation effect task.

Inference in the binary mutation effect task.

usage: PLMinteract mutation_predict [-h] [--seed SEED] [--task_name TASK_NAME] --test_filepath TEST_FILEPATH --output_path
                                 OUTPUT_PATH --resume_from_checkpoint RESUME_FROM_CHECKPOINT
                                 [--weight_loss_mlm WEIGHT_LOSS_MLM] [--weight_loss_class WEIGHT_LOSS_CLASS]
                                 [--max_length MAX_LENGTH] [--batch_size_val BATCH_SIZE_VAL] --offline_model_path
                                 OFFLINE_MODEL_PATH --model_name MODEL_NAME --embedding_size EMBEDDING_SIZE

Inference in the binary mutation effect task

options:
-h, --help            show this help message and exit
--seed SEED           Random seed for reproducibility (default: 2).
--task_name TASK_NAME
                        Set the task name (e.g., mutation_effects_pre)(default: "").

Input data and path of output results:
--test_filepath TEST_FILEPATH
                        Path to the input CSV file for testing.
--output_path OUTPUT_PATH
                        Path to save prediction results.

PLM-interact parameters:
--resume_from_checkpoint RESUME_FROM_CHECKPOINT
                        Path to a trained model.
--weight_loss_mlm WEIGHT_LOSS_MLM
                        Weight applied to the masked language modeling (MLM) loss (default: 1).
--weight_loss_class WEIGHT_LOSS_CLASS
                        Weight applied to the classification loss (default: 10).
--max_length MAX_LENGTH
                        Maximum sequence length for tokenizing paired proteins (default: 1603).
--batch_size_val BATCH_SIZE_VAL
                        The validation batch size on each device (default: 16).

ESM2 model loading:
--offline_model_path OFFLINE_MODEL_PATH
                        Path to a locally stored ESM-2 model.
--model_name MODEL_NAME
                        Choose the ESM-2 model to load (esm2_t12_35M_UR50D / esm2_t33_650M_UR50D).
--embedding_size EMBEDDING_SIZE
                        Set embedding vector size based on the selected ESM-2 model (480 / 1280).