{ "cells": [ { "cell_type": "markdown", "id": "7529b0be-6ee0-40b5-b38f-c873ec1e29d2", "metadata": {}, "source": [ "# RESP pipeline example\n", "\n", "This notebook illustrates how to use resp_protein_toolkit to train a model using data collected against\n", "a single target or antigen of interest then run an in silico search for novel binders using the resulting\n", "model. It takes as input two gzipped csv files included with the package under the \"example_dataset\" folder.\n", "To run this script, obtain the files \"mHER_H3_AgNeg.csv\" and \"mHER_H3_AgPos.csv\" from the repo https://github.com/dahjan/DMS_opt\n", "and move them to the same directory as this notebook.\n", "\n", "**IMPORTANT NOTE:** We have only conducted a very minimal set of hyperparameter tuning experiments for the model used in this notebook. It is very likely possible to achieve better performance with more extensive hyperparameter tuning for this model. Also note that for some hyperparameter settings the model can perform quite poorly. We do not suggest that you use the hyperparameters shown here for your dataset without testing them to ensure they achieve satisfactory performance. In other words: use this notebook as a guide to how to set up and run the pipeline, not as a guide to which hyperparameters to use for your deep learning model." ] }, { "cell_type": "code", "execution_count": 1, "id": "c6fd6f0c-be49-491c-bb87-65b823c99e4b", "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "from sklearn.metrics import matthews_corrcoef as MCC, accuracy_score, average_precision_score\n", "from sklearn.calibration import calibration_curve, CalibrationDisplay\n", "from sklearn.model_selection import train_test_split\n", "import torch\n", "import numpy as np\n", "import pandas as pd\n", "\n", "import resp_protein_toolkit \n", "from resp_protein_toolkit import OneHotProteinEncoder\n", "from resp_protein_toolkit import ByteNetSingleSeq\n", "from resp_protein_toolkit import InSilicoDirectedEvolution as ISDE\n", "\n", "negative_examples = pd.read_csv(\"mHER_H3_AgNeg.csv\")\n", "positive_examples = pd.read_csv(\"mHER_H3_AgPos.csv\")" ] }, { "cell_type": "code", "execution_count": 2, "id": "286daf4a-b27e-4ca8-9c3e-eccc90190089", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | Unnamed: 0 | \n", "Count | \n", "Fraction | \n", "NucSeq | \n", "AASeq | \n", "AgClass | \n", "
|---|---|---|---|---|---|---|
| 0 | \n", "0 | \n", "7 | \n", "0.000007 | \n", "TGTAGCAGGTACACTATCTGCAGTTTCTACAAGCTCCAGTATTGG | \n", "YTICSFYKLQ | \n", "0 | \n", "
| 1 | \n", "1 | \n", "95 | \n", "0.000041 | \n", "TGTAGCAGGTGGTTCCTCTGCGGCTTCTACCAGAACATGTATTGG | \n", "WFLCGFYQNM | \n", "0 | \n", "
| 2 | \n", "2 | \n", "3 | \n", "0.000001 | \n", "TGTAGCAGGTTCGGCAACATCAGCTCCTTCGCGATCGCGTATTGG | \n", "FGNISSFAIA | \n", "0 | \n", "
| 3 | \n", "3 | \n", "10 | \n", "0.000005 | \n", "TGTAGCAGGTTCAAGGTCAACGGTCTGTTCCCGCACCTCTATTGG | \n", "FKVNGLFPHL | \n", "0 | \n", "
| 4 | \n", "4 | \n", "16 | \n", "0.000016 | \n", "TGTAGCAGGTACACTATCTGCAGTATGTACGAGTTCGATTATTGG | \n", "YTICSMYEFD | \n", "0 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 27534 | \n", "27534 | \n", "79 | \n", "0.000034 | \n", "TGTAGCAGGTGGGACGAGGGCGACCCCTACCCCTACCAGTATTGG | \n", "WDEGDPYPYQ | \n", "0 | \n", "
| 27535 | \n", "27535 | \n", "3 | \n", "0.000003 | \n", "TGTAGCAGGTGGCATGAGGACGGCATGTACCAGAACGAGTATTGG | \n", "WHEDGMYQNE | \n", "0 | \n", "
| 27536 | \n", "27536 | \n", "115 | \n", "0.000049 | \n", "TGTAGCAGGTACCGCGACTCCCACTCCTTCACGTTCGTCTATTGG | \n", "YRDSHSFTFV | \n", "0 | \n", "
| 27537 | \n", "27537 | \n", "14 | \n", "0.000006 | \n", "TGTAGCAGGTGGGACGTCCTCAACTACTTCGTGTTCATCTATTGG | \n", "WDVLNYFVFI | \n", "0 | \n", "
| 27538 | \n", "27538 | \n", "46 | \n", "0.000021 | \n", "TGTAGCAGATGGGTGCTCGTCGGTATGTACATGTTCGGGTATTGG | \n", "WVLVGMYMFG | \n", "0 | \n", "
27539 rows × 6 columns
\n", "