What is in silico directed evolution ====================================== The computational portion of the RESP pipeline may be divided into three steps: #. Step 1. Prepare the experimental data for analysis (quality filter raw reads, translate to amino acid sequences, number or align if necessary). #. Step 2. Fit an uncertainty-aware model and ensure that performance is sufficient (e.g. using cross-validation). #. Step 3. Run *in silico* directed evolution, using the uncertainty-aware model to identify candidate sequences predicted to be tight binders or exhibit high activity. #. Step 4. Filter the accepted sequences from the *in silico* search for humanness, developability and other desired properties. Step 1 tends to be fairly problem-specific -- it may depend, for example, on how the experiment was set up and the sequencing technology -- and so tools for this step are not provided here. There are a broad range of options available for Step 2. Of these, we've found CNNs with a last-layer GP (an SNGP model), gradient boosted trees and approximate Gaussian processes `to work well `_, although we've used variational Bayesian NNs `before as well. `_ To learn how to fit an approximate GP to sequence data, see the `xGPR library documentation `_. We'll eventually provide support for training a variety of models here. Step 4 requires models with high accuracy for predicting immunogenicity / humanness and other key developability properties (e.g. stability). For humanness, we've released a model in `the AntPack library `_ that `achieves excellent performance `_. This leaves step 3, where we search sequence space for new candidates with predicted high binding. It's not necessary to use generative models for this step -- indeed, we've found this merely substitutes a black-box alternative for methods like genetic algorithms or simulated annealing. Our preferred alternative here is *in silico* directed evolution: a modified variant on simulated annealing in which we introduce and score mutations into a starting wild-type sequence. For details, `see this paper `_. The `resp_protein_toolkit` provides support for conducting an in silico directed evolution search using a trained uncertainty-aware model. Uncertainty is key here: if the model is not uncertainty-aware, we can veer into poorly-mapped regions of sequence space that are not represented in the training data. By restricting our search to high-confidence candidates, we can minimize the number of experimental evaluations needed for success. **How to run in silico directed evolution** First, create a Python class that exposes a function called `predict`. `predict` must take as an argument a sequence (as a string), and must return: * Two 1d numpy arrays of equal size, the first of which is predicted scores against your antigens of interest and the second of which is the uncertainty on each score. If there is only one antigen each array should be shape[0]=1. You will likely need to set up this class so it wraps / calls a suitable function for encoding the protein sequence supplied to `predict` and then feeding the encoded sequence into a trained uncertainty-aware ML model. Next, create an instance of the `InSilicoDirectedEvolution` class, as below:: from resp_protein_toolkit import InSilicoDirectedEvolution isde_tool = InSilicoDirectedEvolution(model_object, uncertainty_threshold, prob_distro = None, seed = 123) The `model_object` should be the object you've created that has a `predict` method. For more details on `uncertainty_threshold`, `prob_distro` and `seed` see the `Details` section below. Once this object has been created, you can run *in silico* directed evolution by calling `run_chain` as shown below:: isde_tool.run_chain(wild_type_sequence) accepted_seqs = isde_tool.get_accepted_seqs() accepted_scores = isde_tool.get_scores() accepted_uncertainty = isde_tool.get_uncertainty() To see how much progress the in silico directed evolution process made, it can be useful to plot the scores. If there is only one target, this is easily done via:: import matplotlib.pyplot as plt plt.plot(accepted_scores) There are a variety of options you can set when calling `run_chain` to change how the process is conducted; for more details on these see `Details` below. Note that early on in the process, mutations that are slightly deleterious or neutral have some probability of being accepted, whereas later on only mutations that improve the score have a non-negligible probability of being accepted. This procedure ensures the algorithm explores sequence space widely initially before focusing on the best mutations found so far later on. For this reason, however, the earlier portion of the `accepted_seqs` are often associated with low scores, so after plotting the score trajectory during *in silico* directed evolution it's typical to discard all of the accepted sequences that score below some threshold. For example, if the best score achieved by the chain is 4 and the wild-type scores 0, it might be reasonable to discard anything below 3 -- depending on the dataset, your uncertainty- aware model, and the number of sequences you can afford to test experimentally. The threshold ultimately depends on what a "good" score is, which is dependent on your model. Finally, it is sometimes desirable to see how many mutations in an accepted sequence can be converted *back* to wild-type with minimal loss in score. To do this, you can use the `polish` method of the isde_tool. For example:: polished_sequence = isde_tool.polish(accepted_mutant_sequence, wild_type_sequence, thresh=0.99) This call will create a polished_sequence by reverting as many positions to wild-type as possible *as long as* the resulting sequence has a score > thresh * the score of the accepted mutant sequence. With a thresh of 0.99, for example, the polished sequence is guaranteed to have a score > 0.99 * the score of the accepted mutant sequence. If there are multiple targets, the score for each target is guaranteed to be > thresh * the score of the accepted mutant sequence for that target. `polish` is slow but can sometimes be very useful if you are trying to minimize the number of mutations in an accepted sequence. Once you've found an appropriate set of accepted sequences, you can filter them based on predicted humanness and other properties. It's usual to run more than one *in silico* directed evolution chain; since an `InSilicoDirectedEvolution` tool only has one random seed and stores the sequences associated with its completed run, you'll need to create a second object with a new random seed to do this. **Details** Here are the details on how to use the `InSilicoDirectedEvolution` class. .. autoclass:: resp_protein_toolkit.InSilicoDirectedEvolution :special-members: __init__ :members: run_chain, polish, get_scores, get_accepted_seqs, get_uncertainty, get_acceptance_rate