Built-in models for protein sequence data / fitness landscapes

The resp_protein_toolkit contains a couple of built-in deep learning models that are easy to use for modeling protein fitness landscapes. Currently the available built-in models are based on Microsoft’s ByteNet, but are adapted so that they can be made uncertainty-aware using the VanillaRFFs layer also available in this package. You do not have to use these models and can substitute another uncertainty-aware model of your choosing when using the RESP in silico directed evolution also available in this package if desired.

Here are the details:

class resp_protein_toolkit.ByteNetSingleSeq(input_dim, hidden_dim, n_layers, kernel_size, dil_factor, rep_dim=100, pool_type='max', dropout=0.0, slim=False, llgp=False, objective='regression', num_predicted_categories=1, gp_cov_momentum=0.999, gp_ridge_penalty=0.001, gp_amplitude=1.0, num_rffs=1024)

A model for predicting the fitness of a given antibody using a series of ByteNet blocks. Note that it makes predictions using a single sequence only, not sequence pairs.

Parameters:
  • input_dim (int) – The expected dimensionality of the input, which is (N, L, input_dim).

  • hidden_dim (int) – The dimensions used inside the model.

  • n_layers (int) – The number of ByteNet blocks to use.

  • kernel_size (int) – The kernel width for ByteNet blocks.

  • dil_factor (int) – Used for calculating dilation factor, which increases by this factor on each subsequent layer. For short sequence inputs, use 1. For long sequences, 2 (or even 3) may be more appropriate.

  • rep_dim (int) – At the end of the ByteNet blocks, the model either average pools or maxpools across the tokens in each sequence to generate a representation. rep_dim determines the size of that representation.

  • pool_type (str) – One of “max”, “mean”. Determines the type of pooling that is applied in the final layer.

  • dropout (float) – The level of dropout to apply.

  • slim (bool) – If True, use a smaller size within each ByteNet block.

  • llgp (bool) – If True, use a last-layer GP, which enables us to estimate uncertainty.

  • objective (str) – Must be one of “regression”, “binary_classifier”, “multiclass”, “ordinal”.

  • num_predicted_categories (int) – The number of categories (i.e. possible values for y in output). Ignored unless objective is “multiclass” or “ordinal”, and required if the objective is either of those things.

  • gp_cov_momentum (float) – A “discount factor” used to update a moving average for the updates to the covariance matrix when llgp is True. 0.999 is a reasonable default if the number of steps per epoch is large, otherwise you may want to experiment with smaller values. If you set this to < 0 (e.g. to -1), the precision matrix will be generated in a single epoch without any momentum. If llgp is False (model is not uncertainty aware), there is no covariance matrix and this argument is ignored.

  • gp_ridge_penalty (float) – The ridge penalty for the last layer GP. Performance is not usually very sensitive to this although in some cases experimenting with it may improve performance, and it can affect calibration. It should not be set to zero since it is important for numerical stability for it to be > 0. The default is 1e-3.

  • gp_amplitude (float) – The kernel amplitude for the last layer Gaussian process. This is the inverse of the lengthscale. Performance is not generally very sensitive to the selected value for this hyperparameter, although it may affect calibration. Defaults to 1.

  • num_rffs (int) – The number of random Fourier features used to approximate a GP in the final model layer. Only used if llgp is set to True; otherwise it is ignored. A larger number of RFFs means a more accurate kernel approximation. Default is 1024 which is usually fine for most purposes.

__init__(input_dim, hidden_dim, n_layers, kernel_size, dil_factor, rep_dim=100, pool_type='max', dropout=0.0, slim=False, llgp=False, objective='regression', num_predicted_categories=1, gp_cov_momentum=0.999, gp_ridge_penalty=0.001, gp_amplitude=1.0, num_rffs=1024)

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x_antibody, update_precision=False, get_var=False)
Parameters:
  • x_antibody (N, L, in_channels) – – the antibody sequence data

  • update_precision (bool) – If you want to generate the covariance matrix during the last epoch only (i.e. you set gp_cov_momentum to < 0 when creating this model), set this to True during the last epoch only. If you want to generate the covariance matrix over the course of training (i.e. gp_cov_momentum is > 0 and < 1), set this to True throughout training. This should always be False during inference.

  • get_var (bool) – If True, return estimated variance on predictions. Only available if ‘llgp’ in class constructor is True AND objective is regression. Otherwise, this option can still be passed but will be ignored.

Returns:
  • Output (tensor) – – Shape depends on objective. If regression or binary_classifier, shape will be (N). If multiclass, shape will be (N, num_predicted_classes) that was passed when the model was constructed.

  • var (tensor) – Only returned if get_var is True, objective is regression and model was initialized with llgp set to True. If returned, it is a tensor of shape (N).

get_ordinal_score(x, get_var=False)

Returns the latent score (for ordinal regression only; if any other objective has been specified a RuntimeError will be raised).

Parameters:
  • x (N, L, in_channels) – – the antibody sequence data

  • get_var (bool) – If True, return estimated variance on predictions. Only available if ‘llgp’ in class constructor is True AND objective is regression. Otherwise, this option can still be passed but will be ignored.

Returns:
  • scores (np.ndarray) – An array of shape (N), where this is the latent score for each datapoint.

  • var (np.ndarray) – Only returned if get_var is True, llgp in the class constructor is True. If returned, is of shape (N).

predict(x, get_var=False)

This function returns the predicted y-value for each datapoint. For convenience, it takes numpy arrays as input and returns numpy arrays as output. If you already have PyTorch tensors it may be slightly faster / more convenient to use forward instead of calling predict.

Parameters:
  • x (np.ndarray) – The input antibody data.

  • get_var (bool) – If True, return estimated variance on predictions. Only available if ‘llgp’ in class constructor is True and the objective in the class constructor is “regression”. Otherwise this argument is ignored.

Returns:
  • scores (np.ndarray) – If class objective is “regression” or “binary_classifier”, this is of shape (N). If “multiclass”, this is of shape (N, num_predicted_classes) from the class constructor.

  • var (np.ndarray) – Only returned if get_var is True, llgp in the class constructor is True and the objective is “regression”. If returned, is of shape (N).

class resp_protein_toolkit.ByteNetPairedSeqs(input_dim, hidden_dim, n_layers, kernel_size, dil_factor, rep_dim=100, dropout=0.0, slim=False, llgp=False, antigen_dim=None, objective='regression', num_predicted_categories=1, gp_cov_momentum=0.999, gp_ridge_penalty=0.001, gp_amplitude=1.0, num_rffs=1024)

A model for predicting the fitness of a given antibody- antigen pair using a series of ByteNet blocks. Note that it accepts two sets of sequences as input: the antigen sequence and the antibody sequence. Each of these is fed through its own series of ByteNet blocks, then at the end the representations of the two are merged.

Parameters:
  • input_dim (int) – The expected dimensionality of the input, which is (N, L, input_dim).

  • hidden_dim (int) – The dimensions used inside the model.

  • n_layers (int) – The number of ByteNet blocks to use.

  • kernel_size (int) – The kernel width for ByteNet blocks.

  • dil_factor (int) – Used for calculating dilation factor, which increases by this factor on each subsequent layer. For short sequence inputs, use 1. For long sequences, 2 (or even 3) may be more appropriate.

  • rep_dim (int) – At the end of the ByteNet blocks, the mean is taken across the tokens in each sequence to generate a representation. rep_dim determines the size of that representation.

  • dropout (float) – The level of dropout to apply.

  • slim (bool) – If True, use a smaller size within each ByteNet block.

  • llgp (bool) – If True, use a last-layer GP.

  • antigen_dim – Either None or an int. If None, the antigen input is assumed to have the same dimensionality as the antibody.

  • objective (str) – Must be one of “regression”, “binary_classifier”, “multiclass”, “ordinal”.

  • num_predicted_categories (int) – The number of categories (i.e. possible values for y in output). Ignored unless objective is “multiclass” or “ordinal”, and required if the objective is either of those things.

  • gp_cov_momentum (float) – A “discount factor” used to update a moving average for the updates to the covariance matrix when llgp is True. 0.999 is a reasonable default if the number of steps per epoch is large, otherwise you may want to experiment with smaller values. If you set this to < 0 (e.g. to -1), the precision matrix will be generated in a single epoch without any momentum. If llgp is False (model is not uncertainty aware), there is no covariance matrix and this argument is ignored.

  • gp_ridge_penalty (float) – The ridge penalty for the last layer GP. Performance is not usually very sensitive to this although in some cases experimenting with it may improve performance, and it can affect calibration. It should not be set to zero since it is important for numerical stability for it to be > 0. The default is 1e-3.

  • gp_amplitude (float) – The kernel amplitude for the last layer Gaussian process. This is the inverse of the lengthscale. Performance is not generally very sensitive to the selected value for this hyperparameter, although it may affect calibration. Defaults to 1.

  • num_rffs (int) – The number of random Fourier features used to approximate a GP in the final model layer. Only used if llgp is set to True; otherwise it is ignored. A larger number of RFFs means a more accurate kernel approximation. Default is 1024 which is usually fine for most purposes.

__init__(input_dim, hidden_dim, n_layers, kernel_size, dil_factor, rep_dim=100, dropout=0.0, slim=False, llgp=False, antigen_dim=None, objective='regression', num_predicted_categories=1, gp_cov_momentum=0.999, gp_ridge_penalty=0.001, gp_amplitude=1.0, num_rffs=1024)

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x_antibody, x_ant, update_precision=False, get_var=False)
Parameters:
  • x_antibody (N, L, in_channels) – – the antibody sequence data

  • x_ant (N, L2, in_channels) – – the antigen sequence data

  • update_precision (bool) – If you want to generate the covariance matrix during the last epoch only (i.e. you set gp_cov_momentum to < 0 when creating this model), set this to True during the last epoch only. If you want to generate the covariance matrix over the course of training (i.e. gp_cov_momentum is > 0 and < 1), set this to True throughout training. This should always be False during inference.

  • get_var (bool) – If True, return estimated variance on predictions. Only available if ‘llgp’ in class constructor is True AND objective is regression. Otherwise, this option can still be passed but will be ignored.

Returns:
  • Output (tensor) – – Shape depends on objective. If regression or binary_classifier, shape will be (N). If multiclass, shape will be (N, num_predicted_classes) that was passed when the model was constructed.

  • var (tensor) – Only returned if get_var is True, objective is regression and model was initialized with llgp set to True. If returned, it is a tensor of shape (N).

get_ordinal_score(x, get_var=False)

Returns the latent score (for ordinal regression only; if any other objective has been specified a RuntimeError will be raised).

Parameters:
  • x (N, L, in_channels) – – the antibody sequence data

  • get_var (bool) – If True, return estimated variance on predictions. Only available if ‘llgp’ in class constructor is True AND objective is regression. Otherwise, this option can still be passed but will be ignored.

Returns:
  • scores (np.ndarray) – An array of shape (N), where this is the latent score for each datapoint.

  • var (np.ndarray) – Only returned if get_var is True, llgp in the class constructor is True. If returned, is of shape (N).

predict(x, ant, get_var=False)

This function returns the predicted y-value for each datapoint. For convenience, it takes numpy arrays as input and returns numpy arrays as output. If you already have PyTorch tensors it may be slightly faster / more convenient to use forward instead of calling predict.

Parameters:
  • x (np.ndarray) – The input antibody data.

  • ant (np.ndarray) – the input antigen data.

  • get_var (bool) – If True, return estimated variance on predictions. Only available if ‘llgp’ in class constructor is True and the objective in the class constructor is “regression”. Otherwise this argument is ignored.

Returns:
  • scores (np.ndarray) – If class objective is “regression” or “binary_classifier”, this is of shape (N). If “multiclass”, this is of shape (N, num_predicted_classes) from the class constructor.

  • var (np.ndarray) – Only returned if get_var is True, llgp in the class constructor is True and the objective is “regression”. If returned, is of shape (N).

To train these models, it’s typical to pass one of them together with training settings (learning rate, learning rate scheduler, selected optimizer etc.) to a function that will train the model for some set number of epochs (say 1 or 2), then calculate some performance metric on the training and test set.

The details of learning rate, learning rate scheduler, optimizer etc. may need to be changed depending on your problem; it’s usually a good idea to check performance on a validation set and adjust as needed. For an example of how to train this kind of model and use it with RESP to generate new sequences, see the example notebook on the main page of the docs.