Protein encoding tools in resp_protein_toolkit

To use them as input to an ML model, protein sequences have to be encoded (e.g. one-hot encoding, etc.) Writing Python code to do this is easy but frequently redundant. For convenience, this toolkit contains tools for encoding proteins using some very common schemes, using Python-wrapped C++ code to ensure speed. There aren’t any embeddings supported yet since there are too many protein LLMs available for it to be practical to maintain a shared API in one package, but we may add this at some point in the future.

Sequences are encoded as numpy arrays which are easily converted to Jax / PyTorch (e.g. in PyTorch, use torch.from_numpy(my_array). Currently supported schemes include:

  • One-hot encoding with either a 2d or 3d array as output, using either the

basic 20 amino acid alphabet, or the basic alphabet plus gaps, or an extended alphabet including unusual symbols (B, J, O, U, X, Z). - Integer encoding, using either the basic 20 amino acid alphabet, or the basic alphabet plus gaps, or an extended alphabet including unusual symbols (B, J, O, U, X, Z). Integer encoding is useful for LightGBM (gradient boosted trees) and some clustering schemes. - Substitution matrix encoding using a 21 letter alphabet (standard AAs plus gaps) with various percent homologies and two encoding schemes supported.

For details on available options see below.

class resp_protein_toolkit.OneHotProteinEncoder(alphabet='gapped')

Provides basic one-hot encoding.

__init__(alphabet='gapped')

Class constructor.

Parameters:

alphabet (str) – One of ‘standard’, ‘gapped’ or ‘expanded’. ‘standard’ means only the basic 20 aas are allowed. ‘gapped’ means gaps are also allowed. ‘expanded’ means unusual AAs or low-confidence assignments like U, O, X, J, Z etc. are allowed.

encode(sequence_list, flatten_output_array=False, max_length=None)

One-hot encode and return a numpy array. If flattened, it is of shape N x (M * A), where A is the alphabet size (20 for standard amino acids only, 21 if gaps are included, 27 if using an expanded alphabet), M is the number of amino acids and N is the number of datapoints. Otherwise it is a 3d array of shape N x M x A.

Parameters:
  • sequence_list (list) – A list of sequences.

  • flatten_output_array (bool) – If True, a 2d flattened array is returned, otherwise a 3d array as discussed above.

  • max_length – Either None or an int. If None, the sequence length is determined based on the longest sequence present in the input list and sequences are zero-padded to that size if necessary. If an int, sequences are zero-padded to be the size of max_length (unless they already are that length). Note that specifying max_length and then passing in sequences longer than that will cause an exception.

Returns:

encoded_seqs (np.ndarray) – A numpy array of shape N x (M * A) or shape N x M x A depending on flatten_output_array as discussed above.

Raises:

RuntimeError – An exception is raised if invalid input is supplied.

class resp_protein_toolkit.SubstitutionMatrixEncoder(homology='90', rep_type='std')

Encodes input proteins using a substitution matrix.

__init__(homology='90', rep_type='std')

Class constructor.

Parameters:
  • submat (str) – The homology level to use. Currently supported is 95, 90, 85, 75, 62.

  • rep_type (str) – If ‘dist’, the substitution matrix is used to build a distance matrix, each row of which is used as a representation. If ‘std’, the distance matrix is then Cholesky- factored and scaled so that the squared Euclidean distance between the length-21 representation for any two aas (or gaps) will yield the substitution matrix distance. If ‘raw’, the raw row of the substitution matrix is used as a representation.

Raises:

RuntimeError – An exception is raised if an unsupported option is requested.

encode(sequence_list, flatten_output_array=False, max_length=None)

Encode and return a numpy array. If flattened, it is of shape N x (M * 21), where M is the number of amino acids and N is the number of datapoints. Otherwise it is a 3d array of shape N x M x 21.

Parameters:
  • sequence_list (list) – A list of sequences.

  • flatten_output_array (bool) – If True, a 2d flattened array is returned, otherwise a 3d array as discussed above.

  • max_length – Either None or an int. If None, the sequence length is determined based on the longest sequence present in the input list and sequences are zero-padded to that size if necessary. If an int, sequences are zero-padded to be the size of max_length (unless they already are that length). Note that specifying max_length and then passing in sequences longer than that will cause an exception.

Returns:

encoded_seqs (np.ndarray) – A numpy array of shape N x (M * 21) or shape N x M x 21 depending on flatten_output_array as discussed above.

Raises:

RuntimeError – An exception is raised if invalid input is supplied.

class resp_protein_toolkit.IntegerProteinEncoder(alphabet='gapped')

Provides integer encoding.

__init__(alphabet='gapped')

Class constructor.

Parameters:

alphabet (str) – One of ‘standard’, ‘gapped’ or ‘expanded’. ‘standard’ means only the basic 20 aas are allowed. ‘gapped’ means gaps are also allowed. ‘expanded’ means unusual AAs or low-confidence assignments like U, O, X, J, Z etc. are allowed.

encode(sequence_list, max_length=None)

One-hot encode and return a numpy array of shape N x M, where N is the number of datapoints and M is the number of amino acids.

Parameters:
  • sequence_list (list) – A list of sequences.

  • all_same_length (bool) – If True, the sequences are expected to be all the same length; if they are not there is an exception. If False, sequences are zero-padded to be the same length.

  • max_length – Either None or an int. If None, the sequence length is determined based on the longest sequence present in the input list and sequences are zero-padded to that size if necessary. If an int, sequences are zero-padded to be the size of max_length (unless they already are that length). Note that specifying max_length and then passing in sequences longer than that will cause an exception.

Returns:

encoded_seqs (np.ndarray) – A numpy array of shape N x M (see above).

Raises:

RuntimeError – An exception is raised if invalid input is supplied.

If you are encoding only a single sequence, make sure to pass it as a list, e.g.:

encoder1.encode([my_sequence])