Protein encoding tools in resp_protein_toolkit =============================================== To use them as input to an ML model, protein sequences have to be encoded (e.g. one-hot encoding, etc.) Writing Python code to do this is easy but frequently redundant. For convenience, this toolkit contains tools for encoding proteins using some very common schemes, using Python-wrapped C++ code to ensure speed. There aren't any embeddings supported yet since there are too many protein LLMs available for it to be practical to maintain a shared API in one package, but we may add this at some point in the future. Sequences are encoded as numpy arrays which are easily converted to Jax / PyTorch (e.g. in PyTorch, use `torch.from_numpy(my_array)`. Currently supported schemes include: - One-hot encoding with either a 2d or 3d array as output, using either the basic 20 amino acid alphabet, or the basic alphabet plus gaps, or an extended alphabet including unusual symbols (B, J, O, U, X, Z). - Integer encoding, using either the basic 20 amino acid alphabet, or the basic alphabet plus gaps, or an extended alphabet including unusual symbols (B, J, O, U, X, Z). Integer encoding is useful for LightGBM (gradient boosted trees) and some clustering schemes. - Substitution matrix encoding using a 21 letter alphabet (standard AAs plus gaps) with various percent homologies and two encoding schemes supported. For details on available options see below. .. autoclass:: resp_protein_toolkit.OneHotProteinEncoder :special-members: __init__ :members: encode .. autoclass:: resp_protein_toolkit.SubstitutionMatrixEncoder :special-members: __init__ :members: encode .. autoclass:: resp_protein_toolkit.IntegerProteinEncoder :special-members: __init__ :members: encode If you are encoding only a single sequence, make sure to pass it as a list, e.g.:: encoder1.encode([my_sequence])