Splice Site Classification

As an example of the PyTorch integration this example will use the UCI Splice-junction Gene Sequences dataset on UCI.

In short, splice sites are locations where the introns and exons connect on pre-mRNA. It's important to understand where these are because the process of splicing removes introns so the exons from the pre-mRNA form mRNA. And mRNA becomes proteins.

Put another way, understanding where the splice sites are helps determine which parts of DNA become proteins.

This is a great 1m37s video from the DNA Learning Center for more context.

From an ML perspective, this is a classification where the sequence is a feature and there is a label, one of:

Intron to exon (IE) -- Does this sequence contain an intron to exon junction.
Exon to intron (EI) -- Does this sequence contain an exon to intron junction.
Neither (N) -- Does this sequence contain neither.

Loading Data

To help load data GCGC contains a FileMetaDataField class that can help turn files into labels.

In this example, the easiest path is to use the SPLICE_DATA_PATH constant, which will point to the path containing the files EI.fasta, IE.fasta, and N.fasta. The labels for the dataset will become those files names due to the field.

from gcgc.fields.categorical_field import FileMetaDataField
from gcgc.data import SPLICE_DATA_PATH

files = list(SPLICE_DATA_PATH.glob("*.fasta"))
# [PosixPath('/home/tshauck/gcgc/gcgc/data/splice/EI.fasta'),
#  PosixPath('/home/tshauck/gcgc/gcgc/data/splice/IE.fasta'),
#  PosixPath('/home/tshauck/gcgc/gcgc/data/splice/N.fasta')]
# Each class is a file.

file_feature = FileMetaDataField.from_paths("splice_site", files)

file_feature is one possible features someone might want to model -- in this case the splice classification. A SequenceParser object can be holds possible features, and in this case the TorchSequenceParser will go one step further by converting the outputs of the SequenceParser into native torch objects. See the documentation on sequence parsers for more info.

from gcgc.ml.pytorch_utils.parser import TorchSequenceParser

parser = TorchSequenceParser(file_features=[file_feature])

The SequenceParser can also configure how GCGC will deal with sequences of difference length.

There is one more preparatory step needed before creating the PyTorch compatible Dataset, and that is to make a GCGC Alphabet. These Alphabet's are subclasses of BioPython Alphabet's class that support padding, tokenizing and other options relevant for ML applications.

In this case, IUPACAmbiguousDNAEncoding is relevant for DNA with the standard letters plus extra to deal with ambiguous bases. One example of an ambiguous base letter is that "R" is Purine, which can be either Adenine ("A") or Guanine ("G").

from gcgc.alphabet import IUPACAmbiguousDNAEncoding

alphabet = IUPACAmbiguousDNAEncoding()

Those ingredients set the stage to create the GenomicDataset object

from gcgc.ml.pytorch_utils.data import GenomicDataset

dataset = GenomicDataset.from_paths(files, parser, alphabet=alphabet)

Given a set of files, from_paths is the best option for generating a GenomicDataset. GCGC uses BioPython's indexing to work with large files typically found from sequencing samples, and from_paths will generate the indexes for you with the proper alphabet.

Training

Given the dataset, it's now possible to use PyTorch's built-in DataLoader object which helps with shuffling, batching, and other common data utilities.

from torch.utils.data import DataLoader

data_loader = DataLoader(dataset, batch_size=128, shuffle=True)

With the data_loader configured we'll create a simple PyTorch model that applies a single 1d convolution layer then a series of dense layers. The output corresponds to the three classes: EI, IE, N.

import torch.nn as nn
import torch.nn.functional as F


class SplicePrediction(nn.Module):
    def __init__(self, vocab_size, num_labels, embed_size=2, num_filters=2, kernel_size=1):
        super().__init__()

        self.embed = nn.Embedding(vocab_size, embed_size)
        self.conv = nn.Conv1d(embed_size, num_filters, kernel_size)

        self.fc1 = nn.Linear(124, 100)
        self.fc2 = nn.Linear(100, 50)
        self.fc3 = nn.Linear(50, num_labels)

    def forward(self, inputs):
        embed = self.embed(inputs).permute(0, 2, 1)

        x = self.conv(embed)
        x = x.view(x.size()[0], -1)
        x = self.fc1(F.relu(x))
        x = self.fc2(F.relu(x))
        x = self.fc3(x)

        return x

Now, it's a matter of training the model by passing it data from the loader. Importantly, the object iterated through via data_loader has access to a set of keys that are PyTorch tensor object.

seq_tensor is always present, and represents the input sequence represented as a PyTorch LongTensor. Also in the example below notice that the splice_site key is present which was the name of the FileMetaDataField used earlier and represents which class the sequences is a member of.

import torch

vocab_size = len(alphabet)
num_labels = len(files)
model = SplicePrediction(vocab_size, num_labels)

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = torch.nn.CrossEntropyLoss(size_average=False)

model.train()

for i in range(10):  # 10 epochs
    for batch in data_loader:  # Created from the GenomicDataset
        sequences = batch["seq_tensor"].to("cpu")
        target = batch["splice_site"].to("cpu")

        optimizer.zero_grad()
        output = model(sequences)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

Because this isn't an example of PyTorch generally, this example doesn't include important considerations like cross-validation, metrics, etc, but hopefully it's clearer how to use GCGC in conjunction with PyTorch.