PyTorch Integration
GCGC implements a torch.utils.data.Dataset
that loads data from common bioinformatics file
formats via BioPython.
The easiest way to see how this works to see Splice Site example.
Genomic Dataset
The GenomicDataset
implements the interface for a PyTorch Dataset
.
Assuming you have the FASTAs to read, create a dataset by first creating a
TorchSequenceParser
object, the alphabet, then the actual Dataset
.
from gcgc.alphabet import IUPACProteinEncoding from gcgc.ml.pytorch_utils.data import GenomicDataset from gcgc.ml.pytorch_utils.parser import TorchSequenceParser length_parser = EncodedSeqLengthParser(conform_to=100) # Make seq length uniform at 100 parser = TorchSequenceParser( encapsulate=False, seq_length_parser=length_parser, sequence_offset=1 ) alphabet = IUPACProteinEncoding() # Use an Amino Acid vocab. files = ['fasta1.fasta', 'fasta2.fasta'] dataset = GenomicDataset.from_paths(files, parser, alphabet=alphabet)
GenomicDataset
, again, implements the interface.TorchSequenceParser
is an object that helps parse the incoming sequences to PyTorch tensors.IUPACProteinEncoding
is the protein encoding alphabet, think of it as the vocabulary from NLP.
Now that a Dataset
exists, it can be passed to DataLoader
.
from torch.utils.data import DataLoader data_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True) for batch in data_loader: # seq_tensor is (B, S) tensor where B is the batch size and A is the sequence # length, which in this case is 100 seq_tensor = batch['seq_tensor']
Use with PyTorch LSTM Models
One important gotcha is that for using PyTorch models is that typically it expects the batches to have the sequence length as the first dimension and batch size as the second. Therefore, you may want to transpose the matrix.
# Now has batch first. seq_tensor = seq_tensor.transpose(0, 1)