Third Party Libraries

GCGC doesn't doesn't directly integrate with any libraries, but it obviously tries to make it easy for the developer to use their library of choice.

Using with torchtext

The tokenizer implemented in gcgc is compatible with torchtext's Field class.

So imagine there's a TSV file that has a sequence of interest. Generally there'd also be other metadata, like a label.

sequence
ATCG
ATTT
ACGG

It can be loaded using torchtext's TabularDataset object, but instead of supplying no tokenize function to Field, we can supply GCGC's tokenizer.

import gcgc
from torchtext import data

settings = gcgc.SequenceTokenizerSettings(alphabet="ATCG", kmer_size=2, kmer_step_size=2)
tokenizer = gcgc.SequenceTokenizer(settings=settings)

sequence_field = data.Field(
    tokenize=tokenizer,
    batch_first=True,
    fix_length=10
)

train = data.TabularDataset(
    path="./data.tsv",
    format="tsv",
    skip_header=True,
    fields=[("sequence", sequence_field)],
)

sequence_field.build_vocab(train, min_freq=1)

train_iter = data.Iterator(
    train,
    sort_key=lambda x: len(x.sequence),
    batch_size=32,
    sort_within_batch=True,
    repeat=False,
)

for example in train_iter:
    print(example.sequence)
    break

We'd then get a tensor for the sequence attribute on the example:

tensor([[2, 4, 1, 1, 1, 1, 1, 1, 1, 1],
        [3, 5, 1, 1, 1, 1, 1, 1, 1, 1],
        [2, 6, 1, 1, 1, 1, 1, 1, 1, 1]])