Skip to content

GCGC Documentation Third Party Libraries

GCGC Documentation

GCGC
Third Party Libraries Third Party Libraries
Table of contents
- Using with torchtext
CLI
Github
Github
- README
- Changelog
Reference
Reference

Third Party Libraries

GCGC doesn't doesn't directly integrate with any libraries, but it obviously tries to make it easy for the developer to use their library of choice.

Using with `torchtext`¶

The tokenizer implemented in gcgc is compatible with torchtext's Field class.

So imagine there's a TSV file that has a sequence of interest. Generally there'd also be other metadata, like a label.

sequence
ATCG
ATTT
ACGG

It can be loaded using torchtext's TabularDataset object, but instead of supplying no tokenize function to Field, we can supply GCGC's tokenizer.

import gcgc
from torchtext import data

settings = gcgc.SequenceTokenizerSettings(alphabet="ATCG", kmer_size=2, kmer_step_size=2)
tokenizer = gcgc.SequenceTokenizer(settings=settings)

sequence_field = data.Field(
    tokenize=tokenizer,
    batch_first=True,
    fix_length=10
)

train = data.TabularDataset(
    path="./data.tsv",
    format="tsv",
    skip_header=True,
    fields=[("sequence", sequence_field)],
)

sequence_field.build_vocab(train, min_freq=1)

train_iter = data.Iterator(
    train,
    sort_key=lambda x: len(x.sequence),
    batch_size=32,
    sort_within_batch=True,
    repeat=False,
)

for example in train_iter:
    print(example.sequence)
    break

We'd then get a tensor for the sequence attribute on the example:

tensor([[2, 4, 1, 1, 1, 1, 1, 1, 1, 1],
        [3, 5, 1, 1, 1, 1, 1, 1, 1, 1],
        [2, 6, 1, 1, 1, 1, 1, 1, 1, 1]])