ThirdParty

Hugging Face Datasets

Sequence dataset implementations for the hugging face dataset library.

The hugging face datasets library has objects to help build datasets and work with them. This code focuses on helping to build datasets from bioinformatics formats such that the rest of the built in dataset library can be used seamless on bio data.

At present this has uniprot related datasets.

>>> from gcgc.third_party import hf_datasets
>>> ref = hf_datasets.UniprotDataset(name="sprot")
>>> # ref.download_and_prepare()
>>> # ds = ref.as_dataset()

See the UniprotDatasetNames enum for the available names.

>>> from gcgc.third_party import hf_datasets
>>> hf_datasets.UniprotDatasetNames.uniref100
<UniprotDatasetNames.uniref100: 'uniref100'>

FastaBasedBuilder

Builds a dataset backed by FASTA files.

This Builder implements the ability to iterate through a split, but it is incumbant on the subclass to implement the _split_generators method.

When subclassing this, use self.features in the dataset info.

features: Features property readonly

Return the set of FASTA realted features.

UniprotDataset

Rpresents a uniprot dataset using the underlying FastaBasedBuilder.

BUILDER_CONFIG_CLASS

Builder config for uniref datasets.

url: str property readonly

Return the URL for dataset.

__init__(self, name) special

Init the uniprot dataset with the name of the uniprot dataset.

Source code in gcgc/third_party/hf_datasets.py
def __init__(self, name: UniprotDatasetNames):
    """Init the uniprot dataset with the name of the uniprot dataset."""
    self.name: UniprotDatasetNames = name

UniprotDatasetConfig

Builder config for uniref datasets.

url: str property readonly

Return the URL for dataset.

__init__(self, name) special

Init the uniprot dataset with the name of the uniprot dataset.

Source code in gcgc/third_party/hf_datasets.py
def __init__(self, name: UniprotDatasetNames):
    """Init the uniprot dataset with the name of the uniprot dataset."""
    self.name: UniprotDatasetNames = name

UniprotDatasetNames

Enum for the uniprot datasets.

Hugging Face Tokenizers

Modules for huggingface's transformers.

KmerPreTokenizer

Pretokenizes sequences based on kmers.

__init__(self, kmer_length, kmer_stride, alphabet, unk_token='?') special

Inits the KmerTokenizer.

Parameters:

Name Type Description Default
kmer_length int

How long of kmers to create.

required
kmer_stride int

The stride between two kmers. Should be equal to kmer_stride to generate non-overlapping kmers.

required
alphabet str

The particular alphabet

required
unk_token str

The unknown token to use for the pre-tokenization.

'?'
Source code in gcgc/third_party/hf_tokenizer.py
def __init__(self, kmer_length: int, kmer_stride: int, alphabet: str, unk_token: str = "?"):
    """Inits the KmerTokenizer.

    Args:
        kmer_length: How long of kmers to create.
        kmer_stride: The stride between two kmers. Should be equal to kmer_stride to generate
            non-overlapping kmers.
        alphabet: The particular alphabet
        unk_token: The unknown token to use for the pre-tokenization.

    """
    self.kmer_tokenzier = KmerTokenizer(
        kmer_length=kmer_length,
        kmer_stride=kmer_stride,
        alphabet=alphabet,
        pad_token=None,
        bos_token=None,
        eos_token=None,
        mask_token=None,
        unk_token=unk_token,
    )

pre_tokenize(self, pre_tok)

Pretokenize the input string.

Source code in gcgc/third_party/hf_tokenizer.py
def pre_tokenize(self, pre_tok: PreTokenizedString):
    """Pretokenize the input string."""
    pre_tok.split(self._split)

build_hf_tokenizer(kmer_length, kmer_stride, alphabet, unk_token='?')

Build a full huggingface tokenizer from the inputs.

Note

Same arguments taken as KmerPreTokenizer.

Source code in gcgc/third_party/hf_tokenizer.py
def build_hf_tokenizer(
    kmer_length: int, kmer_stride: int, alphabet: str, unk_token: str = "?"
) -> Tokenizer:
    """Build a full huggingface tokenizer from the inputs.

    Note:
        Same arguments taken as KmerPreTokenizer.

    """
    kmer_pre = KmerPreTokenizer(
        kmer_length=kmer_length, kmer_stride=kmer_stride, alphabet=alphabet, unk_token=unk_token
    )
    tokenizer = Tokenizer(
        models.WordLevel(vocab=kmer_pre.kmer_tokenzier.vocab.stoi, unk_token=unk_token)
    )
    tokenizer.pre_tokenizer = PreTokenizer.custom(kmer_pre)
    tokenizer.decoder = ByteLevel()

    return tokenizer