ThirdParty
Hugging Face Datasets
Sequence dataset implementations for the hugging face dataset library.
The hugging face datasets library has objects to help build datasets and work with them. This code focuses on helping to build datasets from bioinformatics formats such that the rest of the built in dataset library can be used seamless on bio data.
At present this has uniprot related datasets.
>>> from gcgc.third_party import hf_datasets
>>> ref = hf_datasets.UniprotDataset(name="sprot")
>>> # ref.download_and_prepare()
>>> # ds = ref.as_dataset()
See the UniprotDatasetNames
enum for the available names.
>>> from gcgc.third_party import hf_datasets
>>> hf_datasets.UniprotDatasetNames.uniref100
<UniprotDatasetNames.uniref100: 'uniref100'>
FastaBasedBuilder
Builds a dataset backed by FASTA files.
This Builder implements the ability to iterate through a split, but it is incumbant on the subclass to implement the _split_generators method.
When subclassing this, use self.features
in the dataset info.
features: Features
property
readonly
Return the set of FASTA realted features.
UniprotDataset
Rpresents a uniprot dataset using the underlying FastaBasedBuilder.
BUILDER_CONFIG_CLASS
Builder config for uniref datasets.
url: str
property
readonly
Return the URL for dataset.
__init__(self, name)
special
Init the uniprot dataset with the name of the uniprot dataset.
Source code in gcgc/third_party/hf_datasets.py
def __init__(self, name: UniprotDatasetNames):
"""Init the uniprot dataset with the name of the uniprot dataset."""
self.name: UniprotDatasetNames = name
UniprotDatasetConfig
Builder config for uniref datasets.
url: str
property
readonly
Return the URL for dataset.
__init__(self, name)
special
Init the uniprot dataset with the name of the uniprot dataset.
Source code in gcgc/third_party/hf_datasets.py
def __init__(self, name: UniprotDatasetNames):
"""Init the uniprot dataset with the name of the uniprot dataset."""
self.name: UniprotDatasetNames = name
UniprotDatasetNames
Enum for the uniprot datasets.
Hugging Face Tokenizers
Modules for huggingface's transformers.
KmerPreTokenizer
Pretokenizes sequences based on kmers.
__init__(self, kmer_length, kmer_stride, alphabet, unk_token='?')
special
Inits the KmerTokenizer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
kmer_length |
int |
How long of kmers to create. |
required |
kmer_stride |
int |
The stride between two kmers. Should be equal to kmer_stride to generate non-overlapping kmers. |
required |
alphabet |
str |
The particular alphabet |
required |
unk_token |
str |
The unknown token to use for the pre-tokenization. |
'?' |
Source code in gcgc/third_party/hf_tokenizer.py
def __init__(self, kmer_length: int, kmer_stride: int, alphabet: str, unk_token: str = "?"):
"""Inits the KmerTokenizer.
Args:
kmer_length: How long of kmers to create.
kmer_stride: The stride between two kmers. Should be equal to kmer_stride to generate
non-overlapping kmers.
alphabet: The particular alphabet
unk_token: The unknown token to use for the pre-tokenization.
"""
self.kmer_tokenzier = KmerTokenizer(
kmer_length=kmer_length,
kmer_stride=kmer_stride,
alphabet=alphabet,
pad_token=None,
bos_token=None,
eos_token=None,
mask_token=None,
unk_token=unk_token,
)
pre_tokenize(self, pre_tok)
Pretokenize the input string.
Source code in gcgc/third_party/hf_tokenizer.py
def pre_tokenize(self, pre_tok: PreTokenizedString):
"""Pretokenize the input string."""
pre_tok.split(self._split)
build_hf_tokenizer(kmer_length, kmer_stride, alphabet, unk_token='?')
Build a full huggingface tokenizer from the inputs.
Note
Same arguments taken as KmerPreTokenizer.
Source code in gcgc/third_party/hf_tokenizer.py
def build_hf_tokenizer(
kmer_length: int, kmer_stride: int, alphabet: str, unk_token: str = "?"
) -> Tokenizer:
"""Build a full huggingface tokenizer from the inputs.
Note:
Same arguments taken as KmerPreTokenizer.
"""
kmer_pre = KmerPreTokenizer(
kmer_length=kmer_length, kmer_stride=kmer_stride, alphabet=alphabet, unk_token=unk_token
)
tokenizer = Tokenizer(
models.WordLevel(vocab=kmer_pre.kmer_tokenzier.vocab.stoi, unk_token=unk_token)
)
tokenizer.pre_tokenizer = PreTokenizer.custom(kmer_pre)
tokenizer.decoder = ByteLevel()
return tokenizer