KmerTokenizer

This module holds the kmer tokenizer and its settings.

The KmerTokenizer tokenizers incoming sequences into kmers of configurable size against a configurable alphabet.

For example, given incoming nucleotide sequences, using settings of kmer length 3 and stride 3 they encoded sequences will be codons (in the loose sense) with a vocabulary of size 64.

>>> from gcgc.tokenizer import KmerTokenizer, KmerTokenizerSettings

>>> settings = KmerTokenizerSettings(alphabet="ATCG", kmer_length=3, kmer_stride=3)
>>> tokenizer = KmerTokenizer(settings=settings)

>>> tokenizer.encode('AAATTTCCC')
[0, 42, 63]

>>> len(tokenizer.vocab)
64

KmerTokenizer

The Kmer Tokenizer that encodes sequences into chunked kmers.

__init__(self, settings=None)

Show source code in gcgc/tokenizer/kmer_tokenzier.py
 97
 98
 99
100
101
102
103
104
105
106
107
    def __init__(self, settings: Optional[KmerTokenizerSettings] = None):
        """Init the SequenceTokenizer class.

        Args:
            settings: The settings for the tokenizer.

        """
        self.settings = settings or KmerTokenizerSettings()
        super().__init__(settings)

        self.vocab = _create_kmer_vocab_from_token(self.settings)

Init the SequenceTokenizer class.

Parameters

Name Type Description Default
settings Optional[gcgc.tokenizer.kmer_tokenzier.KmerTokenizerSettings] The settings for the tokenizer. None

encode(self, seq, add_unknown=False)

Show source code in gcgc/tokenizer/kmer_tokenzier.py
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
    def encode(self, seq: str, add_unknown: bool = False) -> List[int]:
        """Encode the underlying sequence into a list of tokens ids.

        Args:
            seq: The incoming sequence to encode.
            add_unknown: If True, add the unknown token rather than throwing an out of vocabulary
                error.

        Returns:
            A list of the encoded tokens.

        """
        encoded = []

        for letter in self.encode_as_tokens(seq):
            try:
                encoded.append(self.vocab[letter])
            except KeyError:
                if add_unknown:
                    encoded.append(self.settings.unk_token_id)
                else:
                    raise

        return encoded

Encode the underlying sequence into a list of tokens ids.

Parameters

Name Type Description Default
seq str The incoming sequence to encode. required
add_unknown bool If True, add the unknown token rather than throwing an out of vocabulary error. False

Returns

Type Description
List[int] A list of the encoded tokens.

encode_as_tokens(self, seq)

Show source code in gcgc/tokenizer/kmer_tokenzier.py
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
    def encode_as_tokens(self, seq: str) -> List[str]:
        """Tokenize the sequence into a list of tokens.

        Args:
            seq: The sequence to encode.

        Returns:
            The list of strs that are the tokens.

        """
        seq_len = len(seq)

        if seq_len < self.settings.kmer_length:
            raise ValueError(
                f"seq length {seq_len} cannot be less than the kmer "
                "size {self.settings.kmer_length}"
            )

        if self.settings.kmer_length == 1:
            kmer_list = list(seq)
        else:
            kmer_list = self._kmer_n(seq)

        if self.settings.bos_token:
            kmer_list = [self.settings.bos_token] + kmer_list

        if self.settings.eos_token:
            kmer_list = kmer_list + [self.settings.eos_token]

        return super().apply_length_constraints(kmer_list)

Tokenize the sequence into a list of tokens.

Parameters

Name Type Description Default
seq str The sequence to encode. required

Returns

Type Description
List[str] The list of strs that are the tokens.

KmerTokenizerSettings

The specification for the tokenizer.

Like the baseclass, `SequenceTokenizerSettings`, the schema (and thus available fields), can be
seen by using the `print_schema` classmethod.

```python
>>> print(KmerTokenizerSettings.schema_json(indent=2))
{
  "title": "SequenceTokenizerSettings"
  ...
}

resolve_alphabet(alphabet) (classmethod)

Show source code in gcgc/tokenizer/kmer_tokenzier.py
58
59
60
61
62
63
64
65
66
67
68
69
    @validator("alphabet")
    def resolve_alphabet(cls, alphabet: str) -> str:
        """Resolve the alphabet if it's a named alphabet.

        Args:
            alphabet: The raw alphabet, either the sequence literal or a name of a preset alphabet.

        Returns:
            The new alphabet.

        """
        return alphabets.resolve_alphabet(alphabet)

Resolve the alphabet if it's a named alphabet.

Parameters

Name Type Description Default
alphabet str The raw alphabet, either the sequence literal or a name of a preset alphabet. required

Returns

Type Description
str The new alphabet.