BioSequencePiece
Module for Sentence Piece tokenization.
This tokenizer uses bindings to call a faster implementation of the algorithm, for now this is the google library: https://github.com/google/sentencepiece.
This tokenizer needs to be trained to learn its tokenization logic. See the fit*
methods for
different ways to fit the model prior to use.
BioSequencePiece
¶
A sentence piece for model on biological sequences.
sp_processor: SentencePieceProcessor
(property, readonly)¶
Return the SequencePiece process object.
__init__(self, settings=None)
¶
Show source code in gcgc/tokenizer/sentence_piece_tokenizer.py
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
|
Init the BioSequencePiece class.
Parameters
Name | Type | Description | Default |
---|---|---|---|
settings |
Optional[gcgc.tokenizer.sentence_piece_tokenizer.BioSequencePieceSettings] |
The settings for the tokenizer. | None |
encode(self, seq)
¶
Show source code in gcgc/tokenizer/sentence_piece_tokenizer.py
172 173 174 175 176 177 |
|
Encode the underlying sequence into a list of tokens.
encode_as_tokens(self, seq)
¶
Show source code in gcgc/tokenizer/sentence_piece_tokenizer.py
179 180 181 182 183 184 185 186 187 188 189 190 191 192 |
|
Tokenize the sequence into a list of token tokens.
Parameters
Name | Type | Description | Default |
---|---|---|---|
seq |
str |
The sequence to encode. | required |
Returns
Type | Description |
---|---|
List[str] |
The list of strs that are the tokens. |
fit_on_fasta(self, fasta_file)
¶
Show source code in gcgc/tokenizer/sentence_piece_tokenizer.py
112 113 114 115 116 117 118 119 120 121 122 |
|
Run the the SP algo on the fasta_file.
fit_on_list(self, sequence_list)
¶
Show source code in gcgc/tokenizer/sentence_piece_tokenizer.py
124 125 126 127 128 129 130 131 |
|
Fit the SP algo on a list.
fit_on_text(self, text_file)
¶
Show source code in gcgc/tokenizer/sentence_piece_tokenizer.py
133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 |
|
Run the the SP algo on the text_file.
load_vocab(self)
¶
Show source code in gcgc/tokenizer/sentence_piece_tokenizer.py
194 195 196 197 198 |
|
Load the vocabulary from the file.
BioSequencePieceSettings
¶
The settings for the sentence piece model.
Like the baseclass, `SequenceTokenizerSettings`, the schema (and thus available fields), can be
seen by using the `print_schema` classmethod.
```python
>>> print(BioSequencePieceSettings.schema_json(indent=2))
{
"title": "SequenceTokenizerSettings"
...
}
model_path: Path
(property, readonly)¶
Return the model path based on the prefix.
model_vocab: Path
(property, readonly)¶
Return the model vocab based on the prefix.