- Fix bug in the event that a token id is supplied that overrides a default of an inferred token.
pad_at_endboolean setting that when True pads at the end of the sequence, and when False pads at the beginning.
- Add dedicated
Vocabobject which replaces the dictionary of string to integer.
- Update tokenizer integration to override
- Fix bug when trying to save the huggingface tokenizer.
- Make the third party "extras" during python packaging.
- Add better testing and batch encoding operatons.
fit_on_listto BioSequencePiece which can be trained on list of strings.
Union[str, Path], and rework internal implementation to be consistent.
- Add token ids for all the special tokens (e.g.
bos_tokennow has an accompanying
special_token_idsreturns the set of integers that are special tokens.
- Add a
get_special_tokens_maskmethod on the tokenizer that returns a [0, 1] mask of the underlying tokens.
- Add autodoc of API for main tokenizers.
- Install the python sentencepiece package by default. Add utilities for package management.
- Improved the docs to reflect the
SequenceTokenizerSpecthat was added in 0.11.0.
- Made max length optional for the tokenizer.
- Added CLI that parses use the SequencePiece library.
- Began versioning docker build, and make pushing easier during build process.
- Have the tokenizer resolve the named alphabets.
- Use poetry along with general updates to a build pipeline.
- Added the
SequenceTokenizerSpecobject for specifying the tokenizer.
Vocabobject for storing the int to token, and token to int encodings.
- Added example of using tensorflow/keras together with gcgc.
gcgc has been revamped quite a bit to better support existing processing
pipelines for NLP without trying to do to much. See the docs for more
information about how this works.
- Parser now outputs the length of the tensor not including padding. This is useful for packing and length based iteration.
- Generating masked output from the parse_record method is now available.
- Alphabet can include an optional mask token.
- Can now specify how large of kmer step size to generate when supplying a kmer value.
- Renames EncodedSeq.integer_encoded to EncodedSeq.get_integer_encoding which takes a kmer_step_size to specify how large of steps to take when encoding.
- Add parsed_seq_len to the SequenceParser object to control how much padding to apply to the end of the integer encoded sequence. This is useful since a batch of tensors is expected to have the same size.
- Broken test due to platform differences in
- User can specify to use start or end tokens optionally.
- Removed one_hot_encoding. The user can do that pretty easily if needed. E.g.
- Properties to access the integer encodings of special tokens. (35cae2a)
- Remove uniprot dataset creation. (e233162)
- Simplify index handling for GenomicDataset. (3213a9e)
- Updated package management so gcgc is easier to use with other version of torch.
- Ability for kmer size to be passed to an alphabet.
- Add Dockerfile and docker-compose.yml for development.
EncodedSeq.shift, which will shift sequence by an offset integer.
EncodedSeq.from_integer_encoded_seqwill take a list of integers and an alphabet and return an EncodedSeq object.
- Add the ability to apply a function to the rollout_kmers yielded values.
- Alphabet special characters are now located at the start, rather than the end, of the letters and token sequence.
- Add extra css to make underline links in articles.
- Exit if the download directory doesn't exist in the call to download organism.
- Wording improvements in docs.
seq_tensor_one_hotin the PyTorch Parser.
- Added a
gcgc.randommodule to start holding sequence data.
gcgc.rolloutmodule to handle working through chunks of sequences.
rollout_kmerswill roll out kmers.
rollout_seq_featureswill roll out the
EncodingAlphabetnow can optionally take a
gap_charactersset of characters to add to the alphabet letters. It also takes
add_lower_case_for_insertswhich will duplicate the alphabet, but convert the letters to lowercase.
- Fixed bug in
GenomicDataset.from_pathwhere it still referred to
EncodedSeqnow supports iterating through kmers, see
- GCGC is citable.
- GCGC now has a CHANGELOG.md.