Changelog¶
0.12.1 (unreleased)¶
- Add 
fit_on_listto BioSequencePiece which can be trained on list of strings. - Make 
BioSequencePiece.model_prefixaUnion[str, Path], and rework internal implementation to be consistent. - Add token ids for all the special tokens (e.g. 
bos_tokennow has an accompanyingbos_token_id). - Add 
special_token_idsreturns the set of integers that are special tokens. - Add a 
get_special_tokens_maskmethod on the tokenizer that returns a [0, 1] mask of the underlying tokens. - Add autodoc of API for main tokenizers.
 - Install the python sentencepiece package by default. Add utilities for package management.
 
0.12.0 (2020-01-25)¶
- Improved the docs to reflect the 
SequenceTokenizerSpecthat was added in 0.11.0. - Made max length optional for the tokenizer.
 - Added CLI that parses use the SequencePiece library.
 - Began versioning docker build, and make pushing easier during build process.
 - Have the tokenizer resolve the named alphabets.
 - Use poetry along with general updates to a build pipeline.
 
0.11.0 (2019-11-15)¶
Added¶
- Added the 
SequenceTokenizerSpecobject for specifying the tokenizer. - Added 
Vocabobject for storing the int to token, and token to int encodings. - Added example of using tensorflow/keras together with gcgc.
 
0.10.0 (2019-11-09)¶
Changed¶
gcgc has been revamped quite a bit to better support existing processing
pipelines for NLP without trying to do to much. See the docs for more
information about how this works.
0.9.0 (2019-08-05)¶
Added¶
- Parser now outputs the length of the tensor not including padding. This is useful for packing and length based iteration.
 - Generating masked output from the parse_record method is now available.
 - Alphabet can include an optional mask token.
 
Changed¶
- Can now specify how large of kmer step size to generate when supplying a kmer value.
 - Renames EncodedSeq.integer_encoded to EncodedSeq.get_integer_encoding which takes a kmer_step_size to specify how large of steps to take when encoding.
 - Add parsed_seq_len to the SequenceParser object to control how much padding to apply to the end of the integer encoded sequence. This is useful since a batch of tensors is expected to have the same size.
 
0.8.0 (2019-07-04)¶
Fixed¶
- Broken test due to platform differences in 
Path.globsorting. 
Added¶
- User can specify to use start or end tokens optionally.
 
Removed¶
- Removed one_hot_encoding. The user can do that pretty easily if needed. E.g.
  see 
scatterin PyTorch. 
0.7.0 (2019-06-22)¶
Added¶
- Properties to access the integer encodings of special tokens. (35cae2a)
 Alphabet.encoded_startAlphabet.encoded_endAlphabet.encoded_padding- Remove uniprot dataset creation. (e233162)
 - Simplify index handling for GenomicDataset. (3213a9e)
 
0.6.1 (2019-06-10)¶
Added¶
- Updated package management so gcgc is easier to use with other version of torch.
 
0.6.0 (2019-04-04)¶
Added¶
- Ability for kmer size to be passed to an alphabet.
 
0.5.2 (2019-03-21)¶
Added¶
- Add Dockerfile and docker-compose.yml for development.
 EncodedSeq.shift, which will shift sequence by an offset integer.EncodedSeq.from_integer_encoded_seqwill take a list of integers and an alphabet and return an EncodedSeq object.- Add the ability to apply a function to the rollout_kmers yielded values.
 
Changed¶
- Alphabet special characters are now located at the start, rather than the end, of the letters and token sequence.
 
0.5.1 (2019-01-09)¶
Added¶
- Add extra css to make underline links in articles.
 - Exit if the download directory doesn't exist in the call to download organism.
 - Wording improvements in docs.
 
0.5.0 (2018-12-31)¶
Added¶
- Include 
seq_tensor_one_hotin the PyTorch Parser. - Added a 
GCGCRecord.encoded_seqproperty. - New 
gcgc.randommodule to start holding sequence data. - New 
gcgc.rolloutmodule to handle working through chunks of sequences. rollout_kmerswill roll out kmers.rollout_seq_featureswill roll out theSeqFeaturesfrom aSeqRecord.EncodingAlphabetnow can optionally take agap_charactersset of characters to add to the alphabet letters. It also takesadd_lower_case_for_insertswhich will duplicate the alphabet, but convert the letters to lowercase.
Changed¶
Fixed¶
- Fixed bug in 
GenomicDataset.from_pathwhere it still referred toinit_from_path_generator. 
0.4.0¶
Added¶
EncodedSeqnow supports iterating through kmers, seeEncodedSeq.rollout_kmersfor options.- GCGC is citable.
 - GCGC now has a CHANGELOG.md.