Parser now outputs the length of the tensor not including padding. This is
useful for packing and length based iteration.
Generating masked output from the parse_record method is now available.
Alphabet can include an optional mask token.
Can now specify how large of kmer step size to generate when supplying a kmer
Renames EncodedSeq.integer_encoded to EncodedSeq.get_integer_encoding which
takes a kmer_step_size to specify how large of steps to take when encoding.
Add parsed_seq_len to the SequenceParser object to control how much padding to
apply to the end of the integer encoded sequence. This is useful since a batch
of tensors is expected to have the same size.
Broken test due to platform differences in Path.glob sorting.
User can specify to use start or end tokens optionally.
Removed one_hot_encoding. The user can do that pretty easily if needed. E.g.
see scatter in PyTorch.
Properties to access the integer encodings of special tokens. (35cae2a)
Remove uniprot dataset creation. (e233162)
Simplify index handling for GenomicDataset. (3213a9e)
Updated package management so gcgc is easier to use with other version of
Ability for kmer size to be passed to an alphabet.
Add Dockerfile and docker-compose.yml for development.
EncodedSeq.shift, which will shift sequence by an offset integer.
EncodedSeq.from_integer_encoded_seq will take a list of integers and an
alphabet and return an EncodedSeq object.
Add the ability to apply a function to the rollout_kmers yielded values.
Alphabet special characters are now located at the start, rather than the end,
of the letters and token sequence.
Add extra css to make underline links in articles.
Exit if the download directory doesn't exist in the call to download organism.
Wording improvements in docs.
Include seq_tensor_one_hot in the PyTorch Parser.
Added a GCGCRecord.encoded_seq property.
New gcgc.random module to start holding sequence data.
New gcgc.rollout module to handle working through chunks of sequences.
rollout_seq_features will roll out the SeqFeatures from a SeqRecord.
EncodingAlphabet now can optionally take a gap_characters set of characters to add to the
alphabet letters. It also takes add_lower_case_for_inserts which will duplicate the alphabet,
but convert the letters to lowercase.
Fixed bug in GenomicDataset.from_path where it still referred to init_from_path_generator.
EncodedSeq now supports iterating through kmers, see EncodedSeq.rollout_kmers for options.