Skip to content

Encoding Alphabets

The concept of an Alphabet is an analog to the concept of a vocabulary in NLP, i.e. the set of possible tokens present in the corpus, or in the biological case, the sequences.

For example, in the case of DNA where there is no ambiguity in the reads, the alphabet is {A, T, C, G}.

Implementation

Like other areas of GCGC, BioPython is used to supply the basics of the Alphabet with GCGC adding additional functionality of the ML case.

In general, for a BioPython Alphabet, there exists a corresponding GCGC alphabet that handles encoding and decoding a sequence into an integer or one-hot encoding.

For example, in the case described in the into the IUPACUnambiguousDNAEncoding alphabet would be a reasonable choice.

from gcgc.alphabet import IUPACUnambiguousDNAEncoding

alpha = IUPACAmbiguousDNAEncoding()

Once the alphabet is instantiated, it's simple to encode or decode values to go between text and integer encodings.

alpha.encode_token("A")
# 1

alpha.decode_token(1)
# "A"

Possible Letters

In addition to the standard .letters supplied by the BioPython Alphabet, GCGC adds START, END, and PADDING characters for processing applications. There also exists a .letters_and_tokens property which contains the regular letters and the special tokens.

The length of the alphabet is equal to the number of the BioPython alphabet characters plus the special characters, or the length of .letters_and_tokens.

This is relevant for ML applications, for instance, where the size of the vocabulary is needed to create the initial embedding lookup table.

Custom Alphabets

Currently GCGC does not provide direct support for user defined Alphabets, though it is on the roadmap. Hopefully, one of the standard IUPAC's Alphabets will work. See the module gcgc.alphabet.iupac for more info.

Kmer Alphabets

You won't always want to have an alphabet that encodes single residuals. Therefore you can specify the kmer_size of the Alphabet which will create an alphabet of kmers from the original letters.

For example, if you have the IUPACUnambiguousDNAEncoding passing a kmer of size two will use an encoding index under the hood of...

>>> a = IUPACUnambiguousDNAEncoding(kmer_size=2)
>>> a.encoding_index
{'>': 0,
 '<': 1,
 '|': 2,
 'GG': 3,
 'GA': 4,
 'GT': 5,
 'GC': 6,
 'AG': 7,
 'AA': 8,
 'AT': 9,
 'AC': 10,
 'TG': 11,
 'TA': 12,
 'TT': 13,
 'TC': 14,
 'CG': 15,
 'CA': 16,
 'CT': 17,
 'CC': 18}