Fields
Fields are additional pieces of data, including labels (the thing being predicted).
At a most basic level, a field has a name, and depending on the type of field it may take additional options to do more work.
Once a field is declared the it is added to the data dictionary that is returned when processing sequences. See the example in the PyTorch integration section.
Types of Fields
As stated different kinds of fields afford different functionality.
LabelField
A LabelField
ascribes a particular label to a sequence.
FileMetaDataField
A FileMetaDataField
can be used in cases where the filenames contain label information.
For example, imagine there's a list of files:
from pathlib import Path paths = [Path("a/f1.fasta"), Path("a/f2.fasta"), Path("b/f3.fasta")]
In this case there are two files, f1 and f2, that are under label "a" and f3 is under label b. In
order to create a FileMetaDataField
we'll write a function that takes in a Path
and returns a
string (the label).
def preprocess(p: Path) -> str: # Path("a/b") -> Path("a") -> "a" return str(p.parent)
Then create the FileMetaDataField
.
from gcgc.fields import FileMetaDataField label = FileMetaDataField.from_paths(name='label', paths=paths, preprocess=preprocess)
After that, the field has an encoding dictionary and, more importantly, it will add the label name and corresponding integer encoded class to the dataset.
label.encoding_dict {'a': 0, 'b': 1}
AnnotationField
On the SeqRecord
BioPython object exists an .annotations
property, which potentially contains
labels that contain labels or other features. To help access these the AnnotationField
works in a
similar fashion to the FileMetaDataField
: write a processing function and create an exemplar
dataset.
.annotations
is just a dictionary, so this is the simplest example.
annotations = [{"a": "a"}, {"a": "b"}]
Because .annotations
is a Dict, the preprocess
signature is slightly different than FileMetaDataField
.
def preprocess(a: Dict) -> str: return a["a"]
After this, the field behaves similarly to the other fields.
from gcgc.fields import AnnotationField af = AnnotationField.from_annotations("a", annotations, preprocess) af.encoding_dict # {"a": 0, "b": 1}