Skip to content


GCGC is in active development.


GCGC is a package written in Python for pre-processing for biological sequences. Think of it like a Natural Language Processing pre-processing toolkit with design choices oriented towards the differences the sequences found in natural language vs biology.

GCGC has two main points of entry. First, imported as a Python package, and, second, used as a command line tool for specific types of transformations. In this case, the ideas is to integrate GCGC into a larger data processing pipeline.


Install GCGC via pip:

$ pip install gcgc

If you'd like to use one of the third party tools, install the related "extras".

$ pip install gcgc[torch]
  • Getting Started For a full example of using GCGC with a classification model, see the splice site example.

  • Bugs or Help Please file an issue if you're running into issues for some reason.

  • Development Roadmap The GCGC development board is hosted on notion.

  • Source Code GitHub Repo

Documentation Version

The documentation you're reading was build for version: 0.9.1.