Exon Release Notes

May 31, 2024 · 4 min read

Trent Hauck

Developer

We've released a new version of Exon, and it's Python API BioBear. This release includes some new features, improvements, and bug fixes. In particular, we'd like to highlight two things:

Read and write support for FASTA and FASTQ files
Support for integer encoding of dna and protein sequences from FASTA files

Read and Write Support for FASTA and FASTQ Files

Previously it'd been possible to create tables from FASTA and FASTQ files via CREATE EXTERNAL TABLE and query them directly via their respective fasta_scan and fastq_scan table functions. However, it wasn't possible to write data back to these formats until now. With this release, you can now write data back to FASTA and FASTQ files using SQL's built-in COPY command.

For example, to read the SwissProt FASTA file and write it back to a new file. First create the table:

CREATE EXTERNAL TABLE swissprot
STORED AS FASTA
COMPRESSION TYPE GZIP
LOCATION 'uniprot_sprot.fasta.gz'

Then copy the data to a new file:

COPY swissprot TO 'uniprot_sprot_copy.fasta' STORED AS FASTA

Or in BioBear:

import biobear as bb

session = bb.new_session()

session.execute(
    """
CREATE EXTERNAL TABLE swissprot
STORED AS FASTA
COMPRESSION TYPE GZIP
LOCATION 'uniprot_sprot.fasta.gz'
"""
)

session.execute("COPY swissprot TO 'uniprot_sprot_copy.fasta' STORED AS FASTA")

In effect, this example decompressed the file. You can also copy from query results:

session.execute(
    """
COPY (
    SELECT *
    FROM swissprot
    WHERE NOT starts_with(sequence, 'M')
) TO 'uniprot_sprot_weird_aa.fasta' STORED AS FASTA
"""
)

Compression

You can specify the compression type in the COPY's OPTIONS clause. For example, to copy the SwissProt FASTA file to ZSTD compressed file:

session.execute(  # our session from above
    """
COPY swissprot TO 'uniprot_sprot_copy.fasta.zst'
STORED AS FASTA OPTIONS ('compression' 'zstd')
"""
)

Integer Encoding of DNA and Protein Sequences

Second, we've added support for reading DNA and protein sequences as integer-encoded arrays. This can be useful for ML applications as sequences are often encoded as integers for training. For example, to read the SwissProt FASTA file and encode the sequences as integers:

CREATE EXTERNAL TABLE swissprot
STORED AS FASTA
COMPRESSION TYPE GZIP
LOCATION 'uniprot_sprot.fasta.gz'
OPTIONS (fasta_sequence_data_type 'integer_encode_protein')

Or in BioBear to get a Polars DataFrame:

import biobear as bb

session = bb.new_session()

session.execute(
    """
CREATE EXTERNAL TABLE swissprot
STORED AS FASTA
COMPRESSION TYPE GZIP
LOCATION 'uniprot_sprot.fasta.gz'
OPTIONS (fasta_sequence_data_type 'integer_encode_protein')
"""
)

df = session.sql("SELECT * FROM swissprot").to_polars()
df.head()

| id                   | description                     | sequence       |
|----------------------|---------------------------------|----------------|
| sp|Q6GZX4|001R_FRG3G | Putative transcription factor … | [12, 1, … 11]  |
| sp|Q6GZX3|002L_FRG3G | Uncharacterized protein 002L O… | [12, 18, … 22] |
| sp|Q197F8|002R_IIV3  | Uncharacterized protein 002R O… | [12, 1, … 3]   |
| sp|Q197F7|003L_IIV3  | Uncharacterized protein 003L O… | [12, 23, … 9]  |
| sp|Q6GZX2|003R_FRG3G | Uncharacterized protein 3R OS=… | [12, 1, … 18]  |
| …                    | …                               | …              |
| sp|Q6UY62|Z_SABVB    | RING finger protein Z OS=Sabia…  | [12, 7, … 4]   |
| sp|P08105|Z_SHEEP    | Putative uncharacterized prote… | [12, 18, … 9]  |
| sp|Q88470|Z_TACVF    | RING finger protein Z OS=Tacar…  | [12, 7, … 11]  |
| sp|A9JR22|Z_TAMVU    | RING finger protein Z OS=Tamia…  | [12, 7, … 15]  |
| sp|B2ZDY1|Z_WWAVU    | RING finger protein Z OS=White…  | [12, 7, … 1]   |

In addition to integer_encode_protein, you can also use:

utf8 for the UTF-8 encoding (this is the default)
integer_encode_dna for integer encoding of DNA sequences
large_utf8 for larger sequences like chromosomes or genomes

Setting the Encoding for all FASTA Files

If you want to set the encoding for the entire session, you can use the SET command:

session.execute("SET fasta_sequence_data_type = 'integer_encode_protein'")

Then all subsequent CREATE EXTERNAL TABLE commands will use this encoding.

Conclusion

We hope you enjoy these new features and improvements. Please let us know if you have any questions or feedback.

To get started with BioBear, run:

pip install -U biobear

Read and Write Support for FASTA and FASTQ Files​

Compression​

Integer Encoding of DNA and Protein Sequences​

Setting the Encoding for all FASTA Files​

Conclusion​

Read and Write Support for FASTA and FASTQ Files

Compression

Integer Encoding of DNA and Protein Sequences

Setting the Encoding for all FASTA Files

Conclusion