Blog | WHERE TRUE Technologies Documentation

Initial Support for SDF Files

July 22, 2024 · 4 min read

Trent Hauck

Developer

It's now possible to read SDF files with Exon and BioBear.

SDF Files

The "SDF" in SDF files stands for "Structure Data File". It's sorta like ATM machines.

As you might imagine, this file format contains chemical structures and additional data. There may be multiple such records in a single file. For example, from the ChemBL database.

CHEMBL153534
     RDKit          2D

17  0  0  0  0  0  0  0  0999 V2000
6140  -22.2702    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
7047  -23.1991    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
1806  -22.5282    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
9604  -22.7690    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
8790  -23.2163    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
2791  -21.1119    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
5280  -21.4445    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
4225  -22.4364    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
8353  -21.7198    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
2035  -23.8527    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
0534  -23.2163    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
9776  -23.5889    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
6406  -22.4938    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
6406  -23.9215    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
4397  -20.3035    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
6552  -21.6280    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
3  2  0
4  1  0
1  1  0
2  1  0
7  1  0
1  2  0
1  1  0
8  2  0
12  1  0
5  2  3
4  2  0
11  1  0
11  1  0
6  1  0
9  1  0
9  1  0
2  1  0
M  END
> <chembl_id>
CHEMBL153534

$$$$

This record starts with header information, followed by the atom and bond information, then any additional data (chembl_id in this case). The record ends with $$$$.

Reading SDF Files

Depending on which library you're using, you have different options for working with SDF files.

BioBear

If you're working with BioBear, install the latest version from PyPI:

pip install -U biobear

Then you can read an SDF file like this:

from biobear import new_session
from pathlib import Path

session = new_session()

sdf_file = Path("data/chembl_34.sdf.gz")
result = session.read_sdf_file(sdf_file.as_posix()).to_polars()
print(len(result))

# Output:
# 2409270

Exon

If you're working with Exon, you'll use SQL. First register the table:

CREATE EXTERNAL TABLE sdf
STORED AS SDF
LOCATION 'data/chembl_34.sdf.gz'
OPTIONS (compression 'gzip');

Then you can query the table:

SELECT COUNT(*) FROM sdf;
-- Output: 2409270

File Schema

An SDF table has the following schema:

Column Name	Type	Description
`header`	`Utf8`	The header of the record
`atom_count`	`UInt32`	The number of atoms in the record
`bond_count`	`UInt32`	The number of bonds in the record
`data`	`Struct`	Additional data in the record

The data column is inferred from the underlying file format based on the first record. So for example, given the SDF file above, the data column would have the following schema:

Column Name	Type	Description
`chembl_id`	`Utf8`	The `chembl_id` field from the record

Which could be accessed like this:

SELECT data.chembl_id FROM sdf LIMIT 1;

What's Missing?

As mentioned, this is an initial implementation. There are some things missing which which we plan to add in the future:

There is lots of room for optimization
Support for canonicalization of SMILES
Support for generating SMILES from coordinates
Support for writing SDF files
Functions for chemical descriptors

If you have any feedback or suggestions, please let us know.

Exon Release Notes

May 31, 2024 · 4 min read

Trent Hauck

Developer

We've released a new version of Exon, and it's Python API BioBear. This release includes some new features, improvements, and bug fixes. In particular, we'd like to highlight two things:

Read and write support for FASTA and FASTQ files
Support for integer encoding of dna and protein sequences from FASTA files

Read and Write Support for FASTA and FASTQ Files

Previously it'd been possible to create tables from FASTA and FASTQ files via CREATE EXTERNAL TABLE and query them directly via their respective fasta_scan and fastq_scan table functions. However, it wasn't possible to write data back to these formats until now. With this release, you can now write data back to FASTA and FASTQ files using SQL's built-in COPY command.

For example, to read the SwissProt FASTA file and write it back to a new file. First create the table:

CREATE EXTERNAL TABLE swissprot
STORED AS FASTA
COMPRESSION TYPE GZIP
LOCATION 'uniprot_sprot.fasta.gz'

Then copy the data to a new file:

COPY swissprot TO 'uniprot_sprot_copy.fasta' STORED AS FASTA

Or in BioBear:

import biobear as bb

session = bb.new_session()

session.execute(
    """
CREATE EXTERNAL TABLE swissprot
STORED AS FASTA
COMPRESSION TYPE GZIP
LOCATION 'uniprot_sprot.fasta.gz'
"""
)

session.execute("COPY swissprot TO 'uniprot_sprot_copy.fasta' STORED AS FASTA")

In effect, this example decompressed the file. You can also copy from query results:

session.execute(
    """
COPY (
    SELECT *
    FROM swissprot
    WHERE NOT starts_with(sequence, 'M')
) TO 'uniprot_sprot_weird_aa.fasta' STORED AS FASTA
"""
)

Compression

You can specify the compression type in the COPY's OPTIONS clause. For example, to copy the SwissProt FASTA file to ZSTD compressed file:

session.execute(  # our session from above
    """
COPY swissprot TO 'uniprot_sprot_copy.fasta.zst'
STORED AS FASTA OPTIONS ('compression' 'zstd')
"""
)

Integer Encoding of DNA and Protein Sequences

Second, we've added support for reading DNA and protein sequences as integer-encoded arrays. This can be useful for ML applications as sequences are often encoded as integers for training. For example, to read the SwissProt FASTA file and encode the sequences as integers:

CREATE EXTERNAL TABLE swissprot
STORED AS FASTA
COMPRESSION TYPE GZIP
LOCATION 'uniprot_sprot.fasta.gz'
OPTIONS (fasta_sequence_data_type 'integer_encode_protein')

Or in BioBear to get a Polars DataFrame:

import biobear as bb

session = bb.new_session()

session.execute(
    """
CREATE EXTERNAL TABLE swissprot
STORED AS FASTA
COMPRESSION TYPE GZIP
LOCATION 'uniprot_sprot.fasta.gz'
OPTIONS (fasta_sequence_data_type 'integer_encode_protein')
"""
)

df = session.sql("SELECT * FROM swissprot").to_polars()
df.head()

| id                   | description                     | sequence       |
|----------------------|---------------------------------|----------------|
| sp|Q6GZX4|001R_FRG3G | Putative transcription factor … | [12, 1, … 11]  |
| sp|Q6GZX3|002L_FRG3G | Uncharacterized protein 002L O… | [12, 18, … 22] |
| sp|Q197F8|002R_IIV3  | Uncharacterized protein 002R O… | [12, 1, … 3]   |
| sp|Q197F7|003L_IIV3  | Uncharacterized protein 003L O… | [12, 23, … 9]  |
| sp|Q6GZX2|003R_FRG3G | Uncharacterized protein 3R OS=… | [12, 1, … 18]  |
| …                    | …                               | …              |
| sp|Q6UY62|Z_SABVB    | RING finger protein Z OS=Sabia…  | [12, 7, … 4]   |
| sp|P08105|Z_SHEEP    | Putative uncharacterized prote… | [12, 18, … 9]  |
| sp|Q88470|Z_TACVF    | RING finger protein Z OS=Tacar…  | [12, 7, … 11]  |
| sp|A9JR22|Z_TAMVU    | RING finger protein Z OS=Tamia…  | [12, 7, … 15]  |
| sp|B2ZDY1|Z_WWAVU    | RING finger protein Z OS=White…  | [12, 7, … 1]   |

In addition to integer_encode_protein, you can also use:

utf8 for the UTF-8 encoding (this is the default)
integer_encode_dna for integer encoding of DNA sequences
large_utf8 for larger sequences like chromosomes or genomes

Setting the Encoding for all FASTA Files

If you want to set the encoding for the entire session, you can use the SET command:

session.execute("SET fasta_sequence_data_type = 'integer_encode_protein'")

Then all subsequent CREATE EXTERNAL TABLE commands will use this encoding.

Conclusion

We hope you enjoy these new features and improvements. Please let us know if you have any questions or feedback.

To get started with BioBear, run:

pip install -U biobear

Data Prep for Bioinformatic ML

May 21, 2024 · 5 min read

Trent Hauck

Developer

This video is a quick walkthrough of:

What is a join, why it's important for bioinformatics analysis, and how to recognize when you could be using SQL vs Python.
How to use SQL to easily combine contigs, GFFs, and amino acid sequences into a dataset for a back-translation pLM from amino acids to CDSs.
1. Shows how to extract CDSs from contigs in a strand-aware manner.
2. Shows how to use contig metadata for Group K-Fold Cross Validation to avoid leakage.

The second part starts around the 5 minute mark.

If you'd like to try biobear, you can install it with pip:

pip install biobear

Or feel free to reach out if you have any question. Otherwise, the code from the video is below.

Tabular Example

For example, given the table:

gene_id	contig_id
1	1
2	1

Where each row represents a gene and the contig it is on, we can join this table with another table that has information about the contigs:

contig_id	length
1	1000

To get a table that looks like this:

gene_id	contig_id	length
1	1	1000
2	1	1000

So, with a genes table and a contigs table, this would look like:

SELECT
    genes.gene_id,
    genes.contig_id,
    contigs.length
FROM
    genes
JOIN
    contigs
ON
    genes.contig_id = contigs.contig_id

What this Looks like in Python

It's important to recognize this pattern in Python w/ Dictionaries. You may already be using this pattern without realizing it.

# Mapping of gene_id to contig_id
genes = {
    1: 1,
    2: 1
}

# Mapping of contig_id to length
contigs = {
    1: 1000
}

These two dictionaries can then be "joined":

for gene_id, contig_id in genes.items():
    length = contigs[contig_id]
    print(gene_id, contig_id, length)

This is straightforward, but can run into memory issues if the datasets are large. SQL databases are optimized for this type of operation. It also is fairly inflexible, as addint additional data, doing aggregations, etc. can require material code changes.

DataFrames

DataFrames are a common data structure found in Python libraries like Pandas (and base R). Conceptually they are similar to tables in SQL, but can run into the same memory and performance issues as dictionaries.

import pandas as pd

genes = pd.DataFrame({
    'gene_id': [1, 2],
    'contig_id': [1, 1]
})

contigs = pd.DataFrame({
    'contig_id': [1],
    'length': [1000]
})

result = pd.merge(genes, contigs, on='contig_id')
print(result)

Types of Joins

There are several types of joins, but the two most common are:

Inner Join
Left Join

Inner Join

An inner join is the most common type of join. It only returns rows where there is a match in both tables. In the example above, this would mean that only genes that are on a contig would be returned.

Left Join

A left join returns all rows from the left table, and the matched rows from the right table. If there is no match, the result is NULL. In the example above, this would mean that all genes would be returned, even if they are not on a contig.

SELECT
    genes.gene_id,
    genes.contig_id,
    contigs.length
FROM
    genes
LEFT JOIN
    contigs
ON
    genes.contig_id = contigs.contig_id

gene_id	contig_id	length
1	1	1000
2	1	1000
3	2	NULL

pLM Example with BioBear

Joining CDSs with Amino Acids and Metadata for the purposes of training a pLM.

End goals:

A table with the AA sequence and DNA sequence for each CDS
Additional metadata for each CDS, our use case: group k-fold cross-validation by contig

# | echo: false
# | output: false

import polars as pl

Start a New Session and Load Data

Load our GFF.

import biobear as bb

session = bb.new_session()

session.execute(
    """
CREATE EXTERNAL TABLE gff
STORED AS gff
LOCATION './12803.assembled.gff';
"""
)

session.sql(
    """
SELECT seqname, source, start, "end", strand, phase, attributes.locus_tag[1] locus_tag FROM gff WHERE type = 'CDS' LIMIT 5
"""
).to_polars().glimpse()

Load our contigs. This part is somewhat tricky as we have the contig sequence, which contains the CDSs. So we'll need to join and extract the CDS sequences in a minute.

session.execute(
    """
CREATE EXTERNAL TABLE contigs
STORED AS fasta OPTIONS (file_extension 'fna')
LOCATION './12803.assembled.fna';
"""
)

session.sql(
    """
SELECT id, sequence FROM contigs LIMIT 5
"""
).to_polars().glimpse()

Load our protein sequences.

session.execute(
    """
CREATE EXTERNAL TABLE protein
STORED AS fasta OPTIONS (file_extension 'faa')
LOCATION './12803.assembled.faa';
"""
)

session.sql(
    """
SELECT id, sequence FROM protein LIMIT 5
"""
).to_polars().glimpse()

Data is loaded, now we can join and extract the CDS sequences.

df = session.sql(
    """
SELECT seqname as contig,
    CASE
        WHEN strand = '1' THEN substr(contigs.sequence, start, "end" - start + 1)
        ELSE reverse_complement(substr(contigs.sequence, start, "end" - start + 1))
    END AS sequence,
    protein.sequence aa_sequence,
    gff.start,
    gff."end",
    gff.strand,
    "end" - start + 1 AS length
FROM gff
    JOIN contigs
        ON gff.seqname = contigs.id
    JOIN protein
        ON gff.attributes.locus_tag[1] = protein.id
"""
).to_polars()

df.head()

Now we can use sci-kit learn to do a group k-fold cross-validation using the contig as the group.

from sklearn.model_selection import GroupKFold

X = df["aa_sequence"]
Y = df["sequence"]
contigs = df["contig"]

group_kfold = GroupKFold(n_splits=5)

for train_index, test_index in group_kfold.split(X, Y, groups=contigs):
    print("TRAIN:", train_index)
    print("TEST:", test_index)

    X_train, X_test = X[train_index], X[test_index]

    # Do something with the data

    break

Exon CLI + Nextflow Example

January 22, 2024 · 3 min read

Trent Hauck

Developer

We've begun publishing the Exon CLI to the AWS Public ECR. This means you can pull and run the Exon CLI using Docker, and use it interactively or in scripts and workflows, like Nextflow which we'll see in a second.

For example, say I was in a directory with a single FASTA file, test.fasta. I could "drop" into the Exon CLI using the following command:

docker run -it -v $(pwd):/data where-true-tech/exon-cli:latest

Which would start the Exon CLI and drop me into a shell. And from there, I could query the data.

❯ SELECT * FROM fasta_scan('/data/test.fasta');
# +----+--------------+----------+
# | id | description  | sequence |
# +----+--------------+----------+
# | a  | description  | ATCG     |
# | b  | description2 | ATCG     |
# +----+--------------+----------+
# 2 rows in set. Query took 0.026 seconds.

While we plan on releasing native binaries for the Exon CLI, with Docker, you can easily use the CLI in your workflows and scripts.

Nextflow Example

Nextflow is one of the most popular workflow managers for bioinformatics. It's a great tool for building reproducible workflows, and it's easy to use the Exon CLI in Nextflow workflows.

For example, if I wanted to copy a FASTA file generated earlier in the workflow, into Parquet for warehouse storage, I could use the -c (command) flag to execute a COPY query in the Exon CLI. This would look something like the following:

As a minimal example, we can round-trip a FASTA file from S3 to Parquet using the Exon CLI and Nextflow. In reality, whatever relevant domain task would generate the FASTA, and then we could use the Exon CLI to copy it into Parquet.

// main.nf
process copyFastaLocally {
    output:
    path 'test.fasta'

    script:
    """
    aws s3 cp s3://wtt-01-dist-prd/tmp/test.fasta .
    """
}

process exportToParquet {
    container 'public.ecr.aws/where-true-tech/exon-cli'

    input:
    path input_fasta

    script:
    """
    exon-cli -c "COPY (SELECT * FROM fasta_scan('$input_fasta')) TO 's3://wtt-01-dist-prd/tmp/test2.parquet' (FORMAT PARQUET)";
    """
}

workflow {
    main:
    copyFastaLocally()
    exportToParquet(copyFastaLocally.out)
}

In order to use Docker and have it access AWS, you'll need to make sure your nextflow.config is configured for them. For this example, the following configuration would be sufficient:

// nextflow.config
docker {
    enabled = true
    envWhitelist = 'AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY'
}

Conclusion

We hope you find the Exon CLI tool useful to support you taking advantage of modern data warehousing and analytics tools in your bioinformatics workflows. If you'd to learn more about what options are available to you, please check out the CLI documentation. Also, if you need more flexibility than the CLI provides, you can always use the BioBear with Python.

If you have any questions or feedback, please reach out to us on at wheretrue.com/contact or on Twitter.

Exon and BioBear Updates

January 3, 2024 · 7 min read

Trent Hauck

Developer

This post will walk through some of the recent updates to Exon, and show how they can be used with BioBear and/or ExonR.

These updates include:

Addition of several table functions to scan various bioinformatic files and query some indexed files. Using these table functions requires less boilerplate SQL code, so are a nice quality of life improvement.
Exon has gain the ability to do index seeks on GFF files for region queries. Now users can access specific regions across large annotations sets without resorting to costly sequential scans.
Flow Cytometry Standard (FCS) files are now supported, and can help scientists go very quickly from the FCS binary format to a DataFrame for analysis.
Various optimizations led to performance improvements for bgzipped files.
New utility table types where added to align more simply to file extensions: FNA (.fna), FAA (.faa), FQ (.fq).

Table of Contents

Table Functions
Scanning and Querying
- Example Scanning Function
- Example Query Function
Indexed GFF Files
Flow Cytometry Standard (FCS) Files
Faster bgzip Reading
New Utility File Types
Conclusion

Table Functions

The addition of table functions is mostly a quality of life improvement, but it's good to live good, so to speak. A table function returns a table that is incorporated into the larger query plan. For example in

SELECT COUNT(*) FROM gff_scan('gff/test.gff.gz');

gff_scan creates a table which is then incorporated into the overall query to count the contents of that file. This is the same thing as doing:

CREATE EXTERNAL TABLE gff_file
STORED AS GFF
COMPRESSION TYPE GZIP
LOCATION 'gff/test.gff.gz';

SELECT COUNT(*) FROM gff_file;

As you can see, it's considerably less typing. Or, if we were to use BioBear, we can see it's ease:

import biobear as bb

sess = bb.connect()
df = sess.sql("SELECT * FROM gff_scan('gff/test.gff.gz')").to_polars()
print(df.head())

There's the obvious downside in that the table function can't be reused like the table, which is not in the database's catalog. However, given these are backed by an external file, which won't change, it's often moot.

Scanning and Querying

As of this release, there are two types of table functions available: scanning, for bulk scanning of all underlying records, and querying, for specific region access for indexed files.

Please see the table function docs, for what's specific available, but generally all current files have scan support, and indexed files are VCF, SAM, and GFF.

Example Scanning Function

Scanning functions take a couple of options. First, they all require the location of the table. This can be a single file or a folder. Second, if applicable, they can take the compression of the underlying file. The table function will also try to use infer the compression type, but this only works for single files. If you pass a folder/prefix, you must pass a compression type unless the files are actually uncompressed.

So, we already saw one example, but we can also can other file types and potentially include them in larger queries. For example,

SELECT *
FROM fasta_scan('fasta/test.fasta') AS f
    JOIN gff_scan('gff/test.gff.gz') AS g
        ON f.id = g.seqname;

Again, see the table function docs for the list of available functions, and more examples.

Example Query Function

Querying indexed files can save a lot of time because only the bytes that belong to the query region are included. The newly adding bam_indexed_scan, vcf_indexed_scan, and gff_indexed_scan, allow querying those respective files.

With BioBear, we an easily write a script which can search regions and save the results as parquet.

def query(path: str, region: str):
    sess = bb.connect()
    df = sess.sql(f"SELECT * FROM vcf_indexed_scan('{path}', '{region}')").to_polars()
    df.write_parquet(f"{region}.parquet")

if __name__ == "__main__":
    query("vcf/index.vcf.gz", "1")

Indexed GFF Files

As mentioned, Exon can now do index seeks on GFF files to markedly reduce query times when a specific sequence or region is needed.

This access pattern comes up often. For example, in metagenomics to get the neighboring annotations for an enzyme of interest. Or in human health, to query regions near some pathogenic variant.

Similar to SAM and VCF index support in Exon, there are two ways to run queries. First, the gff_indexed_scan, like so:

SELECT COUNT(*)
FROM gff_indexed_scan('gff-index/gencode.v38.polyAs.gff.gz', 'chr1:1000000-2000000')

You must have an index file associated with the gff file at gff-index/gencode.v38.polyAs.gff.gz.tbi.

Similarly, you can make a table, then query it to your heart's content.

CREATE EXTERNAL TABLE indexed_gff
STORED AS INDEXED_GFF
COMPRESSION TYPE GZIP
LOCATION 'gff-index/gencode.v38.polyAs.gff.gz';

And to query:

SELECT COUNT(*)
FROM indexed_gff
WHERE gff_region_filter('chr1:1000000-2000000', seqname) = TRUE;

Flow Cytometry Standard (FCS) Files

FCS files are the output from flow cytometry experiments. They are binary files, and can be quite large. Exon can now can incorporate them in queries, and used with BioBear and ExonR by extension. For example, to scan a file with BioBear and load it into a DataFrame:

import biobear as bb

sess = bb.connect()
df = sess.sql("SELECT * FROM fcs_scan('fcs/Guava Muse.fcs')").to_polars()
print(df.head())

And that will output a table that looks like:

Forward Scatter (FSC-HLin)	Forward Scatter Width (FSC-W)	Yellow Fluorescence (YEL-HLin)	Yellow Fluorescence Width (YEL-W…	…	Time	Cell Size (FSC-HLog)	Viability (YEL-HLog)	Nucleation (RED-HLog)
481.931305	7.5	84.225601	7.5	…	35964.0	2.682985	1.925444	2.597557
1293.619629	18.75	394.14035	18.75	…	52745.0	3.111807	2.595651	3.668868
763.836182	5.75	86.909996	5.75	…	77479.0	2.883	1.93907	2.878045
1.160768	0.75	2.786609	0.75	…	126783.0	0.064745	0.445076	2.346986
338.324463	10.75	38.321301	10.75	…	141965.0	2.529333	1.58344	2.531971

As a next step, you could select specific columns (via the SELECT clause) or rows (via the WHERE clause) to further refine the query, and then pass the results to a plotting library or other analysis tool.

Faster bgzip Reading

Reading some compressed formats as gotten faster due to build improvements. For example, comparing The last version to this one when reading a larger VCF file, we see that the new version is about 1.5x faster than the last.

hyperfine './exon-benchmarks-v0.5.0 vcf-query -p WGS.vcf.gz -r chr1:10000-10000000' './exon-benchmarks-latest vcf-query -p WGS.vcf.gz -r chr1:10000-10000000'
Benchmark 1: ./exon-benchmarks-v0.5.0 vcf-query -p WGS.vcf.gz -r chr1:10000-10000000
  Time (mean ± σ):      1.430 s ±  0.055 s    [User: 5.784 s, System: 0.625 s]
  Range (min … max):    1.361 s …  1.553 s    10 runs

Benchmark 2: ./exon-benchmarks-latest vcf-query -p WGS.vcf.gz -r chr1:10000-10000000
  Time (mean ± σ):     952.6 ms ±  69.7 ms    [User: 3319.4 ms, System: 513.8 ms]
  Range (min … max):   895.1 ms … 1117.4 ms    10 runs

Summary
  './exon-benchmarks-latest vcf-query -p WGS.vcf.gz -r chr1:10000-10000000' ran
    1.50 ± 0.12 times faster than './exon-benchmarks-v0.5.0 vcf-query -p WGS.vcf.gz -r chr1:10000-10000000'

New Utility File Types

Because sequence files often have different extensions, e.g. .faa for a FASTA file consisting of Amino Acids, it can be annoying to manage that difference via the extension vs having a specific table type.

So now that are a few new table types to simplify things:

File Type	Extension	Description
`FAA`	`.faa`	Amino Acid FASTA
`FNA`	`.fna`	Nucleotide FASTA
`FQ`	`.fq`	FASTQ

Conclusion

We hope these updates make Exon, BioBear and ExonR, more useful for your bioinformatics needs. If you have any questions, please feel free to reach out.

Adding Basic Postgres Support to Exome

November 20, 2023 · 3 min read

Trent Hauck

Developer

In an effort to make it easier for our users to connect with the existing tools in their stack, we're excited to announce that Exome now supports communicating over the Postgres wire protocol. What this means is that users can use psql in the console or psycopg2 to connect to Exome and query their data.

To be clear, Exome does not yet support the pg_catalog system catalog, meaning that many BI dashboard and ETL tools will not work with Exome yet. However, we're working on adding support for pg_catalog and hope to have it available soon.

Connecting to Exome with `psql`

Assuming you've made an Exome user, you can connect the example catalog in Exome with psql using the following command:

psql -W -H pg.exome.wheretrue.com -U test@mail.com -d public.example_library.example_catalog

If that all goes well, you'll be prompted for your password and then connected to the Exome catalog. You can then run queries as you would with any other Postgres database.

$ psql -W -H pg.exome.wheretrue.com -U test@mail.com -d public.example_library.example_catalog
psql (14.9 (Homebrew), server 0.1.0)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
Type "help" for help.

public.example_library.example_catalog=>

We can then run queries as we would with any other Postgres database.

public.example_library.example_catalog=> SELECT COUNT(*) FROM example_catalog.example_schema.fasta_table;
 cnt
-----
 500
(1 row)

public.example_library.example_catalog=> SELECT id, description FROM example_catalog.example_schema.fasta_table LIMIT 5;
         id          |                                                description
---------------------+-----------------------------------------------------------------------------------------------------------
 UniRef50_A0A5A9P0L4 | peptidylprolyl isomerase n=1 Tax=Triplophysa tibetana TaxID=1572043 RepID=A0A5A9P0L4_9TELE
 UniRef50_A0A410P257 | Glycogen synthase n=2 Tax=Candidatus Velamenicoccus archaeovorus TaxID=1930593 RepID=A0A410P257_9BACT
 UniRef50_A0A8J3NBY6 | Gln_amidase domain-containing protein n=2 Tax=Actinocatenispora rupis TaxID=519421 RepID=A0A8J3NBY6_9ACTN
 UniRef50_Q8WZ42     | Titin n=3053 Tax=cellular organisms TaxID=131567 RepID=TITIN_HUMAN
 UniRef50_A0A401TRQ8 | Ig-like domain-containing protein (Fragment) n=2 Tax=Chiloscyllium TaxID=34767 RepID=A0A401TRQ8_CHIPU

Connect to Exome with `psycopg2`

psycopg2 is a popular Python library for connecting to Postgres databases. It's used by many Python libraries and frameworks, including Django and SQLAlchemy. To connect to Exome with psycopg2, you'll need to install the psycopg2 package in your python environment.

pip install psycopg2

Once you've installed psycopg2, you can connect to Exome with the following code:

import psycopg2

conn = psycopg2.connect(
    host="pg.exome.wheretrue.com",
    port=5432,
    user="user@example.com",
    password="RainbowsAndUnicorns",
    dbname="public.example_library.example_catalog",
)

conn.autocommit = True

cur = conn.cursor()
cur.execute("SELECT COUNT(*) cnt FROM example_catalog.example_schema.fasta_table;")
print(cur.fetchone())

If that all goes well, you should see the following output:

(500,)

What's Next?

We're working on adding support for pg_catalog and hope to have it available soon. If you have any questions or feedback, please reach out to us at support@wheretrue.com.

Signup for Exome to try it for yourself.

Table Partitions, ExonR, and GitHub Sponsors

November 13, 2023 · 4 min read

Trent Hauck

Developer

In addition to launching the public preview of Exome last week, we've also made some updates to Exon.

First, Exon now supports Hive style partitions. This means you can augment your table definitions with partition information. This is useful because:

It allows you to augment your file schemas with additional metadata. For example, the experiment date, or the sample type.
It allows you to query subsets of your data without having to scan the entire table. For example, you can query only the data from a specific sample.

Second, ExonR has been updated to support a SQL session allowing you to create and query tables directly from R, rather than just scan entire files.

Finally, as more of an administrative note, we've added a GitHub Sponsors page. If you find our open source software useful, please consider sponsoring us. Thanks!

Table Partitions

Table partitions are encoded in the file path when you store the data and additionally in the table definition.

For example, say you had a bunch of S3 objects at:

s3://bucket/fasta_table/sample=sampleA/file1.fasta
s3://bucket/fasta_table/sample=sampleA/file2.fasta
s3://bucket/fasta_table/sample=sampleA/file3.fasta
s3://bucket/fasta_table/sample=sampleB/file4.fasta

You could create a table definition like:

CREATE EXTERNAL TABLE fasta_table
STORED AS FASTA
PARTITIONED BY (sample)
LOCATION 's3://bucket/fasta_table'

Then you could query the data like:

SELECT *
FROM fasta_table
WHERE sample = 'sampleA'

And what happens is super useful, Exon will subset the files it scans based on the predicated and augment the FASTA table with the sample column.

So you end up with something like:

id	description	sequence	sample
seq1	...	...	sampleA
seq2	...	...	sampleA

Moreover, this makes it straight forward to join with other data. For example, if you had a table of sample metadata:

CREATE EXTERNAL TABLE sample_metadata
STORED AS PARQUET
LOCATION 's3://bucket/sample_metadata/'

You could join the two tables like:

SELECT *
FROM fasta_table
JOIN sample_metadata
ON fasta_table.sample = sample_metadata.sample

Thus simplifying your analysis by allowing deeper analysis without complex ETL. Moveover, you can add additional partitions without having to reprocess your data, just add the new files to the appropriate location and Exon will pick them up.

ExonR SQL Session

R is R first programming language love 😉. So we continue to build out the R interface to Exon. The latest addition is the ability to work with Exon sessions directly from R. Previously you were only able to scan entire files, but now you can create and query tables directly from R.

For example, you can create a table like, then query it through DuckDB or just directly through a DataFrame.

First create the table from a local file.

library(exonr)

session <- ExonRSessionContext$new()
session$execute("CREATE EXTERNAL TABLE gene_annotations STORED AS GFF LOCATION 'annotations.gff'")

rdf <- session$sql("SELECT seqname, source, type, start, \"end\", score, strand, phase FROM gene_annotations")

Then for DuckDB, convert the result to an Arrow table and then load it into DuckDB.

arrow_table <- rdf$to_arrow()

con <- dbConnect(duckdb::duckdb())

arrow::to_duckdb(arrow_table, table_name = "gene_annotations", con = con)

result <- dbGetQuery(con, "SELECT * FROM gene_annotations")

df <- as.data.frame(result)

Or just work with the DataFrame directly.

rdf <- session$sql("SELECT seqname, source, type, start, \"end\", score, strand, phase FROM gene_annotations")
arrow_table <- rdf$to_arrow()

df <- data.frame(arrow_table)
#   seqname source type start end score strand phase
# 1     sq0   caat gene     8  13    NA      +  <NA>
# 2     sq0   caat gene     8  13    NA      +  <NA>
# 3     sq0   caat gene     8  13    NA      +  <NA>
# 4     sq0   caat gene     8  13    NA      +  <NA>
# 5     sq0   caat gene     8  13    NA      +  <NA>
# 6     sq0   caat gene     8  13    NA      +  <NA>

GitHub Sponsors

We've added a GitHub Sponsors page. If you find our open source software useful, please consider sponsoring us.

Learn More: WHERE TRUE Technologies GitHub Sponsors

Wrapping Up

In this post we covered two new features of Exon, table partitions and the ExonR SQL session. We hope you find these features useful and we look forward to hearing your feedback.

As always, if you have any questions or feedback, please reach out to us at support@wheretrue.com.

BAM Files & BioBear

October 12, 2023 · 9 min read

Trent Hauck

Developer

This post highlights a couple of updates to the BioBear project to support SQL sessions and improvements to BAM file support in Exon so it can scale to larger full scans (like 14G of BAM files from a 10X Genomics experiment).

BioBear Update

The main update to BioBear is that it now exposes Exon's inner query engine allowing for more streamlined application in ETL and ML use-cases. If you work in Biotech/Pharma, you probably watch your normie friends use tools like polars, DataFusion, and/or DuckDB to simplify their data processing with a mix of jealousy and contempt. They reap the benefits of faster/cheaper workflows, while you schlep GFFs and BAM files between CLI tools, workflows, and scripts.

With this BioBear release it becomes simpler to use Exon's query engine in your ETL jobs, ML pipelines, and other execution methods (like AWS Lambdas).

So what does that look like. First, install the latest BioBear version:

pip install -U biobear

Then, you can connect to a local Exon session:

import biobear as bb

session = bb.connect()

With that session, you can work with bioinformatic data along side CSV, Parquet, and JSON data. As an example, imagine the case where you wanted to combine and summarize the contents of a Metagenomics NGS sample that had been run through primary and secondary analysis.

For example here we have a sample, Ga0451106 from IMG/M. Let's say we want to store the gene annotations from the primary analysis in S3 as a Parquet file. We also want to be conservative with out storage duplication, especially sequences, so we'll join the protein FASTA on a PFAM GFF, so that we only duplicate the protein sequences that are annotated with PFAMs we care about given our end goal.

Write Gene Annotations to Parquet

Taking the session from earlier, create an external table backed by the GFF file, then run a COPY query to write the results.

# Register the store we'll be writing to
session.register_object_store_from_url('s3://wtt-01-dist-prd')

# Create the external table, this could also be on S3
session.sql("""
CREATE EXTERNAL TABLE gene_annotations
STORED AS GFF
LOCATION 's3://wtt-01-dist-prd/TenflaDSM28944/IMG_Data/Ga0451106_prodigal.gff'
""")

df = session.sql("""
COPY gene_annotations
TO 's3://wtt-01-dist-prd/gene_annotations/sample=Ga0455106/gene_annotations.parquet' (FORMAT parquet)
""")

df.to_polars()
# shape: (1, 1)
# ┌───────┐
# │ count │
# │ ---   │
# │ u64   │
# ╞═══════╡
# │ 7998  │
# └───────┘

Say we did have the GFF in S3, we could load it into polars directly:

# Register the store we'll be reading from
session.register_object_store_from_url('s3://wtt-01-dist-prd')

# Create the external table, this could also be on S3
session.sql("""
CREATE EXTERNAL TABLE gene_annotations_s3
STORED AS GFF LOCATION 's3://wtt-01-dist-prd/TenflaDSM28944/IMG_Data/Ga0451106_prodigal.gff'
""")

df = session.sql("SELECT * FROM gene_annotations_s3").to_polars()
df.head()
# shape: (5, 9)
# ┌──────────────┬─────────────────┬──────┬───────┬───┬────────────┬────────┬───────┬───────────────────────────────────┐
# │ seqname      ┆ source          ┆ type ┆ start ┆ … ┆ score      ┆ strand ┆ phase ┆ attributes                        │
# │ ---          ┆ ---             ┆ ---  ┆ ---   ┆   ┆ ---        ┆ ---    ┆ ---   ┆ ---                               │
# │ str          ┆ str             ┆ str  ┆ i64   ┆   ┆ f32        ┆ str    ┆ str   ┆ list[struct[2]]                   │
# ╞══════════════╪═════════════════╪══════╪═══════╪═══╪════════════╪════════╪═══════╪═══════════════════════════════════╡
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 2     ┆ … ┆ 54.5       ┆ -      ┆ 0     ┆ [{"ID",["Ga0451106_01_2_238"]}, … │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 228   ┆ … ┆ 114.0      ┆ -      ┆ 0     ┆ [{"ID",["Ga0451106_01_228_941"]}… │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 1097  ┆ … ┆ 224.399994 ┆ +      ┆ 0     ┆ [{"ID",["Ga0451106_01_1097_2257"… │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 2261  ┆ … ┆ 237.699997 ┆ +      ┆ 0     ┆ [{"ID",["Ga0451106_01_2261_3787"… │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 3784  ┆ … ┆ 114.400002 ┆ +      ┆ 0     ┆ [{"ID",["Ga0451106_01_3784_4548"… │
# └──────────────┴─────────────────┴──────┴───────┴───┴────────────┴────────┴───────┴───────────────────────────────────┘

Subsetting our Sequences

As mentioned, it's important here to be cognizant of storage costs and thus reduce duplication into parquet except for amino acid sequences that likely contain a certain PFAM domain.

Similar to the GFF, we can create an external table for both the protein FASTA and the PFAM GFF:

# Create the external table, this could also be on S3
session.sql("""
CREATE EXTERNAL TABLE protein_fasta
STORED AS FASTA LOCATION 'TenflaDSM28944/IMG_Data/Ga0451106_proteins.faa'
""")

session.sql("""
CREATE EXTERNAL TABLE pfam_gff
STORED AS GFF LOCATION 'TenflaDSM28944/IMG_Data/Ga0451106_pfam.gff'
""")

# Register the store we'll be writing to
session.register_object_store_from_url('s3://wtt-01-dist-prd')

gff = session.sql("""SELECT * FROM pfam_gff""").to_polars()
# shape: (5, 9)
# ┌──────────────────────────────┬─────────────────────────────┬─────────┬───────┬───┬────────────┬────────┬───────┬───────────────────────────────────┐
# │ seqname                      ┆ source                      ┆ type    ┆ start ┆ … ┆ score      ┆ strand ┆ phase ┆ attributes                        │
# │ ---                          ┆ ---                         ┆ ---     ┆ ---   ┆   ┆ ---        ┆ ---    ┆ ---   ┆ ---                               │
# │ str                          ┆ str                         ┆ str     ┆ i64   ┆   ┆ f32        ┆ str    ┆ str   ┆ list[struct[2]]                   │
# ╞══════════════════════════════╪═════════════════════════════╪═════════╪═══════╪═══╪════════════╪════════╪═══════╪═══════════════════════════════════╡
# │ Ga0451106_01_1000235_1003480 ┆ HMMER 3.1b2 (February 2015) ┆ PF14684 ┆ 673   ┆ … ┆ 100.900002 ┆ .      ┆ null  ┆ [{"ID",["Ga0451106_01_1000235_10… │
# │ Ga0451106_01_1000235_1003480 ┆ HMMER 3.1b2 (February 2015) ┆ PF14685 ┆ 753   ┆ … ┆ 81.900002  ┆ .      ┆ null  ┆ [{"ID",["Ga0451106_01_1000235_10… │
# │ Ga0451106_01_1000235_1003480 ┆ HMMER 3.1b2 (February 2015) ┆ PF03572 ┆ 875   ┆ … ┆ 70.400002  ┆ .      ┆ null  ┆ [{"ID",["Ga0451106_01_1000235_10… │
# │ Ga0451106_01_1000235_1003480 ┆ HMMER 3.1b2 (February 2015) ┆ PF07676 ┆ 459   ┆ … ┆ 16.799999  ┆ .      ┆ null  ┆ [{"ID",["Ga0451106_01_1000235_10… │
# │ Ga0451106_01_1000235_1003480 ┆ HMMER 3.1b2 (February 2015) ┆ PF07676 ┆ 408   ┆ … ┆ 15.3       ┆ .      ┆ null  ┆ [{"ID",["Ga0451106_01_1000235_10… │
# └──────────────────────────────┴─────────────────────────────┴─────────┴───────┴───┴────────────┴────────┴───────┴───────────────────────────────────┘

# Join the protein FASTA and PFAM GFF and write the results to S3
df = session.sql("""
COPY (
    SELECT DISTINCT fasta.id, fasta.description, fasta.sequence
        FROM protein_fasta AS fasta
            INNER JOIN pfam_gff AS pfam
                ON fasta.id = pfam.seqname
        WHERE pfam.type = 'PF13561' -- Enoyl-(Acyl carrier protein) reductase
) TO 's3://wtt-01-dist-prd/sequences/pfam=PF13561/sample=Ga0455106/sequences.parquet' (FORMAT parquet)
""").to_polars()

print(df)
# shape: (1, 1)
# ┌───────┐
# │ count │
# │ ---   │
# │ u64   │
# ╞═══════╡
# │ 49    │
# └───────┘

Exon BAM Update

The other update to highlight here is that, Exon, and now by extension BioBear, has undergone optimizations that allow projection and predicate pushdowns to enable more efficient queries on BAM files.

Projection Pushdown

For example, say we have we have a couple of BAM files from this 10X Genomics experiment. For simplicity lets assume one file for one sample from the treatment and one from the control, and we just want to look at the read name, the start position, and the map quality for QC purposes.

We can create an external table for the directory containing the BAM files - a listing table allows us to query the files in a directory as one table. Stop for a moment, and consider what the approach to working with multiple SAM files looks like with samtools. You'd have to write a script to loop over the files, then parse the output, and then combine the results. With Exon, you can just query the directory as a table.

$ ls -lh bam-files
-rw-r--r--@ 1 thauck  staff   6.7G Sep 29 15:33 CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1_possorted_genome_bam.bam
-rw-r--r--  1 thauck  staff   2.2M Sep 29 15:33 CytAssist_FFPE_Human_Colon_Post_Xenium_Rep1_possorted_genome_bam.bam.bai
-rw-r--r--  1 thauck  staff   7.0G Sep 29 15:55 CytAssist_FFPE_Human_Colon_Rep1_possorted_genome_bam.bam
-rw-r--r--  1 thauck  staff   2.3M Sep 29 15:55 CytAssist_FFPE_Human_Colon_Rep1_possorted_genome_bam.bam.bai

As you can see, this are a couple of larger files, totaling just under 14G for the main data files, and around 5M for the associated indexes.

import biobear as bb

session = bb.connect()

session.sql("""
    CREATE EXTERNAL TABLE ten_x_experiment STORED AS BAM LOCATION 'bam-files/'
""")

# See which columns are available
columns = session.sql("""
    DESCRIBE ten_x_experiment
""").to_polars().select('column_name')
# print(columns)

df = session.sql("""
    SELECT name, start, mapping_quality
    FROM ten_x_experiment
""").to_polars()
# print(df.shape)
# (287822789, 3)

Reading about 285,000,000 records took around 3 minutes to load on my laptop, an M2 MacBook Air with 24GB of RAM. And while I recognize running on a local laptop isn't the most scientific, it does represent how many practitioners will work with data. In fact, I think that's an exciting thing about Exon and related tools: building off increased compute power for personal computers to enable local analysis of large datasets.

Predicate Pushdown

We just pushed down the projection, i.e. which fields are "physically read", but we can also push down region filters, to determine which records are read in the first place, before hitting the query's predicate in the first place. This enables quickly subsetting files to just the region of interest.

import biobear as bb

session = bb.connect()

# Note that we've changed to an INDEXED_BAM table
session.sql("""
    CREATE EXTERNAL TABLE ten_x_experiment_index STORED AS INDEXED_BAM LOCATION 'bam-files/'
""")

df = session.sql("""
    SELECT name, start, mapping_quality
    FROM ten_x_experiment_index
    WHERE bam_region_filter('chr1:1-1000000', reference, start, "end")
""").to_polars()
# print(df)
# shape: (39_090, 3)
# ┌───────────────────────────────────┬────────┬─────────────────┐
# │ name                              ┆ start  ┆ mapping_quality │
# │ ---                               ┆ ---    ┆ ---             │
# │ str                               ┆ i32    ┆ str             │
# ╞═══════════════════════════════════╪════════╪═════════════════╡
# │ A00519:1665:H3LNHDSX7:3:2631:254… ┆ 69486  ┆ null            │
# │ A00519:1665:H3LNHDSX7:3:2406:262… ┆ 69486  ┆ null            │
# │ A00519:1665:H3LNHDSX7:3:2247:326… ┆ 69486  ┆ null            │
# │ A00519:1665:H3LNHDSX7:3:2340:113… ┆ 69486  ┆ null            │
# │ …                                 ┆ …      ┆ …               │
# │ A00519:1665:H3LNHDSX7:3:2657:215… ┆ 999932 ┆ 0               │
# │ A00519:1665:H3LNHDSX7:3:2607:125… ┆ 999932 ┆ 0               │
# │ A00519:1665:H3LNHDSX7:3:2346:582… ┆ 999932 ┆ 0               │
# │ A00519:1665:H3LNHDSX7:3:2240:126… ┆ 999937 ┆ 0               │
# └───────────────────────────────────┴────────┴─────────────────┘

Conclusion

This post highlight two recent updates to our tools:

BioBear now exposes Exon's query engine allowing for more streamlined application in ETL and ML use-cases.
Exon now supports projection and predicate pushdowns to enable more efficient queries on BAM files.

As always, feel free to reach out with questions, comments, or suggestions.

Exon 0.3.0 Release

September 11, 2023 · 6 min read

Trent Hauck

Developer

Query Rewriting
Performance Improvements
Conclusion

While there's a slew of changes in the latest release of Exon, the most notable are the improvements in the VCF querying. This is the first release where we've been able to query VCF files with the same or better performance than BCFTools. This is a big deal, as BCFTools is the de facto standard for querying VCF files. Importantly, Exon is able to do this while maintaining the same SQL interface as the rest of the system. This means that you can use the same SQL query to query a VCF file as you would a CSV file.

With that, this post will touch on two topics related to the VCF querying improvements in Exon:

Query Rewriting
Performance Improvements

Query Rewriting

Many bioinformatic file formats can be indexed and searched by genomic region. This presents an opportunity for query optimization, but also presents a challenge for the UI. How to expose this functionality to the user in a way that is germane to SQL and affords the user the ability to use the index to speed up their queries?

The direction we took was to use query rewriting to recognize SQL predicates that could be rewritten to a region expression, then during physical planning, reference the underlying index with the region to push the byte range down to the file read to speed up the query. As an example of this, consider the SQL predicate: WHERE chrom = 'chr1' AND pos BETWEEN 1000 AND 2000. This would get translated to a binary expression tree that looks like:

With this latest release, Exon will now traverse the tree and rewrite the predicate to a region. This is done through custom expressions and rules that govern the conjunction of those expressions. For example, after working through the original leaves of the tree, we would have a tree with custom expressions that look like:

Note that the chromosome field is sufficiently constrained to be a region expression by itself. And the less than and greater than expressions are rewritten to interval expressions.

This is then rewritten to combine the interval and region into a further constrained region expression:

And finally we can apply the rule again to get a single region expression:

This region is then used during physical planning to get the byte range. It's interesting to note, we also apply the filters after the byte range is read to guard against the challenges of the BGZF format returning a byte range that could contain records that don't match the region.

Performance Improvements

With this release the VCF querying component of Exon got much faster and easier to use as the result of complementary changes. The end result is a querying engine that:

Uses query rewriting to recognize index regions from SQL queries. For example, WHERE chrom = ‘chr1’ and pos BETWEEN 1 and 100 becomes the query range chr1:1-100.
Use TBI index files with the region to “pushdown” the region in the query to the specific byte ranges for the BGZF block offsets, then further seek to the uncompressed position within block location of the data.
Use the query projection to determine which parts of a VCF record to parse into a typed field. For example SELECT chrom, pos would cause only chrom and pos fields to generate a 'deserialization expense'.
Fan-out processing for each byte range for a given file. For example, if there were multiple chunks for a VCF file “A” and multiple chunks for a file B for a given region, these would be distributed among the available cores for parallel processing.

In visual form, this roughly looks like:

These changes were somewhat involved, but resulted in nice performance gains for the VCF subcomponent of Exon.

For example, working on the file ALL.chr17.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz.tbi, a 3.7G file from the 1000 Genomes project, we can see the following performance when comparing Exon and BCFTools when querying the region 17:100-10000000.

Alt text

Here is another 1000 Genomes Project file, CCDG_14151_B01_GRM_WGS_2020-08-05_chr1.filtered.shapeit2-duohmm-phased.vcf.gz, a little less chonky at 2.6G, but still a good test case. Here we are querying the region chr1:10000-10000000.

Alt text

The exon-vcf-query-two-files row is two files queried in parallel. As you can see, there is little overhead when adding a second file, making Exon match the multi-file / multi-processor world of modern bioinformatics. To be transparent, Exon currently is a bit slower than BCFTools when querying a single file on an object store due to some inefficiencies in how remote indexes are read, but we believe we can close that gap in the next release.

Conclusion

With this release, Exon has reached performance equivalency with BCFTools on single-file reads, and has surpassed BCFTools on multi-file reads. We also believe this approach will allow us to continue to improve the performance of Exon across many indexable file types as we continue to work on the project.

Finally, thank you if you made it this far. We hope you enjoyed this post. If you have any questions or comments, please feel free to reach out to us on Twitter.

Highlighting some releases in July 2023

August 1, 2023 · 5 min read

Trent Hauck

Developer

Parallel Scan
MzML Reader
- A Sneak Peak for Querying
- A Note on Performance
Pandas Support in BioBear
Conclusion

We wanted to highlight three updates that we made in July 2023.

First, Exon now will rewrite queries to scan multiple files in parallel. This enables considerable speedups for queries over large datasets -- both for local files and object stores (e.g. S3, GCS, etc.).

Second, we're releasing an initial version of an MzML reader in our suite of tools. This is our first step towards supporting proteomics and metabolomics data in exon. This format is subject to change as we get feedback from users, so please let us know what you think!

Third, we're adding support for pandas dataframes in BioBear. This is a common request, so we're excited to be able to support it.

Parallel Scan

Not only is going fast fun, but it's also important for enabling the fast iteration cycles necessary for innovation. To that end, Exon and related tools will now automatically parallelize scans over multiple files.

For example, if you have a directory with FASTA files, you can now do the following:

import biobear as bb
from pathlib import Path

data = Path("./data/") # folder with fasta files
df = FastaReader(data).to_polars()  # need to install polars

And Exon will distribute reading the files across available cores. This is a huge win for data lakes, where you can now read multiple files in parallel without having to manually shard your data. E.g. here we see a 4x speed-up when reading 8 files.

MzML Reader

MzML is a very common format in metabolomics and proteomics. At its core are measurements of the spectrum from the mass spec, but there's also a Controlled Vocabulary that describes various metadata about the experiment.

Similar to other readers, you instantiate a reader and then depending on what you want to do with it, you can convert it to arrow or polars with BioBear, or a data.frame with R.

import biobear as bb

# object stores and local files are supported
df = bb.MzMLReader("mzml-data/GNPS00002_A3_p.mzML").to_polars()

df.head()
# shape: (5, 6)
# ┌───────────────────────────────────┬───────────────────────────────────┬───────────────────────────────────┬────────────┬───────────────────────────────────┬────────────────┐
# │ id                                ┆ mz                                ┆ intensity                         ┆ wavelength ┆ cv_params                         ┆ precursor_list │
# │ ---                               ┆ ---                               ┆ ---                               ┆ ---        ┆ ---                               ┆ ---            │
# │ str                               ┆ struct[1]                         ┆ struct[1]                         ┆ f32        ┆ object                            ┆ object         │
# ╞═══════════════════════════════════╪═══════════════════════════════════╪═══════════════════════════════════╪════════════╪═══════════════════════════════════╪════════════════╡
# │ controllerType=0 controllerNumbe… ┆ {[80.948143, 80.993134, … 957.16… ┆ {[5416.88916, 7712.663574, … 480… ┆ null       ┆ [('MS:1000579', {'accession': 'M… ┆ None           │
# │ controllerType=0 controllerNumbe… ┆ {[80.284782, 81.017242, … 1176.3… ┆ {[4701.711426, 5527.709961, … 95… ┆ null       ┆ [('MS:1000579', {'accession': 'M… ┆ None           │
# │ controllerType=0 controllerNumbe… ┆ {[80.948288, 81.017365, … 1193.9… ┆ {[11152.501953, 9900.050781, … 8… ┆ null       ┆ [('MS:1000579', {'accession': 'M… ┆ None           │
# │ controllerType=0 controllerNumbe… ┆ {[80.948143, 81.017372, … 1074.1… ┆ {[6677.85791, 11401.181641, … 12… ┆ null       ┆ [('MS:1000579', {'accession': 'M… ┆ None           │
# │ controllerType=0 controllerNumbe… ┆ {[80.948174, 81.017387, … 968.84… ┆ {[6586.833984, 13542.833008, … 6… ┆ null       ┆ [('MS:1000579', {'accession': 'M… ┆ None           │
# └───────────────────────────────────┴───────────────────────────────────┴───────────────────────────────────┴────────────┴───────────────────────────────────┴────────────────┘

For R users, see the read_mzml_file function. And for exon-duckdb users see

Please see the docs for more information about MzML files, and again, please let us know if you have any feedback!

A Sneak Peak for Querying

While it's not quite ready yet for BioBear and ExonR, in Exon, we've added a set of MzML specific functions for querying. For example, you can filter spectra based on the presence of a specific peaks.

For example, this will filter a set of spectra to only those that have a peak at 100.0 m/z that's within 0.1 m/z of the peak.

SELECT id
FROM mzml -- same table as above
WHERE contains_peak(mz.mz, 100.0, 0.1) = true

This sort of declarative informatics is part of the Exon vision, which we'll be talking more about in the future.

A Note on Performance

Similar to BioBear comparing favorably to other scientific data libraries, we see modest out-performance for MzML file reading.

The following compares reading using a few difference libraries in Python on an MzML file with about 8k spectra.

Pandas Support in BioBear

There's not much to say besides that places where you could use to_polars you can now also use to_pandas. E.g.

import biobear as bb
from pathlib import Path

data = Path("./data/uniref50.fasta")
df = FastaReader(data).to_pandas()  # need to install pandas

df.head()
#                     id                                        description                                           sequence
# 0  UniRef50_A0A5A9P0L4  peptidylprolyl isomerase n=1 Tax=Triplophysa t...  MEEITQIKKRLSQTVRLEGKEDLLSKKDSITNLKTEEHVSVKKMVI...
# 1  UniRef50_A0A410P257  Glycogen synthase n=2 Tax=Candidatus Velamenic...  MKAIAWLIVLTFLPEQVAWAVDYNLRGALHGAVAPLVSAATVATDG...
# 2  UniRef50_A0A8J3NBY6  Gln_amidase domain-containing protein n=2 Tax=...  MEILGRNLPRILGNLVKTIKTAPVRVVARRGARTLTQKEFGKYLGS...
# 3      UniRef50_Q8WZ42  Titin n=3053 Tax=cellular organisms TaxID=1315...  MTTQAPTFTQPLQSVVVLEGSTATFEAHISGFPVPEVSWFRDGQVI...
# 4  UniRef50_A0A401TRQ8  Ig-like domain-containing protein (Fragment) n...  PPSFIHKPDPQEVLPGSNVKFTSVVTGTAPLKVSWFKGTTELVAGR...

Conclusion

We hope you enjoy the new features and please let us know if you have any feedback! We'll be back next month with more updates.

SDF Files​

Reading SDF Files​

BioBear​

Exon​

File Schema​

What's Missing?​

Read and Write Support for FASTA and FASTQ Files​

Compression​

Integer Encoding of DNA and Protein Sequences​

Setting the Encoding for all FASTA Files​

Conclusion​

Tabular Example​

What this Looks like in Python​

DataFrames​

Types of Joins​

Inner Join​

Left Join​

pLM Example with BioBear​

Start a New Session and Load Data​

Nextflow Example​

Conclusion​

Table Functions​

Scanning and Querying​

Example Scanning Function​

Example Query Function​

Indexed GFF Files​

Flow Cytometry Standard (FCS) Files​

Faster bgzip Reading​

New Utility File Types​

Conclusion​

Connecting to Exome with psql​

Connect to Exome with psycopg2​

What's Next?​

Table Partitions​

ExonR SQL Session​

GitHub Sponsors​

Wrapping Up​

BioBear Update​

Write Gene Annotations to Parquet​

Subsetting our Sequences​

Exon BAM Update​

Projection Pushdown​

Predicate Pushdown​

Conclusion​

Query Rewriting​

Performance Improvements​

Conclusion​

Parallel Scan​

MzML Reader​

A Sneak Peak for Querying​

A Note on Performance​

Pandas Support in BioBear​

Conclusion​

SDF Files

Reading SDF Files

BioBear

Exon

File Schema

What's Missing?

Read and Write Support for FASTA and FASTQ Files

Compression

Integer Encoding of DNA and Protein Sequences

Setting the Encoding for all FASTA Files

Conclusion

Tabular Example

What this Looks like in Python

DataFrames

Types of Joins

Inner Join

Left Join

pLM Example with BioBear

Start a New Session and Load Data

Nextflow Example

Conclusion

Table Functions

Scanning and Querying

Example Scanning Function

Example Query Function

Indexed GFF Files

Flow Cytometry Standard (FCS) Files

Faster bgzip Reading

New Utility File Types

Conclusion

Connecting to Exome with `psql`

Connect to Exome with `psycopg2`

What's Next?

Table Partitions

ExonR SQL Session

GitHub Sponsors

Wrapping Up

BioBear Update

Write Gene Annotations to Parquet

Subsetting our Sequences

Exon BAM Update

Projection Pushdown

Predicate Pushdown

Conclusion

Query Rewriting

Performance Improvements

Conclusion

Parallel Scan

MzML Reader

A Sneak Peak for Querying

A Note on Performance

Pandas Support in BioBear

Conclusion