This post will walk through some of the recent updates to Exon, and show how they can be used with BioBear and/or ExonR.
These updates include:
- Addition of several table functions to scan various bioinformatic files and query some indexed files. Using these table functions requires less boilerplate SQL code, so are a nice quality of life improvement.
- Exon has gain the ability to do index seeks on GFF files for region queries. Now users can access specific regions across large annotations sets without resorting to costly sequential scans.
- Flow Cytometry Standard (FCS) files are now supported, and can help scientists go very quickly from the FCS binary format to a DataFrame for analysis.
- Various optimizations led to performance improvements for bgzipped files.
- New utility table types where added to align more simply to file extensions:
FNA
(.fna
), FAA
(.faa
), FQ
(.fq
).
Table of Contents
Table Functions
The addition of table functions is mostly a quality of life improvement, but it's good to live good, so to speak. A table function returns a table that is incorporated into the larger query plan. For example in
SELECT COUNT(*) FROM gff_scan('gff/test.gff.gz');
gff_scan
creates a table which is then incorporated into the overall query to count the contents of that file. This is the same thing as doing:
CREATE EXTERNAL TABLE gff_file
STORED AS GFF
COMPRESSION TYPE GZIP
LOCATION 'gff/test.gff.gz';
SELECT COUNT(*) FROM gff_file;
As you can see, it's considerably less typing. Or, if we were to use BioBear, we can see it's ease:
import biobear as bb
sess = bb.connect()
df = sess.sql("SELECT * FROM gff_scan('gff/test.gff.gz')").to_polars()
print(df.head())
There's the obvious downside in that the table function can't be reused like the table, which is not in the database's catalog. However, given these are backed by an external file, which won't change, it's often moot.
Scanning and Querying
As of this release, there are two types of table functions available: scanning, for bulk scanning of all underlying records, and querying, for specific region access for indexed files.
Please see the table function docs, for what's specific available, but generally all current files have scan support, and indexed files are VCF, SAM, and GFF.
Example Scanning Function
Scanning functions take a couple of options. First, they all require the location of the table. This can be a single file or a folder. Second, if applicable, they can take the compression of the underlying file. The table function will also try to use infer the compression type, but this only works for single files. If you pass a folder/prefix, you must pass a compression type unless the files are actually uncompressed.
So, we already saw one example, but we can also can other file types and potentially include them in larger queries. For example,
SELECT *
FROM fasta_scan('fasta/test.fasta') AS f
JOIN gff_scan('gff/test.gff.gz') AS g
ON f.id = g.seqname;
Again, see the table function docs for the list of available functions, and more examples.
Example Query Function
Querying indexed files can save a lot of time because only the bytes that belong to the query region are included. The newly adding bam_indexed_scan
, vcf_indexed_scan
, and gff_indexed_scan
, allow querying those respective files.
With BioBear, we an easily write a script which can search regions and save the results as parquet.
def query(path: str, region: str):
sess = bb.connect()
df = sess.sql(f"SELECT * FROM vcf_indexed_scan('{path}', '{region}')").to_polars()
df.write_parquet(f"{region}.parquet")
if __name__ == "__main__":
query("vcf/index.vcf.gz", "1")
Indexed GFF Files
As mentioned, Exon can now do index seeks on GFF files to markedly reduce query times when a specific sequence or region is needed.
This access pattern comes up often. For example, in metagenomics to get the neighboring annotations for an enzyme of interest. Or in human health, to query regions near some pathogenic variant.
Similar to SAM and VCF index support in Exon, there are two ways to run queries. First, the gff_indexed_scan
, like so:
SELECT COUNT(*)
FROM gff_indexed_scan('gff-index/gencode.v38.polyAs.gff.gz', 'chr1:1000000-2000000')
You must have an index file associated with the gff file at gff-index/gencode.v38.polyAs.gff.gz.tbi
.
Similarly, you can make a table, then query it to your heart's content.
CREATE EXTERNAL TABLE indexed_gff
STORED AS INDEXED_GFF
COMPRESSION TYPE GZIP
LOCATION 'gff-index/gencode.v38.polyAs.gff.gz';
And to query:
SELECT COUNT(*)
FROM indexed_gff
WHERE gff_region_filter('chr1:1000000-2000000', seqname) = TRUE;
Flow Cytometry Standard (FCS) Files
FCS files are the output from flow cytometry experiments. They are binary files, and can be quite large. Exon can now can incorporate them in queries, and used with BioBear and ExonR by extension. For example, to scan a file with BioBear and load it into a DataFrame:
import biobear as bb
sess = bb.connect()
df = sess.sql("SELECT * FROM fcs_scan('fcs/Guava Muse.fcs')").to_polars()
print(df.head())
And that will output a table that looks like:
Forward Scatter (FSC-HLin) | Forward Scatter Width (FSC-W) | Yellow Fluorescence (YEL-HLin) | Yellow Fluorescence Width (YEL-W… | … | Time | Cell Size (FSC-HLog) | Viability (YEL-HLog) | Nucleation (RED-HLog) |
---|
481.931305 | 7.5 | 84.225601 | 7.5 | … | 35964.0 | 2.682985 | 1.925444 | 2.597557 |
1293.619629 | 18.75 | 394.14035 | 18.75 | … | 52745.0 | 3.111807 | 2.595651 | 3.668868 |
763.836182 | 5.75 | 86.909996 | 5.75 | … | 77479.0 | 2.883 | 1.93907 | 2.878045 |
1.160768 | 0.75 | 2.786609 | 0.75 | … | 126783.0 | 0.064745 | 0.445076 | 2.346986 |
338.324463 | 10.75 | 38.321301 | 10.75 | … | 141965.0 | 2.529333 | 1.58344 | 2.531971 |
As a next step, you could select specific columns (via the SELECT
clause) or rows (via the WHERE
clause) to further refine the query, and then pass the results to a plotting library or other analysis tool.
Faster bgzip Reading
Reading some compressed formats as gotten faster due to build improvements. For example, comparing The last version to this one when reading a larger VCF file, we see that the new version is about 1.5x faster than the last.
hyperfine './exon-benchmarks-v0.5.0 vcf-query -p WGS.vcf.gz -r chr1:10000-10000000' './exon-benchmarks-latest vcf-query -p WGS.vcf.gz -r chr1:10000-10000000'
Benchmark 1: ./exon-benchmarks-v0.5.0 vcf-query -p WGS.vcf.gz -r chr1:10000-10000000
Time (mean ± σ): 1.430 s ± 0.055 s [User: 5.784 s, System: 0.625 s]
Range (min … max): 1.361 s … 1.553 s 10 runs
Benchmark 2: ./exon-benchmarks-latest vcf-query -p WGS.vcf.gz -r chr1:10000-10000000
Time (mean ± σ): 952.6 ms ± 69.7 ms [User: 3319.4 ms, System: 513.8 ms]
Range (min … max): 895.1 ms … 1117.4 ms 10 runs
Summary
'./exon-benchmarks-latest vcf-query -p WGS.vcf.gz -r chr1:10000-10000000' ran
1.50 ± 0.12 times faster than './exon-benchmarks-v0.5.0 vcf-query -p WGS.vcf.gz -r chr1:10000-10000000'
New Utility File Types
Because sequence files often have different extensions, e.g. .faa
for a FASTA file consisting of Amino Acids, it can be annoying to manage that difference via the extension vs having a specific table type.
So now that are a few new table types to simplify things:
File Type | Extension | Description |
---|
FAA | .faa | Amino Acid FASTA |
FNA | .fna | Nucleotide FASTA |
FQ | .fq | FASTQ |
Conclusion
We hope these updates make Exon, BioBear and ExonR, more useful for your bioinformatics needs. If you have any questions, please feel free to reach out.