Skip to main content

Exon SQL Reference

At its core, Exon supports a fairly common SQL dialect, however there's a number of additional features that are worth knowing to get the most out of Exon.

Table Functions

Exon exposes a number of table functions that can be used to read data directly from files or groups of files. It's a faster alternative to creating an external table and then querying it.

Scan Functions

These functions are used to scan a dataset and return a table.

Function NameRequired ArgumentsOptional ArgumentsDescription
fasta_scanlocationcompression_typeScans a FASTA dataset.
fastq_scanlocationcompression_typeScans a FASTQ dataset.
gff_scanlocationcompression_typeScans a GFF dataset.
gtf_scanlocationcompression_typeScans a GTF dataset.
bed_scanlocationcompression_typeScans a BED dataset.
hmm_dom_tab_scanlocationcompression_typeScans a HMMER DOMTAB dataset.
genbank_scanlocationcompression_typeScans a GenBank dataset.
sam_scanlocationcompression_typeScans a SAM dataset.
bam_scanlocation-Scans a BAM dataset.
vcf_scanlocationcompression_typeScans a VCF dataset.
bcf_scanlocation-Scans a BCF dataset.
fcs_scanlocation-Scans an FCS dataset.

Example(s)

Select all data from a FASTA file:

SELECT *
FROM fasta_scan('s3://bucket/file.fasta');

Indexed Scan Functions

These functions are used to scan indexed file types. They take the location (like file path) and a region literal and/or a region file path to filter by.

Function NameRequired ArgumentsOptional ArgumentsDescription
bam_indexed_scanlocation, region-Filters a BAM dataset by region.
vcf_indexed_scanlocation, region-Filters a VCF dataset by region.
gff_indexed_scanlocation, region-Filters a GFF dataset by region.
fasta_indexed_scanlocation, region-Filters a FASTA dataset by region.
fasta_indexed_scanlocation, region_file-Filters a FASTA dataset by regions, includes all regions in the file.

Example(s)

Select all data from a group of BAM files in S3 that overlap the region chr1:1000-2000:

SELECT *
FROM bam_indexed_scan('s3://bucket/experiment123/', 'chr1:1000-2000');

Indexed FASTA Files

You can work with indexed FASTA files. This is useful for quickly accessing specific sequences or regions in large sequence sets.

⚠️ You must have an index file associated with the FASTA file. The index file must have the same name as the FASTA file, but with a .fai extension. For example, if your FASTA file is test.fasta, the index file must be test.fasta.fai.

When querying an indexed FASTA file, you have two options:

Option One: You can pass one region directly to the fasta_indexed_scan function, like so:

SELECT *
FROM fasta_indexed_scan('fasta/test.fasta', 'chr1:1000000-2000000');

In this case, the function will return all sequences that overlap the region chr1:1000000-2000000.

Option Two: You can pass a regions file to the fasta_indexed_scan function, like so:

SELECT *
FROM fasta_indexed_scan('fasta/test.fasta', 'regions.txt');

Where regions.txt is a file with one region per line, e.g.:

chr1:1000000-2000000
chr2:1000000-2000000

This also works on Object Stores, like S3. You'll just need to pass the full path to the file, e.g. s3://my-bucket/fasta/test.fasta and s3://my-bucket/regions.txt.

SELECT *
FROM fasta_indexed_scan('s3://my-bucket/fasta/test.fasta', 's3://my-bucket/regions.txt');

The id of the returned sequence will be the region name, and the sequence will be the sequence of the region.

Scalar Functions

These function can be applied in the SELECT clause to transform data.

gc_content

For an input sequence, returns the GC content as a float.

SELECT gc_content('ATGC');

quality_scores_to_string

For an input list of quality scores on the Phred scale, returns a string of the quality scores.

SELECT quality_scores_to_string([2, 3, 4])
-- Returns #$%

quality_scores_to_list

For an input string of quality scores, returns a list of quality scores on the Phred scale.

SELECT quality_scores_to_list('#$%')
-- Returns [2, 3, 4]

trim_polya

For an input sequence, returns the sequence with any poly-A tail removed.

SELECT trim_polya('ATCGAAAA')
-- Returns ATCG

alignment_score

For an input sequence and a reference sequence, returns the (local) alignment score.

SELECT alignment_score('ATCG', 'ATCGG')
-- Returns 4

Creating A Table

Tables backed by file(s) can be created using the CREATE EXTERNAL TABLE statement. The syntax is as follows:

CREATE EXTERNAL TABLE
[ IF NOT EXISTS ]
<TABLE_NAME>
STORED AS <FILE_TYPE>
[ COMPRESSION TYPE <GZIP | ZSTD> ]
[ PARTITIONED BY (<column list>) ]
[ OPTIONS (<key_value_list>) ]
LOCATION <literal>

Breaking this down into its components:

  • IF NOT EXISTS - If the table already exists, don't throw an error. Optional.
  • <TABLE_NAME> - The name of the table to create. This needs to be the fully qualified name, including the database name and schema name.
  • <FILE_TYPE> - The type of file(s) that the table is backed by. See File Formats for more information.
  • COMPRESSION TYPE - The compression type of the file(s). Optional.
  • PARTITIONED BY (<column list>) - A comma separated list of columns to partition the table by. Optional.
  • OPTIONS (<key_value_list>) - A comma separated list of key value pairs. Optional.
  • LOCATION <literal> - The location of the file(s) backing the table. This can be a local file path, or a URI to a file in S3 or GCS.

Different file types have different functionality, so see the File Formats section for more information.

Compression Type

For the most part, Exon supports GZIP, ZSTD, and UNCOMPRESSED (omitted). In some cases, Exon will interpret GZIP and block gzip, especially when it is germane to the file type.

Examples

Create a table backed by a BAM file:

CREATE EXTERNAL TABLE my_bam
STORED AS BAM
LOCATION 's3://bucket/file.bam';

Create a table backed by a VCF file that is stored in GCS and compressed with GZIP:

CREATE EXTERNAL TABLE my_vcf
STORED AS VCF
COMPRESSION TYPE GZIP
LOCATION 'gs://bucket/file.vcf.gz';

Create a table backed by a object store directory containing a number of indexed VCF files:

CREATE EXTERNAL TABLE my_vcf
STORED AS INDEXED_VCF
COMPRESSION TYPE GZIP
LOCATION 's3://bucket/vcf_files/';

Create a table backed by an object store directory containing a number of FASTA files, where the object store layout follows hive partitioning. E.g. s3://bucket/fasta_files/reference_name=chr1/file.fa

CREATE EXTERNAL TABLE my_fasta
STORED AS FASTA
PARTITIONED BY (reference_name)
LOCATION 's3://bucket/fasta_files/';

File Formats

Depending on the underlying file type a table represents, Exon will expose different functionality. The following sections describe the functionality exposed by each file type.

Currently, Exon supports the following file types:

File TypeDescription
BAMA BAM file.
BCFA BCF file.
BEDA BED file.
FASTAA FASTA file (.fasta).
FNAA FASTA file (.fna).
FAAA FASTA file (.fna).
FASTQA FASTQ file (.fastq).
FQA FASTQ file (.fq).
FCSAn FCS file.
GENBANKA GenBank file.
GFFA GFF file.
GTFA GTF file.
HMMDOMTABA HMMER DOMTAB file.
INDEXED_BAMA BAM file with an associated index file.
INDEXED_VCFA VCF file with an associated index file.
INDEXED_VGFFA GFF file with an associated index file.
MZMLAn mzML file.
SAMA SAM file.
VCFA VCF file.

Most file types are associated with a single file type, but some are associated with multiple. For example, FASTA is associated with .fasta, .fna, and .faa files. Also note, that the extension is independent of the compression type. For example, a .fasta.gz file is still a FASTA file, though the compression type also needs to be specified.

FASTA Files

FASTA files can be queried individually or as a collection of files.

Controlling Sequence Capacity

While you generally don't need to worry about this, Exon can be updated to control how big of buffer to create when reading the sequences. This can improve speed, but will use more memory. The default is 1000 characters, but can be changed by setting the exon.fasta_sequence_buffer_capacity variable.

SET exon.fasta_sequence_buffer_capacity = 1048;

Controlling Sequence Data Type

By default, Exon will use a UTF8 datatype with i32 offsets to store the sequence field. This is normally ok, but for large sequences this can cause out of memory errors. If you'd like to use a Large UTF8 with i64 offsets, you can set the exon.fasta_large_utf8 variable to true.

SET exon.fasta_large_utf8 = true;

Doing this will increase the memory usage of Exon, but will allow you to store larger sequences. I.e. if you're working with sequences under a hundred million characters or so, you probably don't need to do this.

Controlling the FASTA File Extension

By default, Exon with look for likes that end with .fasta (or .fasta with a compression extension like .fasta.gz). If you're using something else, you can pass the file_extension option to the CREATE EXTERNAL TABLE statement.

CREATE EXTERNAL TABLE my_fasta
STORED AS FASTA
OPTIONS (file_extension 'faa')
LOCATION 's3://bucket/file.fa';

This will cause Exon to look for files that end with .faa (or .faa with a compression extension like .faa.gz).

FASTQ Files

FASTQ files are used for reads from the sequencer. Similar to FASTA, you can use pass the file_extension option to override the default file extension.

CREATE EXTERNAL TABLE my_fastq
STORED AS FASTQ
OPTIONS (file_extension 'fq')
LOCATION 's3://bucket/file.fq';

SAM Files

SAM files are associated with the SAM file type. They are similar to BAM files except they can only be scanned, and not indexed.

Similar to BAM file you can control tag parsing:

SET exon.sam_parse_tags = true;

will cause the tags to be parsed and stored as a struct rather than a list of key value pairs, which is the default.

BAM Files

BAM Files can be associated with a BAM or INDEXED_BAM file types. BAM files may or may not be indexed, but INDEXED_BAM files must have an associated index file. INDEXED_BAM is used a a failsafe if you know your file is indexed, but Exon is unable to detect it which could lead to very slow queries on very large files.

Querying an Indexed BAM File

When you have a BAM or INDEXED_BAM table that has an associated index, you can make use of that index by filtering for the region in the WHERE clause.

For example, say I had a table called my_bam that was backed by a file in S3 called s3://bucket/file.bam and an index called s3://bucket/file.bam.bai. I could query for all reads in the region chr1:1000-2000 like so:

SELECT *
FROM my_bam
WHERE bam_region_filter('chr1:1000-2000', reference, start, end) = true

If I only cared about the reference sequence name, I could omit the start and end columns:

SELECT *
FROM my_bam
WHERE bam_region_filter('chr1', reference) = true

Controlling Tag Parsing

By default, Exon will use a list of key value pairs to store the tags from a BAM file. You can however pass:

SET exon.bam_parse_tags = true;

And Exon will infer the schema of the tags and store them as a struct. This is useful because you get "typed" tags, but it can be slower and use more memory.

VCF Files

VCF Files can be associated with a VCF or INDEXED_VCF file types. VCF files may or may not be indexed, but INDEXED_VCF files must have an associated index file. INDEXED_VCF is used a a failsafe if you know your file is indexed, but Exon is unable to detect it which could lead to very slow queries on very large files.

Query an Indexed VCF File

Similar to querying an indexed BAM file, you can query an indexed VCF file by using the vcf_region_filter function. It takes similar arguments to bam_region_filter, except rather than the start and end columns, it takes the pos column.

SELECT *
FROM my_vcf
WHERE vcf_region_filter('chr1:1000-2000', reference, pos) = true

Controlling INFO and FORMAT (Genotypes) Parsing

VCF files can contain a large number of INFO and FORMAT fields, and parsing them all can be expensive. By default, Exon will parse all INFO and FORMAT and struct and list of structs, respectively. However, you can control if they're parsed or just returned as strings by using the following options:

SET exon.vcf_parse_info = true;
SET exon.vcf_parse_format = true;

MzML Files

MzML files can be associated with a MZML file type. And they have the following schema:

Column NameTypeNullableDescription
idStringNOThe unique identifier for the spectrum
mzStruct{mz: FLOAT64[]}YESThe m/z value of the spectrum
intensityStruct{intensity: FLOAT64[]}YESThe intensity value of the spectrum
wavelengthStruct{wavelength: FLOAT64[]}YESThe wavelength value of the spectrum
cv_paramsMap<STRING, STRING>YESThe CV parameters of the spectrum
precursor_listStructYESThe precursor list of the spectrum

Example

To select all data:

SELECT *
FROM mzml_table;

To select five example M/Z arrays:

SELECT mz.mz AS mz
FROM mzml_table
LIMIT 5;