Skip to main content

Exon SQL Reference

At its core, Exon supports a fairly common SQL dialect, however there's a number of additional features that are worth knowing to get the most out of Exon.

Creating A Table

Tables backed by file(s) can be created using the CREATE EXTERNAL TABLE statement. The syntax is as follows:

CREATE EXTERNAL TABLE
[ IF NOT EXISTS ]
<TABLE_NAME>
STORED AS <FILE_TYPE>
[ COMPRESSION TYPE <GZIP | ZSTD> ]
[ PARTITIONED BY (<column list>) ]
[ OPTIONS (<key_value_list>) ]
LOCATION <literal>

Breaking this down into its components:

  • IF NOT EXISTS - If the table already exists, don't throw an error. Optional.
  • <TABLE_NAME> - The name of the table to create. This needs to be the fully qualified name, including the database name and schema name.
  • <FILE_TYPE> - The type of file(s) that the table is backed by. See File Formats for more information.
  • COMPRESSION TYPE - The compression type of the file(s). Optional.
  • <column list> - A comma separated list of columns to partition the table by. Optional.
  • <key_value_list> - A comma separated list of key value pairs. Optional.
  • LOCATION <literal> - The location of the file(s) backing the table. This can be a local file path, or a URI to a file in S3 or GCS.

Compression Type

For the most part, Exon supports GZIP, ZSTD, and UNCOMPRESSED (omitted). In some cases, Exon will interpret GZIP and block gzip, especially when it is germane to the file type.

Examples

Create a table backed by a BAM file:

CREATE EXTERNAL TABLE my_bam
STORED AS BAM
LOCATION 's3://bucket/file.bam';

Create a table backed by a VCF file that is stored in GCS and compressed with GZIP:

CREATE EXTERNAL TABLE my_vcf
STORED AS VCF
COMPRESSION TYPE GZIP
LOCATION 'gs://bucket/file.vcf.gz';

Create a table backed by a object store directory containing a number of indexed VCF files:

CREATE EXTERNAL TABLE my_vcf
STORED AS INDEXED_VCF
COMPRESSION TYPE GZIP
LOCATION 's3://bucket/vcf_files/';

Create a table backed by an object store directory containing a number of FASTA files, where the object store layout follows hive partitioning. E.g. s3://bucket/fasta_files/reference_name=chr1/file.fa

CREATE EXTERNAL TABLE my_fasta
STORED AS FASTA
PARTITIONED BY (reference_name)
LOCATION 's3://bucket/fasta_files/';

File Formats

Depending on the underlying file type a table represents, Exon will expose different functionality. The following sections describe the functionality exposed by each file type.

Currently, Exon supports the following file types:

File TypeDescription
BAMA BAM file.
BCFA BCF file.
BEDA BED file.
FASTAA FASTA file.
FASTQA FASTQ file.
FCSAn FCS file.
GENBANKA GenBank file.
GFFA GFF file.
GTFA GTF file.
HMMDOMTABA HMMER DOMTAB file.
INDEXED_BAMA BAM file with an associated index file.
INDEXED_VCFA VCF file with an associated index file.
MZMLAn mzML file.
SAMA SAM file.
VCFA VCF file.

FASTA Files

FASTA files can be queried individually or as a collection of files.

Controlling Sequence Capacity

While you generally don't need to worry about this, Exon can be updated to control how big of buffer to create when reading the sequences. This can improve speed, but will use more memory. The default is 1000 characters, but can be changed by setting the exon.fasta_sequence_buffer_capacity variable.

SET exon.fasta_sequence_buffer_capacity = 1048;

BAM Files

BAM Files can be associated with a BAM or INDEXED_BAM file types. BAM files may or may not be indexed, but INDEXED_BAM files must have an associated index file. INDEXED_BAM is used a a failsafe if you know your file is indexed, but Exon is unable to detect it which could lead to very slow queries on very large files.

Querying an Indexed BAM File

When you have a BAM or INDEXED_BAM table that has an associated index, you can make use of that index by filtering for the region in the WHERE clause.

For example, say I had a table called my_bam that was backed by a file in S3 called s3://bucket/file.bam and an index called s3://bucket/file.bam.bai. I could query for all reads in the region chr1:1000-2000 like so:

SELECT *
FROM my_bam
WHERE bam_region_filter('chr1:1000-2000', reference, start, end) = true

If I only cared about the reference sequence name, I could omit the start and end columns:

SELECT *
FROM my_bam
WHERE bam_region_filter('chr1', reference) = true

VCF Files

VCF Files can be associated with a VCF or INDEXED_VCF file types. VCF files may or may not be indexed, but INDEXED_VCF files must have an associated index file. INDEXED_VCF is used a a failsafe if you know your file is indexed, but Exon is unable to detect it which could lead to very slow queries on very large files.

Query an Indexed VCF File

Similar to querying an indexed BAM file, you can query an indexed VCF file by using the vcf_region_filter function. It takes similar arguments to bam_region_filter, except rather than the start and end columns, it takes the pos column.

SELECT *
FROM my_vcf
WHERE vcf_region_filter('chr1:1000-2000', reference, pos) = true

Controlling INFO and FORMAT (Genotypes) Parsing

VCF files can contain a large number of INFO and FORMAT fields, and parsing them all can be expensive. By default, Exon will parse all INFO and FORMAT and struct and list of structs, respectively. However, you can control if they're parsed or just returned as strings by using the following options:

SET exon.vcf_parse_info = true;
SET exon.vcf_parse_format = true;

MzML Files

MzML files can be associated with a MZML file type. And they have the following schema:

Column NameTypeNullableDescription
idStringNOThe unique identifier for the spectrum
mzStruct{mz: FLOAT64[]}YESThe m/z value of the spectrum
intensityStruct{intensity: FLOAT64[]}YESThe intensity value of the spectrum
wavelengthStruct{wavelength: FLOAT64[]}YESThe wavelength value of the spectrum
cv_paramsMap<STRING, STRING>YESThe CV parameters of the spectrum
precursor_listStructYESThe precursor list of the spectrum

Example

To select all data:

SELECT *
FROM mzml_table;

To select five example M/Z arrays:

SELECT mz.mz AS mz
FROM mzml_table
LIMIT 5;