Exon SQL Reference
At its core, Exon supports a fairly common SQL dialect, however there's a number of additional features that are worth knowing to get the most out of Exon.
Creating A Table
Tables backed by file(s) can be created using the CREATE EXTERNAL TABLE
statement. The syntax is as follows:
CREATE EXTERNAL TABLE
[ IF NOT EXISTS ]
<TABLE_NAME>
STORED AS <FILE_TYPE>
[ COMPRESSION TYPE <GZIP | ZSTD> ]
[ PARTITIONED BY (<column list>) ]
[ OPTIONS (<key_value_list>) ]
LOCATION <literal>
Breaking this down into its components:
IF NOT EXISTS
- If the table already exists, don't throw an error. Optional.<TABLE_NAME>
- The name of the table to create. This needs to be the fully qualified name, including the database name and schema name.<FILE_TYPE>
- The type of file(s) that the table is backed by. See File Formats for more information.COMPRESSION TYPE
- The compression type of the file(s). Optional.<column list>
- A comma separated list of columns to partition the table by. Optional.<key_value_list>
- A comma separated list of key value pairs. Optional.LOCATION <literal>
- The location of the file(s) backing the table. This can be a local file path, or a URI to a file in S3 or GCS.
Compression Type
For the most part, Exon supports GZIP
, ZSTD
, and UNCOMPRESSED
(omitted). In some cases, Exon will interpret GZIP
and block gzip, especially when it is germane to the file type.
Examples
Create a table backed by a BAM file:
CREATE EXTERNAL TABLE my_bam
STORED AS BAM
LOCATION 's3://bucket/file.bam';
Create a table backed by a VCF file that is stored in GCS and compressed with GZIP:
CREATE EXTERNAL TABLE my_vcf
STORED AS VCF
COMPRESSION TYPE GZIP
LOCATION 'gs://bucket/file.vcf.gz';
Create a table backed by a object store directory containing a number of indexed VCF files:
CREATE EXTERNAL TABLE my_vcf
STORED AS INDEXED_VCF
COMPRESSION TYPE GZIP
LOCATION 's3://bucket/vcf_files/';
Create a table backed by an object store directory containing a number of FASTA files, where the object store layout follows hive partitioning. E.g. s3://bucket/fasta_files/reference_name=chr1/file.fa
CREATE EXTERNAL TABLE my_fasta
STORED AS FASTA
PARTITIONED BY (reference_name)
LOCATION 's3://bucket/fasta_files/';
File Formats
Depending on the underlying file type a table represents, Exon will expose different functionality. The following sections describe the functionality exposed by each file type.
Currently, Exon supports the following file types:
File Type | Description |
---|---|
BAM | A BAM file. |
BCF | A BCF file. |
BED | A BED file. |
FASTA | A FASTA file. |
FASTQ | A FASTQ file. |
FCS | An FCS file. |
GENBANK | A GenBank file. |
GFF | A GFF file. |
GTF | A GTF file. |
HMMDOMTAB | A HMMER DOMTAB file. |
INDEXED_BAM | A BAM file with an associated index file. |
INDEXED_VCF | A VCF file with an associated index file. |
MZML | An mzML file. |
SAM | A SAM file. |
VCF | A VCF file. |
FASTA Files
FASTA files can be queried individually or as a collection of files.
Controlling Sequence Capacity
While you generally don't need to worry about this, Exon can be updated to control how big of buffer to create when reading the sequences. This can improve speed, but will use more memory. The default is 1000 characters, but can be changed by setting the exon.fasta_sequence_buffer_capacity
variable.
SET exon.fasta_sequence_buffer_capacity = 1048;
BAM Files
BAM Files can be associated with a BAM
or INDEXED_BAM
file types. BAM
files may or may not be indexed, but INDEXED_BAM
files must have an associated index file. INDEXED_BAM
is used a a failsafe if you know your file is indexed, but Exon is unable to detect it which could lead to very slow queries on very large files.
Querying an Indexed BAM File
When you have a BAM
or INDEXED_BAM
table that has an associated index, you can make use of that index by filtering for the region in the WHERE
clause.
For example, say I had a table called my_bam
that was backed by a file in S3 called s3://bucket/file.bam
and an index called s3://bucket/file.bam.bai
. I could query for all reads in the region chr1:1000-2000
like so:
SELECT *
FROM my_bam
WHERE bam_region_filter('chr1:1000-2000', reference, start, end) = true
If I only cared about the reference sequence name, I could omit the start and end columns:
SELECT *
FROM my_bam
WHERE bam_region_filter('chr1', reference) = true
VCF Files
VCF Files can be associated with a VCF
or INDEXED_VCF
file types. VCF
files may or may not be indexed, but INDEXED_VCF
files must have an associated index file. INDEXED_VCF
is used a a failsafe if you know your file is indexed, but Exon is unable to detect it which could lead to very slow queries on very large files.
Query an Indexed VCF File
Similar to querying an indexed BAM file, you can query an indexed VCF file by using the vcf_region_filter
function. It takes similar arguments to bam_region_filter
, except rather than the start
and end
columns, it takes the pos
column.
SELECT *
FROM my_vcf
WHERE vcf_region_filter('chr1:1000-2000', reference, pos) = true
Controlling INFO
and FORMAT
(Genotypes) Parsing
VCF files can contain a large number of INFO
and FORMAT
fields, and parsing them all can be expensive. By default, Exon will parse all INFO
and FORMAT
and struct and list of structs, respectively. However, you can control if they're parsed or just returned as strings by using the following options:
SET exon.vcf_parse_info = true;
SET exon.vcf_parse_format = true;
MzML Files
MzML files can be associated with a MZML
file type. And they have the following schema:
Column Name | Type | Nullable | Description |
---|---|---|---|
id | String | NO | The unique identifier for the spectrum |
mz | Struct{mz: FLOAT64[]} | YES | The m/z value of the spectrum |
intensity | Struct{intensity: FLOAT64[]} | YES | The intensity value of the spectrum |
wavelength | Struct{wavelength: FLOAT64[]} | YES | The wavelength value of the spectrum |
cv_params | Map<STRING, STRING> | YES | The CV parameters of the spectrum |
precursor_list | Struct | YES | The precursor list of the spectrum |
Example
To select all data:
SELECT *
FROM mzml_table;
To select five example M/Z arrays:
SELECT mz.mz AS mz
FROM mzml_table
LIMIT 5;