Exon SQL Reference
At its core, Exon supports a fairly common SQL dialect, however there's a number of additional features that are worth knowing to get the most out of Exon.
Table Functions
Exon exposes a number of table functions that can be used to read data directly from files or groups of files. It's a faster alternative to creating an external table and then querying it.
Scan Functions
These functions are used to scan a dataset and return a table.
Function Name | Required Arguments | Optional Arguments | Description |
---|---|---|---|
fasta_scan | location | compression_type | Scans a FASTA dataset. |
fastq_scan | location | compression_type | Scans a FASTQ dataset. |
gff_scan | location | compression_type | Scans a GFF dataset. |
gtf_scan | location | compression_type | Scans a GTF dataset. |
bed_scan | location | compression_type | Scans a BED dataset. |
hmm_dom_tab_scan | location | compression_type | Scans a HMMER DOMTAB dataset. |
genbank_scan | location | compression_type | Scans a GenBank dataset. |
sam_scan | location | compression_type | Scans a SAM dataset. |
bam_scan | location | - | Scans a BAM dataset. |
vcf_scan | location | compression_type | Scans a VCF dataset. |
bcf_scan | location | - | Scans a BCF dataset. |
fcs_scan | location | - | Scans an FCS dataset. |
Example(s)
Select all data from a FASTA file:
SELECT *
FROM fasta_scan('s3://bucket/file.fasta');
Indexed Scan Functions
These functions are used to scan indexed file types. They take the location (like file path) and a region literal and/or a region file path to filter by.
Function Name | Required Arguments | Optional Arguments | Description |
---|---|---|---|
bam_indexed_scan | location , region | - | Filters a BAM dataset by region. |
vcf_indexed_scan | location , region | - | Filters a VCF dataset by region. |
gff_indexed_scan | location , region | - | Filters a GFF dataset by region. |
fasta_indexed_scan | location , region | - | Filters a FASTA dataset by region. |
fasta_indexed_scan | location , region_file | - | Filters a FASTA dataset by regions, includes all regions in the file. |
Example(s)
Select all data from a group of BAM files in S3 that overlap the region chr1:1000-2000
:
SELECT *
FROM bam_indexed_scan('s3://bucket/experiment123/', 'chr1:1000-2000');
Indexed FASTA Files
You can work with indexed FASTA files. This is useful for quickly accessing specific sequences or regions in large sequence sets.
⚠️ You must have an index file associated with the FASTA file. The index file must have the same name as the FASTA file, but with a
.fai
extension. For example, if your FASTA file istest.fasta
, the index file must betest.fasta.fai
.
When querying an indexed FASTA file, you have two options:
Option One: You can pass one region directly to the fasta_indexed_scan
function, like so:
SELECT *
FROM fasta_indexed_scan('fasta/test.fasta', 'chr1:1000000-2000000');
In this case, the function will return all sequences that overlap the region chr1:1000000-2000000
.
Option Two: You can pass a regions file to the fasta_indexed_scan
function, like so:
SELECT *
FROM fasta_indexed_scan('fasta/test.fasta', 'regions.txt');
Where regions.txt
is a file with one region per line, e.g.:
chr1:1000000-2000000
chr2:1000000-2000000
This also works on Object Stores, like S3. You'll just need to pass the full path to the file, e.g. s3://my-bucket/fasta/test.fasta
and s3://my-bucket/regions.txt
.
SELECT *
FROM fasta_indexed_scan('s3://my-bucket/fasta/test.fasta', 's3://my-bucket/regions.txt');
The id of the returned sequence will be the region name, and the sequence will be the sequence of the region.
Scalar Functions
These function can be applied in the SELECT
clause to transform data.
gc_content
For an input sequence, returns the GC content as a float.
SELECT gc_content('ATGC');
quality_scores_to_string
For an input list of quality scores on the Phred scale, returns a string of the quality scores.
SELECT quality_scores_to_string([2, 3, 4])
-- Returns #$%
quality_scores_to_list
For an input string of quality scores, returns a list of quality scores on the Phred scale.
SELECT quality_scores_to_list('#$%')
-- Returns [2, 3, 4]
trim_polya
For an input sequence, returns the sequence with any poly-A tail removed.
SELECT trim_polya('ATCGAAAA')
-- Returns ATCG
alignment_score
For an input sequence and a reference sequence, returns the (local) alignment score.
SELECT alignment_score('ATCG', 'ATCGG')
-- Returns 4
Creating A Table
Tables backed by file(s) can be created using the CREATE EXTERNAL TABLE
statement. The syntax is as follows:
CREATE EXTERNAL TABLE
[ IF NOT EXISTS ]
<TABLE_NAME>
STORED AS <FILE_TYPE>
[ COMPRESSION TYPE <GZIP | ZSTD> ]
[ PARTITIONED BY (<column list>) ]
[ OPTIONS (<key_value_list>) ]
LOCATION <literal>
Breaking this down into its components:
IF NOT EXISTS
- If the table already exists, don't throw an error. Optional.<TABLE_NAME>
- The name of the table to create. This needs to be the fully qualified name, including the database name and schema name.<FILE_TYPE>
- The type of file(s) that the table is backed by. See File Formats for more information.COMPRESSION TYPE
- The compression type of the file(s). Optional.PARTITIONED BY (<column list>)
- A comma separated list of columns to partition the table by. Optional.OPTIONS (<key_value_list>)
- A comma separated list of key value pairs. Optional.LOCATION <literal>
- The location of the file(s) backing the table. This can be a local file path, or a URI to a file in S3 or GCS.
Different file types have different functionality, so see the File Formats section for more information.