Table Functions
Table functions are functions that return a table. They are used in the FROM
clause of a query.
read_gff
read_gff_raw
read_gtf
read_genbank
read_vcf_file_records
read_fastq
read_bed_file
read_fasta
read_hmm_dom_tbl_out
read_sam_file_records
\&read_bam_file_records
query_bcf
\&query_vcf
read_gff
Read a GFF file and return a table with the results.
Examples
Read a GFF file and return a table with the results using the default compression type.
SELECT * FROM 'test.gff';
reference_sequence_name | source | annotation_type | start | end | score | strand | phase | attributes |
---|---|---|---|---|---|---|---|---|
sq0 | caat | gene | 8 | 13 | + | gene_id=caat1;gene_name=gene0; | ||
sq1 | caat | gene | 8 | 14 | 0.1 | + | 0 | gene_id=caat2;gene_name=gene0; |
Read a GFF file and return a table with the results using the gzip compression type. This is inferred from the file name. Also convert the attributes
column to a map.
SELECT reference_sequence_name, element_at(attrs, 'gene_id')[1] AS gene_id
FROM (
SELECT reference_sequence_name, gff_parse_attributes(attributes) AS attrs
FROM 'test.gff'
);
reference_sequence_name | gene_id |
---|---|
sq0 | caat1 |
sq1 | caat2 |
Read a GFF file and return a table with the results using the gzip compression type. This is explicitly provided.
SELECT reference_sequence_name, start, "end" FROM read_gff('test.gff.gz', compression='gzip');
reference_sequence_name | start | end |
---|---|---|
sq0 | 8 | 13 |
sq1 | 8 | 14 |
Read a GFF file and return a table with the results using the zstd compression type. This is explicitly provided.
SELECT attributes FROM read_gff('test.gff.zst', compression='zstd');
attributes |
---|
gene_id=caat1;gene_name=gene0; |
gene_id=caat2;gene_name=gene0; |
Result Schema
Column Name | Type | Description |
---|---|---|
reference_sequence_name | varchar | The name of the reference sequence. |
source | varchar | The source of the feature. |
annotation_type | varchar | The type of the annotation. |
start | bigint | The start of the feature. |
end | bigint | The end of the feature. |
score | float | The score of the feature. |
strand | varchar | The strand of the feature. |
phase | bigint | The phase of the feature. |
attributes | varchar | The attributes of the feature. |
Arguments
Argument | Description | Default |
---|---|---|
filepath | The path or glob to the GFF file. | |
compression | The compression type of the file. Can be one of none , gzip , or zstd . | none |
Replacement Scans
The following patterns are recognized for a replacement scan:
*.gff
*.gff.gz
*.gff.zstd
read_gff_raw
Read a GFF file and return a table with the results. This is the same as read_gff
column name-wise, but no attempt is made to parse fields that often don't follow the spec, like strand or phase. As such the result schema is slightly different.
Result Schema
Column Name | Type | Description |
---|---|---|
reference_sequence_name | varchar | The name of the reference sequence. |
source | varchar | The source of the feature. |
annotation_type | varchar | The type of the annotation. |
start | bigint | The start of the feature. |
end | bigint | The end of the feature. |
score | float | The score of the feature. |
strand | varchar | The strand of the feature. |
phase | varchar | The phase of the feature. |
attributes | varchar | The attributes of the feature. |
read_gtf
read_gtf
is the GTF analog of read_gff
.
Examples
Similar to other functions, read_gtf
can be used to read a GTF file and return a table with the results.
SELECT * FROM read_gtf('test.gtf');
Result Schema
Column Name | Type | Description |
---|---|---|
seqname | varchar | The name of the reference sequence. |
source | varchar | The source of the feature. |
type | varchar | The type of the annotation. |
start | bigint | The start of the feature. |
end | bigint | The end of the feature. |
score | float | The score of the feature. |
strand | varchar | The strand of the feature. |
frame | varchar | The phase of the feature. |
attributes | map | The attributes of the feature. |
read_genbank
Read a GenBank file and return a table with the results.
Examples
Read a GenBank file and return a table with the results.
SELECT accession, length(features) n_features, features[1] first_feature FROM 'gb/BGC0002747.gbk';
accession | n_features | first_feature |
---|---|---|
BGC0002747 | 35 | {'kind': subregion, 'location': 1..28354, 'qualifiers': {aStool=mibig, contig_edge=False, label=RiPP, subregion_number=1, tool=antismash}} |
Unnest the features column so rather than a list, there's one feature per row.
SELECT accession, UNNEST(features) features FROM 'gb/BGC0002747.gbk' LIMIT 2;
accession | features |
---|---|
BGC0002747 | {'kind': subregion, 'location': 1..28354, 'qualifiers': {aStool=mibig, contig_edge=False, label=RiPP, subregion_number=1, tool=antismash}} |
BGC0002747 | {'kind': region, 'location': 1..28354, 'qualifiers': {contig_edge=False, product=unknown, region_number=1, subregion_numbers=1, tool=antismash}} |
Result Schema
Column Name | Type | Description |
---|---|---|
sequence | varchar | The sequence in for the record. |
accession | varchar | The accession of the record. |
comments | varchar | The comments of the record. |
contig | varchar | The contig of the record. |
date | varchar | The date of the record. |
dblink | varchar | The dblink of the record. |
definition | varchar | The definition of the record. |
division | varchar | The division of the record. |
keywords | varchar | The keywords of the record. |
molecule_type | varchar | The molecule type of the record. |
name | varchar | The name of the record. |
titles | varchar | The titles of the record. |
sources | varchar | The sources of the record. |
version | varchar | The version of the record. |
topology | varchar | The topology of the record. |
features | array<struct(kind varchar, location varchar, qualifiers map(varchar, varchar))> | The features of the record. |
Arguments
Argument | Description | Default |
---|---|---|
filepath | The path or glob to the GenBank file. | |
compression | The compression type of the file. Can be one of none or gzip | none |
Replacement Scans
The following patterns are recognized for a replacement scan:
*.gb
*.gbk
*.genbank
read_vcf_file_records
Return a table with the records from a VCF file. This function also works for BCF files.
Examples
Read from the ClinVar VCF file and return the first 5 records without the info and format columns.
SELECT chromosome, ids, position, reference_bases, quality_score, filter
FROM 'clinvar.vcf.gz'
LIMIT 5;
chromosome | ids | position | reference_bases | quality_score | filter |
---|---|---|---|---|---|
1 | [1019397] | 925952 | G | [] | |
1 | [1543320] | 925956 | C | [] | |
1 | [1648427] | 925969 | C | [] | |
1 | [1362713] | 925976 | T | [] | |
1 | [1568423] | 925986 | C | [] |
Unpack the info struct into a table.
SELECT info.*
FROM 'clinvar.vcf.gz'
LIMIT 1;
AF_ESP | AF_EXAC | AF_TGP | ALLELEID | CLNDN | CLNDNINCL | CLNDISDB | CLNDISDBINCL | CLNHGVS | CLNREVSTAT | CLNSIG | CLNSIGCONF | CLNSIGINCL | CLNVC | CLNVCSO | CLNVI | DBVARID | GENEINFO | MC | ORIGIN | RS | SSR |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1003021 | [110, 111, 116, 95, 112, 114, 111, 118, 105, 100, 101, 100, 0] | [77, 101, 100, 71, 101, 110, 58, 67, 78, 53, 49, 55, 50, 48, 50, 0] | [78, 67, 95, 48, 48, 48, 48, 48, 49, 46, 49, 49, 58, 103, 46, 57, 50, 53, 57, 53, 50, 71, 62, 65, 0] | [99, 114, 105, 116, 101, 114, 105, 97, 95, 112, 114, 111, 118, 105, 100, 101, 100, 44, 95, 115, 105, 110, 103, 108, 101, 95, 115, 117, 98, 109, 105, 116, 116, 101, 114, 0] | [85, 110, 99, 101, 114, 116, 97, 105, 110, 95, 115, 105, 103, 110, 105, 102, 105, 99, 97, 110, 99, 101, 0] | single_nucleotide_variant | SO:0001483 | SAMD11:148398 | [83, 79, 58, 48, 48, 48, 49, 53, 56, 51, 124, 109, 105, 115, 115, 101, 110, 115, 101, 95, 118, 97, 114, 105, 97, 110, 116, 0] | [49, 0] | [49, 54, 52, 48, 56, 54, 51, 50, 53, 56, 0] |
Result Schema
Column Name | Type | Description |
---|---|---|
chromosome | varchar | The name of the chromosome. |
ids | varchar[] | The list of IDs. |
position | bigint | The position of the variant. |
reference_bases | varchar | The reference bases. |
alternate_bases | varchar[] | The alternate bases. |
quality_score | float | The quality score. |
filter | varchar[] | Any filters applied. |
info | struct | The info. |
genotypes | struct[] | The genotypes. |
Arguments
Argument | Description | Default |
---|---|---|
filepath | The path or glob to the VCF file. |
Replacement Scans
The following patterns are recognized for a replacement scan:
*.vcf
*.vcf.gz
*.bcf
read_fastq
Read a FASTQ file and return a table with the results.
Examples
Read a FASTQ file and return a table with the results using the default compression type.
SELECT name, sequence, quality_score_string_to_list(quality_scores) AS quality_scores
FROM 'test.fastq'
LIMIT 5;
name | sequence | quality_scores |
---|---|---|
SEQ_ID | GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT | [0, 6, 6, 9, 7, 7, 7, 7, 9, 9, 9, 10, 8, 8, 4, 4, 4, 10, 10, 8, 7, 4, 4, 4, 4, 8, 13, 16, 9, 9, 9, 12, 10, 9, 6, 6, 8, 8, 9, 9, 20, 20, 34, 34, 37, 29, 29, 29, 29, 29, 29, 34, 34, 34, 34, 34, 34, 34, 21, 20] |
SEQ_ID2 | GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT | [0, 6, 6, 9, 7, 7, 7, 7, 9, 9, 9, 10, 8, 8, 4, 4, 4, 10, 10, 8, 7, 4, 4, 4, 4, 8, 13, 16, 9, 9, 9, 12, 10, 9, 6, 6, 8, 8, 9, 9, 20, 20, 34, 34, 37, 29, 29, 29, 29, 29, 29, 34, 34, 34, 34, 34, 34, 34, 21, 20] |
Read a FASTQ file and return a table with the results using the gzip compression type. This is inferred from the file name.
SELECT * FROM 'test.fastq.gz' LIMIT 5;
name | description | sequence | quality_scores |
---|---|---|---|
SEQ_ID | GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT | !''*((((+))%%%++)(%%%%).1-+*''))**55CCF>>>>>>CCCCCCC65 | |
SEQ_ID2 | GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT | !''*((((+))%%%++)(%%%%).1-+*''))**55CCF>>>>>>CCCCCCC65 |
Read a FASTQ file and return a table with the results using the gzip compression type. Uses the function vs relying on the replacement scan.
SELECT * FROM read_fastq('path/to/file.fastq.gz', 'gzip');
Same output as above.
Result Schema
Column Name | Type | Description |
---|---|---|
id | varchar | The ID of the sequence |
description | varchar | The description of the sequence |
sequence | varchar | The sequence |
quality | varchar | The quality scores for the sequence |
Arguments
Argument | Description | Default |
---|---|---|
filepath | The path or glob to the FASTQ file. | |
compression | The compression type of the file. Can be one of none , gzip , or zstd . | none |
Replacement Scans
The following patterns are recognized for a replacement scan:
*.fastq
*.fq
*.fastq.gz
*.fq.gz
*.fastq.zstd
*.fq.zstd
read_bed_file
Read a BED file and return a table with the results.
Examples
Read a BED file and return a table with the results using the default compression type and 12 columns.
SELECT * FROM 'bed/test3.bed';
reference_sequence_name | start | end |
---|---|---|
sq0 | 8 | 13 |
Read a gzipped BED file and return a table with the results using the gzip compression type. This is inferred from the file name.
SELECT * FROM 'path/to/file.bed.gz';
Sam output as above.
Read a BED file and return a table with the first 8 columns of the BED file, i.e. include the thick start and end columns.
SELECT * FROM read_bed_file('bed/test3.bed', n_columns=8);
reference_sequence_name | start | end | name | score | strand | thick_start | thick_end |
---|---|---|---|---|---|---|---|
sq0 | 8 | 13 | 8 | 13 |
Result Schema
Column Name | Type | Description |
---|---|---|
chromosome | varchar | The chromosome of the feature. Not nullable. |
start | bigint | The start of the feature. Not nullable. |
end | bigint | The end of the feature. Not nullable. |
name | varchar | The name of the feature. Nullable. |
score | float | The score of the feature. Nullable. |
strand | varchar | The strand of the feature. Nullable. |
thick_start | bigint | The thick start of the feature. Nullable. |
thick_end | bigint | The thick end of the feature. Nullable. |
color | varchar | The color of the feature. Nullable. |
block_count | bigint | The block count of the feature. Nullable. |
block_sizes | varchar | The block sizes of the feature. Nullable. |
block_starts | varchar | The block starts of the feature. Nullable. |
Arguments
Argument | Description | Default |
---|---|---|
filepath | The path or glob to the BED file. | |
compression | The compression type of the file. Can be one of none , gzip , or zstd . | none |
n_columns | The number of columns to read. Can be from 3 to 9 , or 12 . | 12 |
Replacement Scans
The following patterns are recognized for a replacement scan:
*.bed
*.bed.gz
*.bed.zstd
read_fasta
Read a FASTA file and return a table with the results.
Examples
Read a FASTA file and return a table with the results using the default compression type.
SELECT *
FROM 'test.fasta';
id | description | sequence |
---|---|---|
a | description | ATCG |
b | description2 | ATCG |
Read all the FASTAs in the path/to
path.
SELECT *
FROM read_fasta('*.fasta');
id | description | sequence |
---|---|---|
a | description | ATCG |
b | ATCG | |
a | description | ATCG |
b | description2 | ATCG |
Read a FASTA file and return a table with the results using the gzip compression type. This is inferred from the file name.
SELECT * FROM 'test.fasta.gz';
id | description | sequence |
---|---|---|
a | description | ATCG |
b | description2 | ATCG |
Read a FASTA file and return a table with the results using the gzip compression type. Uses the function vs relying on the replacement scan.
SELECT * FROM read_fasta('test.gz', compression='gzip');
Same output as above.
Read a FASTA file and return a table with the results using the zstd compression type. Uses the function vs relying on the replacement scan.
SELECT *
FROM read_fasta('test.fasta.zst', compression='zstd');
Same as above.
Result Schema
Column Name | Type | Description |
---|---|---|
id | varchar | The ID of the sequence (before the space in the header). |
description | varchar | The description of the sequence (after the space in the header). |
sequence | varchar | The biological sequence itself. |
Arguments
Argument | Description | Default |
---|---|---|
filepath | The path or glob to the FASTA file. | |
compression | The compression type of the file. Can be one of none , gzip , or zstd . | none |
Replacement Scans
The following patterns are recognized for a replacement scan:
*.fasta
*.fa
*.fasta.gz
*.fa.gz
*.fasta.zstd
*.fa.zstd
read_hmm_dom_tbl_out
Read a HMMER3 domtblout file and return a table with the results.
Examples
Count the number of matches in a HMMER3 domtblout file.
SELECT *
FROM read_hmm_dom_tbl_out('test.pfam.hmmout')
LIMIT 5;
target_name | target_accession | tlen | query_name | accession | qlen | evalue | sequence_score | bias | domain_number | ndom | conditional_evalue | independent_evalue | domain_score | domain_bias | hmm_from | hmm_to | ali_from | ali_to | env_from | env_to | accuracy | description |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
KanNP_rdsDRAFT_30000008001 | 40 | BPD_transp_2 | PF02653.11 | 267 | 1.1e-09 | 33.7 | 1.2 | 1 | 1 | 1.1e-11 | 1.2e-09 | 33.7 | 0.8 | 135 | 173 | 2 | 40 | 1 | 40 | 0.97 | ||
KanNP_rdsDRAFT_30000008401 | 30 | SAC3 | PF12209.3 | 79 | 0.0042 | 13.4 | 0.0 | 1 | 1 | 4.5e-06 | 0.0045 | 13.3 | 0.0 | 29 | 49 | 5 | 25 | 1 | 26 | 0.85 | ||
KanNP_rdsDRAFT_30000014401 | 39 | Pyr_redox_3 | PF13738.1 | 203 | 0.0011 | 15.1 | 0.0 | 1 | 1 | 6.8e-06 | 0.0011 | 15.1 | 0.0 | 78 | 109 | 5 | 36 | 1 | 39 | 0.89 | ||
KanNP_rdsDRAFT_30000015201 | 31 | Peptidase_M4_C | PF02868.10 | 160 | 0.0002 | 17.3 | 0.1 | 1 | 1 | 4e-07 | 0.0002 | 17.3 | 0.1 | 83 | 100 | 1 | 25 | 1 | 30 | 0.78 | ||
KanNP_rdsDRAFT_30000016001 | 40 | DUF885 | PF05960.6 | 549 | 0.0061 | 11.9 | 0.0 | 1 | 1 | 3e-05 | 0.0061 | 11.9 | 0.0 | 448 | 484 | 2 | 38 | 1 | 40 | 0.93 |
Result Schema
Column Name | Type | Description |
---|---|---|
target_name | varchar | The name of the target sequence. |
target_accession | varchar | The accession of the target sequence. |
tlen | bigint | The length of the target sequence. |
query_name | varchar | The name of the query sequence. |
accession | varchar | The accession of the query sequence. |
qlen | bigint | The length of the query sequence. |
evalue | float | The E-value of the match. |
sequence_score | float | The bit score of the match. |
bias | float | The bias score of the match. |
domain_number | bigint | The domain number of the match. |
ndom | bigint | The total number of domains in the query sequence. |
conditonal_evalue | float | The conditional E-value of the match. |
independent_evalue | float | The E-value of the match reported by inclusion thresholding. |
domain_score | float | The bit score of the match reported by inclusion thresholding. |
domain_bias | float | The bias score of the match reported by inclusion thresholding. |
hmm_from | bigint | The start of the match in the HMM. |
hmm_to | bigint | The end of the match in the HMM. |
ali_from | bigint | The start of the match in the target sequence. |
ali_to | bigint | The end of the match in the target sequence. |
env_from | bigint | The start of the match in the target sequence. |
env_to | bigint | The end of the match in the target sequence. |
accuracy | float | The accuracy of the match. |
description | varchar | The description of the match. |
Arguments
Argument | Description | Default |
---|---|---|
filepath | The path or glob to the file. |
read_sam_file_records
& read_bam_file_records
Read a SAM/BAM file, respectively, and return a table with the results.
Examples
Read a SAM file:
SELECT *
FROM read_sam_file_records('sam/example1.sam');
sequence | read_name | flags | alignment_start | alignment_end | cigar_string | quality_scores | template_length | mapping_quality | mate_alignment_start |
---|---|---|---|---|---|---|---|---|---|
CGAGCTCGGT | ref1_grp1_p001 | 99 | 1 | 10 | 10M | !!!!!!!!!! | 34 | 0 | 25 |
Read a BAM file:
SELECT *
FROM read_bam_file_records('bam/example1.bam');
sequence | read_name | flags | alignment_start | alignment_end | cigar_string | quality_scores | template_length | mapping_quality | mate_alignment_start |
---|---|---|---|---|---|---|---|---|---|
CGAGCTCGGT | ref1_grp1_p001 | 99 | 1 | 10 | 10M | !!!!!!!!!! | 34 | 0 | 25 |
Result Schema
Column Name | Type | Description |
---|---|---|
sequence | varchar | The sequence. |
read_name | varchar | The name of the read. |
flags | int | The flags. |
alignment_start | bigint | The alignment start. |
alignment_end | bigint | The alignment end. |
cigar_string | varchar | The CIGAR string. |
quality_scores | varchar | The quality scores. |
template_length | int | The template length. |
mapping_quality | varchar | The mapping quality. |
mate_alignment_start | bigint | The mate alignment start. |
Arguments
Argument | Description | Default |
---|---|---|
filepath | The path or glob to the SAM file. |
query_bcf
& query_vcf
Query a BCF/VCF file, respectively, and return a table with the results. This function only does scanning of the underlying results. It does not do any filtering or processing of the results, but that can be mitigated with a subquery or CTE.
The function takes a path to a BCF/VCF file and a region to query. Note the index file must be present for the BCF/VCF file.
Examples
Query a BCF file for the first chromosome.
SELECT *
FROM query_bcf('bcf/example1.bcf', 'chr1');
Query a VCF file for the first chromosome, then filter the results for only chases where the quality is greater than 10.
WITH results AS (
SELECT *
FROM query_vcf('vcf/example1.vcf', 'chr1')
)
SELECT *
FROM results
WHERE quality > 10;
Result Schema
Column Name | Type | Description |
---|---|---|
chrom | varchar | The name of the chromosome. |
pos | bigint | The position of the variant. |
id | varchar[] | The list of IDs. |
ref | varchar | The reference bases. |
alt | varchar[] | The alternate bases. |
qual | float | The quality score. |
filter | varchar[] | Any filters applied. |
info | struct | The info. |
formats | struct[] | The genotypes. |
Arguments
Argument | Description | Default |
---|---|---|
filepath | The path of the BCF/VCF file. | |
region | The region to query. |