Skip to main content

Table Functions

Table functions are functions that return a table. They are used in the FROM clause of a query.

read_gff

Read a GFF file and return a table with the results.

Examples

Read a GFF file and return a table with the results using the default compression type.

SELECT * FROM 'test.gff';
reference_sequence_namesourceannotation_typestartendscorestrandphaseattributes
sq0caatgene813+gene_id=caat1;gene_name=gene0;
sq1caatgene8140.1+0gene_id=caat2;gene_name=gene0;

Read a GFF file and return a table with the results using the gzip compression type. This is inferred from the file name. Also convert the attributes column to a map.

SELECT reference_sequence_name, element_at(attrs, 'gene_id')[1] AS gene_id
FROM (
SELECT reference_sequence_name, gff_parse_attributes(attributes) AS attrs
FROM 'test.gff'
);
reference_sequence_namegene_id
sq0caat1
sq1caat2

Read a GFF file and return a table with the results using the gzip compression type. This is explicitly provided.

SELECT reference_sequence_name, start, "end" FROM read_gff('test.gff.gz', compression='gzip');
reference_sequence_namestartend
sq0813
sq1814

Read a GFF file and return a table with the results using the zstd compression type. This is explicitly provided.

SELECT attributes FROM read_gff('test.gff.zst', compression='zstd');
attributes
gene_id=caat1;gene_name=gene0;
gene_id=caat2;gene_name=gene0;

Result Schema

Column NameTypeDescription
reference_sequence_namevarcharThe name of the reference sequence.
sourcevarcharThe source of the feature.
annotation_typevarcharThe type of the annotation.
startbigintThe start of the feature.
endbigintThe end of the feature.
scorefloatThe score of the feature.
strandvarcharThe strand of the feature.
phasebigintThe phase of the feature.
attributesvarcharThe attributes of the feature.

Arguments

ArgumentDescriptionDefault
filepathThe path or glob to the GFF file.
compressionThe compression type of the file. Can be one of none, gzip, or zstd.none

Replacement Scans

The following patterns are recognized for a replacement scan:

  • *.gff
  • *.gff.gz
  • *.gff.zstd

read_gff_raw

Read a GFF file and return a table with the results. This is the same as read_gff column name-wise, but no attempt is made to parse fields that often don't follow the spec, like strand or phase. As such the result schema is slightly different.

Result Schema

Column NameTypeDescription
reference_sequence_namevarcharThe name of the reference sequence.
sourcevarcharThe source of the feature.
annotation_typevarcharThe type of the annotation.
startbigintThe start of the feature.
endbigintThe end of the feature.
scorefloatThe score of the feature.
strandvarcharThe strand of the feature.
phasevarcharThe phase of the feature.
attributesvarcharThe attributes of the feature.

read_gtf

read_gtf is the GTF analog of read_gff.

Examples

Similar to other functions, read_gtf can be used to read a GTF file and return a table with the results.

SELECT * FROM read_gtf('test.gtf');

Result Schema

Column NameTypeDescription
seqnamevarcharThe name of the reference sequence.
sourcevarcharThe source of the feature.
typevarcharThe type of the annotation.
startbigintThe start of the feature.
endbigintThe end of the feature.
scorefloatThe score of the feature.
strandvarcharThe strand of the feature.
framevarcharThe phase of the feature.
attributesmapThe attributes of the feature.

read_genbank

Read a GenBank file and return a table with the results.

Examples

Read a GenBank file and return a table with the results.

SELECT accession, length(features) n_features, features[1] first_feature FROM 'gb/BGC0002747.gbk';
accessionn_featuresfirst_feature
BGC000274735{'kind': subregion, 'location': 1..28354, 'qualifiers': {aStool=mibig, contig_edge=False, label=RiPP, subregion_number=1, tool=antismash}}

Unnest the features column so rather than a list, there's one feature per row.

SELECT accession, UNNEST(features) features FROM 'gb/BGC0002747.gbk' LIMIT 2;
accessionfeatures
BGC0002747{'kind': subregion, 'location': 1..28354, 'qualifiers': {aStool=mibig, contig_edge=False, label=RiPP, subregion_number=1, tool=antismash}}
BGC0002747{'kind': region, 'location': 1..28354, 'qualifiers': {contig_edge=False, product=unknown, region_number=1, subregion_numbers=1, tool=antismash}}

Result Schema

Column NameTypeDescription
sequencevarcharThe sequence in for the record.
accessionvarcharThe accession of the record.
commentsvarcharThe comments of the record.
contigvarcharThe contig of the record.
datevarcharThe date of the record.
dblinkvarcharThe dblink of the record.
definitionvarcharThe definition of the record.
divisionvarcharThe division of the record.
keywordsvarcharThe keywords of the record.
molecule_typevarcharThe molecule type of the record.
namevarcharThe name of the record.
titlesvarcharThe titles of the record.
sourcesvarcharThe sources of the record.
versionvarcharThe version of the record.
topologyvarcharThe topology of the record.
featuresarray<struct(kind varchar, location varchar, qualifiers map(varchar, varchar))>The features of the record.

Arguments

ArgumentDescriptionDefault
filepathThe path or glob to the GenBank file.
compressionThe compression type of the file. Can be one of none or gzipnone

Replacement Scans

The following patterns are recognized for a replacement scan:

  • *.gb
  • *.gbk
  • *.genbank

read_vcf_file_records

Return a table with the records from a VCF file. This function also works for BCF files.

Examples

Read from the ClinVar VCF file and return the first 5 records without the info and format columns.

SELECT chromosome, ids, position, reference_bases, quality_score, filter
FROM 'clinvar.vcf.gz'
LIMIT 5;
chromosomeidspositionreference_basesquality_scorefilter
1[1019397]925952G[]
1[1543320]925956C[]
1[1648427]925969C[]
1[1362713]925976T[]
1[1568423]925986C[]

Unpack the info struct into a table.

SELECT info.*
FROM 'clinvar.vcf.gz'
LIMIT 1;
AF_ESPAF_EXACAF_TGPALLELEIDCLNDNCLNDNINCLCLNDISDBCLNDISDBINCLCLNHGVSCLNREVSTATCLNSIGCLNSIGCONFCLNSIGINCLCLNVCCLNVCSOCLNVIDBVARIDGENEINFOMCORIGINRSSSR
1003021[110, 111, 116, 95, 112, 114, 111, 118, 105, 100, 101, 100, 0][77, 101, 100, 71, 101, 110, 58, 67, 78, 53, 49, 55, 50, 48, 50, 0][78, 67, 95, 48, 48, 48, 48, 48, 49, 46, 49, 49, 58, 103, 46, 57, 50, 53, 57, 53, 50, 71, 62, 65, 0][99, 114, 105, 116, 101, 114, 105, 97, 95, 112, 114, 111, 118, 105, 100, 101, 100, 44, 95, 115, 105, 110, 103, 108, 101, 95, 115, 117, 98, 109, 105, 116, 116, 101, 114, 0][85, 110, 99, 101, 114, 116, 97, 105, 110, 95, 115, 105, 103, 110, 105, 102, 105, 99, 97, 110, 99, 101, 0]single_nucleotide_variantSO:0001483SAMD11:148398[83, 79, 58, 48, 48, 48, 49, 53, 56, 51, 124, 109, 105, 115, 115, 101, 110, 115, 101, 95, 118, 97, 114, 105, 97, 110, 116, 0][49, 0][49, 54, 52, 48, 56, 54, 51, 50, 53, 56, 0]

Result Schema

Column NameTypeDescription
chromosomevarcharThe name of the chromosome.
idsvarchar[]The list of IDs.
positionbigintThe position of the variant.
reference_basesvarcharThe reference bases.
alternate_basesvarchar[]The alternate bases.
quality_scorefloatThe quality score.
filtervarchar[]Any filters applied.
infostructThe info.
genotypesstruct[]The genotypes.

Arguments

ArgumentDescriptionDefault
filepathThe path or glob to the VCF file.

Replacement Scans

The following patterns are recognized for a replacement scan:

  • *.vcf
  • *.vcf.gz
  • *.bcf

read_fastq

Read a FASTQ file and return a table with the results.

Examples

Read a FASTQ file and return a table with the results using the default compression type.

SELECT name, sequence, quality_score_string_to_list(quality_scores) AS quality_scores
FROM 'test.fastq'
LIMIT 5;
namesequencequality_scores
SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT[0, 6, 6, 9, 7, 7, 7, 7, 9, 9, 9, 10, 8, 8, 4, 4, 4, 10, 10, 8, 7, 4, 4, 4, 4, 8, 13, 16, 9, 9, 9, 12, 10, 9, 6, 6, 8, 8, 9, 9, 20, 20, 34, 34, 37, 29, 29, 29, 29, 29, 29, 34, 34, 34, 34, 34, 34, 34, 21, 20]
SEQ_ID2GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT[0, 6, 6, 9, 7, 7, 7, 7, 9, 9, 9, 10, 8, 8, 4, 4, 4, 10, 10, 8, 7, 4, 4, 4, 4, 8, 13, 16, 9, 9, 9, 12, 10, 9, 6, 6, 8, 8, 9, 9, 20, 20, 34, 34, 37, 29, 29, 29, 29, 29, 29, 34, 34, 34, 34, 34, 34, 34, 21, 20]

Read a FASTQ file and return a table with the results using the gzip compression type. This is inferred from the file name.

SELECT * FROM 'test.fastq.gz' LIMIT 5;
namedescriptionsequencequality_scores
SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT!''*((((+))%%%++)(%%%%).1-+*''))**55CCF>>>>>>CCCCCCC65
SEQ_ID2GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT!''*((((+))%%%++)(%%%%).1-+*''))**55CCF>>>>>>CCCCCCC65

Read a FASTQ file and return a table with the results using the gzip compression type. Uses the function vs relying on the replacement scan.

SELECT * FROM read_fastq('path/to/file.fastq.gz', 'gzip');

Same output as above.

Result Schema

Column NameTypeDescription
idvarcharThe ID of the sequence
descriptionvarcharThe description of the sequence
sequencevarcharThe sequence
qualityvarcharThe quality scores for the sequence

Arguments

ArgumentDescriptionDefault
filepathThe path or glob to the FASTQ file.
compressionThe compression type of the file. Can be one of none, gzip, or zstd.none

Replacement Scans

The following patterns are recognized for a replacement scan:

  • *.fastq
  • *.fq
  • *.fastq.gz
  • *.fq.gz
  • *.fastq.zstd
  • *.fq.zstd

read_bed_file

Read a BED file and return a table with the results.

Examples

Read a BED file and return a table with the results using the default compression type and 12 columns.

SELECT * FROM 'bed/test3.bed';
reference_sequence_namestartend
sq0813

Read a gzipped BED file and return a table with the results using the gzip compression type. This is inferred from the file name.

SELECT * FROM 'path/to/file.bed.gz';

Sam output as above.

Read a BED file and return a table with the first 8 columns of the BED file, i.e. include the thick start and end columns.

SELECT * FROM read_bed_file('bed/test3.bed', n_columns=8);
reference_sequence_namestartendnamescorestrandthick_startthick_end
sq0813813

Result Schema

Column NameTypeDescription
chromosomevarcharThe chromosome of the feature. Not nullable.
startbigintThe start of the feature. Not nullable.
endbigintThe end of the feature. Not nullable.
namevarcharThe name of the feature. Nullable.
scorefloatThe score of the feature. Nullable.
strandvarcharThe strand of the feature. Nullable.
thick_startbigintThe thick start of the feature. Nullable.
thick_endbigintThe thick end of the feature. Nullable.
colorvarcharThe color of the feature. Nullable.
block_countbigintThe block count of the feature. Nullable.
block_sizesvarcharThe block sizes of the feature. Nullable.
block_startsvarcharThe block starts of the feature. Nullable.

Arguments

ArgumentDescriptionDefault
filepathThe path or glob to the BED file.
compressionThe compression type of the file. Can be one of none, gzip, or zstd.none
n_columnsThe number of columns to read. Can be from 3 to 9, or 12.12

Replacement Scans

The following patterns are recognized for a replacement scan:

  • *.bed
  • *.bed.gz
  • *.bed.zstd

read_fasta

Read a FASTA file and return a table with the results.

Examples

Read a FASTA file and return a table with the results using the default compression type.

SELECT *
FROM 'test.fasta';
iddescriptionsequence
adescriptionATCG
bdescription2ATCG

Read all the FASTAs in the path/to path.

SELECT *
FROM read_fasta('*.fasta');
iddescriptionsequence
adescriptionATCG
bATCG
adescriptionATCG
bdescription2ATCG

Read a FASTA file and return a table with the results using the gzip compression type. This is inferred from the file name.

SELECT * FROM 'test.fasta.gz';
iddescriptionsequence
adescriptionATCG
bdescription2ATCG

Read a FASTA file and return a table with the results using the gzip compression type. Uses the function vs relying on the replacement scan.

SELECT * FROM read_fasta('test.gz', compression='gzip');

Same output as above.

Read a FASTA file and return a table with the results using the zstd compression type. Uses the function vs relying on the replacement scan.

SELECT *
FROM read_fasta('test.fasta.zst', compression='zstd');

Same as above.

Result Schema

Column NameTypeDescription
idvarcharThe ID of the sequence (before the space in the header).
descriptionvarcharThe description of the sequence (after the space in the header).
sequencevarcharThe biological sequence itself.

Arguments

ArgumentDescriptionDefault
filepathThe path or glob to the FASTA file.
compressionThe compression type of the file. Can be one of none, gzip, or zstd.none

Replacement Scans

The following patterns are recognized for a replacement scan:

  • *.fasta
  • *.fa
  • *.fasta.gz
  • *.fa.gz
  • *.fasta.zstd
  • *.fa.zstd

read_hmm_dom_tbl_out

Read a HMMER3 domtblout file and return a table with the results.

Examples

Count the number of matches in a HMMER3 domtblout file.

SELECT *
FROM read_hmm_dom_tbl_out('test.pfam.hmmout')
LIMIT 5;
target_nametarget_accessiontlenquery_nameaccessionqlenevaluesequence_scorebiasdomain_numberndomconditional_evalueindependent_evaluedomain_scoredomain_biashmm_fromhmm_toali_fromali_toenv_fromenv_toaccuracydescription
KanNP_rdsDRAFT_3000000800140BPD_transp_2PF02653.112671.1e-0933.71.2111.1e-111.2e-0933.70.81351732401400.97
KanNP_rdsDRAFT_3000000840130SAC3PF12209.3790.004213.40.0114.5e-060.004513.30.029495251260.85
KanNP_rdsDRAFT_3000001440139Pyr_redox_3PF13738.12030.001115.10.0116.8e-060.001115.10.0781095361390.89
KanNP_rdsDRAFT_3000001520131Peptidase_M4_CPF02868.101600.000217.30.1114e-070.000217.30.1831001251300.78
KanNP_rdsDRAFT_3000001600140DUF885PF05960.65490.006111.90.0113e-050.006111.90.04484842381400.93

Result Schema

Column NameTypeDescription
target_namevarcharThe name of the target sequence.
target_accessionvarcharThe accession of the target sequence.
tlenbigintThe length of the target sequence.
query_namevarcharThe name of the query sequence.
accessionvarcharThe accession of the query sequence.
qlenbigintThe length of the query sequence.
evaluefloatThe E-value of the match.
sequence_scorefloatThe bit score of the match.
biasfloatThe bias score of the match.
domain_numberbigintThe domain number of the match.
ndombigintThe total number of domains in the query sequence.
conditonal_evaluefloatThe conditional E-value of the match.
independent_evaluefloatThe E-value of the match reported by inclusion thresholding.
domain_scorefloatThe bit score of the match reported by inclusion thresholding.
domain_biasfloatThe bias score of the match reported by inclusion thresholding.
hmm_frombigintThe start of the match in the HMM.
hmm_tobigintThe end of the match in the HMM.
ali_frombigintThe start of the match in the target sequence.
ali_tobigintThe end of the match in the target sequence.
env_frombigintThe start of the match in the target sequence.
env_tobigintThe end of the match in the target sequence.
accuracyfloatThe accuracy of the match.
descriptionvarcharThe description of the match.

Arguments

ArgumentDescriptionDefault
filepathThe path or glob to the file.

read_sam_file_records & read_bam_file_records

Read a SAM/BAM file, respectively, and return a table with the results.

Examples

Read a SAM file:

SELECT *
FROM read_sam_file_records('sam/example1.sam');
sequenceread_nameflagsalignment_startalignment_endcigar_stringquality_scorestemplate_lengthmapping_qualitymate_alignment_start
CGAGCTCGGTref1_grp1_p0019911010M!!!!!!!!!!34025

Read a BAM file:

SELECT *
FROM read_bam_file_records('bam/example1.bam');
sequenceread_nameflagsalignment_startalignment_endcigar_stringquality_scorestemplate_lengthmapping_qualitymate_alignment_start
CGAGCTCGGTref1_grp1_p0019911010M!!!!!!!!!!34025

Result Schema

Column NameTypeDescription
sequencevarcharThe sequence.
read_namevarcharThe name of the read.
flagsintThe flags.
alignment_startbigintThe alignment start.
alignment_endbigintThe alignment end.
cigar_stringvarcharThe CIGAR string.
quality_scoresvarcharThe quality scores.
template_lengthintThe template length.
mapping_qualityvarcharThe mapping quality.
mate_alignment_startbigintThe mate alignment start.

Arguments

ArgumentDescriptionDefault
filepathThe path or glob to the SAM file.

query_bcf & query_vcf

Query a BCF/VCF file, respectively, and return a table with the results. This function only does scanning of the underlying results. It does not do any filtering or processing of the results, but that can be mitigated with a subquery or CTE.

The function takes a path to a BCF/VCF file and a region to query. Note the index file must be present for the BCF/VCF file.

Examples

Query a BCF file for the first chromosome.

SELECT *
FROM query_bcf('bcf/example1.bcf', 'chr1');

Query a VCF file for the first chromosome, then filter the results for only chases where the quality is greater than 10.

WITH results AS (
SELECT *
FROM query_vcf('vcf/example1.vcf', 'chr1')
)

SELECT *
FROM results
WHERE quality > 10;

Result Schema

Column NameTypeDescription
chromvarcharThe name of the chromosome.
posbigintThe position of the variant.
idvarchar[]The list of IDs.
refvarcharThe reference bases.
altvarchar[]The alternate bases.
qualfloatThe quality score.
filtervarchar[]Any filters applied.
infostructThe info.
formatsstruct[]The genotypes.

Arguments

ArgumentDescriptionDefault
filepathThe path of the BCF/VCF file.
regionThe region to query.