Skip to main content

· One min read
Trent Hauck

Over the last week, we've made a couple of improvements to how Exon-DuckDB is distributed. We've added Google login, and we've made it easier to install Exon-DuckDB.

Google Login

It's pretty simple. If you go to the login page, there are two new buttons for login and sign-up, respectively. Give it a shot if you're so inclined: https://wheretrue.com/login.

Exon-DuckDB on PyPI

Again, pretty simple: $ pip install exondb. For more information on how to use, see the documentation and/or installation instructions.

· 2 min read
Trent Hauck

This is a quick snippet for how you can use Exon-DuckDB to find contigs that have multiple PFAMs of interest. This has application in metagenomic discovery of things like CRISPR systems and antibiotic resistance genes.

Create a Table of Annotations

CREATE TABLE gff AS
SELECT
reference_sequence_name,
gff_parse_attributes(attributes) attributes
FROM read_gff_raw('IowNatreferecore.gff');
  • read_gff_raw reads the GFF. If it was actually a gff, we could use read_gff. But this GFF is malformed from the source, so we use read_gff_raw.
  • gff_parse_attributes is converting a gff attributes string into key value pairs.
  • This dataset has 12.7 million rows, and easily processed on a laptop.

Create a Table of PFAMs

CREATE TABLE pfams AS
SELECT
target_name,
accession[:7] AS pfam
FROM read_hmm_dom_tbl_out('hmmout')
WHERE evalue < 1e-10
  • read_hmm_dom_tbl_out reads the domain hmm output.
  • accession[:7] munges the raw output from PF00000.12 to PF00000
  • evalue filtering gives us significant hits.

Combine Tables

CREATE TABLE contig_pfams AS
SELECT
reference_sequence_name AS contig_id,
LIST(pfam) AS contig_pfams
FROM gff
JOIN pfams
ON gff.attributes['locus_tag'][1] = pfams.target_name
GROUP BY reference_sequence_name
HAVING LEN(LIST(pfam)) > 5
  • Aggregating into a LIST means one row per contig with a LIST of PFAMs in the other column
  • Join Key is within attributes -- the 'locus_tag' key
  • HAVING applies a heuristic that we only want contigs with a bit of length.

Discover Cool Things

SELECT *
FROM contig_pfams
WHERE list_contains(contig_pfams, 'PF17893')
AND list_contains(contig_pfams, 'PF01867')
AND list_contains(contig_pfams, 'PF09827')
  • list_contains returns true if the list in the first argument contains the string in the second argument.
  • Easy to generalize to other CRISPR types or clusters of genes more generally.
  • All done locally with just SQL.

· 6 min read
Trent Hauck

FASTQs and Barcodes

The process of multiplexing in NGS is to combine many samples into one pool and sequence that pool. This is a useful strategy to amoritize sequencing run costs over those samples. The rub, of course, is almost anytime something is multiplexed, it needs to be demultiplexed. Moreover for organizations engaging in many sequencing experiments, the underlying data needs to be cataloged in accordance with data engineering best practices to allow for easy retreival.

This can be an onerous process, but with Exon-DuckDB's help, you can quickly combine the reads, barcodes, and other metadata; do pre-ingestion QC analytics; group the reads into barcode-based partitions; and finally upload that data into a larger cloud warehouse.

If you'd like to follow along, I'm using data from QIIME2.

wget -O "sample-metadata.tsv" \
"https://data.qiime2.org/2018.4/tutorials/moving-pictures/sample_metadata.tsv"
wget -O "barcodes.fastq.gz" \
"https://data.qiime2.org/2018.4/tutorials/moving-pictures/emp-single-end-sequences/barcodes.fastq.gz"
wget -O "sequences.fastq.gz" \
"https://data.qiime2.org/2018.4/tutorials/moving-pictures/emp-single-end-sequences/sequences.fastq.gz"

Ingestion without Indigestion

Looking at this data, there's three files: a metadata TSV, a gzipped barcodes FASTQ, and a gzipped sequences FASTQ. This data is easy enough to load. Let's pop into the DuckDB CLI, load Exon-DuckDB, and have a look.

Because the two FASTQ files end with "fastq.gz", Exon-DuckDB can detect the format and comperssion.

CREATE TABLE barcodes AS
SELECT * FROM 'barcodes.fastq.gz';

CREATE TABLE reads AS
SELECT * FROM 'sequences.fastq.gz';

And having a quick peak, we can see what we're working with.

-- description is empty in these examples
SELECT * EXCLUDE(description) FROM barcodes LIMIT 5;
┌─────────────────────────────────────┬──────────────┬────────────────┐
│ name │ sequence │ quality_scores │
varcharvarcharvarchar
├─────────────────────────────────────┼──────────────┼────────────────┤
│ HWI-EAS440_0386:1:23:17547:1423#0/1 │ ATGCAGCTCAGT │ IIIIIIIIIIIH │
│ HWI-EAS440_0386:1:23:14818:1533#0/1 │ CCCCTCAGCGGC │ DDD@D?@B<<+/ │
│ HWI-EAS440_0386:1:23:14401:1629#0/1 │ GACGAGTCAGTC │ GGEGDGGGGGDG │
│ HWI-EAS440_0386:1:23:15259:1649#0/1 │ AGCAGTCGCGAT │ IIIIIIIIIIII │
│ HWI-EAS440_0386:1:23:13748:2482#0/1 │ AGCACACCTACA │ GGGGBGGEEGGD │
└─────────────────────────────────────┴──────────────┴────────────────┘
SELECT MIN(LENGTH(sequence)), MAX(LENGTH(sequence)) FROM barcodes;
┌─────────────────────────┬─────────────────────────┐
min(length("sequence"))max(length("sequence"))
│ int64 │ int64 │
├─────────────────────────┼─────────────────────────┤
1212
└─────────────────────────┴─────────────────────────┘

The barcodes are in the sequence column, and they're all the same length. And for the actual reads?

-- isn't EXCLUDE cool
SELECT name, sequence, quality_scores FROM reads LIMIT 5;
┌──────────────────────┬──────────────────────┬──────────────────────────────────────────────────────┐
│ name │ sequence │ quality_scores │
varcharvarcharvarchar
├──────────────────────┼──────────────────────┼──────────────────────────────────────────────────────┤
│ HWI-EAS440_0386:1:… │ TACGNAGGATCCGAGCGT… │ IIIE)EEEEEEEEGFIIGIIIHIHHGIIIGIIHHHGIIHGHEGDGIFIGE… │
│ HWI-EAS440_0386:1:… │ CCCCNCAGCGGCAAAAAT… │ 64<2$24;1)/:*B<?BBDDBBD<>BDD######################… │
│ HWI-EAS440_0386:1:… │ TACGNAGGATCCGAGCGT… │ GGGC'ACC8;;>;HHHHGHDHHHHHEEHHEHHHHHECHEEEHCHFHHHAG… │
│ HWI-EAS440_0386:1:… │ TACGNAGGATCCGAGCGT… │ IIIE)DEE?CCCBIIIIIIGIIIIIHIIIIIIIIIHIIIHFIDIGIIIHH… │
│ HWI-EAS440_0386:1:… │ TACGNAGGATCCGAGCGT… │ GGGC'?CC4<5<6HHHHHHH@@HGGEEGHHFHHHHBBHGFFGCGBBGG@D… │
└──────────────────────┴──────────────────────┴──────────────────────────────────────────────────────┘

The TSV is a nasty little bugger with a line for the header, prepended with #, and a second line, also prepended with #, that specifies an arbitrary type system. Anyways, thankfully DuckDB has a robust set of functions for ingesting CSV/TSV data.

CREATE TABLE metadata AS
SELECT *
FROM read_csv('sample-metadata.tsv', skip=2, sep='\t', columns={
'sample_id': 'VARCHAR',
'barcode': 'VARCHAR',
'linker_primer_sequence': 'VARCHAR',
'body_site': 'VARCHAR',
'year': 'INTEGER',
'month': 'INTEGER',
'day': 'INTEGER',
'subject': 'VARCHAR',
'reported_antibiotic_usage': 'VARCHAR',
'days_since_experiment_start': 'INTEGER',
'description': 'VARCHAR'});

That wasn't so bad. And now that the data is "in", we can use SQL to manipulate it before exporting it.

A Big Table

To facilitate analysis, let's join this data together and augment the quality scores by transforming them from ASCII scores to a list of integers.

CREATE TABLE analysis_table AS
SELECT r.*,
m.sample_id,
m.body_site,
m.subject,
b.sequence barcode,

-- e.g. [14, 23, 27, 25, 25, 35, 31, 27, 33, 35, 35, 33, ...]
quality_score_string_to_list(r.quality_scores) qc_scores

FROM reads r
INNER JOIN barcodes b
ON r.name = b.name
LEFT JOIN metadata m
ON b.sequence = m.barcode;

You could stop here and ship this off to your cloud service provider for storage in parquet, but we can do some analysis here and enjoy the quick feedback loop of local work.

For example, we can quickly find out that there are about 40k undetermined reads, and maybe they came from subject-1.

SELECT subject, COUNT(*) FROM analysis_table GROUP BY subject;
┌───────────┬──────────────┐
│ subject │ count_star()
varchar │ int64 │
├───────────┼──────────────┤
│ subject-1106684
│ subject-2157194
│ │ 38703
└───────────┴──────────────┘

Or maybe, we'd like to filter our analysis table to cases where the last 5 quality scores average above 35 (i.e. we expect some drop-off in quality across the read, but if it was drastic, let's omit).

SELECT *
FROM analysis_table
WHERE list_avg(qc_scores[-5:]) > 35;

Ok, so entering the home stretch, we've got our data in a "big table". We've done some analysis. Now let's move to the final task of storage and how partitioning can simplify retrieval.

Partitioning with DuckDB and Reading with Exon-DuckDB

The idea of partitioning is to split the data into, well, partitions based on some value that is well aligned with access patterns. For example, for web log data, you might partition by day since if there was some site issue last week, you'd want to quickly grab that data without going through everything else. Here, we'll partition by barcode, but it's on you to make the decision for you data.

The syntax from DuckDB is quite simple for generating partitions.

COPY (
SELECT * FROM analysis_table
) TO 'analysis_table' (FORMAT parquet, PARTITION_BY barcode);

Now we have a directory structure that looks like this:

$ tree analysis_table | head
analysis_table
├── barcode=AAAAAAAAAAAA
│ ├── data_0.parquet
│ ├── data_1.parquet
│ └── data_2.parquet
├── barcode=AAAAAATTTAGG
│ ├── data_1.parquet
│ └── data_2.parquet
├── barcode=AAAAAGAGCTTA
│ ├── data_0.parquet
...

And that's all well and good; if you need a specific barcode, you can just grab the directory. Or again you can copy this to your cloud provider.

Though there's one step missing, say you wanted to run the FASTQ of one barcodes through a tool, and that tool like all bioinformatic tools seemingly, can't work with parquet? Exon-DuckDB can help there too.

COPY (
SELECT * FROM parquet_scan('analysis_table/barcode=AAAAAAAAAAAA/*')
) TO 'AAAAAAAAAAAA.fastq.gz' (FORMAT fastq)

And then having a quick look...

$ gzcat AAAAAAAAAAAA.fastq.gz | head -n 4
@HWI-EAS440_0386:4:11:17168:10699#0/1
NAAAAAAAAAAAAAAAAAAAAAAAACAAACAAAAAAAATCTAAAAAAACTACAAACAAAAATCAAACAAACAAACAAAAATACAACATCTAAATATAAAATTCAAACACACTACATTACACTAGAAACCCTCTCTTCACTACTCACTAACAC
+
########################################################################################################################################################

That's it. Thanks for playing along. If you'd like to learn more about WTT, the company, or Exon-DuckDB, the product, please get in touch via the email in the footer.

· 4 min read
Trent Hauck

I wanted to briefly explain what WHERE TRUE Technologies is, what we're building, and why I'm excited about it.

WHERE TRUE Technologies is a company I've formed to commercialize scientific software that I think can have an impact for both teams and individuals. There are two software tools currently in development, Exon-DuckDB and WTT-02, targeting bioinformatics and cheminformatics, respectively.

The main idea for the products is that engineers and scientists working with complex data need both to be able to quickly explore datasets locally (micro) and integrate these datasets into modern data infrastructure (macro) to augment a company's competitive advantage with data.

In my opinion, SQL is the best path there. With recent advancements in DB tooling, it's possible to have top-tier speed locally and data warehouse interop without sacrificing the expressibility and evolvability needed by Scientists to ask questions and Engineers to build systems.

I'll share more information about Exon-DuckDB and WTT-02 as they get closer, but in the meantime, here's a kernel of what excites me.

Three Approaches to Counting Sequences

Here's a simple problem that comes up a lot: count the number of sequences in a FASTA. In this, we'll use the canonical.

The first approach might be grep, the grey-beard's favorite.

$ grep -c ^{">"} uniprot_sprot.fasta # counts lines that start with '{">"}'

Or if you're a pythonista, you might do the following.

from Bio import SeqIO
count = 0
for record in SeqIO.parse('uniprot_sprot.fasta', 'fasta'):
count += 1
print(count)

So, pray tell, what does that look like with Exon-DuckDB?

import exondb
import os

os.environ["WTT_01_LICENSE"] = "XXX"

con = exondb.get_connection()
count = con.execute("SELECT COUNT(*) FROM 'uniprot_sprot.fasta'").fetchone()

print(count)

Comparing runtimes on a local Intel Mac, grep is the fastest, then a bit slower is Exon-DuckDB, then finally python at about 3x the duration. The rub, of course, is grep isn't doing anything interesting with the sequences. It has no notion of sequence or the header, just does the line start with an '>'? (The graph looks better with the light theme selected.)

grepBioPythonExon-DuckDB10 Runs for Each Tool on Swiss-Prot00.511.522.533.54grepBioPythonExon-DuckDBTool00.511.522.533.54Seconds