Skip to main content

Querying Postgres

Postgres is one of the most common locations where organizations store there data, and rightly so, it's a powerful tool. Moreover, it's a very common technology used by SAAS companies that serve scientific organizations. For example, perhaps the most popular ELN/LIMS system, Benchling, serves data back to its customers via a Postgres database.

While Postgres is amazing, it lacks domain specific functions that scientist often need in their day to day. I'd arguably be further in life if not for ad-hoc requests to help find some sequence in some database 😄. The process is kludgey, and involves some combination of python scripts and CLI tools which are both not very accessible and leads to tons of slight differences in the way the data is processed.

While our tools won't be as powerful as purpose-built tools for alignment or similarity search, they are much easier and when you think about the extra time that goes into data munging they often are more than sufficient.

WTT's tools use DuckDB under the hood, and thus have access to its Postgres Scanner, so we can use that to query the data and use WTT's tools to do the analysis.

As an example, let's pretend we have a table in postgres called dna, with a sequence column.

$ psql -h 127.0.0.1 -U postgres -c 'SELECT * FROM dna LIMIT 5;'
idsequence
1GCAGGTAATGTTATATCGTCTGCGTCCTGATCCGTCAGTCCACCGCAGGCAATAACTGCAAAGGGCTTCAAGCTAAGATCAGATCCACAGCGACTTACGTCCATAACTTTTGTACGACGTCGTGGTAACACGCGTACACCATTCGTATCGTTACTAACGGTCTACCCCAAGAACTAGAGACGTCTGCGATCTAAATGTCA
2AAAACGTGGAGCAGGGAAATGCAGATACAGCTTTCGTTATCGTGACGTAGGCCACAGTAACCCCTGTCTACATGTATACCTGTCAATACTCGCGCGGCTGCTCAACCGACAGGAACGCTAAATTTTGCTATGCACACGTCCAGTGAGAGAGTCCCAGACAATGTCCAATACGTAATTTTCGTAAGTAGGCGTTCAAACCA
3TTCGGGTTAATCATGGACGGGCCACTATGGAATGGATACGTGCACGTTTTCGACAGCAAAACCAGCTTTGGTTAACTTCCTCGTAGATCCGTAGATTGCGTCCAGAACCGATCTCTAGAGACGCCGGACCCTAAACTGTTTACGGATGGCCGCTGTGTGTCAATAGGCCAGGGACCGATACCTGAACAATCACCTCGCCT
4CCATGTCAGCACGGCACCGAATCGTCCAACCCCCCCGCGGAAGCGATTCGGAGGACTCCGGGCATCCCCTGACAAAGAAATAGTCACTGACATTCTCTAATAGTCCCCGGGTTCTCACGGGTAGTGTGATTTCCGAGGGGCTACCCTTAATGCGACGGCATGACGAGTGAAAGCCATTGTGATACTGTTTTGCGGGGAGA
5AGGCCTATATAGCGGACTATCTCCGTTGTTTGAATTTATAAAGTGTGCTGCTACGCATCATAGCGGACTGGCCTTTCTACTCACTGTGTTTGCAGAGAGGATGCTCCTTAACATGGTCTCTGCGAAATGTGCCGATTGATCATGAGTGGTGCCGACCGATGGAGTGTGGGCGAGATCTCGCCCATCACATCTGTTACTTA

Now, how can we use Exon-DuckDB to find the sequences that are similar to a query sequence? From the DuckDB CLI, with exondb loaded, first connect to postgres...

INSTALL 'postgres_scanner';
LOAD 'postgres_scanner';

-- export PGPASSWORD=postgres
-- This could also be a connection to your ELN's database
CALL POSTGRES_ATTACH('postgres://postgres:postgres@localhost:5432/postgres');

Then, we can run a dummy query against the postgres tables...

SELECT * FROM dna LIMIT 5;
idsequence
1GCAGGTAATGTTATATCGTCTGCGTCCTGATCCGTCAGTCCACCGCAGGCAATAACTGCAAAGGGCTTCAAGCTAAGATCAGATCCACAGCGACTTACGTCCATAACTTTTGTACGACGTCGTGGTAACACGCGTACACCATTCGTATCGTTACTAACGGTCTACCCCAAGAACTAGAGACGTCTGCGATCTAAATGTCA
2AAAACGTGGAGCAGGGAAATGCAGATACAGCTTTCGTTATCGTGACGTAGGCCACAGTAACCCCTGTCTACATGTATACCTGTCAATACTCGCGCGGCTGCTCAACCGACAGGAACGCTAAATTTTGCTATGCACACGTCCAGTGAGAGAGTCCCAGACAATGTCCAATACGTAATTTTCGTAAGTAGGCGTTCAAACCA
3TTCGGGTTAATCATGGACGGGCCACTATGGAATGGATACGTGCACGTTTTCGACAGCAAAACCAGCTTTGGTTAACTTCCTCGTAGATCCGTAGATTGCGTCCAGAACCGATCTCTAGAGACGCCGGACCCTAAACTGTTTACGGATGGCCGCTGTGTGTCAATAGGCCAGGGACCGATACCTGAACAATCACCTCGCCT
4CCATGTCAGCACGGCACCGAATCGTCCAACCCCCCCGCGGAAGCGATTCGGAGGACTCCGGGCATCCCCTGACAAAGAAATAGTCACTGACATTCTCTAATAGTCCCCGGGTTCTCACGGGTAGTGTGATTTCCGAGGGGCTACCCTTAATGCGACGGCATGACGAGTGAAAGCCATTGTGATACTGTTTTGCGGGGAGA
5AGGCCTATATAGCGGACTATCTCCGTTGTTTGAATTTATAAAGTGTGCTGCTACGCATCATAGCGGACTGGCCTTTCTACTCACTGTGTTTGCAGAGAGGATGCTCCTTAACATGGTCTCTGCGAAATGTGCCGATTGATCATGAGTGGTGCCGACCGATGGAGTGTGGGCGAGATCTCGCCCATCACATCTGTTACTTA

Same as above, so far so good. And to see this isn't "nothing", let's count the number of sequences in the table:

SELECT COUNT(*) AS cnt FROM dna;
cnt
50000

Now, let's get back to the sequence alignment case. If our query sequence is AGGCCTATATAGCGGACTATCTC and we want to find the top 5 similar sequences, we can do the following:

SELECT id, alignment_score_wfa_gap_affine(sequence, 'AGGCCTATATAGCGGACTATCTC') score, sequence
FROM dna
ORDER BY alignment_score_wfa_gap_affine(sequence, 'AGGCCTATATAGCGGACTATCTC') DESC
LIMIT 5;

This take about 3 seconds on a M2 MBA, and returns the following:

idscoresequence
5-360.0AGGCCTATATAGCGGACTATCTCCGTTGTTTGAATTTATAAAGTGTGCTGCTACGCATCATAGCGGACTGGCCTTTCTACTCACTGTGTTTGCAGAGAGGATGCTCCTTAACATGGTCTCTGCGAAATGTGCCGATTGATCATGAGTGGTGCCGACCGATGGAGTGTGGGCGAGATCTCGCCCATCACATCTGTTACTTA
43368-380.0AGGGAGTTGACGTAAATGCCGAGCAGAATAAGGGACCCTAAACCCCTGAAGGGACATGTTGTTGTCCCGCATCATCGTAGATTAGCGTACTTGCGTTAAGATTTTCACTTGGAAGTATCACATATCTAGGTATTCCTAGACCACTACTCATGCTATTGTACAAGTCGAACGTTCATAATGGTCGGCAATCCGCTCATCTC
2400-382.0TGTCCTATAAAGACCTCTTTACTGACTATCTCTTAGTTAGCTGCGACCACGAGGCCTCAGCGGGGAGCAAACGGAGGCCCCAGATTCGTGTCCGCGCTTGCCAAATCGGCACATTAACAATTAGGAGCCGACCCCCCTAACACTCGTATAGACATGGCTACGTGGCCACGCGAACTATTTTTACTAAGTATGGAATACAA
4366-382.0AGGCCAGTCGGTCTGGGATTACATGTCTGTAGAGGGGTGTGTCTCCTTTTCCTTCAAGCTCCCCACCCTAACACAGCGGTTCCTGGACGCGTCATGCTCCTGGTCTTGATGGCGGATCACTGCATAGATTGCTAAAGTGCGACTGATAAGCACCACCGTACATGGGGGTTGTAACTCTCCTGAAGGTAATCCTTAATCTC
4815-382.0AGCGTCGTATCTCACCTTACATTTTCGCAACGAAAAATAACCTTCGTGATGGCCCCGAGGTAGATCAAGACGCCTATAAACCGGAGTGCTCTTATCCCGTTCCAACCCGCCGGAAGCGCCGGCCGCCGATCCACTAGGTAGCGGAGAGAAACTGTCGTCTTACCAACGTTTCCAAGTCACTATTCAGAGCTCTCTTTACC

And this is great, but doesn't tell us a lot about the alignment itself. To do that we can use the alignment_score_wfa_gap_affine function which returns a CIGAR string describing the alignment.

SELECT id, alignment_string_wfa_gap_affine(sequence, 'AGGCCTATATAGCGGACTATCTC') cigar, sequence
FROM dna
ORDER BY alignment_score_wfa_gap_affine(sequence, 'AGGCCTATATAGCGGACTATCTC') DESC
LIMIT 5;
idcigarsequence
523M177IAGGCCTATATAGCGGACTATCTCCGTTGTTTGAATTTATAAAGTGTGCTGCTACGCATCATAGCGGACTGGCCTTTCTACTCACTGTGTTTGCAGAGAGGATGCTCCTTAACATGGTCTCTGCGAAATGTGCCGATTGATCATGAGTGGTGCCGACCGATGGAGTGTGGGCGAGATCTCGCCCATCACATCTGTTACTTA
433683M33I4M1X1M40I5M1X3M104I5MAGGGAGTTGACGTAAATGCCGAGCAGAATAAGGGACCCTAAACCCCTGAAGGGACATGTTGTTGTCCCGCATCATCGTAGATTAGCGTACTTGCGTTAAGATTTTCACTTGGAAGTATCACATATCTAGGTATTCCTAGACCACTACTCATGCTATTGTACAAGTCGAACGTTCATAATGGTCGGCAATCCGCTCATCTC
24001X1M1X6M1X2M9I1M1X9M168ITGTCCTATAAAGACCTCTTTACTGACTATCTCTTAGTTAGCTGCGACCACGAGGCCTCAGCGGGGAGCAAACGGAGGCCCCAGATTCGTGTCCGCGCTTGCCAAATCGGCACATTAACAATTAGGAGCCGACCCCCCTAACACTCGTATAGACATGGCTACGTGGCCACGCGAACTATTTTTACTAAGTATGGAATACAA
43665M14I2M1X2M50I5M39I3M74I5MAGGCCAGTCGGTCTGGGATTACATGTCTGTAGAGGGGTGTGTCTCCTTTTCCTTCAAGCTCCCCACCCTAACACAGCGGTTCCTGGACGCGTCATGCTCCTGGTCTTGATGGCGGATCACTGCATAGATTGCTAAAGTGCGACTGATAAGCACCACCGTACATGGGGGTTGTAACTCTCCTGAAGGTAATCCTTAATCTC
48152M69I7M60I7M44I2M1X3M4I1MAGCGTCGTATCTCACCTTACATTTTCGCAACGAAAAATAACCTTCGTGATGGCCCCGAGGTAGATCAAGACGCCTATAAACCGGAGTGCTCTTATCCCGTTCCAACCCGCCGGAAGCGCCGGCCGCCGATCCACTAGGTAGCGGAGAGAAACTGTCGTCTTACCAACGTTTCCAAGTCACTATTCAGAGCTCTCTTTACC

Wrapping Up

Hopefully, with this you can see how easy it is to connect Exon-DuckDB (and WTT-02, for that matter) to a postgres database, and run queries against it using WTT's domain specific functions.