Querying Postgres
Postgres is one of the most common locations where organizations store there data, and rightly so, it's a powerful tool. Moreover, it's a very common technology used by SAAS companies that serve scientific organizations. For example, perhaps the most popular ELN/LIMS system, Benchling, serves data back to its customers via a Postgres database.
While Postgres is amazing, it lacks domain specific functions that scientist often need in their day to day. I'd arguably be further in life if not for ad-hoc requests to help find some sequence in some database 😄. The process is kludgey, and involves some combination of python scripts and CLI tools which are both not very accessible and leads to tons of slight differences in the way the data is processed.
While our tools won't be as powerful as purpose-built tools for alignment or similarity search, they are much easier and when you think about the extra time that goes into data munging they often are more than sufficient.
Executing a Search​
WTT's tools use DuckDB under the hood, and thus have access to its Postgres Scanner, so we can use that to query the data and use WTT's tools to do the analysis.
As an example, let's pretend we have a table in postgres called dna
, with a sequence
column.
$ psql -h 127.0.0.1 -U postgres -c 'SELECT * FROM dna LIMIT 5;'
id | sequence |
---|---|
1 | GCAGGTAATGTTATATCGTCTGCGTCCTGATCCGTCAGTCCACCGCAGGCAATAACTGCAAAGGGCTTCAAGCTAAGATCAGATCCACAGCGACTTACGTCCATAACTTTTGTACGACGTCGTGGTAACACGCGTACACCATTCGTATCGTTACTAACGGTCTACCCCAAGAACTAGAGACGTCTGCGATCTAAATGTCA |
2 | AAAACGTGGAGCAGGGAAATGCAGATACAGCTTTCGTTATCGTGACGTAGGCCACAGTAACCCCTGTCTACATGTATACCTGTCAATACTCGCGCGGCTGCTCAACCGACAGGAACGCTAAATTTTGCTATGCACACGTCCAGTGAGAGAGTCCCAGACAATGTCCAATACGTAATTTTCGTAAGTAGGCGTTCAAACCA |
3 | TTCGGGTTAATCATGGACGGGCCACTATGGAATGGATACGTGCACGTTTTCGACAGCAAAACCAGCTTTGGTTAACTTCCTCGTAGATCCGTAGATTGCGTCCAGAACCGATCTCTAGAGACGCCGGACCCTAAACTGTTTACGGATGGCCGCTGTGTGTCAATAGGCCAGGGACCGATACCTGAACAATCACCTCGCCT |
4 | CCATGTCAGCACGGCACCGAATCGTCCAACCCCCCCGCGGAAGCGATTCGGAGGACTCCGGGCATCCCCTGACAAAGAAATAGTCACTGACATTCTCTAATAGTCCCCGGGTTCTCACGGGTAGTGTGATTTCCGAGGGGCTACCCTTAATGCGACGGCATGACGAGTGAAAGCCATTGTGATACTGTTTTGCGGGGAGA |
5 | AGGCCTATATAGCGGACTATCTCCGTTGTTTGAATTTATAAAGTGTGCTGCTACGCATCATAGCGGACTGGCCTTTCTACTCACTGTGTTTGCAGAGAGGATGCTCCTTAACATGGTCTCTGCGAAATGTGCCGATTGATCATGAGTGGTGCCGACCGATGGAGTGTGGGCGAGATCTCGCCCATCACATCTGTTACTTA |
Now, how can we use Exon-DuckDB to find the sequences that are similar to a query sequence? From the DuckDB CLI, with exondb
loaded, first connect to postgres...
INSTALL 'postgres_scanner';
LOAD 'postgres_scanner';
-- export PGPASSWORD=postgres
-- This could also be a connection to your ELN's database
CALL POSTGRES_ATTACH('postgres://postgres:postgres@localhost:5432/postgres');
Then, we can run a dummy query against the postgres tables...
SELECT * FROM dna LIMIT 5;
id | sequence |
---|---|
1 | GCAGGTAATGTTATATCGTCTGCGTCCTGATCCGTCAGTCCACCGCAGGCAATAACTGCAAAGGGCTTCAAGCTAAGATCAGATCCACAGCGACTTACGTCCATAACTTTTGTACGACGTCGTGGTAACACGCGTACACCATTCGTATCGTTACTAACGGTCTACCCCAAGAACTAGAGACGTCTGCGATCTAAATGTCA |
2 | AAAACGTGGAGCAGGGAAATGCAGATACAGCTTTCGTTATCGTGACGTAGGCCACAGTAACCCCTGTCTACATGTATACCTGTCAATACTCGCGCGGCTGCTCAACCGACAGGAACGCTAAATTTTGCTATGCACACGTCCAGTGAGAGAGTCCCAGACAATGTCCAATACGTAATTTTCGTAAGTAGGCGTTCAAACCA |
3 | TTCGGGTTAATCATGGACGGGCCACTATGGAATGGATACGTGCACGTTTTCGACAGCAAAACCAGCTTTGGTTAACTTCCTCGTAGATCCGTAGATTGCGTCCAGAACCGATCTCTAGAGACGCCGGACCCTAAACTGTTTACGGATGGCCGCTGTGTGTCAATAGGCCAGGGACCGATACCTGAACAATCACCTCGCCT |
4 | CCATGTCAGCACGGCACCGAATCGTCCAACCCCCCCGCGGAAGCGATTCGGAGGACTCCGGGCATCCCCTGACAAAGAAATAGTCACTGACATTCTCTAATAGTCCCCGGGTTCTCACGGGTAGTGTGATTTCCGAGGGGCTACCCTTAATGCGACGGCATGACGAGTGAAAGCCATTGTGATACTGTTTTGCGGGGAGA |
5 | AGGCCTATATAGCGGACTATCTCCGTTGTTTGAATTTATAAAGTGTGCTGCTACGCATCATAGCGGACTGGCCTTTCTACTCACTGTGTTTGCAGAGAGGATGCTCCTTAACATGGTCTCTGCGAAATGTGCCGATTGATCATGAGTGGTGCCGACCGATGGAGTGTGGGCGAGATCTCGCCCATCACATCTGTTACTTA |
Same as above, so far so good. And to see this isn't "nothing", let's count the number of sequences in the table:
SELECT COUNT(*) AS cnt FROM dna;
cnt |
---|
50000 |
Now, let's get back to the sequence alignment case. If our query sequence is AGGCCTATATAGCGGACTATCTC
and we want to find the top 5 similar sequences, we can do the following:
SELECT id, alignment_score_wfa_gap_affine(sequence, 'AGGCCTATATAGCGGACTATCTC') score, sequence
FROM dna
ORDER BY alignment_score_wfa_gap_affine(sequence, 'AGGCCTATATAGCGGACTATCTC') DESC
LIMIT 5;
This take about 3 seconds on a M2 MBA, and returns the following:
id | score | sequence |
---|---|---|
5 | -360.0 | AGGCCTATATAGCGGACTATCTCCGTTGTTTGAATTTATAAAGTGTGCTGCTACGCATCATAGCGGACTGGCCTTTCTACTCACTGTGTTTGCAGAGAGGATGCTCCTTAACATGGTCTCTGCGAAATGTGCCGATTGATCATGAGTGGTGCCGACCGATGGAGTGTGGGCGAGATCTCGCCCATCACATCTGTTACTTA |
43368 | -380.0 | AGGGAGTTGACGTAAATGCCGAGCAGAATAAGGGACCCTAAACCCCTGAAGGGACATGTTGTTGTCCCGCATCATCGTAGATTAGCGTACTTGCGTTAAGATTTTCACTTGGAAGTATCACATATCTAGGTATTCCTAGACCACTACTCATGCTATTGTACAAGTCGAACGTTCATAATGGTCGGCAATCCGCTCATCTC |
2400 | -382.0 | TGTCCTATAAAGACCTCTTTACTGACTATCTCTTAGTTAGCTGCGACCACGAGGCCTCAGCGGGGAGCAAACGGAGGCCCCAGATTCGTGTCCGCGCTTGCCAAATCGGCACATTAACAATTAGGAGCCGACCCCCCTAACACTCGTATAGACATGGCTACGTGGCCACGCGAACTATTTTTACTAAGTATGGAATACAA |
4366 | -382.0 | AGGCCAGTCGGTCTGGGATTACATGTCTGTAGAGGGGTGTGTCTCCTTTTCCTTCAAGCTCCCCACCCTAACACAGCGGTTCCTGGACGCGTCATGCTCCTGGTCTTGATGGCGGATCACTGCATAGATTGCTAAAGTGCGACTGATAAGCACCACCGTACATGGGGGTTGTAACTCTCCTGAAGGTAATCCTTAATCTC |
4815 | -382.0 | AGCGTCGTATCTCACCTTACATTTTCGCAACGAAAAATAACCTTCGTGATGGCCCCGAGGTAGATCAAGACGCCTATAAACCGGAGTGCTCTTATCCCGTTCCAACCCGCCGGAAGCGCCGGCCGCCGATCCACTAGGTAGCGGAGAGAAACTGTCGTCTTACCAACGTTTCCAAGTCACTATTCAGAGCTCTCTTTACC |
And this is great, but doesn't tell us a lot about the alignment itself. To do that we can use the alignment_score_wfa_gap_affine
function which returns a CIGAR string describing the alignment.
SELECT id, alignment_string_wfa_gap_affine(sequence, 'AGGCCTATATAGCGGACTATCTC') cigar, sequence
FROM dna
ORDER BY alignment_score_wfa_gap_affine(sequence, 'AGGCCTATATAGCGGACTATCTC') DESC
LIMIT 5;
id | cigar | sequence |
---|---|---|
5 | 23M177I | AGGCCTATATAGCGGACTATCTCCGTTGTTTGAATTTATAAAGTGTGCTGCTACGCATCATAGCGGACTGGCCTTTCTACTCACTGTGTTTGCAGAGAGGATGCTCCTTAACATGGTCTCTGCGAAATGTGCCGATTGATCATGAGTGGTGCCGACCGATGGAGTGTGGGCGAGATCTCGCCCATCACATCTGTTACTTA |
43368 | 3M33I4M1X1M40I5M1X3M104I5M | AGGGAGTTGACGTAAATGCCGAGCAGAATAAGGGACCCTAAACCCCTGAAGGGACATGTTGTTGTCCCGCATCATCGTAGATTAGCGTACTTGCGTTAAGATTTTCACTTGGAAGTATCACATATCTAGGTATTCCTAGACCACTACTCATGCTATTGTACAAGTCGAACGTTCATAATGGTCGGCAATCCGCTCATCTC |
2400 | 1X1M1X6M1X2M9I1M1X9M168I | TGTCCTATAAAGACCTCTTTACTGACTATCTCTTAGTTAGCTGCGACCACGAGGCCTCAGCGGGGAGCAAACGGAGGCCCCAGATTCGTGTCCGCGCTTGCCAAATCGGCACATTAACAATTAGGAGCCGACCCCCCTAACACTCGTATAGACATGGCTACGTGGCCACGCGAACTATTTTTACTAAGTATGGAATACAA |
4366 | 5M14I2M1X2M50I5M39I3M74I5M | AGGCCAGTCGGTCTGGGATTACATGTCTGTAGAGGGGTGTGTCTCCTTTTCCTTCAAGCTCCCCACCCTAACACAGCGGTTCCTGGACGCGTCATGCTCCTGGTCTTGATGGCGGATCACTGCATAGATTGCTAAAGTGCGACTGATAAGCACCACCGTACATGGGGGTTGTAACTCTCCTGAAGGTAATCCTTAATCTC |
4815 | 2M69I7M60I7M44I2M1X3M4I1M | AGCGTCGTATCTCACCTTACATTTTCGCAACGAAAAATAACCTTCGTGATGGCCCCGAGGTAGATCAAGACGCCTATAAACCGGAGTGCTCTTATCCCGTTCCAACCCGCCGGAAGCGCCGGCCGCCGATCCACTAGGTAGCGGAGAGAAACTGTCGTCTTACCAACGTTTCCAAGTCACTATTCAGAGCTCTCTTTACC |
Wrapping Up​
Hopefully, with this you can see how easy it is to connect Exon-DuckDB (and WTT-02, for that matter) to a postgres database, and run queries against it using WTT's domain specific functions.