Getting Started
This extension is no longer maintained, though it's still straightforward to use Exon with DuckDB. See the biobear integration for more information.
Exon-DuckDB is a DuckDB extension that provides a set of functions for working with scientific data.
Installation
Exon-DuckDB is installed through DuckDB commands, meaning that however you use DuckDB, you'll need to install it, then you can install the extension.
Here'll we'll show how to install it via the command line, Python, and R. See the other DuckDB language libraries for other options like Julia, C++, etc. You'll find up-to-date instructions for all languages on the DuckDB website.
Command Line
On the command line, first start the DuckDB shell:
duckdb -unsigned
Once there, add the repository, and install the extension:
D SET custom_extension_repository='dbe.wheretrue.com/exon/latest';
D INSTALL exon;
D LOAD exon;
You should only need to install the extension once, but you'll need to load it each time you start DuckDB.
Assuming that all went well, you should be able to run the following command:
SELECT gc_content('ATCG');
Python
For python, you'll follow roughly the same steps, except through Python.
import duckdb
con = duckdb.connect(
config={
"allow_unsigned_extensions": True,
}
)
con.execute("SET custom_extension_repository='dbe.wheretrue.com/exon/latest'")
con.install_extension("exon")
con.load_extension("exon")
Similarly, you should only need to install the extension once, but you'll need to load it each time you start DuckDB. And if the loading went well, you should be able to run the following command:
# Requires pandas be installed
df = con.execute("SELECT gc_content('ATCG')").df()
R
And finally, for R, you'll follow roughly the same steps, except through R.
library(DBI)
library(duckdb)
con <- dbConnect(
duckdb::duckdb(config = list("allow_unsigned_extensions" = "true")),
dbdir = ":memory:"
)
query <- "SET custom_extension_repository='dbe.wheretrue.com/exon/latest';"
dbExecute(con, query)
query <- "INSTALL exon;"
dbExecute(con, query)
query <- "LOAD 'exon';"
dbExecute(con, query)
res <- dbGetQuery(con, "SELECT gc_content('ATCG')")
print(res)
Usage
Once installed, you can use the provided table and/or scalar functions in your queries. For example:
SELECT *
FROM read_fasta('path/to/file.fasta')
WHERE sequence LIKE 'M%'
LIMIT 5
You can see more information below.
🗃️ API Reference
2 items
📄️ Querying Postgres
Postgres is one of the most common locations where organizations store there data, and rightly so, it's a powerful tool. Moreover, it's a very common technology used by SAAS companies that serve scientific organizations. For example, perhaps the most popular ELN/LIMS system, Benchling, serves data back to its customers via a Postgres database.
📄️ Sequence Alignments
Exon-DuckDB includes tools for aligning sequencings and working with the alignment outputs. This page serves as a guide, but please see the API documentation for more details on specific functions.
📄️ Tertiary Analysis with SQL
Bioinformatics analysis is often discussed in terms of primary, secondary, and tertiary analysis. Which, quickly, can be described as: