Skip to main content

Tertiary Analysis with SQL

Bioinformatics analysis is often discussed in terms of primary, secondary, and tertiary analysis. Which, quickly, can be described as:

  • Primary Analysis: Initial processing of raw sequencing data, including base calling, quality control, and data demultiplexing.
  • Secondary Analysis: Computational methods applied to processed data, such as read alignment, variant calling, and gene expression quantification.
  • Tertiary Analysis: Advanced and integrative analyses aiming to derive biological insights and interpret results, often combining NGS data with other data types.

What we're focused on there is the last one, tertiary analysis. And moreover, how to do some of the "bioinformatics-y" things with SQL. It's already really good at the 'other data types' part, but we can do some of the other stuff too.

The Data

JGI maintains a "Genome Portal" where individuals deposit various samples, here metagenomics samples, and then JGI employs a bevy of tools to analyze the data. This results into a repository of files that are useful in-and-of themselves, but we can extract more value by combining them. If you'd like to play along, you can download the files for this sample here: It's a sample from a microbial community in geothermal field west of Mexico City.

Geothermal fields are considered "hot-beds", if you will, of useful diversity as selection pressure of the environment can lead to industrially useful enzymes, e.g. heat stability.

The Analysis

This document covers a few different tasks related to combining different bioinformatics output into useful integrative analyses, and in particular how to work with annotations. For example, if we were a CRISPR company, how we might find genes that are "close" to a CRISPR array in that metagenomics sample.

To carry out the analysis, we'll use a few different tools:

  • DuckDB: A SQL database that can be run in-process, and is particularly good at analytical queries.
  • JupySQL: A simple juptyer extension that allows us to run SQL queries in a notebook.
  • Exon-DuckDB: A duckdb extension that allows us to work with scientific data. Written by the author of this notebook.
  • DNA Features Viewer: A python library for visualizing DNA features.


To get started, setup the notebook and import the necessary modules.

%load_ext sql
The sql extension is already loaded. To reload it, use:
%reload_ext sql
from sqlalchemy import create_engine

connect_with_extensions = create_engine(
connect_args={"config": {"allow_unsigned_extensions": True}},
%sql connect_with_extensions
SET custom_extension_repository='';
LOAD exon

Running query in 'duckdb:///:memory:'

Absolute Positioning

Now that the notebook is ready, we can get into the analysis.

Absolute position in contrast to relative positioning is a good place to start. The idea is to filter based on fixed positions. For example, let's see if there are any CRISPR arrays that start within 50 bases of the start of the contig.


SELECT seqname, source, type, start, "end"
FROM read_gff('./IMGData/Ga0604745_crt.gff')
WHERE type = 'CRISPR' AND start <= 50

Running query in 'duckdb:///:memory:'
Ga0604745_072777CRT 1.8.4CRISPR4698
Ga0604745_073037CRT 1.8.4CRISPR2696
Ga0604745_073489CRT 1.8.4CRISPR9693
Ga0604745_073497CRT 1.8.4CRISPR23658
Ga0604745_073514CRT 1.8.4CRISPR1521

SQL also makes it easy to summarize the data to get a sense of the data.


FROM read_gff('./IMGData/Ga0604745_crt.gff')
WHERE type = 'CRISPR' AND start <= 50

Running query in 'duckdb:///:memory:'

Relative Positions

Relative positions are a bit more complicated, but can also be very useful. As often comes up in bioinformatics, a scientist may want to know if one feature is near another feature -- is this enzyme near this other enzyme in a larger pathway? Is this SNP near a gene that is important in some phenotype? Is this CRISPR array near a Cas gene that could help us classify it?

Co-occurrence of CRISPR Arrays

For example, let's see if there are any CRISPR arrays that start within 1000 bases of the start of another CRISPR array.

To that, we'll self-join the CRISPR array table on itself. Join on the contig, and require that the end of the first array is within 1000 bases of the start of the second array. I.e. if we add 1000 to the end of the first array, it should be greater than the start of the second array. Also, to ensure you don't get an upstream array, we'll require that the start of the first array is less than the start of the second array.


WITH arrays AS (
SELECT *, attributes['ID'][1] AS crispr_id

-- The exon extension provides this table function.
FROM read_gff('./IMGData/Ga0604745_crt.gff')

SELECT arrays.seqname, arrays.crispr_id, arrays.start, arrays.end, other.crispr_id, other.start, other.end
FROM arrays
JOIN arrays AS other
ON arrays.seqname = other.seqname
AND arrays.crispr_id != other.crispr_id
AND (arrays.end + 1000) >= other.start
AND arrays.start <= other.start

Running query in 'duckdb:///:memory:'

It's hard to tell exactly what's happening here, but we'll note that these arrays are a bit on the small side, and tend to start closer to the start of the contig. This is potentially a sign of a fragmented assembly, causing the CRISPR arrays to be fragmented as well. That said, out of ~1650 contigs, having a handful is to be expected.


-- Quick stats on CRISPR arrays for context
SELECT AVG("end" - start) avg_len, AVG(start) avg_start, AVG("end") avg_end, COUNT(*) cnt
FROM read_gff('./IMGData/Ga0604745_crt.gff')
Running query in 'duckdb:///:memory:'


This dataset also has annotated genes, which may lead to some interesting questions about the co-occurrence of genes and CRISPR arrays. For brevity, we'll just do the easy thing, and join the CRISPR arrays with genes on the same config and start or end w/i 1000 bases. This'll give us a good starting point to use for visualization and further analysis.


DROP TABLE IF EXISTS array_cds_overlaps;

CREATE TABLE array_cds_overlaps AS

-- We could also create tables for these at the start of the analysis. A good idea for larger datasets.
WITH arrays AS (
SELECT *, attributes['ID'][1] AS crispr_id
FROM read_gff('./IMGData/Ga0604745_crt.gff')
), cds AS (
SELECT *, attributes['ID'][1] AS cds_id
FROM read_gff('./IMGData/Ga0604745_prodigal.gff')

SELECT arrays.seqname, arrays.crispr_id, arrays.start, arrays.end, cds.cds_id, cds.start cds_start, cds.end cds_end
FROM arrays
JOIN cds
ON arrays.seqname = cds.seqname
(ABS(arrays.start - cds.start) <= 5000 OR ABS(arrays.end - cds.end) <= 5000)

AND arrays.crispr_id = 'Ga0604745_000853_20840_21021'

ORDER BY arrays.seqname, arrays.crispr_id, cds.start ASC, cds.end ASC

Running query in 'duckdb:///:memory:'

Visualizing the Output

One of the advantages of SQL and this type of analysis, is it makes it easy to get the data into a format that can be visualized. For example, let's take the output of the previous query and visualize it -- note I've added a WHERE clause above to filter to a single CRISPR array.

results = %sql SELECT * FROM array_cds_overlaps
Running query in 'duckdb:///:memory:'
df = results.DataFrame()

import matplotlib
from dna_features_viewer import GraphicFeature, GraphicRecord
features = [
GraphicFeature(start=df.iloc[0]['start'], end=df.iloc[0]['end'], color="#ffd700",

for _, row in df.iterrows():
GraphicFeature(start=row['cds_start'], end=row['cds_end'], color="#ffcccc",

f, ax = matplotlib.pyplot.subplots(1, figsize=(15, 4))

record = GraphicRecord(sequence_length=df.cds_end.max() + 1000, features=features).crop((df.cds_start.min() - 1000, df.cds_end.max() + 1000))

ax.set_title('CRISPR array and CDS overlaps')
Text(0.5, 1.0, 'CRISPR array and CDS overlaps')



This notebook has covered a few different ways to work with annotations in SQL, and how to combine them with other data types. This is just a starting point, and there are many other ways to combine annotations with other data types, and many other data types to combine with annotations.