Skip to main content

Tertiary Analysis with SQL

Bioinformatics analysis is often discussed in terms of primary, secondary, and tertiary analysis. Which, quickly, can be described as:

  • Primary Analysis: Initial processing of raw sequencing data, including base calling, quality control, and data demultiplexing.
  • Secondary Analysis: Computational methods applied to processed data, such as read alignment, variant calling, and gene expression quantification.
  • Tertiary Analysis: Advanced and integrative analyses aiming to derive biological insights and interpret results, often combining NGS data with other data types.

What we're focused on there is the last one, tertiary analysis. And moreover, how to do some of the "bioinformatics-y" things with SQL. It's already really good at the 'other data types' part, but we can do some of the other stuff too.

The Data

JGI maintains a "Genome Portal" where individuals deposit various samples, here metagenomics samples, and then JGI employs a bevy of tools to analyze the data. This results into a repository of files that are useful in-and-of themselves, but we can extract more value by combining them. If you'd like to play along, you can download the files for this sample here: https://genome.jgi.doe.gov/portal/IMG_3300061725/IMG_3300061725.info.html. It's a sample from a microbial community in geothermal field west of Mexico City.

Geothermal fields are considered "hot-beds", if you will, of useful diversity as selection pressure of the environment can lead to industrially useful enzymes, e.g. heat stability.

The Analysis

This document covers a few different tasks related to combining different bioinformatics output into useful integrative analyses, and in particular how to work with annotations. For example, if we were a CRISPR company, how we might find genes that are "close" to a CRISPR array in that metagenomics sample.

To carry out the analysis, we'll use a few different tools:

  • DuckDB: A SQL database that can be run in-process, and is particularly good at analytical queries.
  • JupySQL: A simple juptyer extension that allows us to run SQL queries in a notebook.
  • Exon-DuckDB: A duckdb extension that allows us to work with scientific data. Written by the author of this notebook.
  • DNA Features Viewer: A python library for visualizing DNA features.

Preamble

To get started, setup the notebook and import the necessary modules.

%load_ext sql
The sql extension is already loaded. To reload it, use:
%reload_ext sql
from sqlalchemy import create_engine

connect_with_extensions = create_engine(
"duckdb:///:memory:",
connect_args={"config": {"allow_unsigned_extensions": True}},
)
%sql connect_with_extensions
%%sql
SET custom_extension_repository='dbe.wheretrue.com/exon/latest';
INSTALL exon;
LOAD exon

Running query in 'duckdb:///:memory:'
Success

Absolute Positioning

Now that the notebook is ready, we can get into the analysis.

Absolute position in contrast to relative positioning is a good place to start. The idea is to filter based on fixed positions. For example, let's see if there are any CRISPR arrays that start within 50 bases of the start of the contig.

%%sql

SELECT seqname, source, type, start, "end"
FROM read_gff('./IMGData/Ga0604745_crt.gff')
WHERE type = 'CRISPR' AND start <= 50
LIMIT 5;

Running query in 'duckdb:///:memory:'
seqnamesourcetypestartend
Ga0604745_072777CRT 1.8.4CRISPR4698
Ga0604745_073037CRT 1.8.4CRISPR2696
Ga0604745_073489CRT 1.8.4CRISPR9693
Ga0604745_073497CRT 1.8.4CRISPR23658
Ga0604745_073514CRT 1.8.4CRISPR1521

SQL also makes it easy to summarize the data to get a sense of the data.

%%sql

SELECT COUNT(*)
FROM read_gff('./IMGData/Ga0604745_crt.gff')
WHERE type = 'CRISPR' AND start <= 50

Running query in 'duckdb:///:memory:'
count_star()
1400

Relative Positions

Relative positions are a bit more complicated, but can also be very useful. As often comes up in bioinformatics, a scientist may want to know if one feature is near another feature -- is this enzyme near this other enzyme in a larger pathway? Is this SNP near a gene that is important in some phenotype? Is this CRISPR array near a Cas gene that could help us classify it?

Co-occurrence of CRISPR Arrays

For example, let's see if there are any CRISPR arrays that start within 1000 bases of the start of another CRISPR array.

To that, we'll self-join the CRISPR array table on itself. Join on the contig, and require that the end of the first array is within 1000 bases of the start of the second array. I.e. if we add 1000 to the end of the first array, it should be greater than the start of the second array. Also, to ensure you don't get an upstream array, we'll require that the start of the first array is less than the start of the second array.

%%sql

WITH arrays AS (
SELECT *, attributes['ID'][1] AS crispr_id

-- The exon extension provides this table function.
FROM read_gff('./IMGData/Ga0604745_crt.gff')
WHERE type = 'CRISPR'
)

SELECT arrays.seqname, arrays.crispr_id, arrays.start, arrays.end, other.crispr_id, other.start, other.end
FROM arrays
JOIN arrays AS other
ON arrays.seqname = other.seqname
AND arrays.crispr_id != other.crispr_id
AND (arrays.end + 1000) >= other.start
AND arrays.start <= other.start

Running query in 'duckdb:///:memory:'
seqnamecrispr_idstartendcrispr_id_1start_1end_1
Ga0604745_085435Ga0604745_085435_27_24227242Ga0604745_085435_460_609460609
Ga0604745_038187Ga0604745_038187_1_6961696Ga0604745_038187_739_11137391113
Ga0604745_036347Ga0604745_036347_41_30841308Ga0604745_036347_757_11137571113
Ga0604745_101232Ga0604745_101232_1_3261326Ga0604745_101232_365_517365517
Ga0604745_104520Ga0604745_104520_4_2114211Ga0604745_104520_248_529248529
Ga0604745_126005Ga0604745_126005_23_16723167Ga0604745_126005_211_471211471
Ga0604745_164331Ga0604745_164331_32_17832178Ga0604745_164331_215_364215364
Ga0604745_003766Ga0604745_003766_98_26198261Ga0604745_003766_406_581406581
Ga0604745_003677Ga0604745_003677_1876_215418762154Ga0604745_003677_2865_319928653199

It's hard to tell exactly what's happening here, but we'll note that these arrays are a bit on the small side, and tend to start closer to the start of the contig. This is potentially a sign of a fragmented assembly, causing the CRISPR arrays to be fragmented as well. That said, out of ~1650 contigs, having a handful is to be expected.

%%sql

-- Quick stats on CRISPR arrays for context
SELECT AVG("end" - start) avg_len, AVG(start) avg_start, AVG("end") avg_end, COUNT(*) cnt
FROM read_gff('./IMGData/Ga0604745_crt.gff')
WHERE type = 'CRISPR'
Running query in 'duckdb:///:memory:'
avg_lenavg_startavg_endcnt
563.8600357355568394.23525908278737958.09529481834421679

Genes

This dataset also has annotated genes, which may lead to some interesting questions about the co-occurrence of genes and CRISPR arrays. For brevity, we'll just do the easy thing, and join the CRISPR arrays with genes on the same config and start or end w/i 1000 bases. This'll give us a good starting point to use for visualization and further analysis.

%%sql

DROP TABLE IF EXISTS array_cds_overlaps;

CREATE TABLE array_cds_overlaps AS

-- We could also create tables for these at the start of the analysis. A good idea for larger datasets.
WITH arrays AS (
SELECT *, attributes['ID'][1] AS crispr_id
FROM read_gff('./IMGData/Ga0604745_crt.gff')
WHERE type = 'CRISPR'
), cds AS (
SELECT *, attributes['ID'][1] AS cds_id
FROM read_gff('./IMGData/Ga0604745_prodigal.gff')
)

SELECT arrays.seqname, arrays.crispr_id, arrays.start, arrays.end, cds.cds_id, cds.start cds_start, cds.end cds_end
FROM arrays
JOIN cds
ON arrays.seqname = cds.seqname
WHERE
(ABS(arrays.start - cds.start) <= 5000 OR ABS(arrays.end - cds.end) <= 5000)

AND arrays.crispr_id = 'Ga0604745_000853_20840_21021'

ORDER BY arrays.seqname, arrays.crispr_id, cds.start ASC, cds.end ASC

Running query in 'duckdb:///:memory:'
Count
12

Visualizing the Output

One of the advantages of SQL and this type of analysis, is it makes it easy to get the data into a format that can be visualized. For example, let's take the output of the previous query and visualize it -- note I've added a WHERE clause above to filter to a single CRISPR array.

results = %sql SELECT * FROM array_cds_overlaps
Running query in 'duckdb:///:memory:'
df = results.DataFrame()

import matplotlib
from dna_features_viewer import GraphicFeature, GraphicRecord
features = [
GraphicFeature(start=df.iloc[0]['start'], end=df.iloc[0]['end'], color="#ffd700",
label=df.iloc[0]['crispr_id'])
]

for _, row in df.iterrows():
features.append(
GraphicFeature(start=row['cds_start'], end=row['cds_end'], color="#ffcccc",
label=row.cds_id)
)

f, ax = matplotlib.pyplot.subplots(1, figsize=(15, 4))
matplotlib.pyplot.subplots_adjust(bottom=0.3)

record = GraphicRecord(sequence_length=df.cds_end.max() + 1000, features=features).crop((df.cds_start.min() - 1000, df.cds_end.max() + 1000))
record.plot(ax=ax);

ax.set_title('CRISPR array and CDS overlaps')
Text(0.5, 1.0, 'CRISPR array and CDS overlaps')

png

Conclusion

This notebook has covered a few different ways to work with annotations in SQL, and how to combine them with other data types. This is just a starting point, and there are many other ways to combine annotations with other data types, and many other data types to combine with annotations.