Tertiary Analysis with SQL
Bioinformatics analysis is often discussed in terms of primary, secondary, and tertiary analysis. Which, quickly, can be described as:
- Primary Analysis: Initial processing of raw sequencing data, including base calling, quality control, and data demultiplexing.
- Secondary Analysis: Computational methods applied to processed data, such as read alignment, variant calling, and gene expression quantification.
- Tertiary Analysis: Advanced and integrative analyses aiming to derive biological insights and interpret results, often combining NGS data with other data types.
What we're focused on there is the last one, tertiary analysis. And moreover, how to do some of the "bioinformatics-y" things with SQL. It's already really good at the 'other data types' part, but we can do some of the other stuff too.
The Data
JGI maintains a "Genome Portal" where individuals deposit various samples, here metagenomics samples, and then JGI employs a bevy of tools to analyze the data. This results into a repository of files that are useful in-and-of themselves, but we can extract more value by combining them. If you'd like to play along, you can download the files for this sample here: https://genome.jgi.doe.gov/portal/IMG_3300061725/IMG_3300061725.info.html. It's a sample from a microbial community in geothermal field west of Mexico City.
Geothermal fields are considered "hot-beds", if you will, of useful diversity as selection pressure of the environment can lead to industrially useful enzymes, e.g. heat stability.
The Analysis
This document covers a few different tasks related to combining different bioinformatics output into useful integrative analyses, and in particular how to work with annotations. For example, if we were a CRISPR company, how we might find genes that are "close" to a CRISPR array in that metagenomics sample.
To carry out the analysis, we'll use a few different tools:
- DuckDB: A SQL database that can be run in-process, and is particularly good at analytical queries.
- JupySQL: A simple juptyer extension that allows us to run SQL queries in a notebook.
- Exon-DuckDB: A duckdb extension that allows us to work with scientific data. Written by the author of this notebook.
- DNA Features Viewer: A python library for visualizing DNA features.
Preamble
To get started, setup the notebook and import the necessary modules.
%load_ext sql
The sql extension is already loaded. To reload it, use:
%reload_ext sql
from sqlalchemy import create_engine
connect_with_extensions = create_engine(
"duckdb:///:memory:",
connect_args={"config": {"allow_unsigned_extensions": True}},
)
%sql connect_with_extensions
%%sql
SET custom_extension_repository='dbe.wheretrue.com/exon/latest';
INSTALL exon;
LOAD exon
Success |
---|
Absolute Positioning
Now that the notebook is ready, we can get into the analysis.
Absolute position in contrast to relative positioning is a good place to start. The idea is to filter based on fixed positions. For example, let's see if there are any CRISPR arrays that start within 50 bases of the start of the contig.
%%sql
SELECT seqname, source, type, start, "end"
FROM read_gff('./IMGData/Ga0604745_crt.gff')
WHERE type = 'CRISPR' AND start <= 50
LIMIT 5;
seqname | source | type | start | end |
---|---|---|---|---|
Ga0604745_072777 | CRT 1.8.4 | CRISPR | 4 | 698 |
Ga0604745_073037 | CRT 1.8.4 | CRISPR | 2 | 696 |
Ga0604745_073489 | CRT 1.8.4 | CRISPR | 9 | 693 |
Ga0604745_073497 | CRT 1.8.4 | CRISPR | 23 | 658 |
Ga0604745_073514 | CRT 1.8.4 | CRISPR | 1 | 521 |
SQL also makes it easy to summarize the data to get a sense of the data.
%%sql
SELECT COUNT(*)
FROM read_gff('./IMGData/Ga0604745_crt.gff')
WHERE type = 'CRISPR' AND start <= 50
count_star() |
---|
1400 |
Relative Positions
Relative positions are a bit more complicated, but can also be very useful. As often comes up in bioinformatics, a scientist may want to know if one feature is near another feature -- is this enzyme near this other enzyme in a larger pathway? Is this SNP near a gene that is important in some phenotype? Is this CRISPR array near a Cas gene that could help us classify it?
Co-occurrence of CRISPR Arrays
For example, let's see if there are any CRISPR arrays that start within 1000 bases of the start of another CRISPR array.
To that, we'll self-join the CRISPR array table on itself. Join on the contig, and require that the end of the first array is within 1000 bases of the start of the second array. I.e. if we add 1000 to the end of the first array, it should be greater than the start of the second array. Also, to ensure you don't get an upstream array, we'll require that the start of the first array is less than the start of the second array.
%%sql
WITH arrays AS (
SELECT *, attributes['ID'][1] AS crispr_id
-- The exon extension provides this table function.
FROM read_gff('./IMGData/Ga0604745_crt.gff')
WHERE type = 'CRISPR'
)
SELECT arrays.seqname, arrays.crispr_id, arrays.start, arrays.end, other.crispr_id, other.start, other.end
FROM arrays
JOIN arrays AS other
ON arrays.seqname = other.seqname
AND arrays.crispr_id != other.crispr_id
AND (arrays.end + 1000) >= other.start
AND arrays.start <= other.start
seqname | crispr_id | start | end | crispr_id_1 | start_1 | end_1 |
---|---|---|---|---|---|---|
Ga0604745_085435 | Ga0604745_085435_27_242 | 27 | 242 | Ga0604745_085435_460_609 | 460 | 609 |
Ga0604745_038187 | Ga0604745_038187_1_696 | 1 | 696 | Ga0604745_038187_739_1113 | 739 | 1113 |
Ga0604745_036347 | Ga0604745_036347_41_308 | 41 | 308 | Ga0604745_036347_757_1113 | 757 | 1113 |
Ga0604745_101232 | Ga0604745_101232_1_326 | 1 | 326 | Ga0604745_101232_365_517 | 365 | 517 |
Ga0604745_104520 | Ga0604745_104520_4_211 | 4 | 211 | Ga0604745_104520_248_529 | 248 | 529 |
Ga0604745_126005 | Ga0604745_126005_23_167 | 23 | 167 | Ga0604745_126005_211_471 | 211 | 471 |
Ga0604745_164331 | Ga0604745_164331_32_178 | 32 | 178 | Ga0604745_164331_215_364 | 215 | 364 |
Ga0604745_003766 | Ga0604745_003766_98_261 | 98 | 261 | Ga0604745_003766_406_581 | 406 | 581 |
Ga0604745_003677 | Ga0604745_003677_1876_2154 | 1876 | 2154 | Ga0604745_003677_2865_3199 | 2865 | 3199 |
It's hard to tell exactly what's happening here, but we'll note that these arrays are a bit on the small side, and tend to start closer to the start of the contig. This is potentially a sign of a fragmented assembly, causing the CRISPR arrays to be fragmented as well. That said, out of ~1650 contigs, having a handful is to be expected.
%%sql
-- Quick stats on CRISPR arrays for context
SELECT AVG("end" - start) avg_len, AVG(start) avg_start, AVG("end") avg_end, COUNT(*) cnt
FROM read_gff('./IMGData/Ga0604745_crt.gff')
WHERE type = 'CRISPR'
avg_len | avg_start | avg_end | cnt |
---|---|---|---|
563.8600357355568 | 394.23525908278737 | 958.0952948183442 | 1679 |
Genes
This dataset also has annotated genes, which may lead to some interesting questions about the co-occurrence of genes and CRISPR arrays. For brevity, we'll just do the easy thing, and join the CRISPR arrays with genes on the same config and start or end w/i 1000 bases. This'll give us a good starting point to use for visualization and further analysis.
%%sql
DROP TABLE IF EXISTS array_cds_overlaps;
CREATE TABLE array_cds_overlaps AS
-- We could also create tables for these at the start of the analysis. A good idea for larger datasets.
WITH arrays AS (
SELECT *, attributes['ID'][1] AS crispr_id
FROM read_gff('./IMGData/Ga0604745_crt.gff')
WHERE type = 'CRISPR'
), cds AS (
SELECT *, attributes['ID'][1] AS cds_id
FROM read_gff('./IMGData/Ga0604745_prodigal.gff')
)
SELECT arrays.seqname, arrays.crispr_id, arrays.start, arrays.end, cds.cds_id, cds.start cds_start, cds.end cds_end
FROM arrays
JOIN cds
ON arrays.seqname = cds.seqname
WHERE
(ABS(arrays.start - cds.start) <= 5000 OR ABS(arrays.end - cds.end) <= 5000)
AND arrays.crispr_id = 'Ga0604745_000853_20840_21021'
ORDER BY arrays.seqname, arrays.crispr_id, cds.start ASC, cds.end ASC
Count |
---|
12 |
Visualizing the Output
One of the advantages of SQL and this type of analysis, is it makes it easy to get the data into a format that can be visualized. For example, let's take the output of the previous query and visualize it -- note I've added a WHERE clause above to filter to a single CRISPR array.
results = %sql SELECT * FROM array_cds_overlaps
df = results.DataFrame()
import matplotlib
from dna_features_viewer import GraphicFeature, GraphicRecord
features = [
GraphicFeature(start=df.iloc[0]['start'], end=df.iloc[0]['end'], color="#ffd700",
label=df.iloc[0]['crispr_id'])
]
for _, row in df.iterrows():
features.append(
GraphicFeature(start=row['cds_start'], end=row['cds_end'], color="#ffcccc",
label=row.cds_id)
)
f, ax = matplotlib.pyplot.subplots(1, figsize=(15, 4))
matplotlib.pyplot.subplots_adjust(bottom=0.3)
record = GraphicRecord(sequence_length=df.cds_end.max() + 1000, features=features).crop((df.cds_start.min() - 1000, df.cds_end.max() + 1000))
record.plot(ax=ax);
ax.set_title('CRISPR array and CDS overlaps')
Text(0.5, 1.0, 'CRISPR array and CDS overlaps')
Conclusion
This notebook has covered a few different ways to work with annotations in SQL, and how to combine them with other data types. This is just a starting point, and there are many other ways to combine annotations with other data types, and many other data types to combine with annotations.