2 posts tagged with "cli" | WHERE TRUE Technologies Documentation

Data Prep for Bioinformatic ML

May 21, 2024 · 5 min read

Trent Hauck

Developer

This video is a quick walkthrough of:

What is a join, why it's important for bioinformatics analysis, and how to recognize when you could be using SQL vs Python.
How to use SQL to easily combine contigs, GFFs, and amino acid sequences into a dataset for a back-translation pLM from amino acids to CDSs.
1. Shows how to extract CDSs from contigs in a strand-aware manner.
2. Shows how to use contig metadata for Group K-Fold Cross Validation to avoid leakage.

The second part starts around the 5 minute mark.

If you'd like to try biobear, you can install it with pip:

pip install biobear

Or feel free to reach out if you have any question. Otherwise, the code from the video is below.

Tabular Example

For example, given the table:

gene_id	contig_id
1	1
2	1

Where each row represents a gene and the contig it is on, we can join this table with another table that has information about the contigs:

contig_id	length
1	1000

To get a table that looks like this:

gene_id	contig_id	length
1	1	1000
2	1	1000

So, with a genes table and a contigs table, this would look like:

SELECT
    genes.gene_id,
    genes.contig_id,
    contigs.length
FROM
    genes
JOIN
    contigs
ON
    genes.contig_id = contigs.contig_id

What this Looks like in Python

It's important to recognize this pattern in Python w/ Dictionaries. You may already be using this pattern without realizing it.

# Mapping of gene_id to contig_id
genes = {
    1: 1,
    2: 1
}

# Mapping of contig_id to length
contigs = {
    1: 1000
}

These two dictionaries can then be "joined":

for gene_id, contig_id in genes.items():
    length = contigs[contig_id]
    print(gene_id, contig_id, length)

This is straightforward, but can run into memory issues if the datasets are large. SQL databases are optimized for this type of operation. It also is fairly inflexible, as addint additional data, doing aggregations, etc. can require material code changes.

DataFrames

DataFrames are a common data structure found in Python libraries like Pandas (and base R). Conceptually they are similar to tables in SQL, but can run into the same memory and performance issues as dictionaries.

import pandas as pd

genes = pd.DataFrame({
    'gene_id': [1, 2],
    'contig_id': [1, 1]
})

contigs = pd.DataFrame({
    'contig_id': [1],
    'length': [1000]
})

result = pd.merge(genes, contigs, on='contig_id')
print(result)

Types of Joins

There are several types of joins, but the two most common are:

Inner Join
Left Join

Inner Join

An inner join is the most common type of join. It only returns rows where there is a match in both tables. In the example above, this would mean that only genes that are on a contig would be returned.

Left Join

A left join returns all rows from the left table, and the matched rows from the right table. If there is no match, the result is NULL. In the example above, this would mean that all genes would be returned, even if they are not on a contig.

SELECT
    genes.gene_id,
    genes.contig_id,
    contigs.length
FROM
    genes
LEFT JOIN
    contigs
ON
    genes.contig_id = contigs.contig_id

gene_id	contig_id	length
1	1	1000
2	1	1000
3	2	NULL

pLM Example with BioBear

Joining CDSs with Amino Acids and Metadata for the purposes of training a pLM.

End goals:

A table with the AA sequence and DNA sequence for each CDS
Additional metadata for each CDS, our use case: group k-fold cross-validation by contig

# | echo: false
# | output: false

import polars as pl

Start a New Session and Load Data

Load our GFF.

import biobear as bb

session = bb.new_session()

session.execute(
    """
CREATE EXTERNAL TABLE gff
STORED AS gff
LOCATION './12803.assembled.gff';
"""
)

session.sql(
    """
SELECT seqname, source, start, "end", strand, phase, attributes.locus_tag[1] locus_tag FROM gff WHERE type = 'CDS' LIMIT 5
"""
).to_polars().glimpse()

Load our contigs. This part is somewhat tricky as we have the contig sequence, which contains the CDSs. So we'll need to join and extract the CDS sequences in a minute.

session.execute(
    """
CREATE EXTERNAL TABLE contigs
STORED AS fasta OPTIONS (file_extension 'fna')
LOCATION './12803.assembled.fna';
"""
)

session.sql(
    """
SELECT id, sequence FROM contigs LIMIT 5
"""
).to_polars().glimpse()

Load our protein sequences.

session.execute(
    """
CREATE EXTERNAL TABLE protein
STORED AS fasta OPTIONS (file_extension 'faa')
LOCATION './12803.assembled.faa';
"""
)

session.sql(
    """
SELECT id, sequence FROM protein LIMIT 5
"""
).to_polars().glimpse()

Data is loaded, now we can join and extract the CDS sequences.

df = session.sql(
    """
SELECT seqname as contig,
    CASE
        WHEN strand = '1' THEN substr(contigs.sequence, start, "end" - start + 1)
        ELSE reverse_complement(substr(contigs.sequence, start, "end" - start + 1))
    END AS sequence,
    protein.sequence aa_sequence,
    gff.start,
    gff."end",
    gff.strand,
    "end" - start + 1 AS length
FROM gff
    JOIN contigs
        ON gff.seqname = contigs.id
    JOIN protein
        ON gff.attributes.locus_tag[1] = protein.id
"""
).to_polars()

df.head()

Now we can use sci-kit learn to do a group k-fold cross-validation using the contig as the group.

from sklearn.model_selection import GroupKFold

X = df["aa_sequence"]
Y = df["sequence"]
contigs = df["contig"]

group_kfold = GroupKFold(n_splits=5)

for train_index, test_index in group_kfold.split(X, Y, groups=contigs):
    print("TRAIN:", train_index)
    print("TEST:", test_index)

    X_train, X_test = X[train_index], X[test_index]

    # Do something with the data

    break

Exon CLI + Nextflow Example

January 22, 2024 · 3 min read

Trent Hauck

Developer

We've begun publishing the Exon CLI to the AWS Public ECR. This means you can pull and run the Exon CLI using Docker, and use it interactively or in scripts and workflows, like Nextflow which we'll see in a second.

For example, say I was in a directory with a single FASTA file, test.fasta. I could "drop" into the Exon CLI using the following command:

docker run -it -v $(pwd):/data where-true-tech/exon-cli:latest

Which would start the Exon CLI and drop me into a shell. And from there, I could query the data.

❯ SELECT * FROM fasta_scan('/data/test.fasta');
# +----+--------------+----------+
# | id | description  | sequence |
# +----+--------------+----------+
# | a  | description  | ATCG     |
# | b  | description2 | ATCG     |
# +----+--------------+----------+
# 2 rows in set. Query took 0.026 seconds.

While we plan on releasing native binaries for the Exon CLI, with Docker, you can easily use the CLI in your workflows and scripts.

Nextflow Example

Nextflow is one of the most popular workflow managers for bioinformatics. It's a great tool for building reproducible workflows, and it's easy to use the Exon CLI in Nextflow workflows.

For example, if I wanted to copy a FASTA file generated earlier in the workflow, into Parquet for warehouse storage, I could use the -c (command) flag to execute a COPY query in the Exon CLI. This would look something like the following:

As a minimal example, we can round-trip a FASTA file from S3 to Parquet using the Exon CLI and Nextflow. In reality, whatever relevant domain task would generate the FASTA, and then we could use the Exon CLI to copy it into Parquet.

// main.nf
process copyFastaLocally {
    output:
    path 'test.fasta'

    script:
    """
    aws s3 cp s3://wtt-01-dist-prd/tmp/test.fasta .
    """
}

process exportToParquet {
    container 'public.ecr.aws/where-true-tech/exon-cli'

    input:
    path input_fasta

    script:
    """
    exon-cli -c "COPY (SELECT * FROM fasta_scan('$input_fasta')) TO 's3://wtt-01-dist-prd/tmp/test2.parquet' (FORMAT PARQUET)";
    """
}

workflow {
    main:
    copyFastaLocally()
    exportToParquet(copyFastaLocally.out)
}

In order to use Docker and have it access AWS, you'll need to make sure your nextflow.config is configured for them. For this example, the following configuration would be sufficient:

// nextflow.config
docker {
    enabled = true
    envWhitelist = 'AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY'
}

Conclusion

We hope you find the Exon CLI tool useful to support you taking advantage of modern data warehousing and analytics tools in your bioinformatics workflows. If you'd to learn more about what options are available to you, please check out the CLI documentation. Also, if you need more flexibility than the CLI provides, you can always use the BioBear with Python.

If you have any questions or feedback, please reach out to us on at wheretrue.com/contact or on Twitter.

Tabular Example​

What this Looks like in Python​

DataFrames​

Types of Joins​

Inner Join​

Left Join​

pLM Example with BioBear​

Start a New Session and Load Data​

Nextflow Example​

Conclusion​