Skip to main content

Exon CLI + Nextflow Example

· 3 min read
Trent Hauck
Trent Hauck
Developer

We've begun publishing the Exon CLI to the AWS Public ECR. This means you can pull and run the Exon CLI using Docker, and use it interactively or in scripts and workflows, like Nextflow which we'll see in a second.

For example, say I was in a directory with a single FASTA file, test.fasta. I could "drop" into the Exon CLI using the following command:

docker run -it -v $(pwd):/data where-true-tech/exon-cli:latest

Which would start the Exon CLI and drop me into a shell. And from there, I could query the data.

❯ SELECT * FROM fasta_scan('/data/test.fasta');
# +----+--------------+----------+
# | id | description | sequence |
# +----+--------------+----------+
# | a | description | ATCG |
# | b | description2 | ATCG |
# +----+--------------+----------+
# 2 rows in set. Query took 0.026 seconds.

While we plan on releasing native binaries for the Exon CLI, with Docker, you can easily use the CLI in your workflows and scripts.

Nextflow Example

Nextflow is one of the most popular workflow managers for bioinformatics. It's a great tool for building reproducible workflows, and it's easy to use the Exon CLI in Nextflow workflows.

For example, if I wanted to copy a FASTA file generated earlier in the workflow, into Parquet for warehouse storage, I could use the -c (command) flag to execute a COPY query in the Exon CLI. This would look something like the following:

As a minimal example, we can round-trip a FASTA file from S3 to Parquet using the Exon CLI and Nextflow. In reality, whatever relevant domain task would generate the FASTA, and then we could use the Exon CLI to copy it into Parquet.

// main.nf
process copyFastaLocally {
output:
path 'test.fasta'

script:
"""
aws s3 cp s3://wtt-01-dist-prd/tmp/test.fasta .
"""
}

process exportToParquet {
container 'public.ecr.aws/where-true-tech/exon-cli'

input:
path input_fasta

script:
"""
exon-cli -c "COPY (SELECT * FROM fasta_scan('$input_fasta')) TO 's3://wtt-01-dist-prd/tmp/test2.parquet' (FORMAT PARQUET)";
"""
}

workflow {
main:
copyFastaLocally()
exportToParquet(copyFastaLocally.out)
}

In order to use Docker and have it access AWS, you'll need to make sure your nextflow.config is configured for them. For this example, the following configuration would be sufficient:

// nextflow.config
docker {
enabled = true
envWhitelist = 'AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY'
}

Conclusion

We hope you find the Exon CLI tool useful to support you taking advantage of modern data warehousing and analytics tools in your bioinformatics workflows. If you'd to learn more about what options are available to you, please check out the CLI documentation. Also, if you need more flexibility than the CLI provides, you can always use the BioBear with Python.

If you have any questions or feedback, please reach out to us on at wheretrue.com/contact or on Twitter.