Skip to main content

Nextflow

Nextflow is a very popular workflow management tool in the bioinformatics community. You can use Exon as part of your larger workflow to quickly catalog or filter your data.

The general idea like we'll see in the following examples is to run a process that generates some data, and then use Exon to export that data into your data lake.

Basic Usage

Here we have a simple example that "round-trips" a FASTA file from S3 to your local machine, and then exports it to Exon.

process copyFastaLocally {
output:
path 'test.fasta'

script:
"""
aws s3 cp s3://wtt-01-dist-prd/tmp/test.fasta .
"""
}

process exportToParquet {
container 'public.ecr.aws/where-true-tech/exon-cli'

input:
path input_fasta

script:
"""
exon-cli -c "COPY (SELECT * FROM fasta_scan('$input_fasta')) TO 's3://wtt-01-dist-prd/tmp/test2.parquet' (FORMAT PARQUET)";
"""
}

workflow {
main:
copyFastaLocally()
exportToParquet(copyFastaLocally.out)
}

Config

In order to use Docker and have it access AWS, you'll need to make sure your nextflow.config is configured for them. For this example, the following configuration would be sufficient:

// nextflow.config
docker {
enabled = true
envWhitelist = 'AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY'
}

Prodigal Example

Here's a more complex example that uses Nextflow to run Prodigal on a genome, and then exports the results to Exon. Prodigal will produce three files: amino acids (a FASTA file), CDSs (another FASTA file), and genes (a GFF file). We'll export each of these to S3 as Parquet files so they can be used later for analysis.

process prodigal {
input:
path genome_fasta
val genome

output:
path "${genome}_amino_acids.fasta", emit: amino_acids
path "${genome}_cds.fasta", emit: cds
path "${genome}_genes.gff", emit: genes

"""
prodigal -i ${genome_fasta} -a ${genome}_amino_acids.fasta -d ${genome}_cds.fasta -o ${genome}_genes.gff -f gff
"""
}

process exportAminoAcidsToParquet {
container 'public.ecr.aws/where-true-tech/exon-cli:latest'

input:
path input_fasta
val genome

script:
"""
exon-cli -c "COPY (SELECT * FROM fasta_scan('$input_fasta')) TO 's3://wtt-01-dist-prd/tmp/amino_acids/genome=${genome}/${genome}.parquet' (FORMAT PARQUET)";
"""
}

process exportCDSToParquet {
container 'public.ecr.aws/where-true-tech/exon-cli:latest'

input:
path input_fasta
val genome

script:
"""
exon-cli -c "COPY (SELECT * FROM fasta_scan('$input_fasta')) TO 's3://wtt-01-dist-prd/tmp/cds/genome=${genome}/${genome}.parquet' (FORMAT PARQUET)";
"""
}

process exportGenesToParquet {
container 'public.ecr.aws/where-true-tech/exon-cli:latest'

input:
path input_gff
val genome

script:
"""
exon-cli -c "COPY (SELECT * FROM gff_scan('$input_gff')) TO 's3://wtt-01-dist-prd/tmp/genes/genome=${genome}/${genome}.parquet' (FORMAT PARQUET)";
"""
}

workflow {
main:
prodigal(params.genome_fasta, params.genome)
exportAminoAcidsToParquet(prodigal.out.amino_acids, params.genome)
exportCDSToParquet(prodigal.out.cds, params.genome)
exportGenesToParquet(prodigal.out.genes, params.genome)
}

Prodigal Workflow Config

Similar to the previous example, you'll need to make sure your nextflow.config is configured for Docker and AWS. For this example, the following configuration would be sufficient:

// nextflow.config
docker {
enabled = true
envWhitelist = 'AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY'
}