Skip to main content

Exon Session

While BioBear exposes functions to read various data formats directly, it also makes available an Exon engine to work with data in a more SQL native way.

Creating a Session

Getting started is easy, just import BioBear and start a session.

import biobear as bb

session = bb.new_session()

From there, you can start working with your data. For example, you can register an S3 bucket, then create external tables backed by objects in that bucket, and finally copy data from those tables to other locations.

# Create the external table, this could also be on S3
session.execute("""
CREATE EXTERNAL TABLE gene_annotations STORED AS GFF LOCATION 's3://wtt-01-dist-prd/TenflaDSM28944/IMG_Data/Ga0451106_prodigal.gff'
""")

session.execute("""
COPY (SELECT seqname, start, "end", score from gene_annotations)
TO 's3://wtt-01-dist-prd/gene_annotations/sample=Ga0455106/gene_annotations.parquet'
STORED AS PARQUET
""")

You can also query that table directly:

result = session.sql("""
SELECT * FROM gene_annotations WHERE score > 50
""")

df = result.to_polars()
df.head()
# shape: (5, 9)
# ┌──────────────┬─────────────────┬──────┬───────┬───┬────────────┬────────┬───────┬───────────────────────────────────┐
# │ seqname ┆ source ┆ type ┆ start ┆ … ┆ score ┆ strand ┆ phase ┆ attributes │
# │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ str ┆ i64 ┆ ┆ f32 ┆ str ┆ str ┆ list[struct[2]] │
# ╞══════════════╪═════════════════╪══════╪═══════╪═══╪════════════╪════════╪═══════╪═══════════════════════════════════╡
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS ┆ 2 ┆ … ┆ 54.5 ┆ - ┆ 0 ┆ [{"ID",["Ga0451106_01_2_238"]}, … │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS ┆ 228 ┆ … ┆ 114.0 ┆ - ┆ 0 ┆ [{"ID",["Ga0451106_01_228_941"]}… │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS ┆ 1097 ┆ … ┆ 224.399994 ┆ + ┆ 0 ┆ [{"ID",["Ga0451106_01_1097_2257"… │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS ┆ 2261 ┆ … ┆ 237.699997 ┆ + ┆ 0 ┆ [{"ID",["Ga0451106_01_2261_3787"… │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS ┆ 3784 ┆ … ┆ 114.400002 ┆ + ┆ 0 ┆ [{"ID",["Ga0451106_01_3784_4548"… │
# └──────────────┴─────────────────┴──────┴───────┴───┴────────────┴────────┴───────┴───────────────────────────────────┘

Reading Files

The session also grants the ability to read some file types directly into an ExecutionResult. This is useful to read file with options and a more pythonic API.

For example with a FASTQ file:

# Read a single file
df = session.read_fastq_file('./path/to/file.fastq')

# Read a file that is compressed and under .fq not .fastq
from biobear import FASTQReadOptions, FileCompressionType
options = FASTQReadOptions(file_extension='fq', file_compression_type=FileCompressionType.GZIP)
df = session.read_fastq_file('./path/to/file.fq.gz', options=options).to_polars()

A similar API is available for reading many other file types:

  • read_fasta_file
  • read_vcf_file
  • read_bcf_file
  • read_sam_file
  • read_bam_file
  • read_bed_file
  • read_bigwig_file
  • read_gff_file
  • read_gtf_file
  • read_mzml_file
  • read_genbank_file
  • read_cram_file
  • read_fcs_file

These generally behave has a full file scan, but some file types support more advanced features like region filtering.

Indexed VCF Example

For example, indexed VCF files can be read with a region filter:

import biobear as bb

options = bb.VCFReadOptions(region='chr1:1-1000', file_compression_type=bb.FileCompressionType.GZIP)

df = session.read_vcf_file('./path/to/file.vcf.gz', options=options).to_polars()