Exon Session

While BioBear exposes functions to read various data formats directly, it also makes available an Exon engine to work with data in a more SQL native way.

Creating a Session

Getting started is easy, just import BioBear and start a session.

import biobear as bb

session = bb.new_session()

From there, you can start working with your data. For example, you can register an S3 bucket, then create external tables backed by objects in that bucket, and finally copy data from those tables to other locations.

# Create the external table, this could also be on S3
session.execute("""
CREATE EXTERNAL TABLE gene_annotations STORED AS GFF LOCATION 's3://wtt-01-dist-prd/TenflaDSM28944/IMG_Data/Ga0451106_prodigal.gff'
""")

session.execute("""
COPY (SELECT seqname, start, "end", score from gene_annotations)
TO 's3://wtt-01-dist-prd/gene_annotations/sample=Ga0455106/gene_annotations.parquet'
STORED AS PARQUET
""")

You can also query that table directly:

result = session.sql("""
    SELECT * FROM gene_annotations WHERE score > 50
""")

df = result.to_polars()
df.head()
# shape: (5, 9)
# ┌──────────────┬─────────────────┬──────┬───────┬───┬────────────┬────────┬───────┬───────────────────────────────────┐
# │ seqname      ┆ source          ┆ type ┆ start ┆ … ┆ score      ┆ strand ┆ phase ┆ attributes                        │
# │ ---          ┆ ---             ┆ ---  ┆ ---   ┆   ┆ ---        ┆ ---    ┆ ---   ┆ ---                               │
# │ str          ┆ str             ┆ str  ┆ i64   ┆   ┆ f32        ┆ str    ┆ str   ┆ list[struct[2]]                   │
# ╞══════════════╪═════════════════╪══════╪═══════╪═══╪════════════╪════════╪═══════╪═══════════════════════════════════╡
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 2     ┆ … ┆ 54.5       ┆ -      ┆ 0     ┆ [{"ID",["Ga0451106_01_2_238"]}, … │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 228   ┆ … ┆ 114.0      ┆ -      ┆ 0     ┆ [{"ID",["Ga0451106_01_228_941"]}… │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 1097  ┆ … ┆ 224.399994 ┆ +      ┆ 0     ┆ [{"ID",["Ga0451106_01_1097_2257"… │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 2261  ┆ … ┆ 237.699997 ┆ +      ┆ 0     ┆ [{"ID",["Ga0451106_01_2261_3787"… │
# │ Ga0451106_01 ┆ Prodigal v2.6.3 ┆ CDS  ┆ 3784  ┆ … ┆ 114.400002 ┆ +      ┆ 0     ┆ [{"ID",["Ga0451106_01_3784_4548"… │
# └──────────────┴─────────────────┴──────┴───────┴───┴────────────┴────────┴───────┴───────────────────────────────────┘

Reading Files

The session also grants the ability to read some file types directly into an ExecutionResult. This is useful to read file with options and a more pythonic API.

For example with a FASTQ file:

# Read a single file
df = session.read_fastq_file('./path/to/file.fastq')

# Read a file that is compressed and under .fq not .fastq
from biobear import FASTQReadOptions, FileCompressionType
options = FASTQReadOptions(file_extension='fq', file_compression_type=FileCompressionType.GZIP)
df = session.read_fastq_file('./path/to/file.fq.gz', options=options).to_polars()

A similar API is available for reading many other file types:

read_fasta_file
read_vcf_file
read_bcf_file
read_sam_file
read_bam_file
read_bed_file
read_bigwig_file
read_gff_file
read_gtf_file
read_mzml_file
read_genbank_file
read_cram_file
read_fcs_file

These generally behave has a full file scan, but some file types support more advanced features like region filtering.

Indexed VCF Example

For example, indexed VCF files can be read with a region filter:

import biobear as bb

options = bb.VCFReadOptions(region='chr1:1-1000', file_compression_type=bb.FileCompressionType.GZIP)

df = session.read_vcf_file('./path/to/file.vcf.gz', options=options).to_polars()

BED File with N Fields

By default, BED files are assumed to have 12 fields. If your file has a different number and you read it without supplying any options, you'll get a DataFrame with 12 columns but the missing columns will be filled with nulls. You can specify the number of fields in the file with the n_fields option:

import biobear as bb

options = bb.BEDReadOptions(n_fields=3)

df = session.read_bed_file('./path/to/file.bed', options=options).to_polars()
┌─────────────────────────┬───────┬───────┐
│ reference_sequence_name ┆ start ┆ end   │
│ ---                     ┆ ---   ┆ ---   │
│ str                     ┆ i64   ┆ i64   │
╞═════════════════════════╪═══════╪═══════╡
│ chr1                    ┆ 11874 ┆ 12227 │
│ chr1                    ┆ 12613 ┆ 12721 │
│ chr1                    ┆ 13221 ┆ 14409 │
│ chr1                    ┆ 14362 ┆ 14829 │
│ chr1                    ┆ 14970 ┆ 15038 │
│ chr1                    ┆ 15796 ┆ 15947 │
│ chr1                    ┆ 16607 ┆ 16765 │
│ chr1                    ┆ 16858 ┆ 17055 │
│ chr1                    ┆ 17233 ┆ 17368 │
│ chr1                    ┆ 17606 ┆ 17742 │
└─────────────────────────┴───────┴───────┘

Exon Session

Creating a Session​

Reading Files​

Indexed VCF Example​

BED File with N Fields​

Creating a Session

Reading Files

Indexed VCF Example

BED File with N Fields