Skip to main content

Delta Lake Integration

A goal of BioBear and WTT writ large is to bring bioinformatics data handling into the Data Lake era. "Data silos" is a bit trite, but it's true that bioinformatics data is often stored in a variety of formats and locations. This makes it difficult to combine metadata from different sources and the underlying experimental data for analysis.

Modern Data Lake technologies can benefit bioinformatics by making it easier to store, query, and analyze large amounts of data. Moreover FAIR data principles can be more easily achieved with Data Lakes. For example, data versioning is a core feature of Data Lakes, which makes it easier to track data provenance.

We'll look at how BioBear can be used to easily integrate with Delta Lake, a popular open-source Data Lake technology. We'll show a simple case of just loading a FASTA file into a Delta Lake table and reading it out with Pandas, and then an example of how to use Delta Lake to time travel in order to compare two versions of a FASTA file. As an aside, see our SQL Reference for more information on how to load other file formats with BioBear.

Delta Lake Example

Using BioBear with Delta Lake is straightforward. First, use BioBear to generate an Arrow RecordBatch Reader from your data. Then use the Delta Lake Python package to write those batches to a Delta Lake table.

import biobear as bb  # pip install biobear
from deltalake import write_deltalake

# Create an Arrow RecordBatch Reader from a FASTA file with BioBear
session = bb.connect()

# Will need credentials to access S3 for your bucket, e.g. AWS_PROFILE env var
session.register_object_store_from_url("s3://wtt-01-dist-prd")

# Create an external table from the FASTA file
session.execute("""
CREATE EXTERNAL TABLE fasta
STORED AS FASTA
LOCATION 's3://wtt-01-dist-prd/tmp/500.fasta'
""")

# Convert the table to a stream of Arrow RecordBatches
arrow_batches = session.sql("SELECT * FROM fasta").to_arrow_record_batch_reader()

# Write the RecordBatches to a Delta Lake table
table_prefix = "s3://wtt-01-dist-prd/tmp/delta"
write_deltalake(
table_prefix,
arrow_batches,
storage_options={"AWS_S3_ALLOW_UNSAFE_RENAME": 'true'},
# Could include other relevant options here, like partitions
)

If that went well, there should now be a Delta Lake table at s3://wtt-01-dist-prd/tmp/delta. You can read it back in with the Delta Lake Python package.

$ aws s3 ls s3://wtt-01-dist-prd/tmp/delta/
PRE _delta_log/
2023-12-06 15:35:31 8509564 0-a81714e7-6852-45eb-a6f2-7f0ca5749a07-0.parquet

And then back in Python:

from deltalake import DeltaTable

# Read the Delta Lake table
table = DeltaTable("s3://wtt-01-dist-prd/tmp/delta")
print(table.to_pandas())
# id description sequence
# 0 UniRef50_A0A5A9P0L4 peptidylprolyl isomerase n=1 Tax=Triplophysa t... MEEITQIKKRLSQTVRLEGKEDLLSKKDSITNLKTEEHVSVKKMVI...
# 1 UniRef50_A0A410P257 Glycogen synthase n=2 Tax=Candidatus Velamenic... MKAIAWLIVLTFLPEQVAWAVDYNLRGALHGAVAPLVSAATVATDG...
# 2 UniRef50_A0A8J3NBY6 Gln_amidase domain-containing protein n=2 Tax=... MEILGRNLPRILGNLVKTIKTAPVRVVARRGARTLTQKEFGKYLGS...
# 3 UniRef50_Q8WZ42 Titin n=3053 Tax=cellular organisms TaxID=1315... MTTQAPTFTQPLQSVVVLEGSTATFEAHISGFPVPEVSWFRDGQVI...
# 4 UniRef50_A0A401TRQ8 Ig-like domain-containing protein (Fragment) n... PPSFIHKPDPQEVLPGSNVKFTSVVTGTAPLKVSWFKGTTELVAGR...
# .. ... ... ...
# 495 UniRef50_A0A263CXF5 PE-PGRS family protein n=1 Tax=Amycolatopsis a... MDIEFDPSFDWLWILAAGSKAPRVSVKGARAGTETWNEISSTLDQM...
# 496 UniRef50_S8GMU2 HECT-domain (Ubiquitin-transferase) domain-con... MGGRLAFLREGGRHCGREREVVLRSFFCAAQTSIFARCRLRGETHA...
# 497 UniRef50_A0A401PG78 Protocadherin beta-16-like n=1 Tax=Scyliorhinu... MRYKLIWLLKWEPLFFFLLSWELISGQIRYSIPEELQLGAFVGNIA...
# 498 UniRef50_A0A158NVR6 Titin n=7 Tax=Attini TaxID=143999 RepID=A0A158... MKRQQFQESQTQQQQQQAIVQEQQQQQQYIQQRTERQVTRQQITSQ...
# 499 UniRef50_UPI001AE3638B LuxR C-terminal-related transcriptional regula... MGYWDIPPELSWLGWVAGSEVSHVDPDDLVKISAILREKADYIRDV...

Checking Version History

Now imagine the case where you have a Delta Lake table that is updated periodically. You might want to compare the current version of the table to a previous version. Delta Lake makes this easy with its time travel feature. For example, say our data pipeline accidentally duplicated the FASTA file. We can use Delta Lake to compare the current version of the table to the previous version.

# Create an external table from the FASTA file
arrow_batches = session.sql("""
SELECT * FROM fasta
UNION ALL
SELECT * FROM fasta
""").to_arrow_record_batch_reader()

# Write the RecordBatches to a Delta Lake table
table_prefix = "s3://wtt-01-dist-prd/tmp/delta"
write_deltalake(
table_prefix,
arrow_batches,
storage_options={"AWS_S3_ALLOW_UNSAFE_RENAME": 'true'},
mode="overwrite"
)

Now we have two versions of the table. We can use Delta Lake to compare them.

from deltalake import DeltaTable

# Read the Delta Lake table
current_table = DeltaTable("s3://wtt-01-dist-prd/tmp/delta")

# Get the current version of the table
current = current_table.to_pandas()

# Get the previous version of the table
previous_table = DeltaTable("s3://wtt-01-dist-prd/tmp/delta", version=0)

# Get the previous version of the table
previous = previous_table.to_pandas()

# Compare the two versions
print(current.equals(previous))
# False

Loading Metadata

The prior section shows how to load a FASTA file, but the domain specific data often needs to be combined with metadata.

For example, you can also load a CSV file, then join it with the FASTA file.

# Create an external table from a CSV
session.execute("""
CREATE EXTERNAL TABLE metadata
STORED AS CSV
LOCATION 's3://wtt-01-dist-prd/tmp/metadata.csv'
""")

# Load the metadata table into a Delta Lake table
arrow_batches = session.sql("SELECT * FROM metadata").to_arrow_record_batch_reader()

# Write the RecordBatches to a Delta Lake table
table_prefix = "s3://wtt-01-dist-prd/tmp/metadata"
write_deltalake(
table_prefix,
arrow_batches,
storage_options={"AWS_S3_ALLOW_UNSAFE_RENAME": 'true'},
mode="overwrite"
)

Parquet and JSON files are also supported.

Learning More

You can read more about Delta Lake here and its python package here.

If you'd like to learn more about how WHERE TRUE Technologies can help you with your bioinformatics data, please contact us.