We are excited to announce the preview release of our latest tool, WTT-02, designed specifically for Cheminformatics users. WTT-02 is the second major tool in the WHERE TRUE Tools suite and comes packed with a range of powerful features to simplify your work.
What is WTT-02?
WTT-02 is a Cheminformatics tool that provides a range of features to help users streamline their workflows. With WTT-02, you can:
- Input and output SDF files with glob and compression support.
- Easily featurize machine learning workflows using chemical descriptors, Morgan fingerprints, and other related features.
- Subset datasets by substructure or fingerprint similarity.
- Get Within SQL ETL support for PubChem datasets.
For more information see the documentation.
A Minimal Example
Imagine you have a set of SDF files that you would like to filter based on a substructure and fingerprint similarity, featurize using Morgan fingerprints and molecular descriptors, and finally, write to parquet for use in a machine learning workflow. With WTT-02, you can perform all of these tasks in a single query.
SELECT _Name as name, smiles, features.*
SELECT featurize(smiles) AS features, _Name, smiles
WHERE substructure(smiles, 'c1ccccc1') AND tanimoto_similarity(smiles, 'c1ccccc1') > 0.7
) TO 's3://my-bucket/my-file.parquet' (FORMAT PARQUET);
And with that you have a featurized dataset ready for machine learning or your data warehouse.
Also, say you already have your data in a postgres database, see our guide for using querying postgres with Exon-DuckDB. The same idea applies and you can quickly export data based on substructure or similarity