Raphael Thys
  • About
Contact me
Digital Transformation and Digital Products at the age of AI
LinkedInInstagramFacebookXSpotify
Helping Teams Shape the Future of Experiences
Helping Teams Shape the Future of Experiences
/
5 million synthetic drug models could revolutionize pharma pipelines

5 million synthetic drug models could revolutionize pharma pipelines

NVIDIA-backed AI firm drops 5M drug maps to fast-track breakthrough therapies

Researcher studies molecular models on a digital screen in a lab. (representational image) SandboxAQ official website

SandboxAQ, an AI startup spun out of Google and backed by NVIDIA, has released a massive new dataset it hopes will revolutionize early-stage drug discovery.

On Wednesday, the company unveiled the Structurally Augmented IC50 Repository (SAIR), a trove of over 5.2 million computationally generated protein-drug molecule co-structures, each tagged with real-world potency data.

The aim is to make it easier and faster for researchers to determine whether a potential drug will bind effectively to its target protein.

That’s a crucial question scientists must answer before advancing a drug candidate into further testing.

Targeting the bind between drugs and proteins

SandboxAQ’s dataset is designed to support models that predict whether a small molecule will stick to a specific protein. That interaction determines if a drug will inhibit or modify a biological process, such as halting the spread of disease.

Traditionally, researchers use experimental methods to study these structures. The process is costly and time-consuming.

It starts with obtaining a 3D structure of a target protein and then testing thousands of molecules for how they bind. Predicting both the pose and potency of the molecule requires repeated computation and refinement.

Synthetic molecules, real-world accuracy

“This is a long-standing problem in biology that we’ve all, as an industry, been trying to solve for,” Nadia Harhen, general manager of AI simulation at SandboxAQ, told Reuters.

“All of these computationally generated structures are tagged to a ground-truth experimental data, and so when you pick this data set and you train models, you can actually use the synthetic data in a way that’s never been done before.”

To bypass the data bottleneck, SandboxAQ used NVIDIA chips to generate synthetic structures. These are not observed in labs but calculated from real experimental data using the Boltz-1x co-folding model.

For each protein-drug pair from public datasets like ChEMBL and BindingDB, the team created five different 3D poses. They then cross-referenced these predictions with computational potency values to retain only the most accurate ones. The final SAIR dataset includes those high-confidence entries.

Examples of 3D co-folded protein-drug complexes found in the SAIR release. Credit – SandboxAQ

image

Boosting AI model training with open data

AI models like AlphaFold2 and newer systems such as AlphaFold3 and Boltz-2 have made major progress in predicting 3D structures and binding poses. But they still struggle when dealing with unfamiliar proteins or molecules outside their training data.

One way to improve that is through more training data. However, creating new structural data experimentally is expensive, which is the very problem AI hopes to fix.

And while pharma companies hold private datasets, they rarely share them publicly.

By generating synthetic structural data from widely available potency records, SAIR offers a workaround.

Researchers can now use this resource to train models that not only predict structure but also potency, without access to proprietary databases.

From data to drug candidates, virtually

SandboxAQ will make the SAIR dataset freely available to researchers. At the same time, it plans to charge for access to its proprietary AI models trained on this data.

These tools aim to rival lab-based experiments, predicting protein binding quickly, virtually, and with real-world accuracy.

MasterCard

MasterCard

Addition date
Jun 25, 2025 12:40 PM
mTags
Artificial intelligenceArtificial intelligenceHealthHealthMaterialsMaterialsGeneticsGeneticsSynthetic biologySynthetic biology
Added by
U
Untitled
Link
https://interestingengineering.com/innovation/5-million-ai-drug-structures-sandboxaq
Horizon and history date
2025
LAN
EN