Basecamp Research Launches the Trillion Gene Atlas

Aiming to Redefine the Data Foundations of AI-Driven Drug Discovery, the landmark initiative partners with Anthropic, Ultima Genomics, PacBio and NVIDIA to generate and model biological data at an unprecedented scale.

Basecamp Research has announced the launch of the Trillion Gene Atlas, an ambitious global initiative designed to address one of the central bottlenecks in AI-enabled drug discovery: the scarcity and narrow scope of biological training data. Introduced this week at SXSW in Austin and NVIDIA GTC in San Jose, the Atlas aims to expand known evolutionary genetic diversity by 100-fold, collecting genomic data from more than 100 million species across thousands of global sites.

The scale of the project places it among the most significant biological data-generation efforts since the Human Genome Project. By combining large-scale sequencing partnerships with next-generation compute infrastructure, Basecamp plans to condense what would traditionally require decades of data collection and analysis into less than two years.

Addressing a Structural Data Bottleneck in AI Drug Development

Although recent advances in model architectures and compute have pushed the boundaries of biological AI, progress is increasingly constrained by the limited diversity of publicly available datasets. Nearly all current sequence-based foundation models are trained on variations of the same underlying repositories, many of which contain fewer than 250 million sequences drawn from a narrow fraction of Earth’s biodiversity.

“The Trillion Gene Atlas expands the genetic landscape available for model training by orders of magnitude,” said Glen Gowers, Co-founder and CEO of Basecamp Research. “This level of diversity unlocks modelling behaviours that simply aren’t possible with today’s datasets.”

The Atlas builds on the company’s EDEN foundation models, released earlier this year, which were trained exclusively on BaseData™, Basecamp’s proprietary genomic database. Already more than ten times larger than all public resources combined, BaseData has provided EDEN with access to richer evolutionary context than any existing model. Early validation work has demonstrated:

Zero-shot functional activity in primary human T-cells
Programmable gene insertion (aiPGI) enabling targeted gene restoration
AI-designed antimicrobial peptides with a 97% hit rate against priority pathogens
Therapeutic candidate generation directly from a disease prompt

These results revealed new scaling behaviours: as biologically diverse training data increases, model capability accelerates sharply, outpacing improvements achievable through model size alone.

“EDEN showed that biological AI scales differently from language or images,” said Phil Lorenz, CTO of Basecamp Research. “High-quality, context-rich data is the key enabler. The Trillion Gene Atlas pushes this principle to an entirely new level.”

A Global Biodiversity Network Built for Industrial-Scale Discovery

The Atlas is supported by a network of scientific collaborators across 31 countries, developed by Basecamp over the past six years. This infrastructure enables high-quality genomic data collection from remote and biodiverse ecosystems that remain largely absent from public datasets.

The company employs a combination of:

Off-grid sequencing technologies
Specialised regulatory frameworks to ensure compliance with emerging Digital Sequence Information (DSI) requirements
Access and Benefit-Sharing agreements that support local research and capacity building

As part of the launch, Basecamp announced new partnerships in Chile, Argentina, and an expanded programme in Antarctica, marking one of the most geographically comprehensive biodiversity sampling networks in existence.

This model ensures that benefits, technological, scientific, and economic, flow back to partner regions, while enabling responsible and large-scale genomic discovery.

Industrial Sequencing Meets Accelerated Compute

The feasibility of the Trillion Gene Atlas rests on parallel advances in sequencing throughput and AI-accelerated compute.

Ultima Genomics: Ultra-High-Throughput Short-Read Sequencing

Ultima’s UG200 Series platform provides the scale and economics required for trillion-gene-level initiatives. “Biology has long been constrained by a lack of data at scale,” said Gilad Almogy, Founder and CEO of Ultima Genomics. “Our platform was designed specifically to enable projects of this magnitude.”

PacBio: High-Accuracy Long Reads for Full Genomic Context

PacBio’s HiFi sequencing will contribute high-accuracy long-read data capable of resolving strain-level variation in complex samples. “HiFi data provides a robust foundation for biological AI models to learn from nature with the necessary fidelity,” said Christian Henry, PacBio’s President and CEO.

NVIDIA: Accelerating Petabase-Scale Genetic Processing

On the compute side, NVIDIA’s accelerated infrastructure, including Parabricks and CUDA-X libraries, will reduce genome assembly and annotation timelines from decades to months.

Tasks that previously required 20 years of compute are expected to be completed in under two years, enabling continuous model training on the expanding dataset.

Toward an End-to-End AI Therapeutic Design Engine

Anthropic joins the initiative as part of its broader effort to integrate Claude into scientific workflows. By connecting Claude’s reasoning capabilities with EDEN’s generative biology engine and NVIDIA’s accelerated processing, the partners aim to create an integrated system capable of:

Interpreting complex biological and clinical datasets
Identifying therapeutic hypotheses
Designing candidate therapeutics across multiple modalities

The ultimate aim is a seamless, agentic workflow, from interpreting disease data to proposing validated molecular interventions.

A New Foundation for AI-Enabled Drug Discovery

The Trillion Gene Atlas sits at the intersection of three large-scale capabilities: global data collection, advanced sequencing, and accelerated model training. Together, these elements form a foundation for biological AI systems that learn directly from the full breadth of life on Earth, rather than the limited genetic subsets available today.

If successful, the initiative has the potential to reshape early drug discovery, making therapeutic design faster, more systematic and more predictable, across gene therapy, immunology, infectious disease and beyond.

By expanding the evolutionary information available to AI by another 100-fold, Basecamp Research is positioning the Trillion Gene Atlas as a new reference point for the next decade of programmable biology.

Author

BioFocus Newsroom