Prometheus | AI Generates AI

Prometheus is an autonomous pretraining system where AI designs, curates, and generates its own training data -- then trains on it in a fully automated loop. The goal is simple but ambitious: remove the human bottleneck from the entire pretraining pipeline. Target model: GENESIS-100M, a 100-million-parameter language model trained from scratch on AI-curated data.

The guiding strategy is a streaming open corpus with AI acting as Data Architect -- selecting, filtering, and synthesizing the highest-quality training material from a vast pool of raw text, then augmenting it with synthetic data designed to fill identified knowledge gaps.

Four-Phase Pipeline

PHASE 1	Raw Corpus Ingestion Mass-scale text ingestion from open sources. Deduplication using MinHash LSH. Language detection and quality pre-filtering. ✓ Complete
PHASE 2	AI-Powered Data Architecture AI classifies, tags, and scores every document. Quality filtering removes noise, propaganda, and machine-generated spam. Curriculum ordering for optimal learning trajectory. ● Active
PHASE 3	Synthetic Data Generation AI identifies knowledge gaps in the corpus and generates targeted synthetic data to fill them. Quality scoring via multi-model consensus. Adversarial filtering to eliminate hallucinated content. \| Pending
PHASE 4	Pretraining Orchestration Fully automated training loop with checkpointing, evaluation, and convergence monitoring. Curriculum learning with dynamic difficulty adjustment. \| Pending

Design Philosophy

Traditional pretraining pipelines are human-driven: engineers manually curate datasets, write filtering rules, and babysit training runs. Prometheus inverts this -- the AI itself becomes the data expert. It reads raw text, understands what's valuable, and makes curation decisions autonomously. The human role shifts from operator to auditor: reviewing decisions, not making them.

This approach scales. AI can process billions of documents with consistent quality standards -- something no human team can match. And because every curation decision is auditable (via the Sovereign governance layer), quality remains verifiable even at massive scale.

Sovereign Labs

Projects

Site

Prometheus: AI Generates AI

Project Overview | PHASE 2

Four-Phase Pipeline

Design Philosophy

Project Status

Related