Project Overview | PHASE 2
Prometheus is an autonomous pretraining system where AI designs, curates, and generates its own training data -- then trains on it in a fully automated loop. The goal is simple but ambitious: remove the human bottleneck from the entire pretraining pipeline. Target model: GENESIS-100M, a 100-million-parameter language model trained from scratch on AI-curated data.
The guiding strategy is a streaming open corpus with AI acting as Data Architect -- selecting, filtering, and synthesizing the highest-quality training material from a vast pool of raw text, then augmenting it with synthetic data designed to fill identified knowledge gaps.
Four-Phase Pipeline
| PHASE 1 | Raw Corpus Ingestion Mass-scale text ingestion from open sources. Deduplication using MinHash LSH. Language detection and quality pre-filtering. ✓ Complete |
| PHASE 2 | AI-Powered Data Architecture AI classifies, tags, and scores every document. Quality filtering removes noise, propaganda, and machine-generated spam. Curriculum ordering for optimal learning trajectory. ● Active |
| PHASE 3 | Synthetic Data Generation AI identifies knowledge gaps in the corpus and generates targeted synthetic data to fill them. Quality scoring via multi-model consensus. Adversarial filtering to eliminate hallucinated content. | Pending |
| PHASE 4 | Pretraining Orchestration Fully automated training loop with checkpointing, evaluation, and convergence monitoring. Curriculum learning with dynamic difficulty adjustment. | Pending |
Design Philosophy
Traditional pretraining pipelines are human-driven: engineers manually curate datasets, write filtering rules, and babysit training runs. Prometheus inverts this -- the AI itself becomes the data expert. It reads raw text, understands what's valuable, and makes curation decisions autonomously. The human role shifts from operator to auditor: reviewing decisions, not making them.
This approach scales. AI can process billions of documents with consistent quality standards -- something no human team can match. And because every curation decision is auditable (via the Sovereign governance layer), quality remains verifiable even at massive scale.