Akshara‑1B is Initializ Labs’ truly open-source, bilingual (English + Indic) 1B foundation model — in active development. Not open-weights. Open everything — data, code, recipes, every checkpoint, and a machine-auditable provenance ledger.
Llama, Mistral, Gemma, Qwen, DeepSeek — and India’s own Sarvam: a trained file you can run and fine-tune, but cannot audit, cannot reproduce, and cannot trust the data behind. We are building the other thing.
The eight-artifact bar from our base methodology, extended with the two that the Indic setting forces into first-class status. Withhold any one, and we label it open-weights — honestly.
Final + every intermediate checkpoint.
Pretrain, midtrain, long-context, post-train.
The corpus, or a fully reproducible recipe.
Filtering, dedup, decontamination, tokenization.
Configs, hyperparameters, mixes, seeds.
Loss curves, metrics, run metadata.
Benchmarks, harness & decontamination report.
Pinned commit, data version, config, hardware.
Multilingual BPE — mandatory for Indic, hashed & released.
Per-shard provenance grade — the truly-open claim, made binary.
The eight-artifact bar is binary. There is no “open enough.”
A grade filter sits between data and the model, so the openness guarantee can’t silently erode. A third party can re-run the filter and verify the claim from the manifest alone.
English supplies token volume and general capability. Verified Indic supplies coverage — upsampled with documented, ablated epoching, never padded with synthetic text.
Common Crawl, S2 PDFs, code, arXiv, Wikipedia, FineMath — the vetted, ODC-BY licensed open stack. English-dominant to the multi-trillion-token budget.
Human-verified web, OCR’d PDFs, transcribed media — original text, provenance-traceable. Plus a mandatory custom multilingual tokenizer so Indic scripts don’t bleed context to high token-fertility.
अक्षर · akshara — “the imperishable letter,” the irreducible unit of language. Dense, decoder-only, built on a forked OLMo-core trainer. Every choice is a config value — so the later scale-up to 7B is a diff, not a rewrite.
~4T effective tokens · bilingual Grade-A
Late-stage curriculum · capability injection
YaRN scaling to target length
SFT · DPO · RLVR — Grade-A data only
A 1B / 4T-token run is order ~20,000–38,000 H100-hours. On subsidized national infrastructure, that should land far below a commercial cloud run. Empanelment and the IndiaAI application are themselves Phase-0 work — figures below are planning estimates, not costs incurred.
IndiaAI empaneled providers, with up to a 40% subsidy for eligible projects.
≈ $21–40k. Versus ₹38 lakh–₹1.1 crore (~$40–115k) on commercial cloud at $2–3/H100-hr.
3–6 days wall-clock on a 256× H100 node for stage 1.
We publish status honestly. Today Akshara is at Phase 0: standing up the repos and applying for sovereign compute. Nothing here is trained or released yet — and we’ll say so until it is.
Fork the open stack; smoke-test the trainer; begin the IndiaAI compute application.
Grade-A-only manifests; A1/A2 sub-grading; measure the real Indic token count.
Train + freeze the multilingual BPE; benchmark Indic fertility.
Ladder validation → pretrain → midtrain → long-context → post-train (Instruct / Think).
Decontamination report, grade ledger, tokenizer, repro manifest — all ten artifacts.
When openness and capability conflict, openness wins. That’s the whole point — and the moat.