An independent open-source AI lab · India

An open foundation language model for भारत — one you can rebuild from scratch.

Akshara‑1B is Initializ Labs’ truly open-source, bilingual (English + Indic) 1B foundation model — in active development. Not open-weights. Open everything — data, code, recipes, every checkpoint, and a machine-auditable provenance ledger.

Status — Phase 0 · project setup & IndiaAI compute application. Not yet trained.
1B
Dense decoder-only
10
Artifacts, all open at release
Grade A
Data only — by filter
EN + Indic
Bilingual by design
01 The thesis

Almost every “open” model is open-weights. You get the artifact — never the recipe.

Llama, Mistral, Gemma, Qwen, DeepSeek — and India’s own Sarvam: a trained file you can run and fine-tune, but cannot audit, cannot reproduce, and cannot trust the data behind. We are building the other thing.

Open-weights

A black box with a license

  • “We can’t tell you what’s in the training data”
  • No assembly code, no data recipe, no seeds
  • Intermediate checkpoints withheld
  • Cannot be independently re-derived
Truly open-source

A model flow anyone can re-run

  • The full corpus, or a reproducible assembly recipe
  • Training, midtraining, long-context & post-training code
  • Every checkpoint and log, released
  • A reproduction manifest that pins the entire run
02 The openness contract

A release is open only if all ten ship together.

The eight-artifact bar from our base methodology, extended with the two that the Indic setting forces into first-class status. Withhold any one, and we label it open-weights — honestly.

01

Model weights

Final + every intermediate checkpoint.

02

Training code

Pretrain, midtrain, long-context, post-train.

03

Data

The corpus, or a fully reproducible recipe.

04

Assembly code

Filtering, dedup, decontamination, tokenization.

05

Recipes

Configs, hyperparameters, mixes, seeds.

06

Training logs

Loss curves, metrics, run metadata.

07

Eval suite

Benchmarks, harness & decontamination report.

08

Repro manifest

Pinned commit, data version, config, hardware.

09 +

Custom tokenizer

Multilingual BPE — mandatory for Indic, hashed & released.

10 +

Grade ledger

Per-shard provenance grade — the truly-open claim, made binary.

The eight-artifact bar is binary. There is no “open enough.”

03 Provenance grading

Every shard is graded. Only Grade A ever crosses into training.

A grade filter sits between data and the model, so the openness guarantee can’t silently erode. A third party can re-run the filter and verify the claim from the manifest alone.

source · shard
license
sha256
grade
filter
dolma3 / common-crawl
ODC-BY
a17f…c903
A — open
→ pass
sangraha / verified · web (A1)
CC-BY-4.0
4d2e…8b71
A1 — verified
→ pass
sangraha / verified · OCR-pdf (A2)
CC-BY-4.0
90c4…1f55
A2 — review
→ gate
sangraha / unverified
mixed
e3aa…77d0
B — uncertain
✕ drop
sangraha / synthetic (MT)
derived
bb18…4e2c
C — teacher
✕ drop
grade_filter()verified-only Indic. We exclude perplexity-mined (B) and machine-translated/transliterated (C) data outright. It costs us token volume; it buys an unbroken provenance chain.
04 Bilingual by necessity

There is no Dolma-scale open Indic corpus. So the data layer is the project.

English supplies token volume and general capability. Verified Indic supplies coverage — upsampled with documented, ablated epoching, never padded with synthetic text.

English · the volume

Dolma 3 family

Common Crawl, S2 PDFs, code, arXiv, Wikipedia, FineMath — the vetted, ODC-BY licensed open stack. English-dominant to the multi-trillion-token budget.

license: ODC-BY · grade: A
Indic · the coverage

Sangraha वेरिफाइड

Human-verified web, OCR’d PDFs, transcribed media — original text, provenance-traceable. Plus a mandatory custom multilingual tokenizer so Indic scripts don’t bleed context to high token-fertility.

license: CC-BY-4.0 · grade: A · verified-only
05 The model

A reproducible 1B shape, on AI2’s OLMo 3 model-flow methodology.

अक्षर · akshara — “the imperishable letter,” the irreducible unit of language. Dense, decoder-only, built on a forked OLMo-core trainer. Every choice is a config value — so the later scale-up to 7B is a diff, not a rewrite.

Parameters~1.0–1.2B
Hidden size2048
Layers16
Attention16 heads · MHA + QK-norm
MLPSwiGLU · 8192
Vocab~100k–128k custom multilingual
Embeddingstied (Indic vocab budget)
NormRMSNorm · reordered
Context4K → 16K/32K via YaRN
Precisionbf16 · fp32 master

The model flow

1
Pretrain

~4T effective tokens · bilingual Grade-A

2
Midtrain

Late-stage curriculum · capability injection

3
Long-context

YaRN scaling to target length

Post-train → Instruct / Think

SFT · DPO · RLVR — Grade-A data only

06 Sovereign compute · planned

Built to train on IndiaAI Compute — economically tractable for a small team.

A 1B / 4T-token run is order ~20,000–38,000 H100-hours. On subsidized national infrastructure, that should land far below a commercial cloud run. Empanelment and the IndiaAI application are themselves Phase-0 work — figures below are planning estimates, not costs incurred.

~₹100/gpu-hr
Target subsidized rate

IndiaAI empaneled providers, with up to a 40% subsidy for eligible projects.

₹20–38 lakh
Base model · est. end-to-end

≈ $21–40k. Versus ₹38 lakh–₹1.1 crore (~$40–115k) on commercial cloud at $2–3/H100-hr.

~20–38k hrs
H100 · total flow

3–6 days wall-clock on a 256× H100 node for stage 1.

07 Where we are · the open plan

Phase-gated, in public. Each gate is an acceptance test — no phase starts until the last one passes.

We publish status honestly. Today Akshara is at Phase 0: standing up the repos and applying for sovereign compute. Nothing here is trained or released yet — and we’ll say so until it is.

PHASE 0

Foundation

Fork the open stack; smoke-test the trainer; begin the IndiaAI compute application.

In progress
PHASE 2

Tokenizer

Train + freeze the multilingual BPE; benchmark Indic fertility.

Planned
PHASE 3–6

Train the flow

Ladder validation → pretrain → midtrain → long-context → post-train (Instruct / Think).

Planned
PHASE 7

Open release

Decontamination report, grade ledger, tokenizer, repro manifest — all ten artifacts.

Planned
Provenance over performance

Auditable by construction. Reproducible by design. Built in India.

When openness and capability conflict, openness wins. That’s the whole point — and the moat.