Initializ Labs · India

01 The thesis

Almost every “open” model is open-weights. You get the artifact — never the recipe.

Llama, Mistral, Gemma, Qwen, DeepSeek — and India’s own Sarvam: a trained file you can run and fine-tune, but cannot audit, cannot reproduce, and cannot trust the data behind. We are building the other thing.

Open-weights

A black box with a license

“We can’t tell you what’s in the training data”
No assembly code, no data recipe, no seeds
Intermediate checkpoints withheld
Cannot be independently re-derived

Truly open-source

A model flow anyone can re-run

The full corpus, or a reproducible assembly recipe
Training, midtraining, long-context & post-training code
Every checkpoint and log, released
A reproduction manifest that pins the entire run

02 The openness contract

A release is open only if all ten ship together.

The eight-artifact bar from our base methodology, extended with the two that the Indic setting forces into first-class status. Withhold any one, and we label it open-weights — honestly.

Model weights

Final + every intermediate checkpoint.

Training code

Pretrain, midtrain, long-context, post-train.

Data

The corpus, or a fully reproducible recipe.

Assembly code

Filtering, dedup, decontamination, tokenization.

Recipes

Configs, hyperparameters, mixes, seeds.

Training logs

Loss curves, metrics, run metadata.

Eval suite

Benchmarks, harness & decontamination report.

Repro manifest

Pinned commit, data version, config, hardware.

09 +

Custom tokenizer

Multilingual BPE — mandatory for Indic, hashed & released.

10 +

Grade ledger

Per-shard provenance grade — the truly-open claim, made binary.

The eight-artifact bar is binary. There is no “open enough.”

03 Provenance grading

Every shard is graded. Only Grade A ever crosses into training.

A grade filter sits between data and the model, so the openness guarantee can’t silently erode. A third party can re-run the filter and verify the claim from the manifest alone.

source · shard

license

sha256

grade

filter

dolma3 / common-crawl

ODC-BY

a17f…c903

A — open

→ pass

sangraha / verified · web (A1)

CC-BY-4.0

4d2e…8b71

A1 — verified

→ pass

sangraha / verified · OCR-pdf (A2)

CC-BY-4.0

90c4…1f55

A2 — review

→ gate

sangraha / unverified

mixed

e3aa…77d0

B — uncertain

✕ drop

sangraha / synthetic (MT)

derived

bb18…4e2c

C — teacher

✕ drop

grade_filter() — verified-only Indic. We exclude perplexity-mined (B) and machine-translated/transliterated (C) data outright. It costs us token volume; it buys an unbroken provenance chain.

04 Bilingual by necessity

There is no Dolma-scale open Indic corpus. So the data layer is the project.

English supplies token volume and general capability. Verified Indic supplies coverage — upsampled with documented, ablated epoching, never padded with synthetic text.

English · the volume

Dolma 3 family

Common Crawl, S2 PDFs, code, arXiv, Wikipedia, FineMath — the vetted, ODC-BY licensed open stack. English-dominant to the multi-trillion-token budget.

license: ODC-BY · grade: A

Indic · the coverage

Sangraha वेरिफाइड

Human-verified web, OCR’d PDFs, transcribed media — original text, provenance-traceable. Plus a mandatory custom multilingual tokenizer so Indic scripts don’t bleed context to high token-fertility.

license: CC-BY-4.0 · grade: A · verified-only

05 The model

A reproducible 1B shape, on AI2’s OLMo 3 model-flow methodology.

अक्षर · akshara — “the imperishable letter,” the irreducible unit of language. Dense, decoder-only, built on a forked OLMo-core trainer. Every choice is a config value — so the later scale-up to 7B is a diff, not a rewrite.

Parameters~1.0–1.2B

Hidden size2048

Layers16

Attention16 heads · MHA + QK-norm

MLPSwiGLU · 8192

Vocab~100k–128k custom multilingual

Embeddingstied (Indic vocab budget)

NormRMSNorm · reordered

Context4K → 16K/32K via YaRN

Precisionbf16 · fp32 master

The model flow

Pretrain

~4T effective tokens · bilingual Grade-A

Midtrain

Late-stage curriculum · capability injection

Long-context

YaRN scaling to target length

★

Post-train → Instruct / Think

SFT · DPO · RLVR — Grade-A data only

06 Sovereign compute · planned

Built to train on IndiaAI Compute — economically tractable for a small team.

A 1B / 4T-token run is order ~20,000–38,000 H100-hours. On subsidized national infrastructure, that should land far below a commercial cloud run. Empanelment and the IndiaAI application are themselves Phase-0 work — figures below are planning estimates, not costs incurred.

~₹100/gpu-hr

Target subsidized rate

IndiaAI empaneled providers, with up to a 40% subsidy for eligible projects.

₹20–38 lakh

Base model · est. end-to-end

≈ $21–40k. Versus ₹38 lakh–₹1.1 crore (~$40–115k) on commercial cloud at $2–3/H100-hr.

~20–38k hrs

H100 · total flow

3–6 days wall-clock on a 256× H100 node for stage 1.

07 Where we are · the open plan

Phase-gated, in public. Each gate is an acceptance test — no phase starts until the last one passes.

We publish status honestly. Today Akshara is at Phase 0: standing up the repos and applying for sovereign compute. Nothing here is trained or released yet — and we’ll say so until it is.

PHASE 0

Foundation

Fork the open stack; smoke-test the trainer; begin the IndiaAI compute application.

In progress

PHASE 1

Data + provenance grading

Grade-A-only manifests; A1/A2 sub-grading; measure the real Indic token count.

PHASE 2

Tokenizer

Train + freeze the multilingual BPE; benchmark Indic fertility.

Planned

PHASE 3–6

Train the flow

Ladder validation → pretrain → midtrain → long-context → post-train (Instruct / Think).

Planned

PHASE 7

Open release

Decontamination report, grade ledger, tokenizer, repro manifest — all ten artifacts.

Planned

An open foundation language model for भारत — one you can rebuild from scratch.

Almost every “open” model is open-weights. You get the artifact — never the recipe.

A black box with a license

A model flow anyone can re-run

A release is open only if all ten ship together.

Model weights

Training code

Data

Assembly code

Recipes

Training logs

Eval suite

Repro manifest

Custom tokenizer

Grade ledger

Every shard is graded. Only Grade A ever crosses into training.

There is no Dolma-scale open Indic corpus. So the data layer is the project.

Dolma 3 family

Sangraha वेरिफाइड

A reproducible 1B shape, on AI2’s OLMo 3 model-flow methodology.

The model flow

Pretrain

Midtrain

Long-context

Post-train → Instruct / Think

Built to train on IndiaAI Compute — economically tractable for a small team.

Phase-gated, in public. Each gate is an acceptance test — no phase starts until the last one passes.

Foundation

Data + provenance grading

Tokenizer

Train the flow

Open release

Auditable by construction. Reproducible by design. Built in India.