Decoding India’s
Historic Scripts with AI

From Modi Script to Modern Marathi:
Building New datasets for IndicScript AI models.

Modi–Marathi Word-Level Dataset

Summary:

We are building an initial dataset of over 50,000 paired words using Modi-script manuscripts, transliterated Marathi texts, and modern Marathi equivalents. Unlike sentence-level datasets, our focus is on word-by-word pairing and Modi experts' backed meanings, which enables reuse beyond a single sentence.

Key Activities:

Scanning authentic Modi documents and historic Marathi papers.
Using custom JS tools to split Modi full pages into words.
Pairing Modi script words with their transliterated and modern Marathi forms.
Validating pairs with Modi experts and historical dictionaries.

Historic Dictionary Digitization & Vector Database

Summary:

We have independently evaluated publicly available models and datasets for Modi script, including those based on the MoDeTrans dataset. Our tests demonstrate that small, sentence-level datasets and unvalidated transliterations yield outputs that are largely inaccurate and unusable for serious research.

Key Activities:

Existing datasets often contain only ~2,000 lines of sentence-level data without clear word pairings.
Output transliterations are frequently illegible or unrelated to the original text, especially for non-technical users.
There is a strong need for larger, carefully curated, word‑level datasets with modern Marathi pairings and dictionary support.

Maratha History LLM (Planned)

Summary:

Using our curated dictionaries, paired word datasets, and historical corpora, we plan to fine‑tune a domain‑specific language model specializing in Maratha history written in all languages. This model will be grounded in verified data, reducing hallucinations and improving historical reliability.

Our Vision

Our long-term vision is to create reliable AI tools that can read, understand, and explain historic Indian documents — not just at the surface level of text, but with awareness of period-specific language, idioms, and historical context. We want historians, archivists, students, and citizens to be able to access centuries of material that is today locked away in scripts very few people can read.

Join us in building the foundation for AI that understands India’s past.

Together, we can preserve heritage and empower the future.

Collaborate with Us →

Decoding India’sHistoric Scripts with AI

Modi–Marathi Word-Level Dataset

Summary:

Key Activities:

Historic Dictionary Digitization & Vector Database

Summary:

Key Activities:

Maratha History LLM (Planned)

Summary:

Our Vision

Join us in building the foundation for AI that understands India’s past.

Decoding India’s
Historic Scripts with AI