Modi–Marathi Word-Level Dataset
Summary:
We are building an initial dataset of over 50,000 paired words using Modi-script manuscripts, transliterated Marathi texts, and modern Marathi equivalents. Unlike sentence-level datasets, our focus is on word-by-word pairing and Modi experts' backed meanings, which enables reuse beyond a single sentence.
- Scanning authentic Modi documents and historic Marathi papers.
- Using custom JS tools to split Modi full pages into words.
- Pairing Modi script words with their transliterated and modern Marathi forms.
- Validating pairs with Modi experts and historical dictionaries.
Historic Dictionary Digitization & Vector Database
Summary:
We are converting historical Modi and Marathi dictionaries into a structured, machine-readable format and storing them in a vector database. This enables semantic search and helps our models retrieve accurate meanings, usages, and related entities.
- Digitizing dictionary entries and references.
- Converting entries into numerical vector representations.
- Linking words to forts, places, historical figures, and administrative terms (farman, Khalita, jahagir, vatan, pargana, etc.).
- Using this as the backbone for proprietary RAG-based translation.
Evaluation of Existing Modi AI Models
Summary:
We have independently evaluated publicly available models and datasets for Modi script, including those based on the MoDeTrans dataset. Our tests demonstrate that small, sentence-level datasets and unvalidated transliterations yield outputs that are largely inaccurate and unusable for serious research.
- Existing datasets often contain only ~2,000 lines of sentence-level data without clear word pairings.
- Output transliterations are frequently illegible or unrelated to the original text, especially for non-technical users.
- There is a strong need for larger, carefully curated, word‑level datasets with modern Marathi pairings and dictionary support.
Maratha History LLM (Planned)
Summary:
Using our curated dictionaries, paired word datasets, and historical corpora, we plan to fine‑tune a domain‑specific language model specializing in Maratha history written in all languages. This model will be grounded in verified data, reducing hallucinations and improving historical reliability.