Modi–Marathi Word-Level Dataset
Summary:
We are building an initial dataset of over 50,000 paired words using Modi-script manuscripts,
transliterated Marathi texts, and modern Marathi equivalents. Unlike sentence-level datasets,
our focus is on word-by-word pairing backed by Modi experts' meanings, enabling reuse
beyond a single sentence.
- Scanning authentic Modi documents and historic Marathi papers.
- Using custom JS tools to split Modi full pages into words.
- Pairing Modi script words with transliterated and modern Marathi forms.
- Validating pairs with Modi experts and historical dictionaries.
Historic Dictionary Digitization & Vector Database
Summary:
We are converting historical Modi and Marathi dictionaries into a structured,
machine-readable format and storing them in a vector database. This enables semantic
search and helps our models retrieve accurate meanings, usages, and related entities.
- Digitizing dictionary entries and references.
- Converting entries into numerical vector representations.
- Linking words to forts, places, historical figures, and administrative terms (farman, khalita, jahagir, vatan, pargana, etc.).
- Using this as the backbone for proprietary RAG-based translation systems.
Evaluation of Existing Modi AI Models
Summary:
We have independently evaluated publicly available models and datasets for Modi script,
including those based on the MoDeTrans dataset. Our tests demonstrate that small,
sentence-level datasets and unvalidated transliterations yield outputs that are largely
inaccurate and unusable for serious research.
- Existing datasets often contain only ~2,000 lines of sentence-level data without clear word pairings.
- Output transliterations are frequently illegible or unrelated to the original text, especially for non-technical users.
- There is a strong need for larger, carefully curated, word-level datasets with modern Marathi pairings and dictionary support.
Maratha History LLM (Planned)
Summary:
Using our curated dictionaries, paired word datasets, and historical corpora,
we plan to fine-tune a domain-specific language model specializing in Maratha history.
This model will be grounded in verified data, reducing hallucinations and improving
historical reliability.