Our Projects

Modi–Marathi Word-Level Dataset

Summary:
We are building an initial dataset of over 50,000 paired words using Modi-script manuscripts, transliterated Marathi texts, and modern Marathi equivalents. Unlike sentence-level datasets, our focus is on word-by-word pairing backed by Modi experts' meanings, enabling reuse beyond a single sentence.

  • Scanning authentic Modi documents and historic Marathi papers.
  • Using custom JS tools to split Modi full pages into words.
  • Pairing Modi script words with transliterated and modern Marathi forms.
  • Validating pairs with Modi experts and historical dictionaries.

Historic Dictionary Digitization & Vector Database

Summary:
We are converting historical Modi and Marathi dictionaries into a structured, machine-readable format and storing them in a vector database. This enables semantic search and helps our models retrieve accurate meanings, usages, and related entities.

  • Digitizing dictionary entries and references.
  • Converting entries into numerical vector representations.
  • Linking words to forts, places, historical figures, and administrative terms (farman, khalita, jahagir, vatan, pargana, etc.).
  • Using this as the backbone for proprietary RAG-based translation systems.

Evaluation of Existing Modi AI Models

Summary:
We have independently evaluated publicly available models and datasets for Modi script, including those based on the MoDeTrans dataset. Our tests demonstrate that small, sentence-level datasets and unvalidated transliterations yield outputs that are largely inaccurate and unusable for serious research.

  • Existing datasets often contain only ~2,000 lines of sentence-level data without clear word pairings.
  • Output transliterations are frequently illegible or unrelated to the original text, especially for non-technical users.
  • There is a strong need for larger, carefully curated, word-level datasets with modern Marathi pairings and dictionary support.

Maratha History LLM (Planned)

Summary:
Using our curated dictionaries, paired word datasets, and historical corpora, we plan to fine-tune a domain-specific language model specializing in Maratha history. This model will be grounded in verified data, reducing hallucinations and improving historical reliability.