Datasets & Methodology

Data Sources

  • Historic Modi-script manuscripts and letters
  • Modi → Devanagari transliterated works from “Dafters”
  • Specialized history dictionaries and encyclopedias:
    • Aitihasik Shabdakosh (Y. N. Kelkar, 1962)
    • Marathi Vishwakosh
    • Dictionary of Old Marathi
    • Other Marathi–Marathi, English–Marathi, and terminology dictionaries
  • ~4,000+ Marathi book titles related to Maratha history (planned corpus)

Why Word-Level Pairing

Our experience shows that sentence-level transliteration in old Devanagari Marathi is often difficult to understand and cannot be reliably reused outside that specific sentence.

  • Enable flexible recombination across multiple texts
  • Make dictionary verification and correction easier
  • Provide stronger supervision for AI models, improving accuracy and generalization

High-Level Technical Approach

  • Digitization – Scan Modi manuscripts and printed transliterated Marathi works.
  • Segmentation – Use custom JS tools to split Modi and Devanagari text into word-level units.
  • Pairing & Annotation – Pair Modi script words with transliterated and modern Marathi words, add dictionary-based meanings, and historical tags (people, places, events).
  • Vectorization – Convert dictionary and word entries into vectors and store in a vector database.
  • Model Training & RAG – Use the datasets for:
    • Transliteration and translation models
    • Retrieval-augmented generation for historical Q&A
    • Future fine-tuning of a Maratha history LLM