Data Sources
- Historic Modi-script manuscripts and letters
- Modi → Devanagari transliterated works from “Dafters”
- Specialized history dictionaries and encyclopedias:
- Aitihasik Shabdakosh (Y. N. Kelkar, 1962)
- Marathi Vishwakosh
- Dictionary of Old Marathi
- Other Marathi–Marathi, English–Marathi, and terminology dictionaries
- ~4,000+ Marathi book titles related to Maratha history (planned corpus)
Why Word-Level Pairing
Our experience shows that sentence-level transliteration in old Devanagari Marathi
is often difficult to understand and cannot be reliably reused outside that specific sentence.
- Enable flexible recombination across multiple texts
- Make dictionary verification and correction easier
- Provide stronger supervision for AI models, improving accuracy and generalization
High-Level Technical Approach
- Digitization – Scan Modi manuscripts and printed transliterated Marathi works.
- Segmentation – Use custom JS tools to split Modi and Devanagari text into word-level units.
- Pairing & Annotation – Pair Modi script words with transliterated and modern Marathi words,
add dictionary-based meanings, and historical tags (people, places, events).
- Vectorization – Convert dictionary and word entries into vectors and store in a vector database.
- Model Training & RAG – Use the datasets for:
- Transliteration and translation models
- Retrieval-augmented generation for historical Q&A
- Future fine-tuning of a Maratha history LLM