Ground-up efforts to build large datasets for effective and accurate translation of Modi-Script documents into modern Marathi

By Arun Joshi, Indic-Scripts Research Forum

Why build a more relevant dataset for Modi machine translation

At Indic-scripts Research Forum, we have been working on building a reliable, large dataset for successful AI Model training. There have been some efforts to create Marathi transliteration from Modi-scripted Marathi. Still, the scope and success of using it to find a good AI model output have generally been limited. Our experience is that almost all of these transliterated Devanagari Marathi texts are incomprehensible. Unless we have a large dataset of a high percentage of Modi words from 1400 to 1900 AD, during which Modi-script was in use, we won’t achieve that meaningful modern Marathi translation.

Incomplete attempt by IIT Roorkee to build the dataset

The IIT Roorkee approach to building the MoDeTrans dataset, containing 2,043 images of Modi-script sentences along with their Devanagari transliterations, is questionable. First, datasets using complete sentences, with transliterations in old Devanagari Marathi, are challenging to match. Second, the size of 2043 Modi-scripted sentences with their corresponding Devanagari transliteration is not a large enough dataset for training AI models. The sample of this dataset, as published on Hugging Face, is copied below.

inline example

Furthermore, IIT Roorkee’s AI model, MoScNet, when tested independently by our AI team to translate Modi-scripted sentences and reviewed by Modi experts from Bharat Itihas Sanshodhak Mandal, showed 100% failure to create accurate translation. We ran multiple Modi-script sentences five times for each sentence using the MoScNet AI Model, and every transliteration produced an unrelated, illogical, and inaccurate translation. We are writing a separate paper that outlines the testing process and its outcomes.

Dataset design concept used by Indic-scripts Research Forum

We are constructing a large dataset of Modi-Marathi and Modern Marathi using various sources. We are separating words from Modi-script, transliterated Devnagari, and Modern Marathi from sentences, which will form the datasets. We have identified numerous authentic resources in each of the above formats to create an initial dataset of 50,000 words. Another goal is to pair transliterated Marathi words with modern Marathi words. Our efforts will utilize specialized history-related dictionaries (Aitihasik Shabdakosh), old Marathi dictionaries, historical papers with transliteration, unique books featuring page-to-page translations from Modi to Marathi, and some general Modi documents with transliteration. We are also taking help from Modi-script experts to translate from the Modi script to transliterated Marathi.

We have listed below the dictionaries that we plan to use to build the dataset, including Aitihasik Shabdakosh by Y. N. Kelkar, Marathi Vishwakosh, Dictionary of Old Marathi, Bruhadkosh, Digital Dictionaries of South Asia, Illustrative Modi Documents, and Modi Documents from the Danish Collection in Tanjore. Some of these have word-for-word translations from Modi to Marathi.

How we plan on using dictionaries:

Using old and modern dictionaries to build datasets will provide us with generalized training data, which will improve accuracy and reduce ‘hallucinations. This would be the most critical step to create and curate high-quality word sets, specifically for Maratha history. These will serve as a structured knowledge base for our eventual Maratha history LLM.

Dictionaries will provide us with historical terms, places, events, and names of people relevant to Maratha history. This can include names of jagirs, vatans, parganas, or ruling families, forts, and names of the towns surrounding them.

We are converting dictionary entries into numerical representations and storing them in a vector database. It is expected that pairing of Modi-transliterated words and modern Marathi words would search the vector database giving user an ability query all alternate meanings. Eventually, this process will be helpful in building the Modi-Marathi LLM with historical references.

As we plan on building a custom-designed Maratha history LLM, we will use the data sets built using dictionaries to train our AI model as a critically important training resource.

Historic Marathi dictionaries contain valuable word definitions, usages, and idiomatic phrases from different time periods, which would help train LLMs to understand historical context, old vocabulary, and language evolution.

The dataset, built using dictionaries, Modi papers with transliteration, and ancient Marathi books, will serve as our structured word-level data. We intend to use it to verify and streamline all the unstructured historical papers we input into the database. Such unstructured data would be coming in from all three varieties, namely Modi-script words, Modi Marathi transliterated words, and modern Marathi words.

Summarizing

As explained before, our strategy has been two-fold. We are making a structured dataset of Modi-Marathi words paired to modern Marathi, and we are also pairing Modi-script words to transliterated Marathi words. We expect to grow this structured dataset for training AI Models. This would help us eventually to machine translate Modi script words with meaning, historical reference, relationships with people and events from the historical eras./p>

Annexure A

List of Dictionaries in consideration:

  • Maharashtra Language Dictionary – 1821, created by order of the government; prepared by six scholars: Jagannathshastri Kramwant, Balashastri Ghagave, Gangadharshastri Phadke, Sakharamstri Joshi, Dajishastri Shukla, and Parshurambhat Godbole.
  • Candy-Molesworth’s English-Marathi Dictionary, 1831
  • Raghunath Narayan Adhvari alias Panditraj, 1860, State Administration Dictionary
  • Dictionary of the Marathi Language - Raghunath Bhaskar Godbole - Published in 1870
  • Shuddha Marathi Kosh, V. R. Bapat and B. V. Pandit, 1891, Jagdhitecchu, Pune
  • Marathi ShabdaRatnakar, Vasudev Govind Apte, 1922; Extended ShabdaRatnakar - Expansion: H. A. Bhave, June 1995 (The original dictionary had 36,716 words. In the extended version, 60,559 Marathi words are explained in Marathi.)
  • Date-Karve’s Maharashtra Dictionary, 1932-1938, volumes 1 to 7, supplementary volume, Pune, Maharashtra Koshmandal
  • Marathi Root Dictionary, Vi. Ka. Rajwade, 1938, Rajwade Research Board, Dhule
  • Marathi Etymological Dictionary, Kri. Pan. Kulkarni, 1942
  • Madhav Trimbak Patwardhan - Persian Marathi Dictionary
  • Historic Shabdakosh – Y.N. Kelkar Published in 1962.
  • Vakya Kosh (3 Volumes) - Vaman Keshav Lele - Rajhans Publications
  • Sanskrit-English Dictionary - Monier - Williams
  • The Practical Sanskrit-English Dictionary (Sanskrit-English Dictionary 3 volumes) - Chief Editor Vaman Shivram Apte
  • Girvan Laghukosh (Sanskrit-Marathi Dictionary) - Janardan Vinayak Oak
  • Extended ShabdaRatnakar (Chief Editor - Vaman Gopal Apte)
  • Sanjay Bhagat’s www.marathibhasha.org online (computational) dictionary, consisting of 267,000 terms from 35 terminology dictionaries prepared by the Directorate of Languages of the Government of Maharashtra
  • Jnaneshwari – Dictionary of Words from Jnaneshwari – Welingkar
  • Dasbodh – Dictionary of Words from Dasbodh – Mahadevshastri Joshi
  • Gatha – Dictionary of Words from Tukaram’s Gatha – Mahadevshastri Joshi
  • Shri Samarth Ramdas Vangmaya Word Reference Dictionary (Mu. Shri. Kanade)
  • Modern Biography Dictionary (Siddheshwarshastri Chitrav)
  • Ancient Biography Dictionary (Siddheshwarshastri Chitrav)
  • Indian Culture Dictionary (Mahadevshastri Joshi)
    • Ancient Indian Historical Dictionary
    • Author – Raghunath Bhaskar Godbole, Edition – 1876 Features – While compiling, utmost caution was taken to avoid inclusion of Arabic and Persian words. The dictionary was prepared after seven years of hard work.
  • Ancient Indian Historical Dictionary – Author: Raghunath Bhaskar Godbole, Edition – 1876. Features – While compiling, utmost caution was taken to avoid inclusion of Arabic and Persian words. The dictionary was prepared after seven years of hard work.
  • Ancient Indian Place Dictionary, First Volume – Author – Siddheshwarshastri Chitrav, Publisher – Indian Biography Dictionary Board, Edition – 1969. Features – Manuscript was lost in the Panshet floods but the author rewrote it. (Awarded by the Government of India, by Home Minister Yashwantrao Chavan)
  • Medieval Biography Dictionary (Siddheshwarshastri Chitrav)
  • Marathi Literary Dictionary, Parts 1 to 5
  • Dictionary of Proverbs (Vishwanath Dinkar Naravane)
  • Marathi Encyclopedia (Founding Editor – Laxmanrao Joshi)
Back to Home