Why build a more relevant dataset for Modi machine translation
At Indic-scripts Research Forum, we have been working on building a reliable, large dataset for successful AI Model training. There have been some efforts to create Marathi transliteration from Modi-scripted Marathi. Still, the scope and success of using it to find a good AI model output have generally been limited. Our experience is that almost all of these transliterated Devanagari Marathi texts are incomprehensible. Unless we have a large dataset of a high percentage of Modi words from 1400 to 1900 AD, during which Modi-script was in use, we won’t achieve that meaningful modern Marathi translation.
Incomplete attempt by IIT Roorkee to build the dataset
The IIT Roorkee approach to building the MoDeTrans dataset, containing 2,043 images of Modi-script sentences along with their Devanagari transliterations, is questionable. First, datasets using complete sentences, with transliterations in old Devanagari Marathi, are challenging to match. Second, the size of 2043 Modi-scripted sentences with their corresponding Devanagari transliteration is not a large enough dataset for training AI models. The sample of this dataset, as published on Hugging Face, is copied below.
Furthermore, IIT Roorkee’s AI model, MoScNet, when tested independently by our AI team to translate Modi-scripted sentences and reviewed by Modi experts from Bharat Itihas Sanshodhak Mandal, showed 100% failure to create accurate translation. We ran multiple Modi-script sentences five times for each sentence using the MoScNet AI Model, and every transliteration produced an unrelated, illogical, and inaccurate translation. We are writing a separate paper that outlines the testing process and its outcomes.
Dataset design concept used by Indic-scripts Research Forum
We are constructing a large dataset of Modi-Marathi and Modern Marathi using various sources. We are separating words from Modi-script, transliterated Devnagari, and Modern Marathi from sentences, which will form the datasets. We have identified numerous authentic resources in each of the above formats to create an initial dataset of 50,000 words. Another goal is to pair transliterated Marathi words with modern Marathi words. Our efforts will utilize specialized history-related dictionaries (Aitihasik Shabdakosh), old Marathi dictionaries, historical papers with transliteration, unique books featuring page-to-page translations from Modi to Marathi, and some general Modi documents with transliteration. We are also taking help from Modi-script experts to translate from the Modi script to transliterated Marathi.
We have listed below the dictionaries that we plan to use to build the dataset, including Aitihasik Shabdakosh by Y. N. Kelkar, Marathi Vishwakosh, Dictionary of Old Marathi, Bruhadkosh, Digital Dictionaries of South Asia, Illustrative Modi Documents, and Modi Documents from the Danish Collection in Tanjore. Some of these have word-for-word translations from Modi to Marathi.
How we plan on using dictionaries:
Using old and modern dictionaries to build datasets will provide us with generalized training data, which will improve accuracy and reduce ‘hallucinations. This would be the most critical step to create and curate high-quality word sets, specifically for Maratha history. These will serve as a structured knowledge base for our eventual Maratha history LLM.
Dictionaries will provide us with historical terms, places, events, and names of people relevant to Maratha history. This can include names of jagirs, vatans, parganas, or ruling families, forts, and names of the towns surrounding them.
We are converting dictionary entries into numerical representations and storing them in a vector database. It is expected that pairing of Modi-transliterated words and modern Marathi words would search the vector database giving user an ability query all alternate meanings. Eventually, this process will be helpful in building the Modi-Marathi LLM with historical references.
As we plan on building a custom-designed Maratha history LLM, we will use the data sets built using dictionaries to train our AI model as a critically important training resource.
Historic Marathi dictionaries contain valuable word definitions, usages, and idiomatic phrases from different time periods, which would help train LLMs to understand historical context, old vocabulary, and language evolution.
The dataset, built using dictionaries, Modi papers with transliteration, and ancient Marathi books, will serve as our structured word-level data. We intend to use it to verify and streamline all the unstructured historical papers we input into the database. Such unstructured data would be coming in from all three varieties, namely Modi-script words, Modi Marathi transliterated words, and modern Marathi words.
Summarizing
As explained before, our strategy has been two-fold. We are making a structured dataset of Modi-Marathi words paired to modern Marathi, and we are also pairing Modi-script words to transliterated Marathi words. We expect to grow this structured dataset for training AI Models. This would help us eventually to machine translate Modi script words with meaning, historical reference, relationships with people and events from the historical eras./p>