Diacritic Restoration for Yoruba Text with under dot and Diacritic Mark Based on LSTM

  • Kingsley Ogheneruemu Kwara State University
  • Jumoke F Ajao
  • Abdulrafiu Isiaka
  • Franklin O. Asahiah
  • Olumide K. Orimogunje


Abstract Yoruba is a tonal language spoken primarily in Nigeria, some West African countries, and other parts of the world by over 40 million people. Many Yoruba texts written online lack tone marks, which can be confusing, ambiguous, and difficult for Natural Language Processing. This paper presents a method, which combines syllable-based approach and long short-term memory (LSTM) for diacritics restoration of standard Yoruba text.By enhancing the built-in varnishing gradient of RNN, the aim is intended to recover lost diacritics in Yoruba text for both characters that carry diacritic signs and underdot and return it with the proper diacritics. Data were acquired from Yoglobavoice, BBC Yoruba new and Yoruba words collected from literate indigenous writers. 27050 Yoglobalvoice datasets, 2000 Yoruba words extracted from BBC Yoruba news, and 1470 Yoruba words collected from a Yoruba language teacher.In addition, syllabic module was developed to group the tokenized word into different syllables. The output of the syllabication algorithm was fed into the Long Short-Term Memory (LSTM) module for training, the LSTM model was trained using 70% of the dataset and validated using 30% of the dataset. The result obtained showed 96% accuracy.  From the result, it was observed that the use of LSTM for restoring diacritic gave an improved restoration of both character with under dot and character that contains tone-marks.