Diacritic Restoration for Yoruba Text with under dot and Diacritic Mark Based on LSTM
AbstractAbstract Yoruba is a tonal language spoken primarily in Nigeria, some West African countries, and other parts of the world by over 40 million people. Many Yoruba texts written online lack tone marks, which can be confusing, ambiguous, and difficult for Natural Language Processing. This paper presents a method, which combines syllable-based approach and long short-term memory (LSTM) for diacritics restoration of standard Yoruba text.By enhancing the built-in varnishing gradient of RNN, the aim is intended to recover lost diacritics in Yoruba text for both characters that carry diacritic signs and underdot and return it with the proper diacritics. Data were acquired from Yoglobavoice, BBC Yoruba new and Yoruba words collected from literate indigenous writers. 27050 Yoglobalvoice datasets, 2000 Yoruba words extracted from BBC Yoruba news, and 1470 Yoruba words collected from a Yoruba language teacher.In addition, syllabic module was developed to group the tokenized word into different syllables. The output of the syllabication algorithm was fed into the Long Short-Term Memory (LSTM) module for training, the LSTM model was trained using 70% of the dataset and validated using 30% of the dataset. The result obtained showed 96% accuracy. From the result, it was observed that the use of LSTM for restoring diacritic gave an improved restoration of both character with under dot and character that contains tone-marks.
Copyright (c) 2023 The Author(s)
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
The authors hereby represent and warrant that the paper is original and that they are the authors of the paper, except for material that is clearly identified as to its original source, with permission notices from the copyright owners where required. If in future any violation of any copyright come in notice, then the author will be responsible and not FUOYEJET.
The authors declare that:
- This paper has not been published in the same form elsewhere.
- It will not be submitted anywhere else for publication prior to acceptance/rejection by this Journal.
- A copyright permission is obtained for materials published elsewhere and which require this permission for reproduction.
Furthermore, the copyright after publication belongs to the Author(s) (for articles published in 2020 and beyond) and licensed under the creative commons license CC-BY-NC (http://creativecommons.org/licenses/by-nc/4.0). The copyright covers the right to reproduce and distribute the article, including reprints, translations, photographic reproductions, microform, electronic form (offline, online) or any other reproductions of similar nature.