ArseniiBuhaiev/ua-g2p
LibraryA robust rule-based G2P (Grapheme-to-Phoneme) tool designed specifically for the Ukrainian language.
This is a robust G2P tool for representing Ukrainian orthographic text in three different types of transcription, handling abbreviations, pausation, clitics and stress ambiguity.
No GitHub topics on this repo.
- Python100.0%
1 Review
ua-g2p is a promising, tightly scoped Python library for Ukrainian grapheme-to-phoneme conversion. I like that it does not try to be a generic multilingual G2P wrapper; the rules, preprocessing, stress handling, clitic logic, and IPA/phonetic/phonematic outputs are clearly aimed at Ukrainian TTS and linguistic workflows. The public API is also pleasantly small: from ua_g2p import ProcessorG2P is easy to understand, and the README gives a useful first example with accentor="hybrid" and mode="ipa". The acknowledgements are well placed too, since the project builds on existing Ukrainian tokenization and stress tools instead of hiding those dependencies.
The main thing I would improve before expecting outside adoption is packaging and verification. setup.py installs num2words, ukrainian_word_stress, ua_text_stressifier, and tokenize_uk, but the preprocessor imports english_g2p, and requirements.txt includes extra dependencies such as spacy that are not reflected in the install metadata. A user installing with the README command may hit missing imports or model setup friction. The approbation folder is a good sign because it shows the author has started validating preprocessing behavior, and the included clitics report with 92.09% word accuracy is exactly the kind of evidence this project should surface. I would turn that into a normal test suite, add a CI workflow, and document how to reproduce the report.
The rule configuration is substantial and domain-specific, which is the repository’s strongest asset. To make it easier to maintain, I would add comments or references for the trickier assimilation and stress rules, plus a few before/after examples for edge cases such as abbreviations, English input, punctuation pauses, enclitics, and proclitics. The Creative Commons BY-NC license may also discourage code reuse in commercial TTS/NLP projects, so it would be worth clarifying whether that restriction is intentional for both code and data. Overall, this is a useful early-stage Ukrainian NLP tool with a clear niche; tightening install reliability, tests, and documentation would make it much easier for researchers and speech developers to trust.
