Finite-State Tokenization for a Deep Wolof LFG Grammar
DOI:
https://doi.org/10.15845/bells.v8i1.1340Abstract
This paper presents a finite-state transducer (FST) for tokenizing and normalizing natural texts that are input to a large-scale LFG grammar for Wolof. In the early stage of grammar development, a language-independent tokenizer was used to split the input stream into a unique sequence of tokens. is simple transducer took into account general character classes, without using any language-specific information. However, at a later stage of grammar development, uncovered and non-trivial tokenization issues arose, including issues related to multi-word expressions (MWEs), clitics and text normalization. As a consequence, the tokenizer was extended by integrating FST components. is extension was crucial for scaling the hand-written grammar to free text and for enhancing the performance of the parser.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2017 Bamba Dione
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.