Finite-State Tokenization for a Deep Wolof LFG Grammar
DOI:
https://doi.org/10.15845/bells.v8i1.1340Abstract
This paper presents a finite-state transducer (FST) for tokenizing and normalizing natural texts that are input to a large-scale LFG grammar for Wolof. In the early stage of grammar development, a language-independent tokenizer was used to split the input stream into a unique sequence of tokens. is simple transducer took into account general character classes, without using any language-specific information. However, at a later stage of grammar development, uncovered and non-trivial tokenization issues arose, including issues related to multi-word expressions (MWEs), clitics and text normalization. As a consequence, the tokenizer was extended by integrating FST components. is extension was crucial for scaling the hand-written grammar to free text and for enhancing the performance of the parser.
Downloads
Pubblicato
Come citare
Fascicolo
Sezione
Licenza
Copyright (c) 2017 Bamba Dione
![Creative Commons License](http://i.creativecommons.org/l/by-nc/4.0/88x31.png)
Questo lavoro è fornito con la licenza Creative Commons Attribuzione - Non commerciale 4.0 Internazionale.