Finite-State Tokenization for a Deep Wolof LFG Grammar

Cheikh M. Bamba Dione

doi:10.15845/bells.v8i1.1340

Finite-State Tokenization for a Deep Wolof LFG Grammar

Autores/as

Cheikh M. Bamba Dione University of Bergen

DOI:

https://doi.org/10.15845/bells.v8i1.1340

Resumen

This paper presents a finite-state transducer (FST) for tokenizing and normalizing natural texts that are input to a large-scale LFG grammar for Wolof. In the early stage of grammar development, a language-independent tokenizer was used to split the input stream into a unique sequence of tokens. is simple transducer took into account general character classes, without using any language-specific information. However, at a later stage of grammar development, uncovered and non-trivial tokenization issues arose, including issues related to multi-word expressions (MWEs), clitics and text normalization. As a consequence, the tokenizer was extended by integrating FST components. is extension was crucial for scaling the hand-written grammar to free text and for enhancing the performance of the parser.

Descargas

PDF (English)

Publicado

2017-11-23

Cómo citar

Dione, Cheikh M. Bamba. 2017. «Finite-State Tokenization for a Deep Wolof LFG Grammar». Bergen Language and Linguistics Studies 8 (1). https://doi.org/10.15845/bells.v8i1.1340.

Descargar cita

Número

Vol. 8 (2017): The very model of a modern linguist — in honor of Helge Dyvik

Sección

Articles

Licencia

Derechos de autor 2017 Bamba Dione

Esta obra está bajo una licencia internacional Creative Commons Atribución-NoComercial 4.0.

Finite-State Tokenization for a Deep Wolof LFG Grammar

Autores/as

DOI:

Resumen

Descargas

Publicado

Cómo citar

Número

Sección

Licencia

Idioma

Información