How to annotate morphologically rich learner language. Principles, problems and solutions

Authors

  • Sisko Brunni University of Bergen
  • Liisa-Maria Lehto
  • Jarmo H. Jantunen
  • Valtteri Airaksinen

DOI:

https://doi.org/10.15845/bells.v6i0.812

Keywords:

learner corpus, corpus annotation, error tagging

Abstract

This article illustrates the grammatical and error annotations of a morphologically rich learner language with the help of the International Corpus of Learner Finnish (ICLFI). It especially focuses on problems and solutions in morphological and error annotation, both of which are challenging due to the rich morphological structure of the target language. The article also introduces existing Finno-Ugric learner data and their annotation schemes, and compares those with the ones used in ICLFI annotations. Learner data variables, taxonomy, and principles in grammatical and error annotation are also discussed with the help of the ICLFI in the present article.

References

Atkins, S., J. Clear, and N. Ostler. 1992. Corpus design criteria. Literary and Linguistic Computing 7 (1): 1-16.

Bateman, J., J. Forrest, and T. Willis. 1997. The use of syntactic annotation tools: Partial and full parsing. In Corpus annotation. Linguistic information from computer text corpora, eds. R. Garside, G. Leech, and A. McEnery, 1-18. New York: Longman.

CEFLING = Linguistic Basis of the Common European Framework for L2 English and L2 Finnish.(https://www.jyu.fi/hum/laitokset/kielet/tutkimus/hankkeet/paattyneet-hankkeet/cefling/en )

CSC = The Language Bank of Finland. IT Center for Science.

(https://www.csc.fi/-/kielipank-1)

Council of Europe 2001. The Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Cambridge: Cambridge University Press

Dagneaux, E., S. Dennes, and S. Granger. 1998. Computer-aided error analysis. System 26 (2): 163-174.

Dickinson, M., and S. Ledbetter. 2012. Annotating errors in Hungarian learner corpus. Proceedings of the 8th Language Resources and Evaluation Conference (LREC 2012). Stroudsburg: Association for

Computational Linguistics, 1659-1664.

de Haan, P. 2000. Tagging non-native English with the TOSCA-ICLE tagger. In Corpus linguistics and linguistic theory. Papers from the Twentieth International Conference on English Language Research on

Computerized Corpora (ICAME 20), eds. C. Mair, and M. Hundt, 69-79. Amsterdam: Rodopi.

Díaz-Negrillo, A., and J. Fernandes-Dominguez. 2006. Error tagging systems for learner corpora. Resla 19:83-102.

Díaz-Negrillo, A., D. Meurers, S. Valer, and H. Wunsch. 2010. Towards interlanguage POS annotation for effective learner corpora in SLA and FLT. Language Forum 36 (1-2): 139-154. Special Issue on Corpus Linguistics for Teaching and Learning. In Honour of John Sinclair, edited by María Moreno Jaén and

Carmen Pérez Basanta).

(http://www.sfs.uni-tuebingen.de/~dm/papers/diaz-negrillo-et-al-09.html)

Ellis, R. 1990. Instructed second language acquisition. Oxford: Basil Blackwell.

Ellis, R. 1994. The study of second language acquisition. Oxford: Oxford University Press.

Eslon, P. 2007. Õppijakeelekorpused ja keeleõpe [Learner corpora and language learning]. In

Tallinna Ülikooli keelekorpuste optimaalsus, töötlemine ja kasutamine.[Optimality, design and use of the language corpora of the University of Tallinn],ed. P. Eslon, 87-120. Tallinn: Tallinna Ülikooli Kirjastus.

Eslon, P. 2014. Estonian Interlanguage Corpus. Language and Literature 6: 436-451.

Eslon, P., and H. Metslang. 2007. Learner language and Estonian Interlanguage Corpus. In Eesti rakenduslingvistiika ühingu aastaraamat 3 - Estonian Papers in Applied Linguistics 3, eds. H. Metslang, M. Langemets, and M-M. Sepper, 99-116. Tallinn: Eesti Keele Sihtasutus.

Garside, R., G. Leech, and A. McEnery, eds. 1997. Corpus annotation. Linguistic information from computer text corpora. New York: Longman.

Granger, S. 2002. A Bird’s-eye view of learner corpus research. In Computer learner corpora, second language acquisition and foreign language teaching, eds. S. Granger, J. Hung, and S. Petch-Tyson, 3-33. Amsterdam: John Benjamins.

Granger, S. 2003. Error-tagged learner corpora and CALL: A promising synergy. CALICO Journal 20 (3): 465-480.

Granger, S. 2007. A Bird's-eye view of learner corpus research. In Corpus linguistics: Critical concepts in linguistics 2., eds. W. Teubert, and R. Krishnamurthy, 44-72. London, New York: Routledge.

Granger, S., E. Dagneaux, and F. Meunier. 2002. International Corpus of Learner English. Version 1.1. Université catholique de Louvain: Centre for English Corpus Linguistics.

Hakulinen, A., M. Vilkuna, R. Korhonen, V. Koivisto, T-R. Heinonen, and I. Alho, eds. 2004. Iso suomen kielioppi [The Comprehensive Finnish Grammar]. Helsinki: Suomalaisen Kirjallisuuden Seura.( http://scripta.kotus.fi/visk/etusivu.php)

Heikkinen, V., M. Lounela, and E. Voutilainen. 2012. Automaattinen analysaattori tekstilajitutkimuksessa. [Automatic analyser in genre analysis]. In Genreanalyysi – tekstilajitutkimuksen käsikirja. [Handbook of Genre Analysis], eds. V. Heikkinen, E. Voutilainen, P. Lauerma, U. Tiililä, and M. Lounela, 372-391. Kotimaisten kielten keskuksen julkaisuja 169. Helsinki: Gaudeamus.

Ivaska, I. 2014. The Corpus of Advanced Learner Finnish (LAS2): Database and toolkit to study academic learner Finnish. Apples – Journal of Applied Language Studies 8 (3): 21-38. (http://apples.jyu.fi/issue/view/15)

Ivaska, I., and K. Siitonen. 2009. Syntactically encoded learner language corpus: opportunities and questions. In The methodology of corpus studies and the problems of the coding. Proceedings of TLU Institute of Estonian Language and Culture 11, eds. P. Eslon, and K. Õim, 54-71. Tallinn: Tallinna Ülikooli.

Izumi, E., K. Uchimoto, and H. Isahara 2005. Error annotation for corpus of Japanese learner English. In Proc. of 6th International Workshop on Linguistically Annotated Corpora. Jeju Island: South Korea, 71-80.

Jantunen, H. J. 2011. Kansainvälisen oppijansuomen korpus (ICLFI): typologia, taustamuuttujat ja annotointi [International Corpus of Learner Finnish (ICLFI): typology, variables and annotation]. In

Lähivõrdlusi. Lähivertailuja 21, eds. A. Kaivapalu, J. Laakso, P. Muikku-Werner, and M-M. Sepper, 86-105. Tallinn: Eesti Rakenduslingvistiika Ühing.

Jantunen, J. H., and S. Brunni. 2012. Morfologinen priming ja fraseologia vieraan kielen oppimisessa: korpustutkimus oppijansuomesta [Morphological priming and phraseology in second language acquisition: A corpus-study in learner language]. In Lähivõrdlusi. Lähivertailuja 22, eds. A.Kaivapalu, P.

Muikku-Werner, J. H. Jantunen and M-M. Sepper, 71-100. Tallinn: Eesti Rakenduslingvistiika Ühing.

Jelínek, T., B. Štindlová, A. Rosen, and J. Hana. 1999. Combining manual and automatic annotation of a learner corpus. Proceedings of the Text, Speech and Dialogue: Second International Workshop, TSD'99 September 13.–17. Plzen: Czech Republic, 126-134.

Karlsson, F., A. Voutilainen, J. Heikkilä, and A. Anttila, eds. 1995. Constraint grammar: A language-independent system for parsing unrestricted text. Berlin: Mouton de Gruyter.

Laviosa-Braithwaite, S. 1996. English Comparable Corpus (ECC): A resource and a methodology for the empirical study of translation. Unpublished PhD Thesis. Manchester: UMIST.

Leech, G. 1991. The State of the art in corpus linguistics. In English corpus linguistics. Studies in honour of Jan Svartvik. eds. K. Aijmer, and B. Altenberg, 8-29. London: Longman.

Leech, G. 1997a. Introducing corpus annotation. In Corpus annotation. Linguistic information from computer text corpora, eds. R. Garside, G. Leech and A. McEnery, 1-18. New York: Longman.

Leech, G. 1997b. Grammatical Tagging. In Corpus annotation. Linguistic information from computer text corpora, eds. R. Garside, G. Leech and A. McEnery, 20-33. New York: Longman.

Leech, G. 2004. Adding linguistic annotation. In Developing linguistic corpora: a guide to good practice, ed. M. Wynne, 17–29. Oxford: Oxbow Books. (http://www.ahds.ac.uk/guides/linguistic-corpora/chapter2.htm)

Leech, G., and E. Eyes. 1997. Syntactic annotations: Treebanks. In Corpus annotation. Linguistic information from computer text corpora, eds. R. Garside, G. Leech, and A. McEnery, 34-52. New York: Longman.

Lehtinen, M., P. Karvonen, and T. Rahikainen. 1995. Tekstikorpukset [Text corpora]. Helsinki: The Institute for the Languages of Finland.

Lehto, L-M., S. Brunni, and H. J. Jantunen. 2013. How to Annotate Morphologically Rich Language? Problems and Solutions. Poster presented at Learner Corpus Research Conference 2013. Bergen/Os, Norway.

Nieminen, L., A. Huhta, R. Ullakonoja, and J.C. Alderson. 2011. Toisella ja vieraalla kielellä lukemisen diagnosointi: Dialuki-hankkeen teoreettisia ja käytännöllisiä lähtökohtia. [Diagnosis of reading in second and foreign language: The theoretical and practical starting points of the Dialuki project.] In AFinLA-e 3, eds. E. Lehtinen, S. Aaltonen, M. Koskela, E. Nevasaari and M. Skog-Södersved, 102-115.

Martin, M., S. Mustonen, N. Reiman, and M. Seilonen. 2010. On becoming an independent user. In Communicative proficiency and linguistic development, intersections between SLA and language testing research. EUROSLA Monograph Series 1, eds. I. Bartning, M. Martin, and I. Vedder, 57-80. European Second Language Association.

Milton, J., and N. Chowdhury. 1994. Tagging the interlanguage of Chinese learners of English. In Entering text, eds. L. Flowerdew, and A. Tong, 127-143. Hong Kong: The Hong Kong University of science and technology.

Pieneman, M. 1998. Language processing and second language development: Processability theory. Amsterdam: John Benjamins.

Ragheb, M., and M. Dickinson. 2012. Defining syntax for learner language annotation. Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), Poster Session. Mumbai, India, 965-974.

Rastelli, S. 2009. Learner corpora without error tagging. Linguistic Online 38, 2/2009. (http://www.linguistik-online.de/38_09/rastelli.html)

van Rooy, B., and L. Schäfer. 2003. An evaluation of three POS taggers for the tagging of the Tswana Learner English Corpus. In Lancaster University Centre for Computer Corpus Research on Language Technical Papers 16: 835-844. (Proceedings of the Corpus Linguistics 2003 Conference, eds. Dawn Archer, Paul Rayson, Andrew Wilson, and Tony McEnery. (http://www.corpus4u.org/forum/upload/forum/2005092023174960.pdf )

Schmidt, H. 1994. Probabilistic part of speech tagging using decision trees. Proceedings of the International Conference on New Methods in Language Processing, Manchester: UK.

Selinker, L. 1972. Interlanguage. International Review of Applied Linguistics 10: 209-241.

SLATE = Second language acquisition and testing in Europe. (http://www.slate.eu.org/index.htm)

TOPLING = Paths in Second Language Acquisition. (https://www.jyu.fi/hum/laitokset/kielet/tutkimus/hankkeet/topling/en)

Toivola, S., and H. Tossavainen. 2011. Opiskelijoiden käsityksiä yleisten kielitutkintojen korpuksen käyttömahdollisuuksista. [Students’ perceptions of usability of the Finnish National Certificates learner corpus.] In AFinLA-e 3, eds. E. Lehtinen, S. Aaltonen, M. Koskela, E. Nevasaari and M. Skog-Södersved, 158-169.

Tono, Y. 2003: Learner corpora: design, development and applications. Proceedings of the Corpus Linguistics 2003 Conference. Lancaster, UK, 28-31 March, 800-809.

Toropainen, O., M. Härmälä, and S. Lahtinen. 2012. Kaksi asteikkoa, kaksi eri tilannetta: äidinkielellä ja vieraalla kielellä kirjoitettujen tekstien kriteeripohjaisen arvioinnin haasteita. [The challenges of the usage of two CEFR-based rating scales in assessing L1 and L2 texts in Swedish.] In AFinLA-e: 4, eds. L. Meriläinen, L. Kolehmainen and T. Nieminen, 60-79.

Downloads

Published

2015-05-30

How to Cite

Brunni, Sisko, Liisa-Maria Lehto, Jarmo H. Jantunen, and Valtteri Airaksinen. 2015. “How to Annotate Morphologically Rich Learner Language. Principles, Problems and Solutions”. Bergen Language and Linguistics Studies 6 (May). https://doi.org/10.15845/bells.v6i0.812.