5 Creating an Annotation Tool-Kit

Learning Objectives:

Identify previous sources on lexeme- and morpheme-level glossing
Create an annotation tool kit for the language
Learn to build an excel spreadsheet for Sheet Swiper
Learn about hierarchical annotation

5.1 Previous research

Interlinear Glossed Texts (IGT) are used to turn transcribed and translated recordings into data that are usable by communities, linguists, and computer scientists. The amount of effort spent on creating IGT is in an inverse relationship with the amount of available resources there are in the language of analysis. This means that the more grammatical descriptions, dictionary materials, and linguistic analyses you consider in your preparation, the faster and less labor-intensive the process of preparing IGT will be.

An IGT annotation is typically a bundle of annotation:

Line 1. tx/ Transcription (sometimes called the baseline)

Line 2. mr/ Morphological representation (break the word down into its morphological parts, i.e. affixes and roots)

Line 3 mg/ Morpheme gloss (the meaning of each morpheme)

Line 4 wg/ Word gloss

Line 5 ft/ Free translation

Line 6 Note/

And illustrated in the following Manipuri example (Chelliah 1997, p. 225)

tx/ � � � � m�n? � � � � � � ap?l � � � � ?�l?g�

mr/ � � � m� -n? � � � � � ap?l � � � �?� -l?k -�

mg/ � � �he =IS:FOC � apple � � �eat -AM:VEN -SMOD:DECL

wg/ � � � he � � � � � � � � �apple � � � eating

ft/ ‘When he came here he was eating an apple.’

We will be reviewing the use of a common annotation program called FLEx which uses a Lexicon created by the manual input of data. One way of growing the lexicon is by adding words to the lexicon as you gloss a text – as you translate a text, you provide glosses for the words and morphemes in the clause and these are added to the lexicon. �The next time that word or morpheme appears, the program will find the stored gloss in the Lexicon and provide that annotation automatically. The richer the Lexicon, the faster the annotation.

Discussion:� Deconstruct the small IGT sample from Manipuri provided above. For example, discuss what is meant by a dash, what is meant by the equal sign, why some glosses are in capital letters and others in lower case. Why is there a period in 3.POS? What information about the language can be gleaned from this example? What other questions might you investigate based on this example?

Based on our experience in creating IGT, we suggest that, before you begin your annotation project, collect as much information from previous resources as you can. When no previous research has been done on the language you�re working on, look to surrounding and related languages. You may also want to contact researchers who have worked on these languages for their unpublished works as well. Previous research you find will help you:

Develop familiarity with the language and/or language families so you can more readily recognize constituent boundaries and structure
Identify relevant areas of typological interest to locate areas more complicated to translate
Preload your lexicon with content words, function words, and, if possible, bound morphemes to support faster automatic glossing
Better understand the phonology and orthography so that you can provide a consistent transcription or baseline for annotation
Think through abbreviation conventions for morphemes and grammatical categories that will be needed
Determine the closed categories of functional word and morphemes. �Identify some of their members. �For example, semantic role marking would a closed category and a member of that category would be agentive role.

Discussion

Is there a linguistics researcher that has done work with your community? Who has these materials, and can you get access to them? Is the language described the same variety as yours? Do you find previous materials to be an accurate representation of your language?

5.2 Types of Documents

Here are some types of documents you may be able to find and their potential usefulness to your project:

Word lists and dictionaries

The data from these texts can be formatted for input into a FLEx Lexicon to automatically populate your glosses. The more words that are inputted into your FLEx Lexicon to begin with, the fewer you have to manually add.

The most accessible word list files are spreadsheets, which can be modified for import into the FLEx Lexicon. See below for demonstration of� this process. If the dictionary or word list you have access to is not in an appropriate format, try:

contacting the researchers who collected the data to see if you can get access to the list in a more usable format.
reach out to linguists or computer scientists for help in scraping word list data from digital files
manually enter lexical data into a spreadsheet (which may or may not be reasonable depending on the amount of data)
use the dictionary as a reference during glossing, only inputting relevant words.

Transcribed, translated, and/or glossed texts

Existing texts offer opportunities to gain familiarity with the language. The value of these texts increases dramatically when they are glossed. From previous glossed texts, you can pull morphemes with their glosses to include to your FLEx Lexicon. You can also look for multiple instances of the same morpheme to notice similarities and differences in usage between contexts or inconsistencies in glossing. The inconsistencies may reveal the contested spelling of a word or morpheme, polysemy (multiple meanings), different preferences of different authors, and so on. Annotation will go faster if spelling is consistent. �Check the CoRSAL website for �an example of an spelling standardization program to help with the annotation process.

Linguistic analyses

Any existing linguistic analysis of a language represents a step forward for your IGT project. The analyst benefits by developing a deeper understanding of the language they are working on, and existing analyses such as morpheme glosses can be incorporated into the FLEx Lexicon, offering some shortcuts for developing your IGT conventions.

Using previous research: An example from Hakha Lai

As part of a collaborative US National Science Foundation Project between the University of North Texas and Indiana University, Bloomington (#2031052), our team members collected interviews from Hakha Lai (HL) speakers to explore how COVID-19 health-related information is received and understood by this population. This resulted in hours of HL spoken recordings. These were transcribed, and a rough clause-level translation was created by HL-speaking students using SayMore. In order to further translate the material, noting word and morpheme glosses, we studied all existing HL materials we could acquire. These included:

Language examples in published articles, especially useful were appendices, e.g., the appendix on Roengpitya, Rungpat (1997) Glottal stop and glottalization in Lai (connected speech) in Linguistics of the Tibeto-Burman Area 20.2
Unpublished partially glossed texts (through direct correspondences with several researchers)
An existing Lai-English dictionary (www.chin-dictionary.com)
A existing revised, incomplete Lai-English dictionary (through direct correspondence with a researcher)

In order to include this previous research in our FLEx Lexicon, we created, then merged the following two spreadsheets:

1) Translation data from the word lists

2) Grammatical glosses from existing texts and linguistic research. Since we had access to several HL texts, a word list, and previous research, we were able to put together a list of 210 morphemes with preliminary grammatical glosses based on these materials. Here is the information we pulled for each:

Morpheme Spellings-What spellings are used in the text for the morpheme? Are there discrepancies?
HL Gloss-What glosses are used for each morpheme? Are there multiple glosses used for the same morpheme? What do these glosses mean?
Placement-Where does the morpheme show up? Is it part of a verb phrase or noun phrase or something else?

From here, we worked to develop a preliminary gloss for each morpheme to be imported in the FLEx Lexicon import. At that point, our team members began glossing the many interviews using the enriched Lexicon. The increased speed at which we were able to annotate was notable, with many morphemes and lexemes automatically glossed by FLEx.

To learn more about the process of preloading your Lexicon with existing word and morpheme lists, you might watch this video by our team member Marty Heaton and read the step-by-step instructions by team member Ben Hull.

Discussion:

What are some existing word lists in your language? Are these in digital format? Is there agreement on what to call the bound morphemes? Do the abbreviations make sense to you? (This would be question for both non-specialists and linguists).

5.4 What to gloss?

Native speakers will often have an easier time coming up with translations of content words. The meaning of units smaller than words, like the form -ed in the English word walked can be hidden even to speakers, �They know the meaning for sure, but it is deep in their unconscious ability to speak the language. It may be hard for speakers to identify and gloss these without additional help. �In the remainder of this chapter we will talk about this idea of additional help in annotation.

The practice of creating a gloss for a morpheme first requires a decision of whether to gloss based on grammatical function, such as aspect (TNS for tense) or specific iteration of that function such as FUT �future.� Again, based on our personal experiences in glossing texts from a variety of Tibeto-Burman languages, we know that, often, as we are glossing, we understand the function or grammatical category of a morpheme well ahead of the specific semantics. Here are some additional scenarios where glossing is a challenge.

Morphemes that require description

�This word means that the action was performed earlier today�

Morphemes that don�t change the meaning of the lexical word.

�That word attached to eat means the process of eating�

Morphemes that don�t mean the same thing when split from the lexical word.

�This word means that the tiger is a woman, but it doesn�t mean woman by itself.�

Morphemes with no clear meaning

�I don�t know what that word means.�

You may know what the morpheme is relaying but not exactly what to call it or which category of morphology it fits into (like the first example, where we know it means something about having performed an activity earlier in the day but perhaps not being sure if it falls in the �tense or aspect category or something else). �Conversely, you may strongly suspect the category of the morpheme but not know exactly what it means. �For example, morpheme may clearly be expressing aspect but is not clearly ongoing or sporadic. �We propose that your annotation can have two parts: the part you know and, the part you need to further research. We are calling this a hierarchical way of annotating, on the left (usually the thing you know) is the super category. �This will be the general functional or semantic category of the morpheme an on the right is what you don’t know, the sub category or specific semantics of the morpheme.

Scenario: French

We were offered the free translation �Jean could go� and we asked a few questions to a native speaker, allowing us to associate English translations with each word in the French sentence.

Jean �� pourrais �� aller.

Jean�� could�� go

�Jean could go.�

While the name Jean and the verb aller �go� offer more straightforward translations, pourrais is more problematic, since it is glossed with the English modal verb �could� which has several possible interpretations that may or may not line up with the French word including:

Ability: Jean could go to the store (before the accident).
Possibility: Jean could go to the store (on his way home, if he thinks of it).

Furthermore, a native speaker might tell us that ‘pourrais’:

has other forms including �pouvoir� and �peux� (indicating that the word has several possible forms)
may also be translated as might, can, or may (indicating that its translation into English is contextual).

So, how do you gloss it? Here are some logical options:

Mark it with <?> and deal with it later. This is the easiest option, but it ignores any information collected about the morpheme in the process of glossing the text. Since it ignores collected information and moves on, it allows the glosser to cover more ground faster.

Translate it with a similar word in the glossing language (like �might� or �could�). The advantage of this option over the former is that it keeps track of some information discovered in the process of glossing. This option is good for lexical morphemes, but bad for grammatical ones as:

It is imprecise (might can express possibility or permission)
It assumes more analysis exists than has been done
It assumes translatability of grammatical morphemes between languages that may not have accurate correspondences.
It also may lose information about the syntactic position of the morpheme since words with modal meanings may be expressed in a number of ways in a language not limited to a certain syntactic position (adverbials or verbs with complement clauses).
It makes the text less searchable, since someone looking to do analysis on modals would have to use a set of English modals (could, would, should, must, can, etc.) to find the data to begin analysis. In other words, it is not tagged in a way that facilitates research.

This option has some of the advantages of (2) in that it keeps information gained from the process of glossing. It is more precise (�this morpheme expresses possibility�) than giving it a loose translation, but it suffers from the same searchability problem. If morphemes are marked with whatever the researcher feels is the closest analysis, how do you search to find all things with like distribution to further analyze the data?

Mark it as a �modal� or �verbal auxiliary� element. This option offers a broad category label and is more searchable, but loses more nuanced information about semantics gleaned from the process of glossing.

The following chapter considers how to deal with this situation using IGT Workflow.

License

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

From Source to Analysis: A language documenter's guide to annotating text Copyright © 2024 by University of North Texas is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.