6 Annotation: IGT Workflow

Learning Objectives

Learn and apply methods for lexicon creation
Learn and apply methods for hierarchical glossing

6.1 Glossing Standards and Conventions

Transcription provides a representation of the sounds of an utterance. Translation offers access to the information in the utterance in another language. Interlinear glossing provides representation of the utterance at a word and subword level.

By dividing an utterance into morphemes and noting the meaning or function of each morpheme individually, interlinear glossing makes your data more usable researchers in building word and sentence grammar. It also helps with creating pedagogical materials for language teaching.

The most common conventions for interlinear glossing are the Leipzig glossing rules, available here:

https://www.eva.mpg.de/lingua/pdf/Glossing-Rules.pdf

Leipzig Glossing Rules

In the example below, there are four lines of text: (1) the transcription, (2) morpheme by morpheme gloss, (3) the word gloss, and (4) the free translation. Morphemes in line (1) are aligned to their glosses in line (2) and (3).

1) Bawi=niʔ Chin tsoosaa a=eii-təɹɹ

2) Bawi=ERG Chin beef 3S=eat-CAUS

3) Bawi Chin beef made.to.eat

4) ‘Bawi made Chin eat the beef.’

Morphemes such as tsoosaa ‘beef’ and eii ‘eat’ provide content information to the clause – these will be nouns and verbs, or they may be adjectives and adverbs. Here, we see them translated into English in (3). The constituent ‘made to eat’ is more complicated. It has a morpheme that relays content information (i.e. ‘eat’) and also morphemes that relay grammatical information (i.e., causative and third singular, abbreviated as ‘caus’ and ‘3s’), respectively. Content morphemes are glossed in lower case. Grammatical morphemes, such as the ergative (erg) marker niʔ and the causative (caus) marker təɹɹ are glossed in caps in (2). Also, grammatical morphemes, such as təɹɹ (caus) are divided from other morphemes in the same word with a hyphen (-). Clitics, such as the third person singular subject agreement marker (3s), are divided from other morphemes in the same word with an equals sign (=).

Discussion: What are the main principles and goals of the LGR? To help you with the discussion, read: Chelliah, Shobhana, Mary Burke, and Marty Heaton. (2021). Using interlinear gloss texts to facilitate cross-language comparison and improve language description. Indian Linguistics. Indian Linguistics 82(1-2) 2021: 1-24.

Forming an analysis and applying it to data is doing the science of linguistics, and the process is recursive. Data informs analysis and analysis refines data. The hardest part of approaching glossing in a new language is starting. When you have no analysis to work from, every word is a puzzle, or several related puzzles. What we aim for is glossing that is:

Accessible: It should allow even a novice glosser to move through the text even if every morpheme cannot be completely glossed where more analysis is required.
Not relayed only by English translations: Translations, especially of grammatical morphemes, are likely to be imprecise and may result in unclear glosses or mistranslations in other sentences.
Complete: It should reflect the hypotheses developed and knowledge gained by the glosser during the course of the work. It should include both syntactic and semantic information relevant to future studies. It should reflect known information as well as unknown information transparently.
Searchable: Terms should appropriate for use with more than one language and represented consistently so that all instances can be easily found.
Adaptable: Both commonly used category labels and language specific/researcher specific glosses should be possible.

This set of decisions that need to made to reach such glossing conventions can be overwhelming, leaving researchers feeling like they need more training before they can even begin. Luckily, however, we know more than we think we know and, as we discuss in the rest of this chapter, the novice glosser can use this knowledge for glossing:

We know or we can learn the expected grammatical categories for a language based on related languages – Typology can inform glossing!
We can often guess from context the meaning of grammatical morphemes a even if we don’t have a name for it – Placeholder labels are our friends!
We can differentiate between different senses of a morpheme – Glosses can reflect the historical development of a morpheme!

Standards, such as Leipzig Glossing Rules (LGR), are beneficial for data consistency. Such consistency can help with comparison of forms within a corpus and with cross-language comparison. However:

LGR will not provide the vocabulary needed for semantic glossing of all morphemes in all languages. As you learned from the reading above, this was not the goal of LGR. So, even with LGR printed out and at your side when glossing, you will have to come up with glosses for morphemes.
While a language expert, your consultant, can tell you the meanings of many content morphemes, they will likely not be able to gloss grammatical morphemes.

6.2 Clause and Phrase Segmentation

After speech has been transcribed and translated, the next step is to segment speech into clauses, phrases, words, and morphemes. Create a guide for yourself on what you expect to find at the edge of a clause. For Tibeto-Burman languages, this will often be a clause chaining mechanism such a nominalizer or other clausal subordinator. Making list of members of the closed category of subordinators will help you with identifying clause boundaries.

When using the automatic segmentation tool of SayMore or ELAN, the pauses at the edges of intonational chunks will automatically brake connected speech into what roughly corresponds to a phrasal or clausal constituent boundary.

Native speakers will usually be more comfortable matching larger phrases between the transcription and translation than individual words and morphemes. Here the sentence from above is segmented roughly into phrases:

[tii vaa pooŋ pa khat aʔ] [ruul lee hŋerʔ tee] [an rak um] [an tii]

[Near a river] [snake and ant] [there was] [they say]

“There were a snake and an ant near a river, they say.”

6.3 Free morphemes, Bound Morphemes, and Clitics

Native speaker intuitions are key to the segmentation of utterances into morphemes. Be aware, however, that an additional challenge comes in how morphemes are represented in the practical orthography. Consider the following example from a story in Hakha Lai called “The Snake and the Ant” from Roengpitya (LTBA 20.2 p.44 R. Roengpitya).

tii vaa pooŋ pa khat aʔ ruul lee hŋerʔ tee an rak um an tii

“There were a snake and an ant near a river, they say.”

Here both lexical and grammatical morphemes are written separately. The orthography treats these both the same even though in many cases the grammatical morphemes are bound (must occur with another form) and the lexical morphemes are free (can occur without other morphology). On the morpheme representation and morpheme glossing lines, we can make clear if we are dealing with bound or free morphology. For example, a morpheme with a dash before or after might signal a bound affix.

6.4 Morpheme Glossing

Discussion: Once a clause is divided into phrases and you have the free translation, what next steps will you take to further divide the phrase into content and function morphemes? How will you:

Discover where the head of phrase is?
Develop a hypothesis about the order of morphemes?
Identify the edge of a clause or phrase?
Determine if a morpheme is free, bound, or a clitic?
Gloss a grammatical morpheme?

Identifying grammatical morphemes

Whether you are a speaker of the language doing the translation yourself, or you are working with a speaker to arrive at the translation you might pay attention to these types of comments:

Morphemes that correspond to functional categories in English (pronouns, adpositions, auxiliary verbs, etc.)

“That word corresponds to the to part of the phrase”

Morphemes that require description

“This word means that the action was performed earlier today”

Morphemes that don’t change the meaning of the lexical word.

“That word attached to eat means the process of eating”

Morphemes that don’t mean the same thing when split from the lexical word.

“This word means that the tiger is a woman, but it doesn’t mean woman by itself.”

Morphemes with no clear meaning

“I don’t know what that word means.”

These and other native speaker intuitions are helpful clues as to whether a morpheme is lexical or grammatical and bound or free.

Naming grammatical morphemes

Recall that in the last chapter we were offered the free translation “Jean could go” for the French sentence below. We asked a few questions to a native speaker, allowing us to associate English translations with each word in the French sentence.

Jean pourrais aller.

Jean could go

‘Jean could go.’

While the name Jean and the verb aller ‘go’ offer more straightforward translations, pourrais is more problematic, since it is glossed with the English modal verb ‘could’ which has several possible interpretations that may or may not line up with the French word including:

Ability: Jean could go to the store (before the accident).
Possibility: Jean could go to the store (on his way home, if he thinks of it).

Furthermore, a native speaker might tell us that pourrais:

Has other forms including ‘pouvoir’ and ‘peux’ (indicating that the word has several possible forms)
May also be translated as might, can, or may (indicating that its translation into English is contextual).

We have now determined that glossing for pourrais is not a simple matter! Here are the possible strategies we can follow:

So how do you gloss it?

Mark it with <?> and deal with it later. This is the easiest option, but it ignores any information collected about the morpheme in the process of glossing the text. Since it ignores collected information and moves on, it allows the glosser to cover more ground faster.

Translate it with a similar word in the glossing language (like ‘might’ or ‘could’). The advantage of this option over (1) is that it keeps track of some information discovered in the process of glossing. This option is a workable option for lexical morphemes but ultimately confusing for grammatical ones as:

It is imprecise (might can express possibility or permission)
It assumes more analysis exists than has been done
It assumes the translatability of grammatical morphemes between languages that may not have accurate correspondences.
It also may lose information about the syntactic position of the morpheme since words with modal meanings may be expressed in a number of ways in a language not limited to a certain syntactic position (as adverbials or verbs with complement clauses).
It makes the text less searchable, since someone looking to do analysis on modals would have to use a set of English modals (could, would, should, must, can, etc.) to find the data to begin analysis. In other words, it is not tagged in a way that facilitates research.

Give it a name like ‘possibility’ noting whatever your preliminary analysis is, based on context and translation. This option has some of the advantages of (2) in that it keeps information gained from the process of glossing. It is more precise (“this morpheme expresses possibility”) than giving it a loose translation, but it suffers from the same searchability problem. If morphemes are marked with whatever the researcher feels is the closest analysis, how do you search to find all things with like distribution to further analyze the data?

Mark it as a ‘modal’ or ‘verbal auxiliary’ element. This option offers a broad category label and is more searchable, but loses more nuanced information about semantics gleaned from the process of glossing.

General grammatical category label

The method we offer here to meet the challenge of how to gloss a grammatical morpheme involves using a two part name:

Part 1: the more clearly understood information like the general category to which the morpheme belongs
Part w: semantic information which is the meaning relayed by the morpheme as indicated by the free translation

For example, aspect is a general grammatical category and progressive is specific type of an aspect morpheme. So, the two part name would be aspect:progressive. It may be that you know the left side of the equation or the right side or both. You can more rapidly and confidently gloss by stating what you do know and leaving what you don’t know for later. The system proposed here allows researchers to flexibly represent their understandings of word and morpheme glosses, revising them with increasingly deeper levels of analysis. The first step in making this method of hierarchical interlinear glossing (HIGT) work is to create a list of recurring grammatical categories. And this is where our typological understanding of Tibeto-Burman can help!

Examples of grammatical categories

AGR: inflection agreement morphology for agent or patient and may vary according to person and number

DIR: derivational morphology primarily indicating direction of action of motion verb

SUB: subordinators affixed to finite or nominalized clauses to create subordinate clauses in clause chaining or as complements to verbs.

Specific semantic category label

This is where the specific instantiation of the grammatical category (e.g., subordinator, directional or agreement marker) is identified with a semantic label. For example:

(1) subordinator: simultaneous

(2) subordinator: sequential

(3) subordinator: after acting

The general grammatical category is subordinator. The specific instantiations are simultaneous or sequential.

Compare (2) and (3). Imagine that annotator A writes (2) and annotator B writes (3). Annotator B might simply not know the term sequential, but both annotators are referring to the same morpheme. We can recapture at least some of that information by looking at the left side of the colon and the morpheme itself.

It is difficult, to put it mildly, to define the semantics of a morpheme and incorporate all aspects of information about that morpheme in a gloss. Simply standardizing across all annotators may actually hide the nuances of meaning we gain from what is on the right side of the colon. So, both left and right side glossing as a pair will provide the best tracking. It will also allow the annotator to write down with certainty the left side while figuring out the right side.

Here are some clues that can go into your decision on the semantic label for a morpheme. These sources of hypotheses are valuable and merit preserving, especially in forming early analyses.

Information from existing sources on the language

A previous researcher calls this a past tense marker.

Contextual information from the text about position and meaning

Past tense information in the translation doesn’t appear elsewhere in the sentence, so it may be tied to this morpheme.

Translations, meta-information, and other intuitions offered by native speakers

The language assistant said, ‘I think that part means that it happened already.’

Typological and theoretical research on related phenomena

Usually, morphemes with this kind of behavior are called past-tense markers in other languages.

Analyses of phenomena in related languages

This morpheme looks like a past tense marker in a related language.

Your own intuition

Based on my past experiences, this seems to be marking past tense.

You can read more about the process of semantic glossing here: Bochnak, M. Ryan and Lisa Matthewson 2020. Techniques in complex semantic fieldwork. Annual Review of Linguistics 6:261-283.

Updating analyses

Glosses are analyses, and analyses require updates. Consider the following Zophei example:

a-pa-va-ming

‘He checked up on me.

In this example, we have identified:

That the first two morphemes in this complex word are agreement markers associated with a 3rd person singular subject (AGR:3SBJ) and a 1st person object (AGR:1OBJ).
That vaming means ‘to check up on’, but ming means ‘to watch’.
The native speaker doesn’t have a clear intuition about what va means by itself saying it may be related to the sense that you go somewhere to check up on someone, or to the event happening in the past.

We decide that va is a morpheme that we should gloss, but we are unsure what to call it without more data.

a-pa-va-ming

AGR:3SBJ-AGR:1OBJ-_______-watch

‘He checked up on me.

First pass analysis- placeholder for position

One useful strategy is to label this morpheme according to its position, e.g., PRVP for preverbal particle. This way, it can be easily found when gathering up other morphemes in this position, other ‘pre-verbal particles’, when we are ready for the next iteration of morphological analysis. Since the speaker noted that it could have to do with movement or with tense, our hypotheses are that it is a directional (DIR) or past tense (PST) marker, we can make a note of this. FLEx provides multiple note fields for such reminders and the note fields can be searched.

a-pa-va-ming

AGR:3SBJ-AGR:1OBJ-PRVP-watch

‘He checked up on me.

In our next pass of analysis, to make progress on this morpheme, we can use search features to:

find all items transcribed as va
find all items marked as PRVP to compare va with other markers in the same or a similar pre-verbal position
find all items marked with DIR to compare with other suspected directional markers
find all items marked with PST to compare with other suspected past tense markers

Second pass analysis

After considering va with other pre-verbal markers in the language, we have decided that we are confident in calling it a directional marker. From here, we revise our category to DIR, but we aren’t sure what subcategory to use. Rather than leave it blank, we mark it with uk to indicate that it is unknown information.

DIR:uk

Better yet, we can mark other information, such as that the morpheme indicates the associated motion is going, or indicates movement across level ground or other pieces of potentially relevant information

DIR:going

DIR:level

With increasing nuance, we can create more accurate glosses.

Activity: Thinking about a language you are currently working on, make a list of the at least 15 bound morphemes. What are the meaning of these morphemes? How would you gloss these using hierarchical glossing? What general grammatical category do they belong to? What specific semantics do they have?

6.4 Glossing Hakha Lai

With the proposed workflow in mind, let’s work back through the Hakha Lai sentence previously divided into phrases in 6.2

[tii vaa pooŋ pa khat aʔ] [ruul lee hŋerʔ tee] [an rak um] [an tii]

[Near a river] [snake and ant] [there was] [they say]

“There were a snake and an ant near a river, they say.”

And lets consider each phrase individually, assuming we have a native speaker to ask questions of.

“Near a River”

A native speaker offers the following translations of individual words.

tiivaa pooŋ pakhat aʔ

river near one at

“Near a river”

tiivaa

The native speaker indicates that tiivaa means ‘river’, so we have combined the syllables tii and vaa. The word tii means water, but vaa has no clear and apparent meaning, so we decide it it may be related historically, but is not clearly decompositional in modern language. We combine the syllables into the same word and gloss it together as ‘river’.

pooŋ

This morpheme appears after the noun with the meaning ‘near’, but the phrase still needs the post-position aʔ, currently translated as ‘at’. So, it is likely pooŋ is a noun with the translation ‘near’, meaning something like ‘the area near’. So, we’ll just translate it as ‘near’.

pakhat

Our free translation for this word is ‘a’, a function word in English that expresses that the noun is indefinite. Without any other information about this language, we can assume that a translation of ‘a’ is likely not appropriate based on its syntactic category and semantic meaning. To start, we do not yet know whether the language encodes definiteness at all, let alone how it is marked. Next, if we ask for other numbers, we get pahnih ‘two’, pathum ‘three’, and pali ‘four’. Since each word starts with pa-, we can hypothesize that pa- is a separate morpheme, likely a numeral classifier. We can also guess that if there is one numeral classifier, there will be others. So, we should leave room to update our annotation as our understanding of the phenomenon builds. For this one, we’ll separate the word into two morphemes and gloss them:

pa-khat

CLF:uk-one

This option captures the information that:

(a) pa- is a numeral classifier (CLF)

(b) We don’t know what contexts this CLF is used in, and what other CLFs we’ll come across, but we want to leave a placeholder to later add that information (CLF:uk)

aʔ

Our translation for this word is another function word ‘at’. In English, ‘at’ has really variable and idiosyncratic use (e.g. ‘at school’, ‘at 1pm’, ‘at last’), so we know a translation will not be an accurate way to represent this HL word. What we do know is that it marks a noun phrase as an adjunct (not a required part of the sentence), so we can call it an oblique case marker (OBLCM). Since, at least in this context, it is a reference to a location, we will tentatively label it ‘locative’ (LOC). So, our gloss is OBLCM:LOC

Here is the new gloss:

tiivaa pooŋ pa-khat aʔ

river near CLF:uk-one OBLCM:LOC

“Near a river”

“A snake and an ant”

A native speaker offers the following translations of individual words.

ruul lee hŋerʔtee.

snake and ant

“a snake and an ant”

hŋerʔtee

These two syllables together are translated as ‘ant’ so we’ve combined syllables in our transcription accordingly.

lee

The translation we got for this word is ‘and’, which is a conjunction. Conjunctions may have more or less limited contexts where they can show up, and a language may have many conjunctions. So, rather than gloss this (and likely other conjunctions) as ‘and’, we can gloss this as a conjunction (CONJ). To note that there may be more relevant information to add to this gloss, we will add ‘uk’ as a subcategory and leave it to future study (CONJ:uk).

Here is the new gloss:

ruul lee hŋerʔtee.

snake CONJ:uk ant

“a snake and an ant”

“There was”

A native speaker offers the following translations of individual words.

an rak um

? ? was

“there was”

an & rak

Currently, we do not have many clues from translation to figure out the function of these two morphemes. We can call these both pre-verbal particles (PRVP:uk) and move on.

um

This verb is currently translated as ‘was’. Words translated with a form of ‘be’ are often some type of copula (COP) and we are also unsure about the representation of past tense. We also know that here it is reporting the existence of a snake and an ant at a place, so we can add that information in the subcategory gloss (COP:exist)

Here is the new gloss:

an rak um

PRVP:uk PRVP:uk COP:exist

“there was”

“They say”

A native speaker offers the following translations of individual words.

an tii

? say

“they say”

an

We saw this morpheme in the previous line, but here we get the clue that it should correspond with the subject ‘they’. This lets us know that it is likely a pronoun, pronominal clitic, or agreement marker. For now, without other information about the language, we can add subject information to the subcategory gloss (PRVP:3plsubj), keeping this morpheme in the ‘pre-verbal particle’ category so it is easily compared against other PRVPs. (If we have enough information to call this, for example, an agreement marker, we could use AGR:3plsubj).

Here is the new gloss:

an tii

PRVP:3plsubj say

“they say”

Working towards linguistic analysis

Here is our updated gloss:

tiivaa pooŋ pa-khat aʔ

river near CLF:uk-one OBLCM:loc

‘near a river’

ruul lee hŋerʔtee.

snake CONJ:uk ant

‘a snake and an ant’

an rak um

PRVP:uk PRVP:uk COP:exist

‘there was’

an tii

PRVP:3plsubj say

‘they say’

Some of these glosses are good enough to work for us until we have more examples of the same word or other words in the same category for analysis. OBLCM:LOC and COP:exist, for example, have both a grammatical category and semantic category gloss, getting us close to the gloss we may ultimately settle on for our corpus. CONJ:uk and CLF:uk are also likely good enough until we have a lot of examples of conjunctions and numeral classifiers for a larger and more detailed look at these categories.

The gloss PRVP, however, only works as a temporary landing spot for morphemes showing up before the verb, within the same roughly segmented phrase. We can leave these and move on, but after a bit more consideration, we may be able to move these morphemes to more apt categories.

an

The first problem we can see is that an is glossed as both PRVP:a and PRVP:3plsubj. In the first instance, the translation has an expletive (or “dummy”) subject ‘there’ and in the second instance, it is glossed as ‘they’ (referring to unknown or unspecified people). More examples will be needed to determine the grammatical category.

Visit the CoRSAL website to find links to IGT examples. We invite you to try out this method with IGT at your disposal. What are the shortcomings of this method? What are the advantages?

In the next section, we will review the use of the program FLEx where annotation can be stored and revised to record the growing understanding of the annotator.

License

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

From Source to Analysis: A language documenter's guide to annotating text by University of North Texas is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.

6.1 Glossing Standards and Conventions

Leipzig Glossing Rules

6.2 Clause and Phrase Segmentation

6.3 Free morphemes, Bound Morphemes, and Clitics

6.4 Morpheme Glossing

Identifying grammatical morphemes

Naming grammatical morphemes

So how do you gloss it?

General grammatical category label

Specific semantic category label

Updating analyses

First pass analysis- placeholder for position

Second pass analysis

6.4 Glossing Hakha Lai

“Near a River”

tiivaa

pooŋ

pakhat

aʔ

“A snake and an ant”

hŋerʔtee

lee

“There was”

um

“They say”

an

Working towards linguistic analysis

an

License

Share This Book