4 Transcription and Translation
Learning Objectives
In this chapter, we will:
- Review decisions about orthography
- Install and use Keyman for special keyboard characters
- Install and use SayMore for transcription and translation
4.1 Introduction
By creating metadata for your audio, video, scanned text files, and photographs, you have made it possible to organize these files in many different ways: by date, by genre, by creator, and so on. You have also made it possible to search for content easily. This is the first step to providing access to your collection.
In this section, we review how you can provide a deeper level of access through creating and sharing free transcriptions and translations.
In the past, linguists used pen and paper or word processing software to write down transcriptions and translations. This is no longer necessary. In fact, if you talk to anyone who does this kind of work regularly, they will strongly suggest using software specifically designed for translation and transcription of words and text from under-documented languages. It speeds up the process, ensures uniformity in data formats for export, and allows large teams to work in unison. The software we recommend for non-specialists is SayMore. This software is free, runs on Windows, is frequently updated, has a community of users, is easy to learn, includes a data management tool, and exports to frequently used formats.
To summarize, SayMore or a similar tool is useful to:
- Manage your files (group related files in the same folders, make file names consistent)
- Create metadata (add additional contextual information to your files)
- Automatically segment audio into manageable chunks
- Replay, transcribe, and translate chunks of audio
You can read more about SayMore here:
Moeller, S. (2014). SayMore, a tool for Language Documentation Productivity. In Language Documentation & Conservation, 8, (pp. 66-74). Honolulu, Hawaii: University of Hawaii Press. Available from https://scholarspace.manoa.hawaii.edu/handle/10125/4610
You can read about developments for language documentation software here:
Arkhipov, A. and N. Thieberger. (2018). Reflections on software and technology for language documentation. In M. Bradley, A. L. Berez-Kroeker, and G. Holton. (Eds.). Reflections on Language Documentation 20 Years after Himmelmann 1998. Language Documentation & Conservation Special Publication, 15, (pp. 140-149). Honolulu, Hawaii: University of Hawai‘i Press. Available from https://scholarspace.manoa.hawaii.edu/bitstream/10125/24821/ldc-sp15-arkhipov.pdf
4.2 Transcription and Translation
The central concern for transcription is consistency. No matter the purpose of your data collection and analysis, keeping to a predictable representation of the sounds and words in a language makes data more usable for yourself, community members, and future researchers.
Discussion: Orthography and the Practice of Writing
You can read about orthography development here:
Lupke, F. (2011). Orthography development. In P. Austin and J. Sallabank (Eds.). The Cambridge Handbook of Endangered Languages, (pp. 312-336). Cambridge, England: Cambridge University Press. Available from https://assets.cambridge.org/97805218/82156/frontmatter/9780521882156_frontmatter.pdf
Then discuss the following: How do you or would you use your language for written purposes? What written materials exist? What is the orthography used? What discussion has the community had about spelling and characters? Here are some common considerations in developing an orthography. Use this list to help with your discussion:
Shallow vs. Deep Representation: The terms shallow and deep, when referring to orthographies, indicate the level of representation the orthography has. In other words, how much detailed phonetic information is conveyed through the orthography? It may seem tempting to opt for as much phonetic information as possible. But, overspecifying predictable phonetic features may overburden the reader and make it hard to read a passage. For example, in Lamkang, a language spoken in parts of India and Burma, long vowels are tense. But, the practical orthography does not indicate the vowel quality distinctions, only length distinctions. For example: tren [tr?n] ‘buy’ versus treen [tr?n] ‘cut’.
Typographic Ease: Ease of use refers to how easily speakers can actually employ the orthography in their day-to-day activities and how easily they can learn to read and write it. This is another reason having too shallow of orthography can be problematic — it can require more symbols to be learned and more diacritics (e.g., accent marks) to be written.
Another consideration is, how easily can it be typed? It is a digital world where smartphones and other devices can provide platforms for improving literacy. How easily can the orthography be written using a smartphone or standard computer keyboard? Certain special characters can easily be written by hand but can be troublesome to type. A useful development is that some language communities now have apps that let users input less conventional characters via phone and computer.
Spacing and Punctuation: One of the tough decisions for a language that is not written frequently is, what constitutes a word? In English, we write going to as two separate words even though we often pronounce the sequence as one word, i.e., gonna. This is by convention that has been around for many hundreds of years. But for under-resourced languages, conventions are new and not followed by all writers, so we find variation in what is considered one word or more than one word. In Lamkang, we see the following variation, for example: mthungbi or mthung bi ‘then’.
Familiarity: Something else to keep in mind is the familiarity of the orthography being used. Is the orthography similar to the orthography of the dominant language that the community may already be used to using? This may impact ease of learning, use, and acceptance.
Aesthetics: Another concern with orthographic choice is aesthetics. Some language communities want orthographies that remind them of major language orthographies, like English or Hindi. Another example from Lamkang is the representation of repeated vowel sounds. Some community groups wish to replace the second vowel sound with another representation because they perceive it as more aesthetically pleasing, like kruung vs. kruwng ‘lord, god’.
Differences across dialects: Another difficult issue with orthographies is linguistic variation across different dialects. Which variety should be represented in the spelling system? If there are only two or three varieties and the differences are predictable, then one might be able to develop conventions for each variety. However, the community’s thoughts on the standardization and prestige of one variety over another will be the deciding factor.
Acceptance/Adoption: Even when developed by community members, it may take years for spelling conventions to be widely accepted and implemented by the community. Orthographic choices turn out to be a highly visible sign of affiliation–be it to an individual, religion, or community history. For example, the Lamkang community has a Latin-based orthography due to their predominantly Christian beliefs. Attempting to use the Devanagari script or another orthography associated with non-Christian cultures may not be successful. Another example of how politics and culture impact orthography use is with Chechen, spoken in the North Caucasus and modern-day Chechnya. Chechen was written in Arabic, Cyrillic, and Latin script throughout the past century due to the political and religious influences in the region.
For providing access for your collection, you can transcribe in IPA, the International Phonetic Alphabet, but keep in mind that non-specialists won’t be able to read this easily. Therefore, a consistent practical orthography that is approved, at least for the most part, by the community would be useful.
When using language documentation software like SayMore, or even just typing examples in a Word document in order to share your findings, you may need to type special characters not found on your keyboard. If you only need to include a few symbols, you can go to ipa.typeit.org to insert symbols as needed. However, if you have to type a lot of special characters, you will want to download software that lets you use combinations of keystrokes to insert symbols more quickly. Keyman is one keyboard program that works well with the software packages we will be using: FLEx and SayMore. Keyman can be used with Windows, Mac, and Linux systems. Visit the CoRSAL site for a guide on installing and configuring Keyman.
At this point, you may want to read the page on SayMore installation and use at on the CoRSAL site. There you can learn how to use SayMore for creating transcriptions, translations, and metadata, as well as for data management. The program allows for automatic or manual segmentation of the sound. If you have video file, the program will strip the audio from the video. The division or segmentation of the file is time-stamped so you can see exactly where in the recording each chunk of speech occurs. Once divided into chunks, you can link to each chunk and transcribe that chunk and provide a rough translation. This side-by-side transcription and translation can be exported for use in further annotation programs like FLEx or formatted and used for presentation in different forms. You will also find instructions on the use of another program, ELAN, with which you can also create time-aligned annotations. ELAN is useful if you are working with a Mac.
4.3 Project: Creating SayMore and importing to ELAN
- Create a Session folder in SayMore for video documents and add a video recording
- Create an .EAF annotation for your video file
- Export the annotation file as a .EAF
- Open a new ELAN project
- Import your SayMore file
- Add a Translation tier and a Word tier in ELAN