3 Data Management and Metadata Creation

Learning Objectives

In this section we will learn:

  1. How to name files and organize folders
  2. How to create metadata for digital language items
  3. How to use SayMore for data management

 

3.1 Data Management & Metadata

Let’s consider best practices in managing information about your audio, video, text, and photograph files. We will call these your source files.  It is very easy to forget basic information about files such as who created the file, when it was created, how files are related to each other, and who provided the file.  You will need this information when you write about the source files or when you send your files to an archive.

In this section, we provide some suggestions on how you can organize your materials (referred to as data management) and keep track of these details (referred to as metadata). Data management and metadata creation are not difficult if you start practicing them from the very beginning of your efforts at creating a collection.  The opposite is also true –  it is very difficult to find and use your source files without thoughtful data management.

In this chapter, we review data management and metadata creation. It’s easy and fun. Let’s get started!

Activity and Discussion:  Creating Metadata

Explore one or more language archives (e.g., AIILA, PARADISEC, CoRSAL). What metadata is shown in these archives? What metadata in these archives is most relevant to your project? Is there any metadata you feel the archive entries you looked at are lacking? What types of searches can you do? Are the search results what you expected?

 

3.2 File Naming

We provide some principles here on file naming:

  • Keep the pattern consistent. We suggest starting with the ISO 639-3 code followed by the date of recording or writing. If you don’t know this, you could provide the date of acquisition.
  • Keep file names short and unique. The file name should not be the place where you store all possible metadata. That information should be on a separate list that is linked to your file name.
  • Use numerals with leading zeros (01, 02, 03) so that they can be sorted correctly.

Here is an example of a file name that provides four types of information, going from general information to more specific.

  1. ISO 639-3 code (3 characters)
  2. Date of creation (YYYY-MM-DD) (10 characters)
  3. Number of the recording for that day (_0#) (3 characters)
  4. File type extension (e.g., .wav, .mp4, .pdf, .txt, .eaf) (~3 characters)

Here is an example: lmk2018-09-21_01.wav

You may create additional files based on this file, such as transcriptions or translations. These files will have the same name, but different file type extensions. These may look as follows:

  • If you have a transcription of a recording: lmk2018-09-21_01.txt
  • If you have an ELAN transcription of a recording: lmk2018-09-21_01.eaf
  • If you have an PRAAT annotation grid: lmk2018-09-21_01.textgrid
  • If you have a FLEx interlinear gloss file: lmk2018-09-21_01.flextext

The exact file naming convention will differ from project to project. We suggest keeping the name short (20 characters or less), consistent, and focused on chronology (when did I create this? and, on that day, which recording was this?).

Next, let’s look at how we are going to organize those files into folders.

3.3 Foldering for Project Management

It might be helpful to place all related files in one folder. For example, your folder contents will include a source file, like an audio recording, and other items directly pertaining to that file, such as transcriptions, translations, and handwritten notes. Here is an example of a list of folders from Dr. Sadaf Munshi’s project on the Mankiyali language.

 

A fieldworker’s folder for all elicitation sessions
project management – folders

These files are then placed in the Source Sound Files folder.

 

The inside of one of the elicitation folders
project management – folder contents

For each file, you will want to note information that will be necessary for archiving and for any reuse of that file.  In addition, you will want to know if there are any related files. For example, a source video file will have related subtitle files. We can use the data management tool SayMore to manage folders of related files. These folders are called “session” folders in SayMore. In Chapter 4, you will have an activity to experiment with creating a session folder.

 

The contents of a SayMore project session.
project management – SayMore

Whether you create folders yourself or use SayMore to create the organization, you will find that managing your files with considered naming and foldering practices will make it easier to find your files, make it easier not to lose your files, and make archiving a much more pleasant experience.

Discussion: What are your current methods of file naming? How do you use folders to organize materials? Where and how do you keep track of the who, what, where, which, and when related to file’s content?

 

3.4 Metadata

In the following sections, we discuss specific metadata fields you may see in an archive such as CoRSAL. We do this to show the following:

  1. What types of information are often collected for items in a digital language collection?
  2. How do these types of information fit into the library metadata vocabulary? Remember that library cataloging is for many different fields and digital linguistic objects are new to this world. We are fitting our linguistic practices into a cataloging world that has been analog until recently.
  3. We use the CoRSAL metadata as examples of templates or guides that can be used for metadata creation. Using a guide of this sort will bring uniformity to your collection that will make it easier for browsing and searching. It will also ensure that all users and depositors interpret metadata fields in the same way.

Item level metadata

For each of your files, you will want to keep track of the following information. These terms are not created for language or linguistic information, but are used in many library metadata schemes. First, we list the item level metadata needed (information about each item) and then we provide examples and templates for how to think about and note down that information.

  • Titles
    • Main title
    • Parallel title
    • Series title
  • Creator
    • Roles
  • Contributor
    • Roles
  • Coverage
  • Language
  • Date
  • Content Description
  • Physical Description
  • Subject and Keywords

While you are collecting your source files, you can record this information in an Excel spreadsheet, a text document in list form, or into a SayMore project. This is really your choice. At an archive like CoRSAL, a staff member will ask you for this information when they help you build your metadata for archiving. For other archives, you will enter this information into their system. The more information you have and the more complete your metadata is, the easier your material will be to find in the future.  Talk to your archivist before you do a lot of collecting so you know what information you need and how best to store it.

Here we provide some details on what information is needed for each file for the CoRSAL archive.

Titles

Main Title

All items must have a short, informative title in English that accurately describes what the item is. By using this list as a guide, you can keep your titles clear and consistent through your collection.

Type of item Example title
Retelling of: use this when there is a well-known story or a picture book or video that is retold Retelling of the Pear Story
Traditional story about: use this when the story is a folktale Traditional story about the squirrel and bat
Monologue on: use this for one person speaking about opinions, history, interpretation Monologue on the state of the Lamkang language now and in the past
Description of: use this for descriptions of procedures such as cooking, fishing, or making baskets Description of how medicines are made from the plants in kitchen gardens
Performance of: use this for someone singing, dancing, reciting a poem Performance of the ritual blessing song
Personal narrative on: use this for a person talking about their personal experience Personal narrative on education and work
Conversation about: use this for two or more people talking Conversation about the festivals in town
Elicitation of: use this when you are listing vocabulary or grammar examples and getting responses from speakers Elicitation of words about vegetables
Analytical discussion about: use this for a guided conversation between linguist and speaker about specific aspects of the language, more detail about previously collected stories, etc. Analytical discussion about modals
Speech about: use this for a talk prepared ahead of time and performed/spoken in front of a large group Speech about the elections
Reading of: use this for reading of wordlists, when written stories are read out loud. Reading of the wordlist on verbs
Photograph of: use this for photographs Photograph of a traditional house
Unpublished manuscript about: use this for handwritten or typed notes that have not been published Unpublished manuscript about the spelling system
Analytical notes on: use this for written paradigms, notes on technical linguistic subjects, etc. Analytic notes on verb paradigms
Letter to: use this for letters, audio or written Letter to Rex’s brother
**For published books, pamphlets, or magazines, use their official title The 6th Triennial Fellowship, Lamkang Naga Baptist Association: Hymn Book
Parallel Title

The Parallel title is a translation of the title in the source language. So, in the Lamkang Language Resource, where the Main Title is the “United Nations Declarations on the Rights of Indigenous Peoples,” the Parallel Title is “Chaatti Kunpun ni Mdopandandok Indigenous Miirek ki Ruhtanna.” Something like a story or a poem will likely have a name in the source language. Other items like a word list or your notes on grammar may only have an English title–that is okay. This field is optional.

 

Series Title

We use the Series Title field to group items within a collection. For example, the Lamkang Language Resource includes the Shobhana Chelliah Collection and the Rex Khullar Collection, among others. This makes it easier to find just those items contributed by Rex Khullar. This field is optional.

Creator Field

The Creator field is used to indicate who made the item. We have provided the most common roles for the Creator field for our language documentation projects. You will find that identifying the role of the creator becomes much easier after you have used this list a few times.

Role Use for
Analyst A person or group that provided linguistic analysis of language data.
Author A person that wrote the textual item (e.g., book, pamphlet, poem).
Collector A person who has brought together material for the collection. They may have recorded or written the materials themselves, or brought together recordings and materials by someone else. If someone records a speech event, but does no other speaking or questioning, you may also use the collector role.
Interviewer A person who elicits language data through engaging in conversation or questions couched in conversation.
Research team member A person who participated in a research project, but whose role did not involve direction or management of it (use sparingly).
Researcher A person or organization responsible for performing research (use sparingly).
Research team head A person who spearheads the intellectual activities of a research project (use sparingly).
Photographer A person or organization responsible for taking photographs.
Transcriber A person who prepares a handwritten or typed transcript of an audio or video recording, or re-types field notes or legacy materials.
Translator A person who renders a text from one language into another, or from an older form of a language into the modern form.

 

Contributor Field

Here, we need to say who else is involved in the making of the recording, like the speaker, or someone who helps with the recording equipment.

Role Use for
Analyst A person who provides linguistic analysis of language data.
Consultant A person who provides specialized knowledge – anything from acoustic analysis or videography to cultural expertise. Note that this is separate from a linguistic consultant –  see “Speaker” below.
Interviewee A person who provides language data through informal conversation or in response to informal questions.
Performer A person who participates in a staged performance, i.e., sang songs, recited a poem, or acted in a play.
Research team head A person who directs or manages the research project leading to the recording (use sparingly).
Research team member A person who participates in a research project, but whose role did not involve direction or management of it (use sparingly).
Researcher A person who conducts research which includes creating this recording.  For example, this could be someone hired by a grant to help with research (use sparingly).
Speaker A person who contributes to a resource by speaking in the language of this resource collection, most likely a native speaker of the language.
Transcriber A person who prepares a handwritten or typed transcript of an audio or video recording, or re-types field notes or legacy materials.
Translator A person who renders a text from one language into another, or from an older form of a language into the modern form.
Activity: Come up with at least 5 scenarios of recording of source data. For example, a fieldworker is recording word lists for understanding the phonetic inventory of the language or a young woman is recording her grandmother’s retelling of folk stories. In each case, think of  additional people who may provide derivative files on the source file. For example, the folk stories may be transcribed and translated by a PhD student with the help of the young woman who did the recording. In the situations you have listed, who are the Creators and who are the Contributors?

 

Since you will need to provide the names of the creators and contributors, you will want those names to be listed and spelled consistently. This can be tricky with South Asian names, as these may or may not include fully spelled out or abbreviated caste names, family names, nicknames, and clan names. It would be very helpful if you keep a list of names of the contributors and creators. You can decide how you would like it to display in the archive.

Here is what such a list might look like:

The name How to write it in the DL What each part of the name means
Shobhana Lakshmi Chelliah Chelliah, Shobhana Lakshmi Family Name, Given Name (first), Given Name (Middle)
Rex Rengpu Khullar Khullar, Rex Rengpu Clan Name, Given Name, birth order name
Th. Harimohon Singh Singh, Harimohon Thounaojam Caste title, Given Name, Family Name

Note: If you have already listed someone under Creator, do not repeat that person under Contributor. For example, if someone records themselves singing a song in their language, they will be the Creator of the item (Collector), even though they are also the Contributor (Performer, in this case).

Read and discuss issues of naming raised in the following article: Burke, M., Chelliah, S.  2021.  Challenges to representing personal names and language names in language archives: Examples from Northeast India. Proceedings of the 1st International Workshop on Digital Language Archives 2021.   https://hdl.handle.net/2142/111713.

 

Coverage

This field is for the geographical area where the language is spoken. Coverage is a repeatable field–this means there can be multiple geographical areas listed on one item. For example, many items in the Mankiyali Language Resource include the name of the village in Pakistan where the language is spoken in the Coverage field, even if the recording was made at the University of North Texas in Denton, TX, USA. It is also possible to include other geographical areas discussed in the item, like a story about a trip to Nagaland. Even though the source language is spoken primarily in Assam, we can also include ‘Nagaland’ to indicate the subject of the story.

Deciding what goes into the Coverage field can be confusing. The area where the language is spoken is the best place to start. CoRSAL staff can help you think through the other place names to include.

 

Date

This field reflects when the item was created. Write the year first, the month, then day. So, for October 10, 2008, the field will look like this: 2008-10-10. You may not know the whole date, but provide as much as you can. If you are not sure, you can use a question mark after the date like this: 2008-10-10?. If you want to say an approximate date, you can use a tilde (~) like this: 2008-10-10~.

Content Description

Although not required, we have found it helpful to provide our depositors at CoRSAL with the following template for content description. This template ensures some agreement between different collection and even in the same collections, what is meant by a ‘monologue’ or ‘performance’, for example. Our experience before we created the template was that different researchers had different ideas about what each of these terms meant. Therefore, there was no way to search across language deposits for retellings of the Pear Story, for example. Some descriptions included a story plot while other did not – the unevenness of the entry in the ‘content description’ field made browsing the collection less predictable and more taxing.

In the content description, you say what the item is about. If it is a photograph, describe who and/or what is photographed and what the context is, (e.g., a festival, or special event). If it is a story, describe what happens in the story, or say who the story is about. You can also give more information on who is telling it, like what village or country they are from. You can also add additional information about the item, like whether a story is only told on special occasions, or what makes the item unique.  Here are some templates to help you describe your materials.

Genre Description template Example 
Retelling This is a retelling of {name of the story} narrated by {speaker}. The story is about {maximum 100 word description}. This is a retelling of the Pear Story narrated by Beshot Khullar. In this story a boy steals a basket of pears and meets and shares pears with three other boys.
Traditional story  This is a traditional story about {maximum 100 word description}. This is a traditional story about when Benglam chases a tiger and traps him in a tree, after which the tiger makes Benglam answer riddles.

 

Monologue  This is a monologue on {maximum 100 word description}. This is a monologue on why it is important to document a language.
Description This is a description of {maximum 100 word description}. This is a description of how to harvest rice from the paddy. First, they gather the rice on a winnowing fan to separate the husk. After that, the rice is stored in a granary.
Performance This is a performance of {maximum 100 word description}. This is a performance of housewarming songs. These songs are performed as part of the spring festival by men and women.
Personal narrative This is a personal narrative about{maximum 100 word description}. This is a personal narrative about the time Sumshot Khular played soccer in Texas.
Conversation  This is a conversation about {maximum 100 word description}. This is a conversation about preparations for a festival.  Father and son discuss what needs to be cooked for the feast.
Discussion  This is a discussion about {maximum 100 word description}. This is a discussion about how to weave fishing baskets in the traditional way.
Elicitation This is an elicitation of/about {maximum 100 word description}. This is an elicitation about modal verbs based on pictures and responsive sentences.
Analytical discussion  This is an analytical discussion about {maximum 100 word description}. This is an analytical discussion about modals based on 20 sentences collected through elicitation. The speaker and researcher identify substitutions for the modals that occur in a previously elicited list of sentences.
Speech This is a speech about {topic} given at {event, setting, time, place}. This is a speech about the importance of language documentation given at a workshop for digital archiving at IGNCA in June 2019.
Reading This is a reading of {maximum 100 word description}. This is a reading of a word list on tone in compounds.
Interview This is an interview of/about {maximum 100 word description}. This is an interview of two young Manipuri girls about going to school in Manipur.
Photograph This is a photograph of {maximum 100 word description}. This is a photograph of chickens in a field in Chandel during harvest time. They are of interest because of their unique plumage.
Manuscript This is a manuscript of/about {maximum 100 word description}. This is a manuscript of handwritten field notes on the Lamkang language from three Lamkang researchers collected at the 2013 orthography workshop at Don Bosco.
Analytical notes These are analytical notes on {maximum 100 word description}. These are analytical notes on discourse markers in Lamkang.

Here is a brief list of other things the depositor may want to include in the description

  1. Birth year/gender/occupation/village the speaker is from
  2. The specific variety being spoken (Yasin Burushaski v. Nagar Burushaski)
  3. Whether there is a transcription or any analysis of the recording, or any related items
  4. Notes on the setting/context of recording (e.g., recorded during a wedding)

Sometimes you do not have all the information about an item, and that is okay. Include as much as you can.

 

Physical Description

For audio and video recordings, this is the duration of the recording. For a textual item, like a transcription, translation, or book, this is the number of pages. Use the following examples:

Audio and video recordings: 1 recording (7 mins., 42 secs.)

Textual items: 18 p. ; pdf

Photographs: 1 photograph : digital, col.

 

Subject and Keywords

The Subject and Keyword fields are tools you can use to increase the use of your item. You might use keywords to provide the following types of information:

  1. Linguistic information  (e.g., code-switching, serial verbs, relative clauses, agglutination)
  2. Ethnographic information (e.g, rice cultivation, bamboo, festivals, hunting, medicinal plants, dances, implements, clothing)
  3. Communicative patterns (e.g., questions, exclamations, warnings, greetings)

 

Discussion: What metadata might make a collection useful to different users? For example, a language teacher, a botanist, a phonetician? What kinds of information about grammar could be include in subject or keyword fields? How about for everyday traditional practices like cooking, weaving, and hunting?

 

Traditional Knowledge Labels

Traditional Knowledge labels help translate traditional concepts into protocols for cultural knowledge sharing and use. Using these Traditional Knowledge Labels conveys important information on how the knowledge you are sharing is viewed by the culture from which it originates. The labels can convey proper use, guidelines for action, and responsible stewardship (Local Contexts, 2017). Traditional Knowledge labels can be viewed in more detail on the Local Contexts page, but here are some examples:

  • TKCO label is being used to indicate that this material is traditionally and usually not publicly available.

“This label is being used to indicate that this material is traditionally and usually not publicly available. The label is correcting a misunderstanding about the circulation options for this material and letting any users know that this material has specific conditions for circulation within the community. It is not, and never was, free, public and available for everyone at anytime. This label asks you to think about how you are going to use this material and to respect different cultural values and expectations about circulation and use.”

  • TK NC label indicates that material has been designated as being available for non-commercial use.

“This material has been designated as being available for non-commercial use. You are allowed to use this material for non-commercial purposes including for research, study or public presentation and/or online in blogs or non-commercial websites. This label asks you to think and act with fairness and responsibility towards this material and the original custodians.”

Even if an archive does not have a systematic way of indicating these labels, the information can be included in a notes field.

 

3.5 Project: Create Metadata for an item

Look at the following and consider the metadata created by the depositor: http://sevensistersmusic.com/?s=chakpa.  

Here is an example:

  • What is your name? Marjing
  • Who took this video? Chakpa Andro Panam Ningthou Meihoubirol Cultural Association
  • Give this video a title: Indigenous Dance and Songs of Chakpa Andro
  • Where was this video taken?
    • o Village/Town/City: Andro
    • o State: Manipur
    • o Country: India
  • When was this photo taken?
    • o Year: 2017
    • o Month: December
    • o Day: 12
  • Please describe this item and let us know why it is important to you: One song from the album Chakpa Abdri Yekhou Jagoi – Eshei. This Manipuri folk dance is accompanied by two pena players. The pena players don’t use a standardized key, just a natural harmony to go along with their singing. The dance and the song are about the Langmeiching, which is a mythical mountain significant in the Meetei religion. There is a belief that the Meeitei people originate from this mountain. This is a place that they worship. There is a transcription of the song and a translation into English.
  • Keywords
    • Andro
    • Manipur
    • Pena
    • Folk dance
    • Musical instrument
    • Meitei religion
    • Indigenous Pre-Vaishnavite

Share This Book