« Is mine big enough? Peter K. Austin | Blog home | Australia‚Äôs linguistic exports - Peter K. Austin »

business learning training articles new learning business training opportunities finance learning training deposit money learning making training art loan learning training deposits make learning your training home good income learning outcome training issue medicine learning training drugs market learning money training trends self learning roof training repairing market learning training online secure skin learning training tools wedding learning training jewellery newspaper learning for training magazine geo learning training places business learning training design Car learning and training Jips production learning training business ladies learning cosmetics training sector sport learning and training fat burn vat learning insurance training price fitness learning training program furniture learning at training home which learning insurance training firms new learning devoloping training technology healthy learning training nutrition dress learning training up company learning training income insurance learning and training life dream learning training home create learning new training business individual learning loan training form cooking learning training ingredients which learning firms training is good choosing learning most training efficient business comment learning on training goods technology learning training business secret learning of training business company learning training redirects credits learning in training business guide learning for training business cheap learning insurance training tips selling learning training abroad protein learning training diets improve learning your training home security learning training importance

[From our man back from the Netherlands, Peter K. Austin, Endangered Languages Academic Programme, SOAS]

We had an interesting discussion about documentation corpora in the course I taught last week for the LOT winter school at the Universiteit van Tilburg.

In the course I took the somewhat strong view that a documentary corpus minimally consists of: (a) media or text recordings (inscriptions), with (b) time-aligned transcription, and (c) time-aligned translation, and (d) relevant metadata about the documentation and communicative context. Thus, on this view, the 150 hours of untranscribed video collected by a project that one of the students is involved in is not part of any corpus (though it might be what Himmelmann (2006:10) calls 'primary data' ("recordings of observable linguistic behaviour and metalinguistic knowledge"), or what OLAC calls 'a resource', and it might become part of the corpus when it is worked on in the future). Neither is the audio recording of a 6-person conversation that another student made in Sri Lanka that neither he nor his consultants are able to transcribe. Media recordings without transcription or translation thus do not constitute data by themselves and don't document anything. This view of what a corpus is also appears in the DoBeS guidelines as presented in Brugman 2003, available here, and on the HRELP website. A corpus can be enriched by annotation (see Bird and Liberman 2001) with the addition of linguistic information like morphemic analysis, morpheme-by-morpheme glosses, part of speech tags etc (see Schultze-Berndt 2006), or non-linguistic information like kinship relations or cultural practices etc (see Franchetto 2006).

I suggested in an earlier post that size may not be a useful criterion for determining the value of a documentary corpus. In class last week, we talked about what some evaluative criteria might be. Through discussions over a number of years, Robert Munro, David Nathan and I have come up with the following list of possible qualitative evaluative dimensions (in no particular order, and recognising that some may be in conflict) that could be applied to a documentary corpus:

  1. comprehensiveness -- to what degree does the corpus represent a range of speech event types and situations in which the language is used?
  2. uniqueness -- does the corpus contain material that is unusual or special in some way, or material that cannot be easily reproduced or collected again?
  3. novelty -- to what extent is the content of the corpus new and contain material never collected before?
  4. usefulness and adaptability -- can the corpus be used for a range of purposes? What range of potential users does it serve? Can the corpus be modified for uses other than those intended by the original collector? Can it be converted into other formats? To what degree does it meet the needs of stakeholders other than the collector?
  5. ethics -- was the corpus collected in a responsible manner in accordance with clearly stated ethical procedures? Are there explicitly stated protocols for access and use of the corpus? (See Holton 2005 [.pdf])
  6. organisation and management -- here we might identify several dimensions (see Gibbon 2002 [.pdf] for discussion of one possible model for fieldwork linguists. There are also useful resources here, especially Nick Thieberger's presentation):
    • explicitness and robustness -- is the corpus stored in a well-structured format that is portable (in the sense of Bird and Simons 2003) and transparent to other users? Are there explicit links between information in different parts of the corpus?
    • consistency -- are the annotation schemes (for transcription, glossing etc) applied rigorously across the whole corpus? Are the media recorded in a consistent manner?
    • meaningfulness -- do the annotation schemes have clear semantic interpretations?
    • conventionality -- is the representation of the data in the corpus in some commonly used or standardised format?
    • preservability -- is the corpus stored in such a way that it can be archived and preserved for future users? Munro 2005 sets out a 6-point scale of corpus archivability.

Might these, and perhaps other dimensions, serve as a basis for a descriptive vocabulary for talking about documentary corpora? One of the concerns I heard expressed by the students in my LOT course, and by students and post-doctoral and other junior researchers at SOAS and elsewhere, is that corpus preparation work does not 'count' for the purposes of research evaluation, as demanded by our academic audit culture for things like job applications, promotion, tenure review etc. (Interestingly, despite the commodification of endangered languages research, it is one product that appears to have no value to the accounting system.)

As the LOT students noted, corpora tend to be left 'messy' or 'incomplete' or 'half-done' because researchers determine that time should be 'better spent' on the writing and publication of descriptive or theoretical materials which will be counted by the audit. If a review process that categorised corpora along these dimensions (or others) could be established, and given institutional backing, then perhaps the resulting evaluations would be a spur to getting corpus preparation judged more positively by everyone.

Note: thanks to Alexandra, Felix, Sebastian and Sonja for lively discussion in Tilburg. Rob Munro and David Nathan are not to be held responsible for any misuse of their ideas that I may have made.


Bird, Steven & Mark Liberman 2001 A formal framework for linguistic annotation. Speech Communication 33:23-60
Bird, Steven & Gary Simons 2003 Seven dimensions of portability. Language 79:557-582.
Himmelmann, Nikolaus 2006 Language documentation: what is it and what is it good for?, In Jost Gippert, Nikolaus Himmelmann & Ulrike Mosel (eds.) Essentials of Language Documentation, 1-30. Berlin: Mouton de Gruyter.
Franchetto, Bruna 2006 Ethnography in language documentation, In Jost Gippert,
Nikolaus Himmelmann & Ulrike Mosel (eds.) Essentials of Language Documentation, 183-211. Berlin: Mouton de Gruyter.
Gibbon, Daffyd 2002 Ubiquitous multilingual corpus management in computational fieldwork. LREC Proceedings.
Holton, Gary 2005 Ethical practices in language documentation and archiving. OLAC Tutorial Archiving and linguistic resources or How to keep your data from becoming endangered. Linguistics Society of America annual meeting, Oakland CA.
Munro, Robert 2005 The digital skills of language documentation. In Peter K. Austin (ed.) Language Documentation and Description, Volume 3, 141-156. London: SOAS.
Schultze-Berndt, Eva 2006 Linguistic annotation, In Jost Gippert, Nikolaus Himmelmann & Ulrike Mosel (eds.) Essentials of Language Documentation, 213-251. Berlin: Mouton de Gruyter.

The Authors

About the Blog

The Transient Building, symbolising the impermanence of language, houses both the Linguistics Department at Sydney University and PARADISEC, a digital archive for endangered Pacific languages and music.

Recently commented on


Papua New Guinea FAQs from Eva Lindstrom Papua New Guinea (New Ireland): Eva Lindstrom's tips for fieldworkers

Australian Languages Answers to some frequently asked questions about Australian languages

Papua Web Information network on Papua, Indonesia (formerly Irian Jaya)

Hibernating blogs

Indigenous Language SPEAK

Langguj gel Australian linguistics and fieldwork blog

Interesting Blogs

Omniglot Writing systems and languages of the world

LingFormant Linguistics news

Language hat Linguistics news and commentary

Jabal al-Lughat Linguistics news and commentary on a range of languages

Living languages Blog with news items and discussion of endangered languages

OzPapersOnline Notices of recent work on the Indigenous languages of Australia

That Munanga linguist Community linguist blog

Anggarrgoon Claire Bowern's linguistics and fieldwork blog

Savage Minds A group blog on Anthropology

Fully (sic)

Language on the Move Intercultural communication and multilingualism

Talking Alaska: Reflections on the native languages of Alaska

Culture matters: applying anthropology Australian anthropology blog: postgraduates and staff

Long Road ethnography and anthropology blog - including about Australia

matjjin-nehen Blog on Australian linguistics, fieldwork, politics and the environment.

Language Log Group blog on language and linguistics


E-MELD The E-MELD School of Best Practices in Digital Language Documentation

Tema Modersmål Website in Swedish with links to sites on and in many languages

Hans Rausing Endangered Languages Project: Language Documentation: What is it? Information on equipment, formats, and archiving, and examples of documentation

Indigenous Peoples Issues & Resources a worldwide network of organizations, academics, activists, indigenous groups, and others representing indigenous and tribal peoples

Technorati Profile

Technology-enhanced language revitalization Include ILAT (Indigenous Languages and Technology) discussion list.

Endangered languages of Indigenous Peoples of Siberia

Koryak Net Information on the people of Kamchatka

Linguistic fieldwork preparation: a guide for field linguists syllabi, funding, technology, ethics, readings, bibliography

On-line resources for endangered languages

Papua New Guinea Language Resources Phonologies, grammars, dictionaries, literacy, language maps for many PNG languages

Resource network for linguistic diversity Networking practitioners working to record,retrieve & reintroduce endangered languages


ACLA child language acquisition in three Australian Aboriginal communities

DELAMAN The Digital Endangered Languages and Musics Archives Network

PARADISEC The Pacific And Regional Archive for Digital Sources in Endangered Cultures

Murriny-Patha Song Project Documenting the language and music of public songs and dances composed and performed by Murriny Patha-speaking people

PFED The Project for Free Electronic Dictionaries

DOBES Endangered language documentation and archiving, funded by the Volkswagen Foundation and sponsored by the Max Planck Institute, Nijmegen.

DELP Documenting endangered languages at the University of Sydney

Ethno EResearch Exploring methods and technology for streaming media and interlinear text