« Book launch for new grammar of Australian language | Blog home | Another life gone - wiyarrpa »

business learning training articles new learning business training opportunities finance learning training deposit money learning making training art loan learning training deposits make learning your training home good income learning outcome training issue medicine learning training drugs market learning money training trends self learning roof training repairing market learning training online secure skin learning training tools wedding learning training jewellery newspaper learning for training magazine geo learning training places business learning training design Car learning and training Jips production learning training business ladies learning cosmetics training sector sport learning and training fat burn vat learning insurance training price fitness learning training program furniture learning at training home which learning insurance training firms new learning devoloping training technology healthy learning training nutrition dress learning training up company learning training income insurance learning and training life dream learning training home create learning new training business individual learning loan training form cooking learning training ingredients which learning firms training is good choosing learning most training efficient business comment learning on training goods technology learning training business secret learning of training business company learning training redirects credits learning in training business guide learning for training business cheap learning insurance training tips selling learning training abroad protein learning training diets improve learning your training home security learning training importance

Peter K. Austin
Linguistics Department, SOAS
8th August 2010

Forty-five years ago the annual fieldwork reports of some of the researchers funded by the then Australian Institute of Aboriginal Studies (now AIATSIS) included specifications of how much research had been completed in terms of the number of feet of tapes that had been recorded during the project year ("this year was especially productive with 45 feet 3 inches of tape being recorded"). The modern measure of this kind of quantitative nonsense is the number of gigabytes of digital files (soon to be terabytes) created by the researcher. Don't mind the quality, it's the length/bytes that count.

My colleague David Nathan, Director of the Endangered Languages Archive (ELAR) at SOAS, has been approached on several occasions by researchers (both those funded by ELDP and those not (yet)) asking how much data they would be allowed to deposit in the archive. "Would it be OK if I deposit 500 gigabytes of data?" they ask. When you think about it for a moment or two, this is a truly odd request, but one driven by part of what David (in Nathan 2004, see also Dobrin, Austin and Nathan 2007, 2009) has termed "archivism". This is the tendency for researchers to think that an archive should determine their project outcomes. Parameters stated in terms of audio resolution and sampling rate, file format, and encoding standards take the place of discussions of documentation hypotheses, goals, or methods that are aligned with a project's actual needs and intentions. David's response to such a question is usually: if the material to be deposited is "good quality" (stated in terms of some parameters (not volume!) established by the project in discussion with ELAR) then the archive will be interested in taking it.

Another quantity that comes up in this context (and in the context of grant applications as well) is the statement that "10% of the deposited archival data will be analysed". The remainder of the archive deposit will be, in the worst case, a bunch of media files, or in the best case, media files plus transcription (and/or translation). Where does this magical 10% come from? It seems to have originated around 10 years ago with the DOBES project which established a set of guidelines for language documentation during its pilot phase in 2000. As Wittenburg and Mosel (2004:1) state:

"During a pilot year intensive discussions ... took place amongst the participants. The participants agreed upon a number of basic guidelines for language documentation projects. ... For some material a deep linguistic analysis should be provided such that later researchers will be able to reconstruct the (grammar of the) language"

Similarly, the guidelines for ELDP grant applications (downloadable here) include the following:

"Note that audio and video are not usable, accessible or archivable without accompanying textual materials such as transcription, annotation, or notes about content and participants. While you are encouraged to transcribe and annotate as much of the material as possible, we recognise that this is very time-consuming and you may not be able to do this for all recorded materials. However, you must provide some text indication of the content of all recordings. This does not have to be the linguistic content and could include, for example, description of the topics or events (e.g. names of songs), or names of participants, preferably with time alignment (indication of where they occur in the recording)."

No actual figure is given of how much "some material" (for DOBES) or "as much of the material as possible" (for ELDP) amounts to. In earlier published versions of advice to applicants both DOBES and ELDP did mention 10%.

Interestingly, Wittenburg (2009, slide 34) has done an analysis of the language documentation data collected by DOBES projects between 2000 and 2009, and he notes that the average project team has recorded 131 hours of media (59 hours of audio, 72 hours of video), transcribed 50 hours of this, and translated 29 hours. Linguistic analysis on average exists for 14 hours of recordings -- strikingly this is exactly 10.68% of the average corpus!!

How much of the corpus needs to be linguistically annotated so that "later researchers will be able to reconstruct the (grammar of the) language" or indeed so that the rest of the corpus can be parsed? Well, it depends on a range of factors, including the nature of the language(s) being documented. Some Austronesian languages, like Sasak or Toratan, have relatively little morphology with pretty straightforward morpho-phonemics of such morphology that does exist, and so a relatively small amount of morpheme-by-morpheme glossed materials in conjunction with a lexicon would enable users to bootstrap the morphological analysis of other parts of a transcribed corpus in those languages. Other languages, like Athapaskan tongues with their fiendishly complex verb morphology, might need more annotated data to help the user deal with the whole corpus.

This is however an empirical question, and one that to my knowledge has not been addressed so far. There are now a number of documentary corpora available, with more coming on stream, and it should be possible to establish whether the "magical 10%" is a real goal to be aimed for, or just a figure that researchers have created and continue to repeat to one another.


Thanks to Anthony Jukes, David Nathan and Mandana Seyfeddinipur for discussion of some of the ideas presented here; none of them is responsible for the opinions expressed however.

Dobrin, Lise, Peter K. Austin and David Nathan. 2007. Dying to be counted: commodification of endangered languages. In Peter K. Austin, Oliver Bond and David Nathan (eds.) Proceedings of Conference on Language Documentation and Linguistic Theory, 59-68. London: SOAS. (online here)
Dobrin, Lise, Peter K. Austin & David Nathan. 2009. Dying to be counted: the commodification of endangered languages in language documentation. In Peter K. Austin (ed.) Language Documentation and Description, Volume 6, 37-52. London: SOAS.
Nathan, David. 2004. 'Documentary linguistics: alarm bells and whistles?', Seminar presentation, SOAS. 23 November 2004.
Wittenburg, Peter. 2009. Introduction to DOBES - Overview. Powerpoint slides, DOBES training course, June 2009.
Wittenburg, Peter, and Ulrike Mosel. 2004. The DOBES Programme and its Contribution to Standardization and Revitalization. Paper presented at Linguapax 2004 (on line here)


Thanks for mentioning this issue, Peter. Fieldwork is notoriously hard to “measure”, and outcomes may not be seen for years after linguists have had time to mentally digest all the material and immaterial findings that come from fieldwork. So it’s obvious that granting agencies would look for something measurable to use in their reports and hence to serve as a requirement for linguists working on grants.

I think that most linguists don’t take the measurement of their fieldwork output very seriously, instead seeing it as yet another grant hurdle to overcome. There are already so many seemingly arbitrary requirements in filing for grants and in fulfilling grant obligations once awarded that the archive quantity and 10% requirements are just one more thing in the pile. Certainly a good fieldworker knows intuitively when they’ve collected a reasonable amount of material, and when they have done intensive transcription and translation of enough to be somehow useful to others.

On the other hand, there are plenty of people who aren’t experienced fieldworkers who will take such “measurements” of fieldwork productivity as adequate for evaluation of one’s work. That issue leaves me with a sense of disquiet.

Also something to consider is the fact that percentage is a relative quantity. Some folks I know have well over 1000 hours of recorded material, and I think nowhere near ten percent of that has been transcribed. Asking for someone to do the ten percent for this before being willing to accept it is a bit unreasonable.

In addition, the “good quality” requirement needs to be looked past sometimes too. If you stumble upon a recording made in the 1960s on a lousy reel-to-reel which has since been converted to cassette, and it’s got a bunch of noise in the background, but it’s of a person speaking a dialect long since lost or perhaps a person speaking in a style that is now extinct, wouldn’t this recording be far more valuable in some sense than a recording made just yesterday?

Regarding those "pesky Athapaskan tongues with their fiendishly complex verb morphology", I think we might find that the high degree of cognacy across the prefix complex puts a lower burden on the user to parse untranscribed verbs. With some knowledge of the language-specific prefix phonology an Athabaskanist can probably tease out the prefixes even in an unfamiliar Athabaskan language. This is much more difficult when there are huge differences in morphological structure across the family.


Many thanks for your detailed comments -- I am currently preparing another post where I address the issue of depositing "1000 hours of recorded material" less than 10% of which is transcribed. As for "good quality" note that I did not define what the parameters for deciding that between the depositor and the archive might be (other than not data volume). For some thoughts from January 2008 about what possible parameters might be have a look here.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Enter the code shown below before pressing post

The Authors

About the Blog

The Transient Building, symbolising the impermanence of language, houses both the Linguistics Department at Sydney University and PARADISEC, a digital archive for endangered Pacific languages and music.


Papua New Guinea FAQs from Eva Lindstrom Papua New Guinea (New Ireland): Eva Lindstrom's tips for fieldworkers

Australian Languages Answers to some frequently asked questions about Australian languages

Papua Web Information network on Papua, Indonesia (formerly Irian Jaya)

Hibernating blogs

Indigenous Language SPEAK

Langguj gel Australian linguistics and fieldwork blog

Interesting Blogs

Omniglot Writing systems and languages of the world

LingFormant Linguistics news

Language hat Linguistics news and commentary

Jabal al-Lughat Linguistics news and commentary on a range of languages

Living languages Blog with news items and discussion of endangered languages

OzPapersOnline Notices of recent work on the Indigenous languages of Australia

That Munanga linguist Community linguist blog

Anggarrgoon Claire Bowern's linguistics and fieldwork blog

Savage Minds A group blog on Anthropology

Fully (sic)

Language on the Move Intercultural communication and multilingualism

Talking Alaska: Reflections on the native languages of Alaska

Culture matters: applying anthropology Australian anthropology blog: postgraduates and staff

Long Road ethnography and anthropology blog - including about Australia

matjjin-nehen Blog on Australian linguistics, fieldwork, politics and the environment.

Language Log Group blog on language and linguistics


E-MELD The E-MELD School of Best Practices in Digital Language Documentation

Tema Modersmål Website in Swedish with links to sites on and in many languages

Hans Rausing Endangered Languages Project: Language Documentation: What is it? Information on equipment, formats, and archiving, and examples of documentation

Indigenous Peoples Issues & Resources a worldwide network of organizations, academics, activists, indigenous groups, and others representing indigenous and tribal peoples

Technorati Profile

Technology-enhanced language revitalization Include ILAT (Indigenous Languages and Technology) discussion list.

Endangered languages of Indigenous Peoples of Siberia

Koryak Net Information on the people of Kamchatka

Linguistic fieldwork preparation: a guide for field linguists syllabi, funding, technology, ethics, readings, bibliography

On-line resources for endangered languages

Papua New Guinea Language Resources Phonologies, grammars, dictionaries, literacy, language maps for many PNG languages

Resource network for linguistic diversity Networking practitioners working to record,retrieve & reintroduce endangered languages


ACLA child language acquisition in three Australian Aboriginal communities

DELAMAN The Digital Endangered Languages and Musics Archives Network

PARADISEC The Pacific And Regional Archive for Digital Sources in Endangered Cultures

Murriny-Patha Song Project Documenting the language and music of public songs and dances composed and performed by Murriny Patha-speaking people

PFED The Project for Free Electronic Dictionaries

DOBES Endangered language documentation and archiving, funded by the Volkswagen Foundation and sponsored by the Max Planck Institute, Nijmegen.

DELP Documenting endangered languages at the University of Sydney

Ethno EResearch Exploring methods and technology for streaming media and interlinear text