Peter K. Austin
Linguistics Department, SOAS
8th August 2010
Forty-five years ago the annual fieldwork reports of some of the researchers funded by the then Australian Institute of Aboriginal Studies (now AIATSIS) included specifications of how much research had been completed in terms of the number of feet of tapes that had been recorded during the project year ("this year was especially productive with 45 feet 3 inches of tape being recorded"). The modern measure of this kind of quantitative nonsense is the number of gigabytes of digital files (soon to be terabytes) created by the researcher. Don't mind the quality, it's the length/bytes that count.
My colleague David Nathan, Director of the Endangered Languages Archive (ELAR) at SOAS, has been approached on several occasions by researchers (both those funded by ELDP and those not (yet)) asking how much data they would be allowed to deposit in the archive. "Would it be OK if I deposit 500 gigabytes of data?" they ask. When you think about it for a moment or two, this is a truly odd request, but one driven by part of what David (in Nathan 2004, see also Dobrin, Austin and Nathan 2007, 2009) has termed "archivism". This is the tendency for researchers to think that an archive should determine their project outcomes. Parameters stated in terms of audio resolution and sampling rate, file format, and encoding standards take the place of discussions of documentation hypotheses, goals, or methods that are aligned with a project's actual needs and intentions. David's response to such a question is usually: if the material to be deposited is "good quality" (stated in terms of some parameters (not volume!) established by the project in discussion with ELAR) then the archive will be interested in taking it.
Another quantity that comes up in this context (and in the context of grant applications as well) is the statement that "10% of the deposited archival data will be analysed". The remainder of the archive deposit will be, in the worst case, a bunch of media files, or in the best case, media files plus transcription (and/or translation). Where does this magical 10% come from? It seems to have originated around 10 years ago with the DOBES project which established a set of guidelines for language documentation during its pilot phase in 2000. As Wittenburg and Mosel (2004:1) state:
"During a pilot year intensive discussions ... took place amongst the participants. The participants agreed upon a number of basic guidelines for language documentation projects. ... For some material a deep linguistic analysis should be provided such that later researchers will be able to reconstruct the (grammar of the) language"
Similarly, the guidelines for ELDP grant applications (downloadable here) include the following:
"Note that audio and video are not usable, accessible or archivable without accompanying textual materials such as transcription, annotation, or notes about content and participants. While you are encouraged to transcribe and annotate as much of the material as possible, we recognise that this is very time-consuming and you may not be able to do this for all recorded materials. However, you must provide some text indication of the content of all recordings. This does not have to be the linguistic content and could include, for example, description of the topics or events (e.g. names of songs), or names of participants, preferably with time alignment (indication of where they occur in the recording)."
No actual figure is given of how much "some material" (for DOBES) or "as much of the material as possible" (for ELDP) amounts to. In earlier published versions of advice to applicants both DOBES and ELDP did mention 10%.
Interestingly, Wittenburg (2009, slide 34) has done an analysis of the language documentation data collected by DOBES projects between 2000 and 2009, and he notes that the average project team has recorded 131 hours of media (59 hours of audio, 72 hours of video), transcribed 50 hours of this, and translated 29 hours. Linguistic analysis on average exists for 14 hours of recordings -- strikingly this is exactly 10.68% of the average corpus!!
How much of the corpus needs to be linguistically annotated so that "later researchers will be able to reconstruct the (grammar of the) language" or indeed so that the rest of the corpus can be parsed? Well, it depends on a range of factors, including the nature of the language(s) being documented. Some Austronesian languages, like Sasak or Toratan, have relatively little morphology with pretty straightforward morpho-phonemics of such morphology that does exist, and so a relatively small amount of morpheme-by-morpheme glossed materials in conjunction with a lexicon would enable users to bootstrap the morphological analysis of other parts of a transcribed corpus in those languages. Other languages, like Athapaskan tongues with their fiendishly complex verb morphology, might need more annotated data to help the user deal with the whole corpus.
This is however an empirical question, and one that to my knowledge has not been addressed so far. There are now a number of documentary corpora available, with more coming on stream, and it should be possible to establish whether the "magical 10%" is a real goal to be aimed for, or just a figure that researchers have created and continue to repeat to one another.