« Linguistic diversity and scholarship - Peter Austin | Blog home | Endangered Pacific Rim languages - Peter Austin »

business learning training articles new learning business training opportunities finance learning training deposit money learning making training art loan learning training deposits make learning your training home good income learning outcome training issue medicine learning training drugs market learning money training trends self learning roof training repairing market learning training online secure skin learning training tools wedding learning training jewellery newspaper learning for training magazine geo learning training places business learning training design Car learning and training Jips production learning training business ladies learning cosmetics training sector sport learning and training fat burn vat learning insurance training price fitness learning training program furniture learning at training home which learning insurance training firms new learning devoloping training technology healthy learning training nutrition dress learning training up company learning training income insurance learning and training life dream learning training home create learning new training business individual learning loan training form cooking learning training ingredients which learning firms training is good choosing learning most training efficient business comment learning on training goods technology learning training business secret learning of training business company learning training redirects credits learning in training business guide learning for training business cheap learning insurance training tips selling learning training abroad protein learning training diets improve learning your training home security learning training importance

[From Peter K. Austin, Endangered Languages Academic Programme, SOAS]

There has been an interesting discussion on the LINGTYP linguistic typology list over the past week about publishing fieldwork data (archived here). David Gil argued that:

“One's collection of transcribed texts constitutes a set of complete objects, each of which could (if there were a willing publisher) stand alone as an electronic or hardcopy publication. Barring the discovery and correction of errata, once the text is transcribed, that's it, it's done.”

I responded that:

“From my experience, and that of other researchers I have spoken to, understanding/analysis of a given "text" (in the sense of inscription of a particular linguistic performance) evolves over time and is not "fixed" at any point, not even the point where it is "published" (in whatever version). Secondly, textual annotation (of which the 'traditional' interlinear format is but one particular type) is hypertextual and, these days, multimedia in nature - this is hardly a new insight - see this Kairos article [JHS: apologies to earlier readers - I caused an html oblitto cut] for a discussion of the hypertextual nature of annotation in the Talmudic tradition. Developments in Web 2.0 publishing also mean that multiple annotations of texts by multiple (distributed) contributors is now possible.”
I pointed out that there are two excellent papers dealing with these topics that will be coming out in Language Documentation and Description Vol 4 to be published at SOAS later this month:

• Nick Evans and Hans-Jürgen Sasse 'Searching for meaning in the Library of Babel: field semantics and problems of digital archiving'
• Anthony C. Woodbury 'On thick translation in linguistic documentation'

Both papers emphasise the ongoing, contingent, interpretive, hermeneutical quality of the documentation of languages, especially meaning in texts.

David Gil’s reply to my intervention included the following:

“My claim is merely that with respect to texts, there still exists a kind of basic intuitive level of transcription plus annotation -- comprising things such as orthographic transcription, phonetic transcription, interlinear gloss, free translation into English (or some other language) -- that, once accomplished, provides a natural point at which the text may be published. Even if one chooses to add or amend things later.”

Now, maybe my experience doing fieldwork and analysis over the past 35 years is unusual, but I have real doubts about the existence of such a “basic intuitive level of transcription plus annotation”, and real reservations about there being “a natural point at which the text may be published". Transcription and analysis of texts, and their publication, involves a range of decisions taken at a particular point in time about what to include or not include. And we often publish because of pragmatic reasons such as there being an opportunity to do so, or because we are moving on to other projects, or we need the publication for a job or promotion. In addition, my experience in preparing my Jiwarli text corpus for publication (eventually published by Tokyo University of Foreign Studies in 2006 – email pa2 AT soas.ac.uk me if you want a free copy) was that serious editing in collaboration with my native-speaker consultant had to take place before he would approve publication of my transcriptions and translations of the stories we had recorded together. This included: deletion of repetitions and false starts, revisions to word order, insertion of contextual material, and elimination of loan words.

Looking back 11 years later I can imagine a whole set of different decisions that I would now make at various points about all areas of the published texts, from transcription to morpheme-by-morpheme glossing to running translation. What I think I understand about Jiwarli has been constantly changing over the years as I do more work on it and other languages.

Some researchers also seem to believe that there is a kind of endpoint when a whole language documentation is ‘done and dusted’ as they say here in the UK. A draft paper entitled ‘Adequacy of Documentation’ was discussed at the meeting of the Linguistic Society of America Committee on Endangered Languages and their Preservation in Los Angeles in January. In that paper, the authors wonder about: “the conditions that must be met for a language to be considered adequately documented”. They suggest it is possible to measure “how far along one has come in documenting a language” and to determine “how far there is to go” by using an “accounting function of analysis”. So:

“How do we know when we’ve gotten all the phonology? When we’ve done the phonological analysis and our non-directed elicitation isn’t producing any new phonology. How do we know when we’ve gotten all the morphology? When we’ve done the morphological analysis, when our non-directed elicitation isn’t producing any new forms, and when—crucially for inflected languages—we have elicited all the implicit inflected forms that didn’t happen to come up in non-directed elicitation.”

They recognise, however, that work on the lexicon and texts is very different but still insist that:

“even in the more open-ended aspects of syntax and lexicon, we know we are coming to an endpoint when new constructions and lexical items become rarer and rarer in non-directed elicitation”

Unfortunately, this begs the question of what we mean by “rarer and rarer”: is 1 million words of annotated text like the Brown Corpus enough to judge? What about 100 million like the British National Corpus? Or the billion word Oxford English Corpus?

Now researchers studying lesser known languages may never record and analyse textual corpora of this size but I wonder if it is only exhaustion or death that brings a “natural point” at which the linguist’s work is done.


re "rarer and rarer" and diminishing returns in text. I agree (or maybe I'm atypical too) - sure, transription becomes easier, and the number of new words per text declines as the number of textual recordings increases, but I've never reached the point, even with 80 hours or so of Bardi recordings, where I've felt that the payoff in new vocab would be less than transcribing the text in full was worth. There's always new vocab, new complex predicates, and new senses of previously discovered words (and this from a language that I know reasonably well and can hold a conversation in). I think my transcirbed corpus is baout 40,000 words at this point, and I'm about a third to half the way through. But there's always new and interesting stuff, even in the old and interesting stuff...

It's also worth noting that a transcribed text, particularly in an endangered language, can have endless applications for education/applied linguistics/language revitalisation. There are some texts that we rehash again and again and are eternally useful with each incarnation offering something of value... i don't see an endpoint there.

Plus there are other texts, that everytime I think we've nailed it, the author/speaker wants to change this word or that word or remembers a better word... surely any so-called endpoint is always going to be arbitrary. Or am I missing the point here?

No Wamut that is precisely the point -- I was suggesting that publication and archiving of data takes place at relatively arbitrary points in time, and that what's in the corpus represents the linguist's understanding at that time. One problem is that published (or archival) versions of material can become canonical and information from such 'canonical texts' can get used in secondary academic writings or typological research and become reified as representing "the language" or "the grammar of X". I'm just suggesting: "there's no such thing".


Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Enter the code shown below before pressing post

The Authors

About the Blog

The Transient Building, symbolising the impermanence of language, houses both the Linguistics Department at Sydney University and PARADISEC, a digital archive for endangered Pacific languages and music.


Papua New Guinea FAQs from Eva Lindstrom Papua New Guinea (New Ireland): Eva Lindstrom's tips for fieldworkers

Australian Languages Answers to some frequently asked questions about Australian languages

Papua Web Information network on Papua, Indonesia (formerly Irian Jaya)

Hibernating blogs

Indigenous Language SPEAK

Langguj gel Australian linguistics and fieldwork blog

Interesting Blogs

Omniglot Writing systems and languages of the world

LingFormant Linguistics news

Language hat Linguistics news and commentary

Jabal al-Lughat Linguistics news and commentary on a range of languages

Living languages Blog with news items and discussion of endangered languages

OzPapersOnline Notices of recent work on the Indigenous languages of Australia

That Munanga linguist Community linguist blog

Anggarrgoon Claire Bowern's linguistics and fieldwork blog

Savage Minds A group blog on Anthropology

Fully (sic)

Language on the Move Intercultural communication and multilingualism

Talking Alaska: Reflections on the native languages of Alaska

Culture matters: applying anthropology Australian anthropology blog: postgraduates and staff

Long Road ethnography and anthropology blog - including about Australia

matjjin-nehen Blog on Australian linguistics, fieldwork, politics and the environment.

Language Log Group blog on language and linguistics


E-MELD The E-MELD School of Best Practices in Digital Language Documentation

Tema Modersmål Website in Swedish with links to sites on and in many languages

Hans Rausing Endangered Languages Project: Language Documentation: What is it? Information on equipment, formats, and archiving, and examples of documentation

Indigenous Peoples Issues & Resources a worldwide network of organizations, academics, activists, indigenous groups, and others representing indigenous and tribal peoples

Technorati Profile

Technology-enhanced language revitalization Include ILAT (Indigenous Languages and Technology) discussion list.

Endangered languages of Indigenous Peoples of Siberia

Koryak Net Information on the people of Kamchatka

Linguistic fieldwork preparation: a guide for field linguists syllabi, funding, technology, ethics, readings, bibliography

On-line resources for endangered languages

Papua New Guinea Language Resources Phonologies, grammars, dictionaries, literacy, language maps for many PNG languages

Resource network for linguistic diversity Networking practitioners working to record,retrieve & reintroduce endangered languages


ACLA child language acquisition in three Australian Aboriginal communities

DELAMAN The Digital Endangered Languages and Musics Archives Network

PARADISEC The Pacific And Regional Archive for Digital Sources in Endangered Cultures

Murriny-Patha Song Project Documenting the language and music of public songs and dances composed and performed by Murriny Patha-speaking people

PFED The Project for Free Electronic Dictionaries

DOBES Endangered language documentation and archiving, funded by the Volkswagen Foundation and sponsored by the Max Planck Institute, Nijmegen.

DELP Documenting endangered languages at the University of Sydney

Ethno EResearch Exploring methods and technology for streaming media and interlinear text