« Dying to be counted: commodification of endangered languages in documentary linguistics - Peter K. Austin | Blog home | More on language documentation corpora - Peter K. Austin »

business learning training articles new learning business training opportunities finance learning training deposit money learning making training art loan learning training deposits make learning your training home good income learning outcome training issue medicine learning training drugs market learning money training trends self learning roof training repairing market learning training online secure skin learning training tools wedding learning training jewellery newspaper learning for training magazine geo learning training places business learning training design Car learning and training Jips production learning training business ladies learning cosmetics training sector sport learning and training fat burn vat learning insurance training price fitness learning training program furniture learning at training home which learning insurance training firms new learning devoloping training technology healthy learning training nutrition dress learning training up company learning training income insurance learning and training life dream learning training home create learning new training business individual learning loan training form cooking learning training ingredients which learning firms training is good choosing learning most training efficient business comment learning on training goods technology learning training business secret learning of training business company learning training redirects credits learning in training business guide learning for training business cheap learning insurance training tips selling learning training abroad protein learning training diets improve learning your training home security learning training importance

[From our man back from Chicago, Peter K. Austin, Endangered Languages Academic Programme, SOAS]

At the recent Linguistic Society of America annual meeting in Chicago, Sandra Chung from University of California Santa Cruz gave an invited plenary address on the topic “How much can understudied languages really tell us about how language works?” She argued, among other things, that data from understudied languages should play a crucial role in the development of linguistic theory since only by including them can we get a full picture of the array of phenomena found in human languages that need to be taken account of. She illustrated her talk with examples from her work on Chamorro, an endangered Austronesian language spoken on Guam.

During the question time following Sandy’s talk, one person commented something along the following lines (I paraphrase, since I was rather stunned to hear the opinion being openly expressed before a linguistics audience, and don’t recall the exact formulation):

“Linguistic research needs to concentrate on working with corpora and for the sort of languages you were talking about, like Chamorro, you will never be able to put together a corpus of sufficient size to be able to do anything meaningful. We should give up on the small (and disappearing) languages and concentrate on ones where we are likely to be able to get a decent sized corpus.”

There was quite a corpus buzz at the meeting (John Goldsmith gave an invited plenary talk entitled “Towards a new empiricism for linguistics” presenting his ideas about statistical corpus-based research), and I imagine many people had in mind ‘big language’ corpora of the 1-100 million words range (or perhaps even the two billion word corpus of English that the Oxford Dictionary folks have just compiled). At the Symposium on “Mobilizing Linguistic Resources Within Speaker Communities” (held after Sandy Chung’s talk) one of the presenters, Andrew Garett, was explicitly asked by an audience member how big was the text corpus for Yurok, the indigenous Californian language that he has been working on for some years and which has been the focus of recent language revitalization and teaching efforts.

So, should we just pack up, stop wasting our time, and leave the small languages alone? How big does a corpus have to be in order to be useful?

A partial answer can be found in Friederike Luepke’s 2005 paper entitled “Small is beautiful: contributions of field-based corpora to different linguistic disciplines, illustrated by Jalonke” published in Language Documentation and Description, Volume 3. Friederike shows how her Jalonke corpus of 7,000 intonation units (roughly 6,000 clauses) of transcribed and glossed text data can be explored quantitatively and qualitatively to uncover significant information on verb argument structure and alternations, genre-based variation, language contact phenomena, and language standardization tendencies. It is an impressive demonstration of the value of a richly annotated ‘small’ corpus.

Alternatively, there is Andrew Garrett’s response to the LSA Symposium question: the Yurok corpus of audio and text data is larger than the corpus for Luwian, an extinct Indo-European language that has played an important role in elucidating the Anatolian branch. It’s also bigger than that for Palaic, or several other languages that are ‘well respected’ in historical linguistics research.

Size is just one measure of value, and a pretty poor one it seems to me when it comes to endangered languages corpora in particular.


Andrew Taylor contacted me with the following information which he has given me permission to reproduce here:

"Your recent contribution on corpus size reminded me of a paper given by Leonard Newell of SIL Philippines at a conference on lexicography in Manila in 1992, in which he discussed this issue. I no longer have the paper, alas, but if I remember correctly he then suggested aiming for a corpus of a million words. The paper, 'Computer processing of texts for lexical analysis' was published in the conference proceedings (Papers from the first Asia International Lexicography Conference. Manila: Linguistic Society of the Philippines Special Monograph No. 35).

Then, in his Handbook on Lexicography for Philippine and Other Languages (Linguistic Society of the Philippines, Special Monograph No. 36, 1995) the third chapter, Developing a textual corpus, deals with a range of issues involved in compiling a useful corpus. The last section is 3.8, The size of the corpus for a modest project. By this time, his suggestion was for a somewhat larger corpus.

He estimated that a keyboarder could, conservatively, collect, enter, and do a spelling edit on about one million words of text in a year and goes on to say 'Based on the experience of the Romblomanon project, a corpus yielding about three million morphemes is considered both attainable and adequate to meet the needs of a modest lexicographic project on a lesser-known language' (p.43). However, he does acknowledge the limitations of human and financial resources which usually apply to projects on languages with small numbers of speakers. (I notice the change from words to morphemes in his paragraph, which would affect the count.)

I am not suggesting his view is correct, and he may well have changed it subsequently, but it is an interesting early attempt to quantify the problem."

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Enter the code shown below before pressing post

The Authors

About the Blog

The Transient Building, symbolising the impermanence of language, houses both the Linguistics Department at Sydney University and PARADISEC, a digital archive for endangered Pacific languages and music.


Papua New Guinea FAQs from Eva Lindstrom Papua New Guinea (New Ireland): Eva Lindstrom's tips for fieldworkers

Australian Languages Answers to some frequently asked questions about Australian languages

Papua Web Information network on Papua, Indonesia (formerly Irian Jaya)

Hibernating blogs

Indigenous Language SPEAK

Langguj gel Australian linguistics and fieldwork blog

Interesting Blogs

Omniglot Writing systems and languages of the world

LingFormant Linguistics news

Language hat Linguistics news and commentary

Jabal al-Lughat Linguistics news and commentary on a range of languages

Living languages Blog with news items and discussion of endangered languages

OzPapersOnline Notices of recent work on the Indigenous languages of Australia

That Munanga linguist Community linguist blog

Anggarrgoon Claire Bowern's linguistics and fieldwork blog

Savage Minds A group blog on Anthropology

Fully (sic)

Language on the Move Intercultural communication and multilingualism

Talking Alaska: Reflections on the native languages of Alaska

Culture matters: applying anthropology Australian anthropology blog: postgraduates and staff

Long Road ethnography and anthropology blog - including about Australia

matjjin-nehen Blog on Australian linguistics, fieldwork, politics and the environment.

Language Log Group blog on language and linguistics


E-MELD The E-MELD School of Best Practices in Digital Language Documentation

Tema Modersmål Website in Swedish with links to sites on and in many languages

Hans Rausing Endangered Languages Project: Language Documentation: What is it? Information on equipment, formats, and archiving, and examples of documentation

Indigenous Peoples Issues & Resources a worldwide network of organizations, academics, activists, indigenous groups, and others representing indigenous and tribal peoples

Technorati Profile

Technology-enhanced language revitalization Include ILAT (Indigenous Languages and Technology) discussion list.

Endangered languages of Indigenous Peoples of Siberia

Koryak Net Information on the people of Kamchatka

Linguistic fieldwork preparation: a guide for field linguists syllabi, funding, technology, ethics, readings, bibliography

On-line resources for endangered languages

Papua New Guinea Language Resources Phonologies, grammars, dictionaries, literacy, language maps for many PNG languages

Resource network for linguistic diversity Networking practitioners working to record,retrieve & reintroduce endangered languages


ACLA child language acquisition in three Australian Aboriginal communities

DELAMAN The Digital Endangered Languages and Musics Archives Network

PARADISEC The Pacific And Regional Archive for Digital Sources in Endangered Cultures

Murriny-Patha Song Project Documenting the language and music of public songs and dances composed and performed by Murriny Patha-speaking people

PFED The Project for Free Electronic Dictionaries

DOBES Endangered language documentation and archiving, funded by the Volkswagen Foundation and sponsored by the Max Planck Institute, Nijmegen.

DELP Documenting endangered languages at the University of Sydney

Ethno EResearch Exploring methods and technology for streaming media and interlinear text