« Dying to be counted: commodification of endangered languages in documentary linguistics - Peter K. Austin | Main | More on language documentation corpora - Peter K. Austin »

[From our man back from Chicago, Peter K. Austin, Endangered Languages Academic Programme, SOAS]

At the recent Linguistic Society of America annual meeting in Chicago, Sandra Chung from University of California Santa Cruz gave an invited plenary address on the topic “How much can understudied languages really tell us about how language works?” She argued, among other things, that data from understudied languages should play a crucial role in the development of linguistic theory since only by including them can we get a full picture of the array of phenomena found in human languages that need to be taken account of. She illustrated her talk with examples from her work on Chamorro, an endangered Austronesian language spoken on Guam.

During the question time following Sandy’s talk, one person commented something along the following lines (I paraphrase, since I was rather stunned to hear the opinion being openly expressed before a linguistics audience, and don’t recall the exact formulation):

“Linguistic research needs to concentrate on working with corpora and for the sort of languages you were talking about, like Chamorro, you will never be able to put together a corpus of sufficient size to be able to do anything meaningful. We should give up on the small (and disappearing) languages and concentrate on ones where we are likely to be able to get a decent sized corpus.”

There was quite a corpus buzz at the meeting (John Goldsmith gave an invited plenary talk entitled “Towards a new empiricism for linguistics” presenting his ideas about statistical corpus-based research), and I imagine many people had in mind ‘big language’ corpora of the 1-100 million words range (or perhaps even the two billion word corpus of English that the Oxford Dictionary folks have just compiled). At the Symposium on “Mobilizing Linguistic Resources Within Speaker Communities” (held after Sandy Chung’s talk) one of the presenters, Andrew Garett, was explicitly asked by an audience member how big was the text corpus for Yurok, the indigenous Californian language that he has been working on for some years and which has been the focus of recent language revitalization and teaching efforts.

So, should we just pack up, stop wasting our time, and leave the small languages alone? How big does a corpus have to be in order to be useful?

A partial answer can be found in Friederike Luepke’s 2005 paper entitled “Small is beautiful: contributions of field-based corpora to different linguistic disciplines, illustrated by Jalonke” published in Language Documentation and Description, Volume 3. Friederike shows how her Jalonke corpus of 7,000 intonation units (roughly 6,000 clauses) of transcribed and glossed text data can be explored quantitatively and qualitatively to uncover significant information on verb argument structure and alternations, genre-based variation, language contact phenomena, and language standardization tendencies. It is an impressive demonstration of the value of a richly annotated ‘small’ corpus.

Alternatively, there is Andrew Garrett’s response to the LSA Symposium question: the Yurok corpus of audio and text data is larger than the corpus for Luwian, an extinct Indo-European language that has played an important role in elucidating the Anatolian branch. It’s also bigger than that for Palaic, or several other languages that are ‘well respected’ in historical linguistics research.

Size is just one measure of value, and a pretty poor one it seems to me when it comes to endangered languages corpora in particular.

Comments

Andrew Taylor contacted me with the following information which he has given me permission to reproduce here:

"Your recent contribution on corpus size reminded me of a paper given by Leonard Newell of SIL Philippines at a conference on lexicography in Manila in 1992, in which he discussed this issue. I no longer have the paper, alas, but if I remember correctly he then suggested aiming for a corpus of a million words. The paper, 'Computer processing of texts for lexical analysis' was published in the conference proceedings (Papers from the first Asia International Lexicography Conference. Manila: Linguistic Society of the Philippines Special Monograph No. 35).

Then, in his Handbook on Lexicography for Philippine and Other Languages (Linguistic Society of the Philippines, Special Monograph No. 36, 1995) the third chapter, Developing a textual corpus, deals with a range of issues involved in compiling a useful corpus. The last section is 3.8, The size of the corpus for a modest project. By this time, his suggestion was for a somewhat larger corpus.

He estimated that a keyboarder could, conservatively, collect, enter, and do a spelling edit on about one million words of text in a year and goes on to say 'Based on the experience of the Romblomanon project, a corpus yielding about three million morphemes is considered both attainable and adequate to meet the needs of a modest lexicographic project on a lesser-known language' (p.43). However, he does acknowledge the limitations of human and financial resources which usually apply to projects on languages with small numbers of speakers. (I notice the change from words to morphemes in his paragraph, which would affect the count.)

I am not suggesting his view is correct, and he may well have changed it subsequently, but it is an interesting early attempt to quantify the problem."

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Enter the code shown below before pressing post

The Authors

About the Blog

The Transient Building, symbolising the impermanence of language, houses both the Linguistics Department at Sydney University and PARADISEC, a digital archive for endangered Pacific languages and music.
More

Recently Commented On

FAQ

Papua New Guinea FAQs from Eva Lindstrom Papua New Guinea (New Ireland): Eva Lindstrom's tips for fieldworkers

Australian Languages Answers to some frequently asked questions about Australian languages

Papua Web Information network on Papua, Indonesia (formerly Irian Jaya)

Interesting Blogs

Omniglot Writing systems and languages of the world

LingFormant Linguistics news

Language hat Linguistics news and commentary

Jabal al-Lughat Linguistics news and commentary on a range of languages

Living languages Blog with news items and discussion of endangered languages

OzPapersOnline Notices of recent work on the Indigenous languages of Australia

That Munanga linguist Community linguist blog

Anggarrgoon Claire Bowern's linguistics and fieldwork blog

Savage Minds A group blog on Anthropology

Talking Alaska: Reflections on the native languages of Alaska

Arwarbukarl Indigenous Language and Information Technology Blog

Culture matters: applying anthropology Australian anthropology blog: postgraduates and staff

Indigenous Language SPEAK A forum for linguists, language speakers, educators and any other interested people to discuss any issues regarding language loss, language research, and fieldwork methodology within indigenous communities.

Long Road ethnography and anthropology blog - including about Australia

matjjin-nehen Blog on Australian linguistics, fieldwork, politics and the environment.

Langguj gel Australian linguistics and fieldwork blog

Language Log Group blog on language and linguistics

Links

E-MELD The E-MELD School of Best Practices in Digital Language Documentation

Tema Modersmål Website in Swedish with links to sites on and in many languages

Hans Rausing Endangered Languages Project: Language Documentation: What is it? Information on equipment, formats, and archiving, and examples of documentation

Technorati Profile

Technology-enhanced language revitalization Include ILAT (Indigenous Languages and Technology) discussion list.

Endangered languages of Indigenous Peoples of Siberia

Koryak Net Information on the people of Kamchatka

Linguistic fieldwork preparation: a guide for field linguists syllabi, funding, technology, ethics, readings, bibliography

On-line resources for endangered languages

Papua New Guinea Language Resources Phonologies, grammars, dictionaries, literacy, language maps for many PNG languages

Resource network for linguistic diversity Networking practitioners working to record,retrieve & reintroduce endangered languages

Projects

ACLA child language acquisition in three Australian Aboriginal communities

DELAMAN The Digital Endangered Languages and Musics Archives Network

PARADISEC The Pacific And Regional Archive for Digital Sources in Endangered Cultures

Ethno EResearch Exploring methods and technology for collaborative electronic research

Murriny-Patha Song Project Documenting the language and music of public songs and dances composed and performed by Murriny Patha-speaking people

PFED The Project for Free Electronic Dictionaries

DOBES Endangered language documentation and archiving, funded by the Volkswagen Foundation and sponsored by the Max Planck Institute, Nijmegen.

DELP Documenting endangered languages at the University of Sydney

Powered by
Movable Type 3.2