« Endangered languages and technology in the New York Times | Blog home | Australian Indigenous language funding »

business learning training articles new learning business training opportunities finance learning training deposit money learning making training art loan learning training deposits make learning your training home good income learning outcome training issue medicine learning training drugs market learning money training trends self learning roof training repairing market learning training online secure skin learning training tools wedding learning training jewellery newspaper learning for training magazine geo learning training places business learning training design Car learning and training Jips production learning training business ladies learning cosmetics training sector sport learning and training fat burn vat learning insurance training price fitness learning training program furniture learning at training home which learning insurance training firms new learning devoloping training technology healthy learning training nutrition dress learning training up company learning training income insurance learning and training life dream learning training home create learning new training business individual learning loan training form cooking learning training ingredients which learning firms training is good choosing learning most training efficient business comment learning on training goods technology learning training business secret learning of training business company learning training redirects credits learning in training business guide learning for training business cheap learning insurance training tips selling learning training abroad protein learning training diets improve learning your training home security learning training importance

Peter K. Austin
Department of Linguistics, SOAS
29th July 2009

At the Linguistic Society of America Summer Institute in Berkeley last week (17-19th July) the National Science Foundation sponsored Cyberling 2009, a workshop exploring how computational infrastructure (called "cyberinfrastructure" in the US, and e-Science or e-Humanities in the UK) can support linguistic research in a variety of fields. There was a panel discussion about data sharing that looked at the proposal:

"A cyberinfrastructure for linguistic data would allow unprecedented access [to] the empirical base of our field, but only if we collectively build that empirical base by contributing data. This panel addresses the benefits of data sharing and the obstacles to the widespread adoption of sharing practices, from the perspective of a variety of subfields"

But the bulk of the workshop was given over to closed discussion sessions by seven working groups looking at annotation standards, other standards, new multi-purpose software (so-called "killer apps"), data reliability and provenance, models from other fields, funding sources, and collaboration structure. The group discussions and resulting final day presentations are available on the Cyberling Wiki.

I was co-chair of Working Group 4 that was charged with discussing "protecting data reliability and provenance", i.e. how to keep track of the creation of data and analysis and its passage through the electronic infrastructure as researchers access and use each other's materials. As the Cyberling Wiki says, this is crucial

"for data creators (who need credit for the work they have done and the academic contribution of collecting, curating and annotating data) and the data users (who need to know where the data has come from so they can form an opinion of how much credence to give it and how to give proper credit to the originator of the data)".

We also looked at how to establish a culture of data sharing and what mechanisms might be put in place to encourage people to share data. Clearly, for endangered language research where data are unique and fragile, these are very important issues.

After two and a half days of intense discussions our group came up with a set of proposals relating to data reliability and provenance that can be summarised as follows:

  • Curated data as publication -- the best way to ensure reliability and provenance would be to treat data that has been curated (selected, structured and analysed, with associated metadata) as a form of publication. The technology to do this is already available, however to do it successfully there needs to be institutional and social engagement so that creators will receive recognition and credit, and users will properly use and cite other researcher's materials
  • Handles -- we need to set up a system of globally unique, persistent identifiers for entities (people, organisations, roles -- similar to, but much broader than, the OpenID system for customer identification), documents (since URLs are volatile), and mashups and views (combinations of data, often generated on the fly from a range of sources, e.g. the forthcoming Rosetta Platform which will draw data about languages and speakers from Freebase, classification information from Ethnologue, and locations from Googlemaps, using language codes and GPS references as the 'glue').
  • Software as a Service -- we need provision of software on the web to analyse, restructure and repurpose data, while keeping track of its provenance and reliability. Again, some of this service provision is currently available, but more would make collaboration and data sharing a real possibility.

Our group also proposed some first steps that could lead to more sharing and collaboration among linguists:

  • proactive education to ensure that all linguists understand the value of data sharing (as well as the ways in which access and use can be controlled and proper citation ensured). In the case of endangered languages materials, there is the added importance of bringing out into the open materials that are unique and fragile
  • mentors (eg. PhD supervisors) should publish and share their data sets as models for the next generation
  • websites should provide a "cite as" button with their data views so that proper referencing can be maintained -- this could be ideally extended later using the new linguistic identifier (handle) system
  • more service provision for data structure, integrity validation, and conversion -- this already existing in some areas, eg. ELAR at SOAS provides these services on a case-by-case basis to ELDP grantees
  • publishers and editors could require provenance information when linguistic material, eg. example sentences, is included in books and articles
  • editors and funding agencies could encourage data sets to be published

On the final point, publication of data sets currently exists in some areas of science, such as Earth System Science Data which publishes articles on "the planning, instrumentation and execution of experiments or collection of data. Any interpretation of data is outside the scope of regular articles." The first realisation of this in linguistics can be seen in the newly established on-line Journal of Experimental Linguistics which aims to publish:

"reproducible computational experiments on topics related to speech and language. These experiments may involve the analysis of previously published corpus data, or of experiment -specific data that is published for the occasion ... In all cases, JEL articles will be accompanied by executable recipes for re creating all figures, tables, numbers and other results. These recipes will be in the form of source code that runs in some generally available computational environment" [emphasis mine, PKA]

Perhaps the time is ripe for this approach to be applied to endangered languages research, with full publication of media files and annotation sets. Some individual researchers are already doing this, eg. Stuart McGill's Cicipu texts website presents time-aligned, annotated and glossed texts that are available to other researchers to check the analyses presented in his recently-submitted PhD thesis. However, establishment of an edited journal that publishes endangered languages data could do much to promote collaborative research, and open up the field to replicability and testability of results in a way not seen so far on a large scale.


Peter, I'm wondering how your participation in and laudatory reporting of this workshop sits with your earlier strident critique of many of these same ideas (quantification of endangerment, location of bounded languages, ennumeration of speakers etc) in your LSA presentation and blog? Or was that stirring the possum?

Cyberling was about the idea of developing an infrastructure for sharing data and analyses between researchers and other interested stake-holders, which is a separate issue from commodification of languages that turns them into objects to be counted and measured. The arguments that Lise Dobrin, David Nathan and I have made (the most recent and detailed version of which are in a paper published in June in Language Documentation and Description, Volume 6) involve pointing out this trend in endangered languages research and arguing that it is due to forces of our Western audit culture that dominate research methods. We argue that (p47):

"What is needed is an explicit recognition that the singularity of languages is irreducible, and that the methods used to study them must be singular as well. Each research situation is unique, and documentary work derives its quality from its appropriateness to the particularities of that situation"

None of this denies the possibility that researchers should be encouraged to share their data and analyses, nor prevents us from thinking about how such sharing might be developed or achieved, ie. what kind of infrastructure might support it. Indeed, one of the issues that came up repeatedly in discussions at Cyberling was the blockages thrown up by the very audit-based approach to authorship credit and citation that Dobrin, Nathan and I were pointing to.

If I have misunderstood your comment I'd be pleased to learn more.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Enter the code shown below before pressing post

The Authors

About the Blog

The Transient Building, symbolising the impermanence of language, houses both the Linguistics Department at Sydney University and PARADISEC, a digital archive for endangered Pacific languages and music.


Papua New Guinea FAQs from Eva Lindstrom Papua New Guinea (New Ireland): Eva Lindstrom's tips for fieldworkers

Australian Languages Answers to some frequently asked questions about Australian languages

Papua Web Information network on Papua, Indonesia (formerly Irian Jaya)

Hibernating blogs

Indigenous Language SPEAK

Langguj gel Australian linguistics and fieldwork blog

Interesting Blogs

Omniglot Writing systems and languages of the world

LingFormant Linguistics news

Language hat Linguistics news and commentary

Jabal al-Lughat Linguistics news and commentary on a range of languages

Living languages Blog with news items and discussion of endangered languages

OzPapersOnline Notices of recent work on the Indigenous languages of Australia

That Munanga linguist Community linguist blog

Anggarrgoon Claire Bowern's linguistics and fieldwork blog

Savage Minds A group blog on Anthropology

Fully (sic)

Language on the Move Intercultural communication and multilingualism

Talking Alaska: Reflections on the native languages of Alaska

Culture matters: applying anthropology Australian anthropology blog: postgraduates and staff

Long Road ethnography and anthropology blog - including about Australia

matjjin-nehen Blog on Australian linguistics, fieldwork, politics and the environment.

Language Log Group blog on language and linguistics


E-MELD The E-MELD School of Best Practices in Digital Language Documentation

Tema Modersmål Website in Swedish with links to sites on and in many languages

Hans Rausing Endangered Languages Project: Language Documentation: What is it? Information on equipment, formats, and archiving, and examples of documentation

Indigenous Peoples Issues & Resources a worldwide network of organizations, academics, activists, indigenous groups, and others representing indigenous and tribal peoples

Technorati Profile

Technology-enhanced language revitalization Include ILAT (Indigenous Languages and Technology) discussion list.

Endangered languages of Indigenous Peoples of Siberia

Koryak Net Information on the people of Kamchatka

Linguistic fieldwork preparation: a guide for field linguists syllabi, funding, technology, ethics, readings, bibliography

On-line resources for endangered languages

Papua New Guinea Language Resources Phonologies, grammars, dictionaries, literacy, language maps for many PNG languages

Resource network for linguistic diversity Networking practitioners working to record,retrieve & reintroduce endangered languages


ACLA child language acquisition in three Australian Aboriginal communities

DELAMAN The Digital Endangered Languages and Musics Archives Network

PARADISEC The Pacific And Regional Archive for Digital Sources in Endangered Cultures

Murriny-Patha Song Project Documenting the language and music of public songs and dances composed and performed by Murriny Patha-speaking people

PFED The Project for Free Electronic Dictionaries

DOBES Endangered language documentation and archiving, funded by the Volkswagen Foundation and sponsored by the Max Planck Institute, Nijmegen.

DELP Documenting endangered languages at the University of Sydney

Ethno EResearch Exploring methods and technology for streaming media and interlinear text