[From Peter K. Austin
Linguistics Department, SOAS
30 November 2010

In commenting on a recent blog post of mine about SOAS publication plans, Nick Thieberger raises a number of relevant and important issues for anyone publishing in our field. Getting comments like this is manna to me as a blog author since so many of my posts go uncommented upon (I know people are reading them because I can track redirects from Facebook and my home page via bitly.com, and just occasionally someone references the content of a blog post, as in the recently published Handbook of Descriptive Linguistic Fieldwork by Shobhana L. Chelliah and William J. De Reuse). It is also good to be challenged to clarify one's own thinking about issues, so thanks for the feedback, Nick.

I identified the following main four points in Nick's comments:

  • 1. LDD should "move to an Open Access model for [its] content in the future"
  • 2. content should be free and online because that makes it available to people who cannot pay and who would otherwise not be able to access it
  • 3. having content online means you can measure downloads and the number of downloads measures impact
  • 4. the current LDD business model should be replaced

I will respond to each of these points in turn.

I recently attended a symposium titled Models for capacity development in language documentation and conservation hosted by ILCAA at the Tokyo University of Foreign Studies. The symposium brought together a group of people involved in supporting language work in the Asia-Pacific region in various ways (see the website for a full list): academic (Institute of Linguistics, Minhsiung, Taiwan, Beijing, China, Goroka, PNG, Batchelor, Australia, Bangkok, Thailand) and community-based (Manokwari, West-Papua; Tshanglalo, Bhutan; Bhasha Research Centre and Adivasi Academy, Gudjarat, India; Miromaa, Australia), using film (Sorosoro, France), or archiving language records (PARADISEC). The aim of the meeting was to build a network that would continue to link between training activities to support language work, the Consortium on Training in Language Documentation and Conservation (CTLDC), whose planning group members are listed here.

Stateline has a good interview (and transcript) with various staff of the Australian Institute of Aboriginal and Torres Strait Islander Studies [thanks Sarah!]. It's about the audio visual archive and you can hear snippets of recordings, and also hear about the problems with machinery going obsolescent..and the importance of metadata...

AIATSIS (misspelled 'IATSIS' alas in the transcript) actually has a fantastic print/manuscript collection as well. We're lucky in Australia that nearly 50 years ago, some far-sighted enthusiasts got the Government to set up AIATSIS (then AIAS) and pay for the archiving and dissemination of materials related to Australian Aborigines and Torres Strait Islanders.


[from Peter K. Austin
Linguistics Department, SOAS
24th September 2010 ]

Last month I wrote a blog post about quantification in language documentation and "[h]ow much of the corpus needs to be linguistically annotated so that 'later researchers will be able to reconstruct the (grammar of the) language' or indeed so that the rest of the corpus can be parsed". Note that I was talking about linguistic annotation (not just transcription) here, but in his very useful comments on my post, James Crippen wrote:

"Some folks I know have well over 1000 hours of recorded material, and I think nowhere near ten percent of that has been transcribed. Asking for someone to do the ten percent for this before being willing to accept it is a bit unreasonable."

Well, the first thing I have to say is: 1000 hours is an awful lot of recordings. It's about 7.5 times the average DoBeS corpus (based on the figure I mentioned in my previous post) and if it's video it's equivalent to around 550 feature length movies (which average around 110 minutes each). If you spent every waking hour of the working week, with no time for eating, bathing, shopping, checking e-mail etc, it would take you six and a half months to merely watch or listen to it all, let alone create any metadata, analysis, transcription, or index (and remember that this is probably going to be in a language you don't understand and with no subtitles). You'd want to have a good reason to do so, I reckon.

Anyway, be that as it may, James' comment prompted me to seek some empirical data about this issue, so I wrote to five colleagues who are responsible for archives of materials on endangered languages, namely Peter Wittenburg of the DoBeS archive, Heidi Johnson of the Archive of the Indigenous Languages of Latin America AILLA, Gary Holton of the Alaska Native Language Archive ANLA, Nick Thieberger of the Pacific And Regional Archive for Digital Sources in Endangered Cultures PARADISEC, and David Nathan of the Endangered Languages Archive ELAR at SOAS. I asked them the following questions:

"If someone approached you about depositing 1000 hours of recorded digital data on some language, less than 10% of which was transcribed, what advice would you (Archive_Name) give them? What would be the minimal requirements that you would have in order to accept the materials for deposit?"

I've been meaning to express my love and gratitude for the excellent Hugo Schuchardt Archiv at the Uni Graz for a while now. I was thinking of maybe saying a little something about Schuchardt for his birthday or Todestag, but the dates passed and in any case I come to exhume Schurchardt, not to praise him.

You can read all about Schuchardt yourself at the archive. There's freely accessible scans of all his published works, a growing full-text searchable database of some of the correspondence he received, some secondary materials, and pointers to further resources. More online archives like this would be great!

Peter K. Austin
Linguistics Department, SOAS
8th August 2010

Forty-five years ago the annual fieldwork reports of some of the researchers funded by the then Australian Institute of Aboriginal Studies (now AIATSIS) included specifications of how much research had been completed in terms of the number of feet of tapes that had been recorded during the project year ("this year was especially productive with 45 feet 3 inches of tape being recorded"). The modern measure of this kind of quantitative nonsense is the number of gigabytes of digital files (soon to be terabytes) created by the researcher. Don't mind the quality, it's the length/bytes that count.

My colleague David Nathan, Director of the Endangered Languages Archive (ELAR) at SOAS, has been approached on several occasions by researchers (both those funded by ELDP and those not (yet)) asking how much data they would be allowed to deposit in the archive. "Would it be OK if I deposit 500 gigabytes of data?" they ask. When you think about it for a moment or two, this is a truly odd request, but one driven by part of what David (in Nathan 2004, see also Dobrin, Austin and Nathan 2007, 2009) has termed "archivism". This is the tendency for researchers to think that an archive should determine their project outcomes. Parameters stated in terms of audio resolution and sampling rate, file format, and encoding standards take the place of discussions of documentation hypotheses, goals, or methods that are aligned with a project's actual needs and intentions. David's response to such a question is usually: if the material to be deposited is "good quality" (stated in terms of some parameters (not volume!) established by the project in discussion with ELAR) then the archive will be interested in taking it.

Another quantity that comes up in this context (and in the context of grant applications as well) is the statement that "10% of the deposited archival data will be analysed". The remainder of the archive deposit will be, in the worst case, a bunch of media files, or in the best case, media files plus transcription (and/or translation). Where does this magical 10% come from? It seems to have originated around 10 years ago with the DOBES project which established a set of guidelines for language documentation during its pilot phase in 2000. As Wittenburg and Mosel (2004:1) state:

"During a pilot year intensive discussions ... took place amongst the participants. The participants agreed upon a number of basic guidelines for language documentation projects. ... For some material a deep linguistic analysis should be provided such that later researchers will be able to reconstruct the (grammar of the) language"

Similarly, the guidelines for ELDP grant applications (downloadable here) include the following:

"Note that audio and video are not usable, accessible or archivable without accompanying textual materials such as transcription, annotation, or notes about content and participants. While you are encouraged to transcribe and annotate as much of the material as possible, we recognise that this is very time-consuming and you may not be able to do this for all recorded materials. However, you must provide some text indication of the content of all recordings. This does not have to be the linguistic content and could include, for example, description of the topics or events (e.g. names of songs), or names of participants, preferably with time alignment (indication of where they occur in the recording)."

No actual figure is given of how much "some material" (for DOBES) or "as much of the material as possible" (for ELDP) amounts to. In earlier published versions of advice to applicants both DOBES and ELDP did mention 10%.

Interestingly, Wittenburg (2009, slide 34) has done an analysis of the language documentation data collected by DOBES projects between 2000 and 2009, and he notes that the average project team has recorded 131 hours of media (59 hours of audio, 72 hours of video), transcribed 50 hours of this, and translated 29 hours. Linguistic analysis on average exists for 14 hours of recordings -- strikingly this is exactly 10.68% of the average corpus!!

How much of the corpus needs to be linguistically annotated so that "later researchers will be able to reconstruct the (grammar of the) language" or indeed so that the rest of the corpus can be parsed? Well, it depends on a range of factors, including the nature of the language(s) being documented. Some Austronesian languages, like Sasak or Toratan, have relatively little morphology with pretty straightforward morpho-phonemics of such morphology that does exist, and so a relatively small amount of morpheme-by-morpheme glossed materials in conjunction with a lexicon would enable users to bootstrap the morphological analysis of other parts of a transcribed corpus in those languages. Other languages, like Athapaskan tongues with their fiendishly complex verb morphology, might need more annotated data to help the user deal with the whole corpus.

This is however an empirical question, and one that to my knowledge has not been addressed so far. There are now a number of documentary corpora available, with more coming on stream, and it should be possible to establish whether the "magical 10%" is a real goal to be aimed for, or just a figure that researchers have created and continue to repeat to one another.

from David Nathan, SOAS, London
29 June 2010

On Wednesday 30 June, the Endangered Languages Archive (ELAR) at SOAS, University of London, will launch the new version of our website. The site now offers access to endangered languages (EL) resources, subject to conditions applied by depositors.

ELAR implements a new approach to the archiving and dissemination of EL resources. Our system uses a “Web 2.0” or social networking model, where information owners and information users can interact with and via the web-based system. For example, depositors can update their deposits, and manage and monitor access to them. Registered users can apply to depositors for access to restricted materials. The archive becomes redefined as a forum where users can negotiate with depositors (initially, about access; we plan to add to the possibilities of depositor/user interaction in the coming months).

ELAR’s archive system is designed specifically to meet the needs of EL community members and researchers. The processes, conventions, and interfaces of social networking are a good fit with our understanding of the needs of endangered languages documentations and its various stakeholders. While protocol (collection and observance of sensitivities and restrictions) is important for documentary linguistics, conditions of access can be diverse, yet need to be accountably managed by a “system”. Using a flexible, web-based facility makes access control, monitoring and authorisation more flexible, nuanced, and dynamic. In fact, the majority of our depositors have already indicated that they prefer to allow access as a result of application from potential users on a case-by-case basis.

We also felt that the existing genre of EL archives finds it difficult to fully meet the often-expressed, but rarely met, goal of making it equally feasible for community members to access resources. For example, viewing deposits at ELAR will show resources easily accessed by the languages speakers’ names, enabling community members to locate resources in terms of their own community/social perspectives, rather than technical or linguistic ones.

While hoping that ELAR will make a significant contribution to the development of documentary linguistics, we understand that fixes and improvements will be needed, so we sincerely invite feedback and suggestions for improvement of our site (we’ll do our best to respond, even though we are a team of only 2.5 people!).

ELAR’s online system is built around a set of protocol categories, derived from the categories on our deposit form. There are 4 major categories, each of which matches rights to access resources with the recognised roles of users. The categories are U (User - allows access to all non-restricted resources), R (Researcher - allows access to those recognised as researchers in an area relevant to linguistics or endangered languages), C (Community member - those affiliated with the language documented), and S (Subscriber - those granted individual rights to access by the depositor).

What to see: anyone can see the top pages of the site, with metadata for the deposits. For those deposits which are access-enabled (see the front page, elar.soas.ac.uk), anyone can see basic information about the deposit. To go further and see resources, all users have to register with ELAR. If you already have an ELAR user account then you can access resources, subject to their restrictions, and any additional user roles you may attain.

While about 30% of materials are completely open to access, all but a few resources in the collection are accessible to those who become recognised as a community member (decided by each depositor or their delegate), a researcher (decided by ELAR) or as a subscriber to a particular deposit (decided by the depositor or delegate upon application).

Initially, we have prepared 12 deposits for release, with the co-operation of their depositors. These deposits cover many geographic areas and a range of access types. Deposits by Lahaussois (Koyi Rai, Nepal), Chambers (Kumbokota, Solomon Islands), Jukes (Ratahan, Indonesia) and Morey (Singhpo and Turung, India) are open. Deposits by Bowern (Yan-nhangu, Australia), Jany (Chuxnabán Mixe, Mexico), Morrison (Bena, Tanzania), Caballero (Choguita Rarámuri, Mexico), Mendez (Ayutla Mixe, Mexico) and Budd (Bierebo, Vanuatu) are accessible upon application for subscription approved by the depositors. Kono’s deposit (Kiksht, USA) is currently accessible to recognised community members only. Throughout the year we will release further deposits at the rate of about one per week.

For a beautifully organised site run by a small group, check out Sarah Colley's new site: the Sydney fish project. What fish have been found in archaeological sites in Sydney? What do the bits look like, what does the whole fish look like (i.e. a reference skeleton)? What fish did Aborigines eat at what period? What did settlers eat? How did they eat them? You need access to such collections to be able to interpret finds at different sites.

I so like having all the detailed information about the picture visible with the picture, seeing multiple images of different bones on one screen, seeing at a glance how many examples there are of a particular taxon, the hierarchical views of the taxa, different views of the bones..

The ability to access a collection by image on the web gives far more people access to the collection. This one's been done in conjunction with the University of Sydney Library, which has done a lot of interesting things making stuff accessible on the web. And because it is housed by a major library, the archive is more sustainable.


Further to the discussion of making online material discoverable (using standard metadata or via a more elaborate infrastructure proposed by ELIIP), other useful sources of free online grammars or dictionaries include 'Online Books' and the Project Gutenberg sites. These are 'free' as in unencumbered by intellectual property or copyright concerns, typically because the authors have been dead for over 50 years, not because they were placed in an open access archive. A sample of the files available follows, but wouldn't it be great to have a way of announcing these items using standard metadata terms so they could all be searched via a dedicated language service? For example, the entry for Sgau Karen below is followed by Sgaw Karen, so google searching on Sgaw will only give you one of these three items.

[From Nick Thieberger, University of Melbourne]

On the topic of trying to locate material in a small language, I was reading Kaisa Maliniemi's 2009 article on the discovery of new linguistic material in Kven and Sámi in Norway's public records archives. She discusses the fact that the records have been publicly available for some time and that a number of researchers must have worked with them in the past, but there was no trace in that activity of the fact that the records included considerable amounts of information in these two minority languages. She argues that archives can make available to 'the other' those voices and knowledge marginalized by the western-dominated global mainstream. But the point that the article made strongly for me is that we should be able to provide a means for tagging such collections so that they can be located by others interested in those languages (this was also a topic at the ELIIP conference reported on by Jane Simpson here and here ).

The suggestion that we can use Wikipedia [in Peter Austin's reply to Jane's blog] is only part of a solution. I have put links to South Efate material into a Wikipedia entry here as a way to make the information available. We can, however, do better than an unstructured language page that is made by hand, as in the Wikipedia approach, rather than being automatically populated by web-based information in Web2 style. Using Web2 technologies, the Open Language Archives Community (OLAC) harvests information from participating collections and then establishes a page for every language represented in those collections, like this one, where the three-letter language code (ISO-639-3) designates the language, in this case 'erk' = South Efate (Vanuatu). Of course there are languages without ISO standard codes and they need to be brought into the system too.

A focus of our archive, PARADISEC, is to make previously unlocatable material available, and we have done this in several ways. The first, and most straightforward, is to provide an online catalog of material in our own collection. The catalog, using standard terms like country names, language names and the metadata given by the Open Language Archives Community, allows depositors to enter their own metadata. For many, this is the first time they have actually systematised their collection. Because the catalog is part of the OLAC federation, it is accessible via their search mechanisms, and is also locatable via Google.

Second we have made material available by taking scans of around 14,000 pages of notes and placing them online, with enough contextual information to allow them to be located [see Arthur Capell's notes here, or Stephen Wurm's notes here, or Calvin Roesler's notes here]. If you look at the OLAC page with South Efate material listed you will also find a number of references and links to Arthur Capell's notes which we put online.

Third, we can enter a record in our catalog to make an existing resource more widely available, and, as our catalog is harvested by the Open Language Archives Community, it will then be more generally locatable. For example, George Grace is a linguist who has worked in various parts of the western Pacific, and his fieldnotes have been scanned and put online at the University of Hawai'i (UH) library. If you know that it is there and you search for his name, then you can find it in Google. However, there is no provision made by UH for standardising language names by use of the three-letter code (or ISO-639-3) that reduces ambiguity in searching. The UH library catalog currently does not list these items, nor does their 'Online resources' catalog. By entering a record into the PARADISEC catalog (here) the information is then propagated through to OLAC:

Waropen olac search.jpg.

A Google search for one of the languages mentioned in this collection, 'Waropen', locates our record (hit number 3) in OLAC:

Waropen google olac find.jpg
The item at UH comes in at hit number 57:
Waropen hit at UH google number 57.jpg

OLAC's language pages are an excellent source of information, and if we can add to each page by providing a fairly minimal pointer in an OLAC-compliant record then that may also solve the problem for the Kven and Sámi material that Maliniemi discovered.

Maliniemi, Kaisa. 2009. Public records and minorities: problems and possibilities for Sámi and Kven. Archival Science. Vol. 9, Numbers 1-2: 15-27 DOI 10.1007/s10502-009-9104-3

From: Peter K. Austin
Department of Linguistics, SOAS

9 February 2009

David Nathan, Director of the Endangered Languages Archive, at SOAS, and I are back in Tokyo at the invitation of Toshihide Nakayama of ILCAA, the Institute for Languages and Cultures of Asia and Africa, at Tokyo University of Foreign Studies for 10 days to run a workshop on language documentation that follows up our 2008 workshop. This year we are taking a different tack and focusing the week of seminars and practical sessions on the principles and practices of archiving endangered languages materials. The week begins on Monday (today) with preparations in the morning and David's public lecture on "Archiving endangered language materials" in the afternoon. Classes begin in earnest on Tuesday and run until Friday, with sessions from 10am to 5pm each day. There will be 15 attendees, mostly students who are doing fieldwork in various locations around the world. Details of the workshop can be found here.

The topics we plan to cover include:

  • Language documentation and language archiving - major issues
  • Audio - good practices refresher
  • Audio recording - how to make great audio
  • Data and metadata - good practices refresher
  • Data management practical
  • Workflow for archiving
  • Mobilisation and delivery of language materials
  • Transcription, annotation, translation - good practices refresher
  • IP and ethical issues in the delivery, usage, and archiving of materials

There will be group work in the practical sessions and a final discussion with presentations by the attendees on the last day. If time and energy permit I will blog about how the workshop goes and report on some of the outcomes.


Peter K. Austin and David Nathan
Linguistics Department, SOAS
6th January 2009

The Endangered Languages Archive (ELAR) was established at SOAS in January 2004, with the first deposits accepted in late 2005. Our initial priority was on preservation but recently the ELAR public catalogue was released and it will soon extend to providing access to materials (where permissions allow). To date, ELAR has received over 50 deposits and stores about 4 terabytes of data. Audio recordings make up about 60% of this (both in terms of the total number of files and the total volume of data).

ELAR was established primarily to preserve and disseminate data collected by grantees from the Endangered Languages Documentation Programme (ELDP) and by staff and students from the Endangered Languages Academic Programme (ELAP). Because language documentation is an emerging area that relies a lot on new techniques and technologies, ELAR also provides training, advice and support to ELDP grantees, ELAP staff and students, and others through international training workshops (see, for example, the various organised by ELAR and taught by ELAR and ELAP staff and students and additional experts). ELAR staff also manage the research facilities of the Rausing Room, the Linguistics Resources Room, and the pool of fieldwork equipment available to ELAP staff and students.

ELAR now has four staff, with David Nathan and Ed Garrett being card-carrying linguists and IT professionals, and technicians Tom Castle and Bernard Howard having specialist skills in digital and analogue audio techniques and equipment.

With these resources, skills and experience, ELAR is able to help people who want to archive resources for endangered languages, including individual and retired researchers who may not have alternative sources of equipment or advice. Dietrich Schüller, the former Director of the Austrian Phonogrammarchiv, has warned in a recent paper[.pdf] that the great majority of the world's human cultural heritage is sitting unpreserved and uncatalogued on the shelves of individual researchers. We can help these researchers with preparing materials, including digitising and converting audio, as well as providing advice and training in how to create metadata and cataloguing information.

Over the last few years ELAR has collaborated with a number of individual researchers in preparing their materials for deposit:


[From our man in Hawai'i and Melbourne - Nick Thieberger]

The Australian government has millions of dollars that it will be spending on what it calls the National Collaborative Research Infrastructure Strategy (NCRIS) to support new technologies in research in Australia.

"Through NCRIS, the Government is providing $542 million over 2005-2011 to provide researchers with major research facilities, supporting infrastructure and networks necessary for world-class research."

DEST released a paper outlining what it called 'capabilities' which it proposed to fund, and they were ALL in the sciences, including lots of shiny pointy instruments (synchrotron, new telescopes and so on) to do the whizzbang experiments that are so popular and capture the imagination of politicians. While the physical science community has amazing capacity to pull in big research dollars, there are not that many of them, and even fewer who actually want to use each of these very expensive instruments.

On the other hand, the Humanities, Arts and Social Science (HASS) community is huge, and also does the kind of work that, in the main, is immediately relevant to those who fund it (taxpayers). So, in the consultation that followed, the clamour of HASS proponents resulted in a new 'capability' being added to the 'roadmap', but without any funding (yet) associated with it. There will be an 'Innovation White Paper' announcement before the end of 2008, and the current roadmap leads to the White Paper.

All of this is important for us, as it is the bucket from which national infrastructure like a National Data Service may be funded, and where policies on standards for data repositories like PARADISEC will be set. It is where funding will come from for the national computer facility that houses the online version of the PARADISEC collection.

Early this morning, a delivery of audio files was quietly sent from Paradisec's local server at the University of Sydney to permanent near-line tape storage at the Australian Partnership for Advanced Computing in Canberra. This happens on many days, as you might imagine, but what makes today's delivery special, was that somewhere in that bunch of files was our 2000th archived hour of audio.

Moreover, we will soon be celebrating five years of operations, in which case, 2000 hours might not seem so impressive - it's just 400 hours per year after all - but we at Paradisec are very proud of our collection. Especially given that just about everything here is done on a shoestring budget and there have been some lengthy hiatuses of funding lately.

Speaking of which, this may be an opportune time to mention that we are always amenable to generous donations from people wishing to sponsor the digitisation and preservation of a collection of data. See our website for more details.

So, just which file was the lucky 2000th hour? Well, we can't really be sure, but we do know that it was among a collection of Mark Durie's research into the dialects of Aceh, an area that was devastated by the Indian Ocean tsunami of Boxing Day 2006.

To help us celebrate both these milestones, Mark has kindly written a small piece for us about Aceh's dialects, his research of them and the importance of preserving the collection. He has also allowed a small portion of one of these recordings to be posted with this piece, which you can download here.

A lot of work has been happening at the University of Sydney over the past six months, and at the end of last year the top floor of the Transient Building, which houses Linguistics, Paradisec and a few other offices, got renovated. Unfortunately, since the entire exterior of the building is composed of fibrous asbestos, it's unlikely that the University will outlay the mammoth insurance costs to do any exterior work. But anyone who knows the Transient building knows that the best option would be to demolish the whole thing and start again from scratch.


After some effort PARADISEC has finally established a streaming server that can be used in normal web pages. This means that an online dictionary, for example, can have example headwords and sentences spoken, or video clips presented to illustrate a given word. You can see the trial version here, (NB this will only work with the Firefox browser and you will also need to pre-install the Annodex plugin).

For some time it has been troubling that we have no simple way of presenting media online in association with transcripts, especially when an archived field recording may be the only recording of a particular language. It should have been simple enough to access media on the web. After all, we do it on Youtube and other places. But we have been further constrained by really wanting all of this to be open source (freely available software) so that anyone with the right skills can replicate this setup and not have to pay. And we also wanted the process for getting material into an online presentation to follow on from normal fieldwork outcomes, in line with output from the tools typically used by a professional linguist (one who keeps up to date with the methods of their profession). When the archival form of the media exists in a repository, it should then be an automatable process to put it into a streaming server for access.

Yesterday (27 October) was the first celebration of UNESCO's world day of audio-visual heritage. The trailer on that website, put together from the holdings of various audio-visual archives around the world, gives a flavour of the kind of material that is held in audio and film/video archives worldwide. Australia is fortunate to have many cultural institutions that hold and look after material recorded in Australia: the National Film and Sound Archive (NFSA), the Australian Institute of Aboriginal and Torres Straid Islander Studies (AIATSIS), the National Library of Australia (NLA), the National Archives of Australia (NAA) and many others.


Check out 'Language Archives Newsletter' (LAN) No. 10 (edited by David Nathan, Marcus Uneson, Paul Trilsbeek). It features articles on the role of video in language documentation by Patrick McConvell and Peter Wittenburg, as well as reviews of audio recorders including the Zoom H4.

LAN 10 Contents:
Video - A Linguist's View (A Reply to David Nathan), by Patrick McConvell
Video - A Technologist's View (A Reply to David Nathan), by Peter Wittenburg
Review: Audio Recorders Zoom H4 and Korg MR-1, by Paul Trilsbeek, Gerd Klaas
Review: Audio Recorder iRiver H320, by Bernard Howard
CLARIN Research Infrastructure Initiative, by Peter Wittenburg
Announcements etc

So you want to preserve that MSWord novel, those spreadsheets, those AppleWorks fieldnotes forever?

The National Archives of Australia are ahead of you - they've developed free and open source software to help in the long term preservation of digital records. Xena! (XML Electronic Normalising for Archives - and I bet they thought hard to come up with the N).

I saw a demo of Xena a couple of years ago, and was greatly impressed by the potential of streamlining the workflow in digital text archives - by detecting the file formats of digital objects, and then converting them into open formats like XML for preservation. Databases remain the nightmare of course.

Anyway, there's a new release - and here are the details.


Digital archives of photos, films and recordings are springing up in Indigenous communities, and some of them are even Getting Funding, hurrah! The Bill and Melinda Gates Foundation is giving a million US dollars to the Northern Territory State Library System:

"a 2007 Access to Learning Award recognizes the Northern Territory Library for providing free computer and Internet access and training to impoverished indigenous communities... The award honours the innovative Libraries and Knowledge Centres (LKC) Program, which provides communities with free access to computers and the Internet, and helps Indigenous Territorians to build digital collections of their culture through the Our Story database."

They've got Knowledge Centres at Milingimbi, Wadeye, Peppimenarti, Umbakumba, Angurugu, Pirlangimpi, Milikapiti, Barunga, Ti Tree, and Ltyentye Apurte.

.....As well, "Microsoft, a Global Libraries initiative partner, will donate US $224,000 in software and technology training curriculum to upgrade the organization’s 300 library computers." [Weep for us Mac users]

The Our Story database is an adaptation of the classic Filemaker Pro Ara Irititja program developed by the artist and historian John Dallwitz for the Anangu Pitjantjatjara.

Ara Irititja, a project of the Pitjantjatjara Council, commenced in 1994 when it was realised that a large amount of archival material about Anangu was not controlled by or accessible to them. This material was held in museums, libraries and private collections. Items held by private individuals were often at risk of being damaged or irretrievably lost. To date, a major focus of Ara Irititja’s work has been retrieving and securing such records for the benefit of Anangu and the broader Australian community.

The great advantage of Filemaker Pro was that it was basically off-the-shelf and basically fairly easy for people to use. There have been elaborate proposals, but going beyond glamour to making things work in remote communities is a very large step.

[ from Nick Thieberger, PARADISEC, Melbourne University branch ]

I am a firm believer in open access to information, especially research information that has been created by taxpayers' funds. Thus it came as something of a surprise to find myself likened to the main man of the dark forces of corporate information ownership on a site formerly known as the 'Stolen Grammars' site.

Constructed by a linguist in Stockholm, the site offered downloadable versions of many grammars which had been copied from various locations ("Browse my collection of stolen .pdf reference grammars if you'd rather not pay.")

Last Friday was a bit of a milestone for me, since, in the 6 or so months that I have been involved in the audio preservation side of things at PARADISEC, I hadn't yet actually cleaned a damaged audio tape. Unfortunately for me, the process isn't quite as straight-forward as it is for a CD - warm soapy water, a non-abrasive cloth, wipe across the grain - rather, the entire process can take weeks, depending on how badly affected the tapes are.


[From Peter K. Austin, Endangered Languages Academic Programme, SOAS]

On Wednesday last week (25th April) during Endangered Languages Week at SOAS there was a presentation on the "Dawes online" project at SOAS which aims to make an interactive digital facsimile of William Dawes' notebooks of the Sydney language available on the web. The project has produced high resolution digital images of the notebooks written by Dawes in 1790 and is developing searchable transcriptions of the manuscripts that will include the linguistic analysis made by Jaky Troy (published in 1993) along with topic maps (using the XTM standard for XML topic maps). This will enable users to search by topic, such as “animals” or “names” as well as linguistic topics, such as verb paradigms.

This project brings together knowledge and skills from archive studies, philology, linguistic analysis, and information and multimedia technologies. It is one of the more technically sophisticated of a series of projects that have emerged over the past several years to work on archival materials of Australian and Pacific languages, especially languages that have no or very few speakers. This work has parallels in the richly elaborated studies of Old English manuscripts published by Bernard Muir of Melbourne University as CDs and DVDs. The goal of both Muir’s work and the Dawes project is to present the original materials in an interactive format along with layers of standoff analytical markup.

A related kind of study is what we could call “second generation language documentation” (2GLD) where it is linguist’s fieldnotes and transcriptions which form the basis for documentation rather than speech events or speaker knowledge (usually because it is no longer possible to access such knowledge or events). Paradisec has photographed over 10,000 pages of fieldnotes on a wide range of languages for 2GLD purposes using the system developed at the Australian Science and Technology Heritage Centre This includes Arthur Capell’s notes on Pacific languages.

All over Australia now people are writing reports on the progress of their grants - to attach to their begging-letters for more grants. Reading the reports gives you the sense that Australia is a garden of projects, each a mass of bright blossoms fragrant with success. (So why haven't we solved world poverty or climate change yet?) That's why it was really really good to go along to the ARC E-Research post-funding workshop (14-15 February), where participants were encouraged to report on the problems they encountered in their projects...

It's Australian grant application time! Joy, rapture! (Skips lightly around the room)
If you're thinking about what to spend your requested squillions on, here are two thoughts:

Be realistic about how much it will cost to prepare your recordings for archiving, and then the cost of archiving itself - if you don't have a large friendly archive to hand. PARADISEC gives some guidelines on costs. And Dave Nathan has some shudder-inducing remarks on the current cost of archiving video. [1]

Getting manuscripts ready for publication
Many linguistics publishers do NOTHING about proof-reading or copy-editing your masterpiece. Your baby, you wash the nappies. And non-commercial linguistics publishers that do take copy-editing and proof-reading seriously, like Pacific Linguistics, need all the help you can give them - such as a publication subsidy to defray the costs of copy-editing. So imagine how many hours it might take to copy-edit your dictionary, double it, multiple by a suitable hourly rate - and build it into your application if you can.

[1] Video and Language Documentation: panacea or madness? presented at the DELAMAN IV meeting. 2 November 2006, SOAS.

[ Barry sent this in response to the Artefacts, labels and linguists post. He is the curator responsible for the Pacific Cultures Gallery at the South Australian Museum, and has a brilliant website for an Upper Sepik-Central New Guinea project on the relations between material culture and language, geographical propinquity, population, subsistence and environment.]

The outcome of the renovation of the Pacific Gallery is a compromise between the enormous task of upgrading and relabelling an exhibition of 3000 artefacts and the available funding. A lot of money went into removing the 1960s ceiling, replacing aircon, carpet and lighting, and a paint job. I did not agree with the shiny black but the Goths had the numbers in the committee.

We have begun the difficult task of providing renewed labelling in the wall cases - difficult because one case may have around a hundred items and in such instances we can't provide a label for each individual item - instead we will try to say something about the collectivity of objects that gives the viewer some sense of what they are looking at, in terms of types and geographical distribution (such as in the display of over 80 stone headed clubs). Electronic means of providing information will not be limited in this way and individual items will be provided with full data, including language groups (speech communities) from which the objects have come (where known).


Whether languages can be property has generated further discussion, on Language Log, and on several anthropology blogs (thanks Kimberly!). Two themes emerged: power, and the potential conflict with open access.


The surprise for me from the Sustainable Data from Digital Fieldwork workshop (aka Suzzy Data..) was how much plant taxonomists and field linguists have in common. And how much we need to work together with librarians and archivists. We both have to look after records - the decaying recordings of the languages, and the dried specimens in the herbariums. We both work with the living communities, the trees that will get logged and the communities that live with the trees, and the families and children who will switch to speaking another language.

Dear ELAN Workshop attendees, and anyone who might find this of interest,

There were a few loose ends left at the end of the ELAN workshop last week. I'd particularly like to address one, the question as to whether we should aim for a standard set of ELAN templates which everyone uses.

I wandered into the office today to see Jane and Mark with a large map of part of the northern territory rolled out on the floor, discussing the issue of iso-glosses, and boundaries. Maps maps maps. They're just everywhere at the moment!


Last week, one of my favourite blogs, BoingBoing, had an interesting link to a new web based research tool. I've been having a go over the weekend.

Check out the latest Language Archives Network News [sorry Dave!]newsletter here. It's got helpful information on how the Max Planck Institute (Nijmegen) can help you set up a local archive, a system of cataloguing linguistics information (IMDI) about your recordings, and on getting permanent unique resource identifiers for stuff stored on the web. And it's also got an article on recording information about plants and animals in the field that you might read in conjunction with Tom's post on this topic.


Our December conference is almost full, so if you were thinking of coming along, now is the time to register! The preliminary schedule is up, papers have been reviewed, everything is going along nicely (touch wood).

The third day of the conference is a workshop, with sections on audio and video recording, transcribing and managing your data, and producing outputs from this data. If this is more your thing you can come to just that. If you're interested in ELAN for transcribing or shoebox/toolbox, I thoroughly recommend it, but there'll be plenty of other useful stuff.

The Australian Research Council's website today has survived the pressure of everyone wanting to know whether they've got winning tickets. I was in a few syndicates (PARADISEC, continuing the Aboriginal Child Language Acquisition (ACLA project), and a new project on Indonesian). And the lucky winners are...

The preliminary schedule for the conference "Sustainable data from digital fieldwork: from creation to archive and back" is now up. There looks to be some really interesting projects on display. I had a sneak peek at EOPAS, a project to create a workflow and display interlinearised texts, and annodex, a project to display multiple streams of visual, audio and textual data, both of which look great. I'll also be talking about the FieldHelper tool I've been working on this year, a tool to add in the tagging of arbitrary metadata to field work data, amongst other things.

Our registration quota of 40 places is fast filling up. Please register now if you wish to come, also note that you can choose to come to the third day workshop if your interest in more in practical experience with current digital field work tools.


RNLD in collaboration with the conference "Sustainable data from fieldwork" is offering a day-long session on the creation, organisation, annotation and display of digital media. I highly recommend this to anyone interested in making digital recordings and annotating them. If you're new to shoebox or ELAN and have any questions about using it, and you have your own data, then bring along your laptop. The workshop will be held at Sydney University on Wednesday, December 6, 2006.

Read on for the specifics


Every dead ethnographer (Indigenous or non-Indigenous) had a tin trunk in which all the information on the people, the language, the culture, anything, yes anything you want to know, could be found. But, I'm sorry, aunty died last week, and we don't know WHERE that tin trunk is now. (Source of observation: Michael Walsh). The anthropologist Ursula McConnel who worked with Wik Mungkan people on Cape York Peninsula, died in 1957, and people have been looking for her trunk ever since.

I got inspired to preen our blogroll, by following up blogrolls on other linguistics blogs (notably Language Log). This meant hours of pleasure going through musings, dead blogs, frozen blogs, (very!) personal blogs, e-learning blogs exhorting us to use blogs in teaching, e-learning blogs exhorting not to use them, pictures of cats, gardens, parrots, business blogs, meta-blogs..

The results?

Jane's last post and a post on the ever excellent Language Log have got me thinking about permanence and accountability in the internet age. Its a theme that I encounter again and again, working for a digital archive.

if you want to spend three years thinking and writing about languages and cultures of Australia and the Asia-Pacific region ...
Nod to Ethics committee: HEALTH WARNING: and you're not ESPECIALLY worried about whether you'll find a interesting job afterwards....

... applications for the 2007 APA/UPA scholarships at the University of Sydney are now open. Information and an application can be downloaded from:


Many academic disciplines depend on analysis of primary data captured during fieldwork. Increasingly, researchers today are using digital methods for the whole life cycle of their primary data, from capture to organisation, submission to a repository or archive, and later access and dissemination in publications, teaching resources and conference presentations. This conference and workshop will showcase a number of projects that have been developing innovative and sustainable ways of managing such data.

Those of you in Canberra this week might be interested in the Australasian Sound Recordings Association annual conference "Listening" to be held 23-24 August at the National Film and Sound Archive. Among the several presentations of likely interest to readers of this blog will be the session on "Listening, Language and Culture" on 23 August, and a highlight will be the the Alice Moyle lecture to be delivered by Gupapungu elder Joe Gumbula (Galiwin'ku Knowledge Centre, Elcho Island) on 24 August at 9.15. See the conference programme for full details!

On Thursday 29 June 2006 I joined heaps of overcoated people in the large, airy Reading Room of the Australian Institute of
Aboriginal and Torres Strait Islander Studies
(AIATSIS) in Canberra. We were celebrating the launch of "Indigitisation" - a three year funded digitisation program for sound, text, film, and photographs. The view of lake, sky and trees and some determined ducks was a distraction from the speeches, but some things stuck - 40,000 hours of sound recordings of Indigenous languages to digitise, lots of expensive machines, some enthusiastic staff, and as yet no off-site backup. Storage problems mean they're not digitising everything at 24-bit, 96 kHz. They're planning to deliver some sound files through the web, where communities have given permission. So in future you should be able to click on some on-line catalogue entries and download sound files.

The AIATSIS Library staff showed "Collectors of words" - a web presentation of the nineteenth century word-lists of Australian languages from E. M. Curr and Victorian and Tasmanian languages from R. B. Brough Smyth . They're available as pdfs, organised alphabetically according to the place the words were attributed to, and linked to maps. A nice feature is the linking to the AIATSIS catalogue, so that you can find other materials referring to the same language group. Unfortunately the pdfs are only images - you can't search for text in them. If you want text copies of Curr, go for the transcribed copies in AIATSIS's electronic text archive ASEDA. These aren't yet linked to the scanned images - a job for the future!


Pacific and Regional Archive for Digital Sources in Endangered Cultures / Sydney Humanities and Social Sciences e-Research Initiative Workshop

Presenters: Dr Linda Barwick, Director, PARADISEC and Frank Davey, Audio Preservation Officer, PARADISEC.
A free workshop covering: the range of research applications for recording and analysis of digital audiovisual media; questions of sustainability and archiving of audiovisual data; tools and resources for archiving, analysis and presentation of digital audio; the role of recordings in humanities disciplines; and using audio recordings in presentations and teaching. Includes hands-on sessions using Audacity sound editing software and Transcriber speech annotation software.


