[ from Nick Thieberger, PARADISEC, Melbourne University branch ]

I am a firm believer in open access to information, especially research information that has been created by taxpayers' funds. Thus it came as something of a surprise to find myself likened to the main man of the dark forces of corporate information ownership on a site formerly known as the 'Stolen Grammars' site.

Constructed by a linguist in Stockholm, the site offered downloadable versions of many grammars which had been copied from various locations ("Browse my collection of stolen .pdf reference grammars if you'd rather not pay.")

In tandem with open access there has to be a mechanism for recognising creative effort, otherwise no-one will put their work online. 'Stolen Grammars' did not link to existing open access resources, but copied them without proper attribution. What kind of researcher wants data that is not properly attributed? If you want to cite an electronic grammar in a paper, then you want to cite its proper URL - citing 'Stolen Grammars' is not going to impress the average publisher.

David Nash and I both wrote to 'Stolen Grammars' and asked that our work, which was already in open access repositories, be linked to rather than copied. I was also concerned that the site had appropriated PARADISEC material from the Arthur Capell papers which we had painstakingly spent some effort to image (14,000 photos) and enter metadata for. While 'Stolen Grammars' had signed an access form stating that they would not further distribute this material, they did so, and did not link to or acknowledge the work of PARADISEC.

Linguistic archives rely on the good faith of those signing agreements about how they will use data from the archive. Depositors have a right to trust that the material they deposit will not be misused.

We encourage the use of open access repositories that provide proper infrastructure so that the collection is well described using standard descriptors (like standard language names) and will be available into the future (with persistent identification of the items in the collection). I also suggested to 'Stolen Grammars' that they could become an OLAC repository of links to exactly the same information that they had copied, but using a set of metadata and links that would make their repository add value to research efforts.

Anyway, on receipt of our unthreatening request to link to our work rather than appropriating it, 'Stolen Grammars' picked up their bat and went home (to use an Australian colloquial expression), closing down access to all items on their site, not just the ones we had queried, and putting the note on their sites that has since caused us to receive a message likening us to the Sheriff of Nottingham to 'Stolen Grammar' 's Robin Hood.

We hope that 'Stolen Grammars' may return in a new form, perhaps as 'Found Grammars' with links to resources on the web.


Interestingly, Harald Hammarström, to whose homepage the 'Stolen Grammars' page is linked, recently contributed an extensive sets of annotated links to descriptive linguistic materials to the LING-TYP list (see http://listserv.linguistlist.org/cgi-bin/wa?A2=ind0705c&L=lingtyp&D=1&F=&S=&X=0196E36722CD2F5231&Y=pa2%40soas.ac.uk&P=230) and http://listserv.linguistlist.org/cgi-bin/wa?A2=ind0705c&L=lingtyp&D=1&F=&S=&X=0196E36722CD2F5231&Y=pa2%40soas.ac.uk&P=1239). Paradisec and Melbourne e-prints repository are among them.

Good to see you and David Nash putting him back on the straight and narrow.

Dear Dr. Thieberger and other readers,
I'm happy that the debate is re-opened and that I was invited to share my perspective, but I am a bit disappointed in Dr. Thieberger's quite ignorant blogpost (e.g. I am neither a linguist, nor am I from or have ever lived in Stockholm, many relevant pieces of information and argumentation is absent). Now, I have to open this response with a boring section with clarifications and apologetics that I thought we were beyond.


I have never likened Dr. Thieberger to "the main man of the dark forces" or anything similar, now, formerly, on the web or in any email. Another linguist wrote such things in an email where I was cc:ed, but I did not write that, nor do I endorse it. The other linguist wrote to ask me why the site was down, before there was any note explaining why, which I explained, and he emailed
you on his own accord. This linguist said he was just back from Leipzig where many others had apparently wondered the same thing, and their guess was way off (thinking some magna publisher had sued me). During the time the site has been down, I had already received some fifty emails asking the same question, so after I had replied to the linguist I put up the same note there. Dr. Thieberger and Dr. Nash were informed about the note from the first second and I asked for corrections if there was any misrepresentation in the description from their perspective (no such correction was offered). There is in fact nothing at all wrong or even misleading in that note, and the reason the former stolen grammars page has not been converted into a page with only links is clearly stated: I don't have the time.

It is fully true that I copied files of Capell notes from the PARADISEC site, as well as files from Melbourne ePrints, MIT dSpace and many other open access repositories, and put local copies of them on my site. All bibliographic information, author, year, publisher and so on, was posted and available for download in .bib format. There was no information on where I found the file or who made the document digital originally, i.e. before I copied it, but there were no indications of any kind that any credit for authoring or digitizing the documents should go to me. I know that this entails infringing on copyright (where one does not have the right to spread), but I did not think that was a significant difference in practice, because all those files were available anyway, and were posted originally because the posters wanted people to read them freely. My argument was: what's the difference? Two clicks or one? Five clicks or one?

The files were taken down as soon as I found out there were people disagreeing (Dr. Thieberger and Dr. Nash's emails were the first such emails I got during the circa two years the site was up = 22157 visits to the index page). I should have asked before, of course, not wait until someone else alerted. I responded with a rough indication of who and how much downloads there were and suggested that legal action should be taken if any money had been lost (especially for PARADISEC, where according to Dr. Thieberger future funds were at risk). I failed to appreciate the want of the original depositors to be credited in some form. Offering to compensate monetary or otherwise is the best I can do after having made a serious and gratuitous mistake in interpreting the functioning of open access repositories. In his response Dr. Thieberger appeared happy with my taking down the files and has later informed me that there was never a threat to sue. Thus there appears to be no conflict as to the former existence of the files on the site.

So why do you say I picked up the bat and that I cause emails to be sent to you? I could not cause other people to write emails even if I wanted to. (Btw. there is no "they" -- it's just me.)

Similarly, ca 200, out of a total of 1177, items were not bound by any copyright agreements and are not affected by arguments presented by Dr. Nash and Dr. Thieberger -- the "fault" that they are no longer there is fully mine (I cannot bother to sort them out) and can't be said to have anything to do with them, and so should not be used to cast darkness or whatever on them, as the other linguist (again, not me) may have done in an email.

Dr. Thieberger also highlighted the fact that my stolen grammars site was not annotated with metadata, and that this would have made the site much more useful. I responded that I fully agree, and that my first priority was SIL-codes and sorting. The reason I hadn't provided such metadata already was that I simply hadn't had the time. Dr. Thieberger fully understands this reason since he is working on a shoestring budget, while I was working on no budget at all.


The stolen grammars site is down and there I have no plans to repost if it is against anyone's will. I had no gain of it myself so I have no personal problems with this situation. There are no local copies of any files, so you should be happy too.

That being said, let me try to explain here why I think your arguments about local copies of descriptive language materials, or open slather if
you wish, make little sense. This is a debate I tried to initiate with you on email but you did not reply for a while and, surprisingly, you did not address those points in the blogpost. I'll make use of some references to the old stolen grammars site as it happens to make good examples, but this is not an attempt to justify its previous existence or argue for its re-opening.

We probably both agree that we want everyone to read and learn as much as they like about the languages of the world, that's one of the reasons we became researchers. The question is, how should this be organized so that incentive to generate more is reproduced, and the cost, in money or hassle, to the reader is no more than necessary?

In short, my argument is that open slather access serves our goal better because:
1. It significantly increases spread
2. It significantly spreads the incentive to get more people to join the open access/slather
3. It works against the formation of selective networks of shady exchanges
The only real disbenefit I know that cannot easily be overcome is that of credit in the form of hit counts, but they are not crucial in general.

In long, see below. Again, I have no personal benefit to collect if open slather becomes the norm -- I can get anything I want anyway since I am well-connected and I happen to be at a rich Western university.

1. You claim that open access does not work unless the creators of the information do not get credit for their effort. In the case of digital documents on languages of the world, there would be two creators, one who authored the information and one who made the information digital. The is no question that authors need their credit to go on to produce, especially as most of them are dependent on this for their future employment, and this is currently well-handled by placing the authors name on the title page and around any reference to the document authored. (Since data on lesser-known languages have no commercial potential, I guess this is the only kind of recompense possible.) When it comes to those whomake documents digital, there are those who do it and need credit for it in order to go on, and those who don't. I don't know the ratio between the two, but the more of any kind, the better.

Those who make information digital without claiming credit do it for this reason: they enjoyed reading the material in question and they want to allow their friends and others to have the same experience more easily. Maybe they also think that if they spend a little effort, it will be more likely that others do too, which will benefit them in accessing similar materials.

For example, when teenagers convert files from their record collection to .mp3:s and share them on the internet, they don't do it for this reason. In my case, on a world-wide level no-one has better track of grammars on and off the net, so I could help other linguists save time with relatively little extra effort. I also hoped to increase interest in the kind of linguistics I like best.

It's simple enough, the credit for those who digitize could be issued in the same way as for authors, namely with the necessary information on the first page of a .pdf file.

You hold that the only way to ensure proper crediting is if all access has to travel through the digital originator's internet site. For this you have offered the following arguments so far, which I comment below:

This functionality is superflous: if all access does not travel through the originator's site then you still have the same possibilities for updates and version control as a regular article or book in a university library. By your reasoning university libraries would not function. Readers do not go through the publishers or authors site when they pick up a book to read.

-Metadata and stable locator id
Metadata and stable locator can preferably be posted within the electronic document.

-Credit in the form of hit counts
I know too little to know what the role of hit counts is for the PARADISEC (perhaps you can expand?), but in general hit counts seem to have no essential influence on the incentive to produce. University libraries have survived long enough without them, and almost all of the sites from which I used to steal files from clearly put the files there for open and easy reading, regardless of who reads what and how many times. If one wants hit counts it's simple to advertise and up goes your hit count with the amount of effort spent on advertising. But maybe hit counts are good for something, if only for the spirits of the originator.

Against this, we may weigh the benefits of open slather access. Open slather access has little or no extra infrastructure costs, allows for users to repackage (e.g. together with something else), convert to another format, add more metadata and so on, but above all facilitates spreading to more readers. How do I know that spreading would increase significantly with open slather access? Collaborative filtering, as it is known technically, is an incredibly effective way to let users locate the items they want (and vice versa). It is used ubiquitously in automatic recommendation algorithms and is provably much more effective than content-based recommendations. Now, are people more likely to share links or share files (which corresponds to the difference between open access and open slather)? Of course, some people will bother to share only links, others will bother only with files. We should take advantage of both kinds, the more the better.

We may also note that many (but not all) sites that provide open access do it with access restrictions, user-unfriendly formats, or require users to perform some kind of pirouette before granting access. Many of these measures are totally meaningless, and open slather-policies will solve them more efficiently. Regular open access sites will not always care, or afford to do it. (What in fact happened in case of the PARADISEC data is that one linguist asked his friend who is a more experienced programmer if the page-by page .jpg:s could not be automatically downloaded and converted into .pdf:s. This he did and shared underground, not knowing that another linguist had done the same thing already.)

Perhaps the clearest proof that open slather does work well in science is to look at other disciplines. As I have pointed out already, in computer science almost everyone keeps papers and theses freely online on their homepages. Sites like citeseer index many of these, provide cross-citation statistics and local copies in many different formats. There is no sign of problems Dr. Thieberger has predicted for linguistics. According to Dr. Thieberger's view, computer scientists wouldn't do this unless he annotated their files with metadata, kept them under register-to-get-access only etc. Why should linguistics be any different, I wonder?

2. Open slather spreads the incentive and effort over to many more people to digitize further materials. Inasmuch as the author does not object (or is deceased) and the publisher won't not (e.g. fails to respond to email), this is a good thing. For example, as a direct result of the former stolen grammars page, people from various parts of the world came forward and arranged the artes of Chiquitano and Lule, early Papuan wordlists, some killer Brazilian phd:s, Stevenson's phd and some 200 other valuable items to be digitized. All were from third world countries and showed no interest in "credit for their valuable efforts". As I was writing this very sentence, I got an email with the perfect example, which is pasted below. What open access repository do you think this individual would have submitted to?

Date: Tue, 22 May 2007 00:25:35 +0700 (ICT)
To: harald2@cs.chalmers.se
Subject: This weekend

[ The following text is in the "utf-8" character set. ]
[ Your display is set for the "ISO-8859-1" character set. ]
[ Some characters may be displayed incorrectly. ]

Hi Harald,

I will continue to send the rest pages of the book this weekend. I caught a flu, but anyway, not bird flu. Indonesia is number one in this case. It has the most bird-flu victims in the world. Well, I think, I have been working too hard to make my dream come true. Lately, I go to sleep at about 03.00 or even 04.30 a.m. and have to get up at 06.30 a.m.
My problem is lack of materials because university libraries in Indonesia do not keep grammar books of various world languages, let alone endangered ones. Indonesia is a developing country and has a lot of socio-economic problems. Corruption and bribery are ubiquitous.
I am dreaming of living close to a big library where I can learn a lot of grammar books of world languages after hour, from the evening till dawn. The years I spent when I studied in the USA were unforgetable, because I could read books I wanted. Book collection of the libraries of University of Kentucky is really impressive, make me envy day and night.

Greetings from Indonesia,

However, almost all people who used to download from the stolen grammars site, were not struck by any contagious incentive. And to be fair, maybe the stolen grammars site has discouraged other individuals from making their works digitally available. I don't know of any (except possibly PARADISEC), but this is hard for me to find out.

3. Never forget that when you are catering to the linguistics community you are catering to hyenas. Linguists will copy grammars with little regard to proper this and proper that, and share them with their friends in informal networks. I am one of the worst sinners, but I have also done a lot to have Swedish university libraries buy the same items, list and systematize lesser-known grammars, and I wish I could do even more to grant everyone access under the same conditions. Maybe there are linguists without sin, but if there are, they must be a clear minority and high-profile or rich Western linguists are no exception. For example, from my weblogs I computed the number of downloads of items from the former stolen grammars issued from some well-known ip numbers:

*.leidenuniv.nl 966
*.su.se 20315
*.cnrs.fr 903
*.edu (i.e. USA) 10929
*.soas.co.uk 1183
*.eva.mpg.de 3158
*.edu.au 1927

Of the last post *.unimelb.edu.au has 691 downloads and *.anu.edu.au has 689.

Another example is the CD with the WALS grammars, which have been copied a gazillion times in private circles, with little qualms about copyrights and open slather.

I don't need to educate you on the malicious impact on research if those old in the game and well-connected can acquire materials much more easily than others, let alone the time-waste involved. If there is a culture of open slather, there will be enough people willing to break such friends-only chains. If copyright is to be broken anyway with inefficient hurdles attached, why not make it fair for all and remove the hurdles.

Linguist X who is interested in some hard-to-find material on some language will go about copying it in some way or another, the next linguist will spend effort doing the same in parallel, and so it goes on, when they really could have shared the effort once for all.

I offered to help set up an OLAC annotated list of links for placement under LINGUIST LIST, you can consider this offer withdrawn. According to you, I did not want to do this unless I get some recompense for my creative work (I really don't know what it is that I want? A virtual statue maybe).

I am still happy to share the old stolen grammars index files if anyone wants to spend their time changing it to links. I could help with clarifications on what item came from where (preferably over the phone). I already got a semi-offer (not sure) the other day so it might happen, but if there is anyone else in addition, that's fine too.

Looking forward to hearing counterarguments,

I believe this appears at the top of every page that contains Capell's materials:

Please do not copy material from this site for further distribution but rather link to this site. PARADISEC has raised funds to digitise this collection and would like to be recognised for the work that we have put into developing the online presentation of fieldnotes. If you copy and distribute this data and do not acknowledge PARADISEC's work then we will have to put password protection on the data.

I don't know if it was put up specifically in response to this issue, but if it was there before then it seems pretty unequivocal to me.

There was also this disclaimer, which I would think would have been there for a while:

Copyright: Paradisec believes that many of the items provided through this guide are no longer the subject of copyright restrictions, or have been cleared for display in this service by the Copyright owners. However, Paradisec invites any individuals who believe they hold current rights over items provided through this service to make contact.

What happens if someone does come forward who demonstrably holds rights over the items? We'd have to not only change our access conditions, but potentially track down anyone who had copied the materials and request similar procedures, which, in an open-slather situation, could be difficult.

I agree fully with an open-source informational environment, but wouldn't go as far as the 'open-slather' environment that you suggest. Multiple, freely-available, digital copies is a good thing. But any sites that store local copies for free distribution should be constrained by the conditions of the originator. I don't think this is too much to ask, for a potential distributor of grammars to have expressed permission to distribute from the creator of the material. Since the material is open-source, the creator would probably have no issue with it, and it means that if any issue arises, they know who is distributing their intellectual property and can act accordingly.

To sum it up, 'freedom of information' to me, means that it costs the end-user nothing monetarily, but not necessarily that they don't have to go through the normal hoops in finding that material.

Perhaps it's nothing to be surprised about but so far the arguments I posted have been ignored. Why raise the issue in a blogpost if you do not want to debate the issue?

Aidan's post misses the points. First, as to the unequivocal copyright notice, I don't remember seeing that particular formulation earlier, but as I have iterated clearly, I knew anyway that it was technically not legal to distribute local copies. So there is no issue whatsoever there. Second, the fact that PARADISEC too illegally spreads copyrighted material is no argument against an open slather policy. The invitation for holders to come forward is worth nothing legally, and subsequent spreading by other entities than the original spreader are separate cases of copyright violation.

Third, yes of course any users of open-access materials should abide by the intentions of the originator. The question is, if the originator is
a researcher and the material he/she produced is
not commercially interesting, what kind of restrictions would he/she want to put on it? For example, the originator might want that any reader should perform a somersault (clearly non-monetary) before reading. No originator of research data I know does this, because it would be completely alien to his/her goals. Similarly, it should lie in the originator's intererest to remove any other restrictions that are also alien to his/her goals. Whether a hurdle is normal or traditional has no relevance at all. Open access is great, but as I have argued, open slather is even better. These arguments are intended to convince originators in the field of linguistics that open slather is a superior policy. They are not arguments that sanction gratuitous misuse of intentions and agreements of existing resources.


