I've never seen that before. I broke the BLAST server:
An error has occurred on the server, Informational Message: Too many HSPs to save all
Informational Message: [blastsrv4.REAL]: Error: CPU usage limit was exceeded, resulting in SIGXCPU (24).
I've never seen that before. I broke the BLAST server:
An error has occurred on the server, Informational Message: Too many HSPs to save all
Informational Message: [blastsrv4.REAL]: Error: CPU usage limit was exceeded, resulting in SIGXCPU (24).
My young apprentice (who, according to the Black Queen, is spending too much time in my presence and picking up "bad habits". Pah, I say, it's all part of the training), Beta Gal, sent me a file of primer sequences yesterday. Obviously she needs more training, because the file was created in Microsoft Word. Oh well, she's still young.
Now the thing is, I've been using iWork for the last year or so. One of the many gorgeous things about the word processing and page layout component is how it handles Word documents. With the latest version (to which I treated myself last month) Pages will track changes and handle comments seamlessly between itself and Word. It consistently handles Word documents better than Microsoft Word (a pattern develops: $BOSS sent me a PowerPoint file that Microsoft PowerPoint screwed up - lost the images, formatting, completely buggered. Keynote, with the exception of ls possibly the most beautiful program I've ever used, didn't even blink).
So Beta Gal sends me a .docx file. The version of Office (2004) I'm running on this machine doesn't recognize it.

It's a sodding Word document for pity's sake. What kind of crack-brained pot-smoking monkeys are doing this to a simple text-processing file format? Even my trusty code-wrench BBEdit throws up its character sets in dispair. The only text it shows me that isn't binary gobbledegook says
[Content_Types].xml
Pages, of course, smiles sweetly and shows me what my young Padawan has been up to.
Breathe, BK; think sunny thoughts, find your happy place.
Members of Joel Sussman's lab at the the Weizmann Institute have developed Proteopedia (direct link), an online tool for making structural biology clearer for chemists and biologists by linking textual content to 3D structures.
Impressive.
For a born-again structural biologist like myself, this looks like an invaluable research and teaching aid. I shall follow its career with interest.
(via Peter MR, who reminds me what great fun BioMOO was, back in the day)
Note: This post is somewhere to publish my scalar version of the Fisher-Yates shuffle, in Perl. It gets a bit geeky, but that's the price of Google-fu.
We had need, just before Christmas, to search genomic sequences for reasonably short ( ≤ 13 bases) and see if the found sequence was in an exon, intron, or across the splice junctions. Such a tool did not seem readily available, so I read the Ensembl API documentation, dusted off my somewhat rusty bush Perl , and coded it myself.
It was a learning experience. But in the end I have a few little Perl programs that do what we want and have given us interesting data (the question, What does it all mean? , is still begging, however). While thinking about this and hacking away I realized that I needed appropriate control searches. My idea was to perform Monte Carlo-type simulations, by either randomizing the input or the entire human genome (ouch. . .) and seeing what numbers fell out.
My first attempt at this was to scramble the query sequence a fair number of times (a thousand, say) and search the genome with it. I invented a little subroutine that looked like this
sub scalar_shuffle {
my $ordered = shift;
my $count = length ($ordered);
my $shuffled;
while ($count) {
$shuffled .= substr($ordered, int rand($count), 1, '');
$count--;
}
return $shuffled;
}
What this does is is to take the string containing the query sequence ($ordered) and sample it at random, putting the base it finds each time onto the end of the new string $shuffled. And it worked, for short sequences.
But the results I got were not really usable: the query strings were too short to get sensible standard deviations. For an eleven base sequence, I calculated that we should see 1430 hits, but I was getting (for a 1,000 x shuffle) anywhere between four and forty thousand hits, with a standard deviation of 2360. So I decided that shuffling the genome itself was a better bet.
I tried it out on the first chromosome, and things started to be a bit s l o w. Very slow, in fact. This was probably something to do with having two strings of 300 MB each. That's a lot of RAM.
So I googled and discovered the Fisher-Yates shuffle, which shuffles an array in situ (by swapping randomly paired elements of that array) and is easily implemented in Perl.
But all the implementations I could find were for arrays, not scalars (strings). And in Perl, an array is an array of scalars, which means they get very expensive in terms of memory and processing when you are looking at hundreds of millions of elements. Moreover, when my code reads in the individual chromosome sequences it puts them into scalars (precisely to avoid array overload, as well as to make regular expression searches easier) and converting them into arrays would have been yet more processing overhead.
So I sat, and thought, and looked at my original but highly inefficient pseudo-random shuffle, and the Fisher-Yates code, and then it struck me. Why not just substitute scalar expressions and operations? So I did, and this little subroutine positively blazed when I tested it:
sub fisher_yates_shuffle {
my $chromosome = shift;
my $count = length($chromosome);
while ($count--) {
my $j = int rand($count+1);
my $swapper = substr($chromosome, $j, 1);
substr($chromosome, $count, 1, $swapper);
}
return $chromosome;
}
Here, we choose a random position in the string and swap it with the first base. We keep doing this until we've swapped all except the first base with something. For a chromosome, it's random enough, and executes on a timescale that means I can shuffle the entire genome ten times before tea.
If this post helps one other person with a similar problem, then it was worth the effort. If any of you Perl ninjas out there can see a better/more efficient way of doing it, then do let me know, although I probably don't need to implement it now.
Ian York is approaching the 1.0 release of XPlasMap (which I have written about previously). He's actively soliciting bug reports and feature requests.
Go to it, team.
![]()
Now that's what I call a scary error message.
BioPerl is, apparently, distributed under the Perl Artistic License, and not as I read it first, the 'Perl Autistic Licence'.
(now the ears of my ears awake and
now the eyes of my eyes are opened)
When Ricardipus isn't whinging about his inability to manage his time, he's making comments on other people's weblogs.
He makes the very good point that the general public is never going to read science weblogs, but is tied to 'conventional' (as opposed to nuclear? I think we should be told) media. My thesis is that 'conventional' journalists probably are not going to have the time nor the inclination to use material from those same weblogs, so it all seems a bit pointless. However, I prefer e. e. cummings over Susan Musgrave and am more optimistic than Ricardipus, both when it comes to time management and the effects of science weblogs.
Journalists who report on science generally have to make do with press releases and interviews with heads of labs, from which they need to distill pithy soundbites. These are frequently misrepresented (deliberately or accidentally) and devoid of important context. We know that the scientific literature is impenetrable and execrably written and that this makes it difficult for scientists themselves, let alone the educated layman, to understand published work. Believe me, this week I am editing undergraduate theses and I know, I just know there must exist a compulsory Pompous and Ineffectual Writing 101 course.
What hope, then? What hope for the layman or the trained writer crafting a press release to understand the primary literature? What hope for editors of school textbooks to get the science right? What hope for seeing balanced and accurate science writing in the papers?
Weblogs.
Hang on a minute BK, what are you smoking? you might well ask. Bear with me.
Dr Ludbrook says
It seems to me that there are two components to writing well in science and medicine. The first, and most fundamental, is the need for a solid grounding in English grammar, usage and style.
This obviously starts at school, and we might in a moment of heady optimism imagine that universities offer writing courses that actually are effective. But this involves time and money, and is probably too late for most of us.
Ludbrook suggests that we should make use of freelance editors. This is obviously a marketing ploy and I shall pay it no further attention. But here's the crux:
The second is constant practice in writing, especially in the format required by biomedical journals.
"Constant practice in writing". Now we see how weblogs can be useful. Not necessarily as a means of direct or indirect (via traditional media) communication with 'the public', but as a tool for improving writing.
If practice makes perfect, then write to get better at writing. I don't know if he does it with this aim in mind, but Ian's weblog is a fantastic example of writing about science constantly, with no hope of reaching the 'general public'. But Ian has succeeded in making immunology more accessible to me, and I bet his writing has improved because he does it.
I'll be the first to proclaimadmit that a lot of the 'science' weblogs out there are terrible, and that the authors would rather take cheap shots at people who disagree with them than actually tackle anything with substance. But the beauty of the weblog format is the opportunity for feedback, for instant peer review. Which means that you and me, the readers, have the chance to point out that someone is being unintelligible, or overly verbose, or stupid, or narrow-minded. And maybe, when they learn to write betterer as a result, when they write their papers or their summaries for the PR office, they'll thank you.
In the meantime (because this won't happen overnight), if you're writing a weblog, or a paper, or a school newsletter, then check out this little Flesch-Kincaid readability gadget. I should run it on these theses and see if I can break it.
Pithy soundbite: Just because it's a blog doesn't mean it has to be crap.
Back in the glory days of System 7 it seemed that everybody had a plasmid-drawing tool (I was originally going to say 'widget', but you'll see why I changed my mind in a minute). Then something went horribly wrong and all these useful little programs broke or went outrageously commercial (you know who you are) and consequently out of the reach of many of us.
So for a while there we were using Illustrator or similar to draw plasmid maps, which is a little bit like using a howitzer to take out a cockroach — it's expensive, there's a lot of collateral damage and you're never quite sure if you actually got it.
OK, so maybe that analogy is flawed. Never mind. Where was I? Oh, yes. Plasmid maps.
The incomparable but fairly well-hidden Ian York not only writes intelligently about immunology, thereby demonstrating that he is much, much smarter than I am, but I am reminded (because I promised to do this months ago) that he's written a little DNA-mapping program, XPlasMap, which as far as I'm concerned proves that he's superhuman.

I heartily endorse this product or service
This gadget (which would, I'm convinced, make an excellent OS X widget) slurps a sequence, finds the open reading frames and restriction sites, and makes a quickndirty plasmid (or linear) map in a mere couple of clicks. It is dead easy to insert fragments and to change the labels and colours, although removing sites you don't want to see is a little tiresome, and it would be nice to be able to zoom into crowded bits of the map. It also would be handy to be able to insert fragments by restriction site rather than base number, and to export the entire sequence (making the save file a package, as does EnzymeX, would help here), but for a simple to learn, easy to use tool for making records of all your plasmids, you can't really go wrong with XPlasMap.
Oh, and it outputs JPEG and PNG files for further furtling, although the reason for the JPEGs being 4,000 pixels wide is probably merely further proof of Ian's superiority.
Wednesday's post was born of frustration. You see, I had a nice script hacked together, but it just would not 'print' what I expected it to during my debugging. Which made me think I was doing something boneheaded. I implemented Paul's suggestion, and I was still not getting the expected output.
Then it struck me. If my code was good, the CSV itself (produced by Filemaker) must be knackered. And indeed, playing with CSV::text showed that perl was barfing on, and thus unable to parse, the very first line.
I left the problem alone yesterday, mainly because we had a group meeting with a visiting speaker and I had to put together a spiel for it, but also because it felt like my brain was oozing out of my ears.
So this morning, I went to talk to my cells, came back to my desk, sipped the coffee that the Black Queen had kindly bought for me, and constructed a very simple CSV file;
"r1c1","r1c2","r1c3","r1c4"
"r2c1","r2c2","r2c3","r2c4"
Then I showed it to my perl script (irrelevant bits stripped; yes I'm using 'strict'):
open (CSV, $probefile) || die "Can't open probefile! $!\n";
while (<CSV>) {
chomp;
($probeSetID, $chr, $start, $stop) = split (/,/,$_);
push (@probeSetIDs, $probeSetID);
push (@chrs, $chr);
push (@starts, $start);
push (@stops, $stop);
}
close CSV;
print join("\t", @probeSetIDs), "\n";
print join("\t", @chrs), "\n";
print join("\t", @starts), "\n";
print join("\t", @stops), "\n";
And BINGO! Four arrays, each with the appropriate column:
"r1c1" "r2c1"
"r1c2" "r2c2"
"r1c3" "r2c3"
"r1c4" "r2c4"
The conclusion is that Filemaker Pro can't make proper CSV files. There appears to be an invisible character in the first line. Oh hum. There are ways around this, I hope, but damn, it'd be nice to be able to trust something occasionally.
As they say,
"AHS, ASS".
One of the things that annoys me about bioinformaticians is their propensity to say "look, I've got a cool algorithm/fast computer/nifty widget/whatever; look what I can do with it!" instead of the rather more useful "Do you have any interesting biological problems I can work on, please?". I think this is because most bioinformaticians are computer geeks first, and biologists only third, if at all. What we really want is biologists who know how to wrangle code to good effect.
Which is a convoluted way of saying that if there's a small computational problem I need addressing then I'm more likely to have a go at it myself rather than muck around trying to get someone else interested. And this usually works, at least at the level of perl scripts and web pages.
Unfortunately, I have a need for a small perl script that is giving me a bit of a headache. You see, I have these microarrays, and I need to get at the chromosome sequence that corresponds to the probe signals that go up and down. I thought this would be easy: output a CSV file that contains columns of probeID, chromosome number, start location and stop location, and build URLs from that information.
But it's turning out to a bit more complicated than I thought it would be, and my rusty perl skills are feeling the strain. I can open the CSV, read it in to an array (actually an array of arrays; n rows of four fields) but then getting those n rows into four sub-arrays (one for each of probeID, chromosome number, start location and stop location) each of n elements is proving tricky. I'd rather not use the CSV::Text and what-have-you modules, because I want the code to be portable, but even if I did I still don't see how to do it.
Constructing the actual URLs (a 0 . . n loop with concatenation of scalars from the four sub-arrays) will be trivial, once I can get the four sub-arrays sorted.
Ideas?
When a colleague sends an email that begins
Sadly, you need to use Internet Explorer for this
and then goes on to say
technical problems should be reported to me.... and, on that note, just be aware that when looking at a list of courses or people, the list is restricted to showing FIVE items at a time. We've already had some "Mac" issues .... deep joy....
I hope that they realize that the 'issues' are not with the Mac, but with coders who think that "Internet Explorer" and "adequate web browser" are intersecting sets.
Conversation between fresh-faced young Honours student and myself:
PFY: What primer design program should I use.
BK: Your brain and a pencil and paper. I haven't used a primer design program in over ten years.
PFY: Gosh. I didn't know they existed back then.
An interesting (and I mean that nicely) article over at MacResearch for anyone considering using Mac OS X rather than Linux. Or 'in addition to' as well, I guess.
. . .there must be quite a few scientists that have been brought up on Linux or Unix and have either been forced into using Mac OS X, or have chosen that path themselves. Apple has plenty of information for Windows switchers, but Linux switchers seem to be a bit neglected.
I did not know about pbpaste or pbcopy, so the article is good even for hardcore Macheads like myself.
In my copious spare time (hah) I would have liked to get to grips with programming, in a real language such as C or ObjectiveC. I did have a fiddle just before leaving the UK, and indeed was able to wrestle PERL to create some useful (no, really) web-based applications such as the Protein Calculator and the Codon Usage Wrangler.
PERL has its uses but I'm really interested in learning an object-oriented language that I can easily integrate with Apple's wonderful Xcode environment to make shiny applications for my shiny Mac. It's a hobby thing. The problem I have had — aside from the lack of time — is the absence of a worthy adversary, I mean project. So the Protein Calculator and the Codon Wrangler came about because there was a need for those functions in the lab, and how I implemented them seemed the fastest and most accessible way at the time (we had some Other OS users in the lab and I was not going to beat my head about cross-platform code).
This imposter is not really a structural biologist; he's one of our IT support trolls testing the 3D capabilities of Coot:

"Hello, my name is John and I'm a scientist. I haven't been able to clone anything for three months. I'm taking it one day at a time."
When I get to the bottomI go back to the top of the slide
Where I stop and turn
and I go for a ride
A bit of a frustrating week.
The thermal cycler broke down just before it had completed a crucial experiment (but I think it went far enough to get some useful data), someone put agar plates with no antibiotic into the ampicillin plate bag (screwing up my cloning), a Western blot is tantalizingly inconclusive and a beautiful hypothesis appears to have been brutally slain.
EMBOSS (the freeborn son of GCG) hits version 4. Download site is here.
Lots of goodies in the new version, here are some highlights:

All your base are belong to us
The BioLOG is back, bigger and bad to the bone
LabLit
From the blurb: LabLit.com is dedicated to real laboratory culture and to the portrayal and perceptions of that culture – science, scientists and labs – in fiction, the media and across popular culture.
Mind the Gap
Adventures in the London sci-lit-art scene...and occasionally beyond
Humans in Science
Similar to 'Lab Rats', a very human look at the process of doing science and how daily life impacts our profession
The Daily Grind
Jonathan Sanderson, a TV producer interested in making 'popular science' shows