One of the things that annoys me about bioinformaticians is their propensity to say "look, I've got a cool algorithm/fast computer/nifty widget/whatever; look what I can do with it!" instead of the rather more useful "Do you have any interesting biological problems I can work on, please?". I think this is because most bioinformaticians are computer geeks first, and biologists only third, if at all. What we really want is biologists who know how to wrangle code to good effect.
Which is a convoluted way of saying that if there's a small computational problem I need addressing then I'm more likely to have a go at it myself rather than muck around trying to get someone else interested. And this usually works, at least at the level of perl scripts and web pages.
Unfortunately, I have a need for a small perl script that is giving me a bit of a headache. You see, I have these microarrays, and I need to get at the chromosome sequence that corresponds to the probe signals that go up and down. I thought this would be easy: output a CSV file that contains columns of probeID, chromosome number, start location and stop location, and build URLs from that information.
But it's turning out to a bit more complicated than I thought it would be, and my rusty perl skills are feeling the strain. I can open the CSV, read it in to an array (actually an array of arrays; n rows of four fields) but then getting those n rows into four sub-arrays (one for each of probeID, chromosome number, start location and stop location) each of n elements is proving tricky. I'd rather not use the CSV::Text and what-have-you modules, because I want the code to be portable, but even if I did I still don't see how to do it.
Constructing the actual URLs (a 0 . . n loop with concatenation of scalars from the four sub-arrays) will be trivial, once I can get the four sub-arrays sorted.
Ideas?




Comments
Gee, I'd love to be of assistance, but I'm too busy staring blankly at my cool algorithm/fast computer/nifty widget. Do you always start a request for help by insulting the very people who are among the most likely to be able to help you?
Posted by: BioCompGeek | September 19, 2007 11:21 PM
I'm not sure I understand your problem. Why are you reading them into an array of arrays if your final goal is to get them into four arrays? Why can't you just do
while (>)
{
my ($probeid, $cn, $start, $stop) = split ",";
push @probeids, $probeid;
push @cns, $cn;
...
}
Posted by: Paul Tomblin | September 20, 2007 02:21 AM
I do enjoy pushing hot buttons.
Paul, I didn't think of that. Thanks. I was stuck in a readline/accumulator nightmare.
Posted by: BK | September 20, 2007 06:25 AM
One thing that annoys me about biologists is their unwillingness to learn how to wrangle code to good effect ;) There are plenty of biology-trained bioinformaticians, able and willing to help (I'm one). Most often, nobody asks or if they do, they ask badly.
I often find that Perl complex data structures (hashes of hashes, hashes of arrays...) are the way to store biological data. You have a chromosome; on that chromosome is a probe; the probe has a position. The trick is to map that description into a data structure. The unique identifiers - chromosome and probe - would make good hash keys. The start/end makes a good array. Result - a hash of hashes of arrays. Something like:
@{$hash{$cn}{$probeid}} = ($start, $end)
Probes matching in more than one location? No problem; push [$start, $end] instead, giving a hash (key chromosome) of hashes (key probeids) of arrays (number of positions) of arrays (position starts/ends).
Posted by: Neil | September 20, 2007 12:15 PM
Precisely Neil - I believe you started out as a biologist so you understand the problem (from both sides). Paul (check out his website) is a smart computer cookie, who doesn't pretend to know anything biological, so again, someone I'd be happy to go to for help. It's the "ooh, no, why don't you do this because it's more interesting?" attitude that bugs me - and that I've come up against a lot.
Which means I often find it less hassle in the long term to roll up my sleeves and try to hack out something. Naturally, I was cautious to use 'most' in my first paragraph precisely so that I could filter out the unhelpful ones. . .
You've given me something to think about, Neil. I might step back and have a serious think about the workflow. The lab will be quiet next week - Combio :)
Posted by: BK | September 20, 2007 12:35 PM
Yeah, I'm exposed regularly to both sides of the coin. Bioinformatics should be about solving biological problems and far too much of it is not - see any issue of the journal Bioinformatics, for example! On the other hand, a lot of biologists are very poor at engaging with bioinformaticians, for a multitude of reasons.
We need far more people who are comfortable speaking "biology" and "programming", since bioinformatics is really just the process of converting back and forth between biological "objects" and their representation as code. In many cases, there's a direct mapping (sequence - string of characters, for a trivial example).
I'll be manning POS-MON-86, 1:30-2:30 next Monday.
Posted by: Neil | September 20, 2007 01:57 PM
Neil wrote: "if they do, they ask badly" and I fully agree from the biologist side of things. We don't usually have the words to ask the questions right!
In the chronic shortage of bioinformaticians in my institution, we biologists had to use the few tools we could manage - basics of Excel and Access (sorry about FMPro, but we're slaves to Microsoft and untrained in programming) to do something that any proper programmer could probably have done in one tenth the time and with far more elegance and efficiency. This was similar to BK's task - take a long list of gene IDs that are interesting (SAGE data this time), map the physical interval of the whole gene as known; on the other side, convert the list of intervals corresponding to human disease loci, available from OMIM, into physical coordinates, and find out which of our interesting genes fall within the known disease intervals (with a margin).
I am describing this task as a biologist would, see?
What would be great is to have some sort of list on OpenWetWare of interested bioinformaticians around the world who say, sure, I'm available and would be happy to collaborate (making sure it is considered a true collaboration and not just the usual use-the-tech attitude). Anyone?
Posted by: Alethea | September 24, 2007 05:07 PM