Note: This post is somewhere to publish my scalar version of the Fisher-Yates shuffle, in Perl. It gets a bit geeky, but that's the price of Google-fu.
We had need, just before Christmas, to search genomic sequences for reasonably short ( ≤ 13 bases) and see if the found sequence was in an exon, intron, or across the splice junctions. Such a tool did not seem readily available, so I read the Ensembl API documentation, dusted off my somewhat rusty bush Perl , and coded it myself.
It was a learning experience. But in the end I have a few little Perl programs that do what we want and have given us interesting data (the question, What does it all mean? , is still begging, however). While thinking about this and hacking away I realized that I needed appropriate control searches. My idea was to perform Monte Carlo-type simulations, by either randomizing the input or the entire human genome (ouch. . .) and seeing what numbers fell out.
My first attempt at this was to scramble the query sequence a fair number of times (a thousand, say) and search the genome with it. I invented a little subroutine that looked like this
sub scalar_shuffle {
my $ordered = shift;
my $count = length ($ordered);
my $shuffled;
while ($count) {
$shuffled .= substr($ordered, int rand($count), 1, '');
$count--;
}
return $shuffled;
}
What this does is is to take the string containing the query sequence ($ordered) and sample it at random, putting the base it finds each time onto the end of the new string $shuffled. And it worked, for short sequences.
But the results I got were not really usable: the query strings were too short to get sensible standard deviations. For an eleven base sequence, I calculated that we should see 1430 hits, but I was getting (for a 1,000 x shuffle) anywhere between four and forty thousand hits, with a standard deviation of 2360. So I decided that shuffling the genome itself was a better bet.
I tried it out on the first chromosome, and things started to be a bit s l o w. Very slow, in fact. This was probably something to do with having two strings of 300 MB each. That's a lot of RAM.
So I googled and discovered the Fisher-Yates shuffle, which shuffles an array in situ (by swapping randomly paired elements of that array) and is easily implemented in Perl.
But all the implementations I could find were for arrays, not scalars (strings). And in Perl, an array is an array of scalars, which means they get very expensive in terms of memory and processing when you are looking at hundreds of millions of elements. Moreover, when my code reads in the individual chromosome sequences it puts them into scalars (precisely to avoid array overload, as well as to make regular expression searches easier) and converting them into arrays would have been yet more processing overhead.
So I sat, and thought, and looked at my original but highly inefficient pseudo-random shuffle, and the Fisher-Yates code, and then it struck me. Why not just substitute scalar expressions and operations? So I did, and this little subroutine positively blazed when I tested it:
Scalar Fisher-Yates Shuffle
sub fisher_yates_shuffle {
my $chromosome = shift;
my $count = length($chromosome);
while ($count--) {
my $j = int rand($count+1);
my $swapper = substr($chromosome, $j, 1);
substr($chromosome, $count, 1, $swapper);
}
return $chromosome;
}
Here, we choose a random position in the string and swap it with the first base. We keep doing this until we've swapped all except the first base with something. For a chromosome, it's random enough, and executes on a timescale that means I can shuffle the entire genome ten times before tea.
If this post helps one other person with a similar problem, then it was worth the effort. If any of you Perl ninjas out there can see a better/more efficient way of doing it, then do let me know, although I probably don't need to implement it now.