Genomics 101

Nearly four months ago, I started working at the Wellcome Trust Sanger Institute. To say it’s a sweet gig would be something of an understatement! I can only think of a handful of cooler things: e.g., ESA, NASA, CERN; mostly by virtue of me understanding physics better than I do biology. Indeed, I’m not a genomicist — or even a bioinformatician — I’m just a run-of-the-mill software engineer (grunt). To be honest, being surrounded by alphageeks is a bit intimidating (think Stuart from The Big Bang Theory) with my barely-layman’s knowledge of genetics, but that’s entirely my problem.

Anyway, since joining, I’ve had a question about genomics that’s been bugging me. I think I’ve figured it out… (I probably should have asked my colleagues from the start, but in retrospect — presuming my understanding is correct — it’s a bit of a stupid question!)

If everyone’s DNA is different, how can there be a single human genome?

The clue to this is largely in the name: genomics is about genes, not nucleotides. DNA is a very long string of nucleotide pairs — the A, C, G and Ts you’re probably familiar with — but certain substrings of basepairs function to delimit genes. To analogise, written sentences begin with a capital letter and end with a full stop; likewise genes begin with a particular sequence and end with another. Genes are what define species and that’s what genomics is all about: Determining the manifest of genes for a species, what they do and how they interact. So every human has the same set of genes, but the “parameters” of those genes will be different. For simplification’s sake, if there were a single gene for eye colour: every person would have it, but some would have the blue version (allele), while others would have brown, etc.

To continue with my literary analogy, think of a genome for a particular species as a constricted poem, like a sonnet or limerick. The sentences (genes) within the poem must have the same meter, syllable count or rhyming scheme, but the actual words can be different. Thus we can have a multitude of poems with the same structure, but with vastly different content. Such is the genome to DNA.

I actually have another, related query, that is probably more of a thought-experiment rather than answerable…

The DNA molecule is a polymer, like plastic or sugar. Indeed, the two strands that hold the nucleotides in its quintessential double-helix are sugars: deoxyribose, hence the name deoxyribonucleic acid. In principal, polymers can be arbitrarily long — the human genome is over three billion basepairs — but are there non-chemical constraints on its length? For example, is there a point where it becomes too long to hold itself together, or that biochemical processes become too inefficient to be useful?

The reason I ask is because, as we are operating over an alphabet of just four symbols, any theoretical maximum genome size \(N\) would give us an upper-bound on all varieties of DNA-based life at \(4^N\). Granted this upper-bound would far exceed the number of atoms in the universe, but presumably vast swathes of DNA don’t translate to viable lifeforms.

My point being that life can’t therefore be infinite in variety.

Found 404s

When you get a 404 Not Found error, on StackOverflow, you are presented with some obfusticated code:

Even with syntax highlighting turned off, the preprocessor macros immediately mark this as C to me, but with extra bits encoded into comments. Let’s deal with the C first:

The first line aliases putchar to v. putchar is part of the C standard library and simply takes an unsigned integer, representing an ASCII character, which it prints to the screen, and returns the original input value. It’s kind of like an identity function with IO side-effects! Then, in the second line, the main function is simply aliased to print, taking a parameter x. The important thing to note is that x is not bound to anything in this macro, so it is effectively redundant. Anyway, the main function deobfusticates to:

Function application is left associative, so putchar(52) gets evaluated first. ASCII point 52 is the character “4”, which is printed to the screen. This returns 52 and so putchar(52 - 4) outputs ASCII point 48 (the character “0”). Then, in the final call, we simply add four back to get ASCII 52. That is, the print(202*2) call at line 4, just outputs the string “404” to the screen and then exits.

As I said, the argument passed to print is irrelevant to C, but in some scripting languages the “#” is interpreted as a comment. This actually therefore makes the source in its entirety valid Python and, I think, Perl:

i.e., Outputs \(202\times 2 = 404\).

Next up is the third line, which is a comment in the C source that starts on the end of the second. It’s in Brainfuck:

In Brainfuck, the only valid symbols are “>” (move up a memory address), “<” (move down an address), “+” (increase the value at the current address), “-” (decrease the value), “[” (loop until zero), “]” (close loop), “.” (output current address value) and “,” (get input); anything else (like that random “4”) should just be ignored. So we have:

  1. Move to address 1
  2. Increase the value by 8
  3. Start loop
  4. Move to address 2
  5. Increase by 6
  6. Return to address 1
  7. Decrease by one
  8. Loop until zero
  9. Move to address 2
  10. Increase by 4
  11. Output
  12. Decrease by 4
  13. Ouput
  14. Increase by 4
  15. Output

In lines 1 to 8, we create an iteration count in address 1 (of eight) and increase address 2 by six in each loop. Thus, when we exit the loop, the value at address 2 is \(8\times 6 = 48\). We then add four to that (i.e., 52, ASCII “4” again) and output, then reduce by four and output (ASCII “0”), then set up the final “4”.

In the entirety of this source, there are only four other valid Brainfuck symbols: the “+” and “-” on line 2, which increment then decrement the value at address 0 (i.e., does nothing); then the “>.” on line 5, which will output the value at address 3 (i.e., does nothing, because the value here is 0 and ASCII point 0 is just the null string terminator). Thus, again, the source in its entirety can be run through a Brainfuck interpreter to output “404”.

Now there’s the question of the final line: In C, Python, Perl and Brainfuck, it is has no effect. There’s also the curious spacing of the #define macros on the first two lines. Also, what’s up with that random “4” in the Brainfuck code on line 3, anyway?… It may all just be a red herring, but I have a feeling that there is a bit more to this!

Either way, this is neat. Kudos to whoever wrote it! I reminds me of those “How many triangles can you see?” puzzles, but an order of magnitude more complex.

Movies of 2014

Last year’s rating system was an attempt to give more justifiable scores, but it was too complicated. Moreover, neither of us are technical experts on some of the things we were judging, so that seemed a tad disingenuous. I was tempted to introduce an even more sophisticated system, but Mrs. Xoph and I had a very busy year and ended up rating everything we saw retrospectively (i.e., just now), using a simple ten point scale!

It works well enough :)

Continue reading Movies of 2014

Syntactic Shibboleths

A shibboleth is a word or phrase that can be used to identify a speaker’s background based on the way that it is pronounced by them. So, for example, one could vaguely determine which end of the UK someone was from by how they pronounce the vowel in the word “bath”: [æ] would err towards the north, while it’s [ɑ] in the south (east). It’s actually possible to isolate a person’s origin quite precisely by testing a number of shibboleths, with respective, intersecting regions, together. A kind of linguistic Guess Who, if you like.

Four years ago, for my MA dissertation (available on request), I conducted an experiment that called for native English speakers. However, because it was disseminated over the Internet, I had no way of certifying this, while still needing a way to improve the signal-to-noise ratio in my data. My innovation was to test subjects’ responses (i.e., grammaticality judgements) on subtly warped syntactic structures which are peculiar to English (or, at least, Germanic languages). Subjects who baulk at the dodgy constructions are more likely to be native speakers, against those who accept them because they look “about right”.

These are my syntactic shibboleths (actually, “morphosyntactic”, but that breaks the alliteration). The first four have been “battle tested” in research — and proved very useful in sanitising my data — the others (and there are certainly more) seem like viable candidates from my experience. For each, I give three examples of corruption, largely to make native speakers cringe!

Continue reading Syntactic Shibboleths

London Calling

Fortunately for me, I have now escaped from London. Not so fortunately, I still work there…which is a bit of an epic commute, but I digress!

Anyway, regarding the day job, recently people have been rather confused about how they can determine a London address. I immediately suggested using the postcode, as they generally follow a regular pattern and, in the capital, they correspond to the compass points in the “outward” part. That is:

Outward Area
E East
EC East Central
WC West Central
W West
N North
NW North West
SE South East
SW South West

This can easily be turned into a regular expression that can be used to validate, presuming the postcodes are already validated, London postcodes:

This is fine for Inner London, but as the city slowly absorbs its suburban neighbours, this won’t work for the foetid glory that is its greater metropolitan area. We have to expand into the orbiting postal code regions:

Outward Area
BR Bromley
CM Chelmsford
CR Croydon
DA Dartford
EN Enfield
HA Harrow
IG Ilford
KT Kingston-upon-Thames
RM Romford
SM Sutton
TW Twickenham
UB Southall
WD Watford

Now it would be straightforward to put together an alternation group of these thirteen (unlucky for some) codes, but apparently that would overgeneralise as some areas covered are not considered within Greater London. To solve this problem, I was given a 64MB CSV file of every valid London postcode and let loose!

I wasn’t about to upload over 300,000 postcode records into a database table, so I figured I would proceed down my regular expression route. Fortunately, with a bit of command line fu, I was able to reduce those records down to 130 unique, Greater London outward postcodes:

The 130 outwards that I obtained represent the actually valid Greater London postcodes for the above districts. As postcodes follow a quite simple format, I was able to condense these into 13 regular expressions:

Outward Area Regular Expression
BR Bromley ^BR[1-8]
CM Chelmsford ^CM1[34]
CR Croydon ^CR([02-9]|44|90)
DA Dartford ^DA([15-8]|1[4-8])
EN Enfield ^EN[1-9]
HA Harrow ^HA\d
IG Ilford ^IG([1-9]|11)
KT Kingston-upon-Thames ^KT([1-9]|1[7-9]|22)
RM Romford ^RM([1-9]|1[0-5]|50)
SM Sutton ^SM[1-7]
TW Twickenham ^TW([1-9]|1[0-59])
UB Southall ^UB([1-9]|1[018])
WD Watford ^WD([236]|23)

We know that the inward part of postcode is always of the format \d[A-Z]{2}$ and it should follow the outward by a space, although this is often missed or doubled up. So we can take the alternation group of the above outwards, factor and include the inward pattern to obtain this beast:

Now we can both determine and classify London addresses without having to resort to an enormous lookup table :)

Not to gloat, but it’s most satisfying to know ones tools well. What would have taken my colleagues — even those who claim to be developers — literally days, I was able to do in less than an hour.

This is my blog.
There are many like it, but this one is mine.
My blog, without me, is useless.
Without my blog, I am useless.

The Beast of Wimbledon

About three weeks ago, in the dead of a cold and ironically hackneyed foggy night, my wife and I were stirred from our bed by a scream. An ear-piercing series of short bursts of terror that chilled us to the bone. The sound of a woman — albeit somewhat caricatured — in immediate peril. We lay there in confused silence, our duvet huddled around to protect us from whatever horror our imaginations concocted outside our bedroom window.

The screams died away oddly. A few minutes passed. Timidly, I got up from bed and peered through a crack in the curtain, careful not to cause too much movement that would compromise my surveillance. Nothing. Just a cold, misty night. Other than the screams, there had been no sign of a struggle. No hasty footsteps. No clattering of discarded weapons. No thuds of lifeless bodies. Just the terrible screams and terrible silence.

Convinced that it was some poor soul about to meet their maker, I called the police. I had no evidence — there was no body — but it sounded so much like the death throes of my fellow man, to my ear, I was compelled to play it safe. Some fifteen minutes later, the flashing blue lights of a police car shone through the chinks between our curtains, dancing on our bedroom ceiling. We heard the officers’ deliberate footsteps outside, while their radios buzzed and crackled static. Nothing. They pulled away, leaving us trying to sleep.

My wife wasn’t convinced that it was the screams of a person. Given the obvious lack of foul play, I was inclined to agree with her, but what was it? An urban fox? I searched for videos, to compare the noises they make with that now etched into my psyche. They bark and howl like dogs; admittedly, at a higher pitch, but nothing like what we heard that night. Nor owls. Nor cats.

Last night, I heard it again. My wife slept soundly, while that same petrifying series of womanly screeches resonated through my slumber. They lasted longer than the first time; becoming distinctly less human-like as they progressed. An unholy retching that, again, petered out into the night, leaving no trace. I eventually fell back to sleep.

I’m rational enough to weigh-up the evidence. There is none that impinges on my or my wife’s safety. However, the mystery remains to disturb my sanity: What could it be? A feral womble? The spirit of the departed, perhaps brutally murdered and forced to haunt SW19 until her death is avenged?

The stuff local folklore is made of.