Pad Thai Fluke

Despite what I said last year, my New Year’s Resolution for 2015 has been to cook a new dish once a week. Even if I only got a 10% success rate, that would be five new recipes to my repertoire. So far, it’s proven to be extremely successful: just one-or-two rejects; mostly good stuff and a few really great new things.

Maybe I’ll do a post about all the new food I’ve been cooking at some point, but let’s focus on today. The week before last’s (4th April) new recipe was pad thai — a classic Thai noodle dish — based and simplified largely from this article I found in the Guardian. The result wasn’t bad, but I was convinced I could do better…and better, I did!

Let me share with you my recipe!

n.b., “Fluke” as in lucky, rather than the parasitic flatworm…probably shouldn’t have shared that 😕

Continue reading Pad Thai Fluke

Journalistic Integrity

Yesterday, on Newsbeat, was a story about the phenomenon of young Europeans — particularly girls and women — defecting to Syria. During which, a counter-terrorism specialist lambasted their use of cryptographic software and urged that — while “privacy is important to us”™ — backdoors for law enforcement should be readily available to make their jobs easier.

Won't somebody please think of the children!

Now I don’t want to paint myself as a target — and, by that, I mean with respect to certain signal intelligence agencies — but as a digital rights advocate with a passing knowledge of cryptography, it troubled me that the report completely neglected any discussion of the potential repercussions of compromising such systems. Setting a precedence for surveillance is a slippery slope; the consequences of which ought to be made abundantly clear to all.

Maybe you think Newsbeat is dumbed-down, so I shouldn’t expect too much. I think that’s a bit patronising: Newsbeat may cater to its demographic (e.g., today’s story on the new, racially diverse emoji), but I wouldn’t say it’s dumbed-down. Indeed, the reason I listen to Radio One on my commute is because it’s entertaining. I gave Radio Four a try and, despite being very dry, it only serves to highlight the problem I have with the BBC’s news and current affairs: It panders to its audience. While it’s true that my views often align with this side, it does nothing to broaden my — or others’ — horizons by preaching to the choir.

This isn’t the first time such one-sidedness has been peddled by them. Indeed, I notice that it’s often the case when I have some, non-trivial knowledge of the subject in question. (I’ve heard similar things from others: Mrs. Xoph often questions the impression of her kin given by the BBC.) As such, without evidence to the contrary, I think it would be reasonable to extrapolate this bias to any of their stories.

Of course, it goes without saying that one shouldn’t believe everything they read, hear or see — critical thinking and all — but in my mind, the BBC always had a reputation for impartiality. When did this change?

Genomics 101

Nearly four months ago, I started working at the Wellcome Trust Sanger Institute. To say it’s a sweet gig would be something of an understatement! I can only think of a handful of cooler things: e.g., ESA, NASA, CERN; mostly by virtue of me understanding physics better than I do biology. Indeed, I’m not a genomicist — or even a bioinformatician — I’m just a run-of-the-mill software engineer (grunt). To be honest, being surrounded by alphageeks is a bit intimidating (think Stuart from The Big Bang Theory) with my barely-layman’s knowledge of genetics, but that’s entirely my problem.

Anyway, since joining, I’ve had a question about genomics that’s been bugging me. I think I’ve figured it out… (I probably should have asked my colleagues from the start, but in retrospect — presuming my understanding is correct — it’s a bit of a stupid question!)

If everyone’s DNA is different, how can there be a single human genome?

The clue to this is largely in the name: genomics is about genes, not nucleotides. DNA is a very long string of nucleotide pairs — the A, C, G and Ts you’re probably familiar with — but certain substrings of basepairs function to delimit genes. To analogise, written sentences begin with a capital letter and end with a full stop; likewise genes begin with a particular sequence and end with another. Genes are what define species and that’s what genomics is all about: Determining the manifest of genes for a species, what they do and how they interact. So every human has the same set of genes, but the “parameters” of those genes will be different. For simplification’s sake, if there were a single gene for eye colour: every person would have it, but some would have the blue version (allele), while others would have brown, etc.

To continue with my literary analogy, think of a genome for a particular species as a constricted poem, like a sonnet or limerick. The sentences (genes) within the poem must have the same meter, syllable count or rhyming scheme, but the actual words can be different. Thus we can have a multitude of poems with the same structure, but with vastly different content. Such is the genome to DNA.

I actually have another, related query, that is probably more of a thought-experiment rather than answerable…

The DNA molecule is a polymer, like plastic or sugar. Indeed, the two strands that hold the nucleotides in its quintessential double-helix are sugars: deoxyribose, hence the name deoxyribonucleic acid. In principal, polymers can be arbitrarily long — the human genome is over three billion basepairs — but are there non-chemical constraints on its length? For example, is there a point where it becomes too long to hold itself together, or that biochemical processes become too inefficient to be useful?

The reason I ask is because, as we are operating over an alphabet of just four symbols, any theoretical maximum genome size \(N\) would give us an upper-bound on all varieties of DNA-based life at \(\frac{4}{3}(4^N – 1)\). Granted this upper-bound would unimaginably exceed the number of atoms in the universe, but presumably vast swathes of DNA don’t translate to viable lifeforms.

My point being, if this were true and in a very nit-picky way, life couldn’t therefore be truly infinite in variety.

Found 404s

When you get a 404 Not Found error, on StackOverflow, you are presented with some obfusticated code:

Even with syntax highlighting turned off, the preprocessor macros immediately mark this as C to me, but with extra bits encoded into comments. Let’s deal with the C first:

The first line aliases putchar to v. putchar is part of the C standard library and simply takes an unsigned integer, representing an ASCII character, which it prints to the screen, and returns the original input value. It’s kind of like an identity function with IO side-effects! Then, in the second line, the main function is simply aliased to print, taking a parameter x. The important thing to note is that x is not bound to anything in this macro, so it is effectively redundant. Anyway, the main function deobfusticates to:

Function application is left associative, so putchar(52) gets evaluated first. ASCII point 52 is the character “4”, which is printed to the screen. This returns 52 and so putchar(52 - 4) outputs ASCII point 48 (the character “0”). Then, in the final call, we simply add four back to get ASCII 52. That is, the print(202*2) call at line 4, just outputs the string “404” to the screen and then exits.

As I said, the argument passed to print is irrelevant to C, but in some scripting languages the “#” is interpreted as a comment. This actually therefore makes the source in its entirety valid Python and, I think, Perl:

i.e., Outputs \(202\times 2 = 404\).

Next up is the third line, which is a comment in the C source that starts on the end of the second. It’s in Brainfuck:

In Brainfuck, the only valid symbols are “>” (move up a memory address), “<” (move down an address), “+” (increase the value at the current address), “-” (decrease the value), “[” (loop until zero), “]” (close loop), “.” (output current address value) and “,” (get input); anything else (like that random “4”) should just be ignored. So we have:

  1. Move to address 1
  2. Increase the value by 8
  3. Start loop
  4. Move to address 2
  5. Increase by 6
  6. Return to address 1
  7. Decrease by one
  8. Loop until zero
  9. Move to address 2
  10. Increase by 4
  11. Output
  12. Decrease by 4
  13. Ouput
  14. Increase by 4
  15. Output

In lines 1 to 8, we create an iteration count in address 1 (of eight) and increase address 2 by six in each loop. Thus, when we exit the loop, the value at address 2 is \(8\times 6 = 48\). We then add four to that (i.e., 52, ASCII “4” again) and output, then reduce by four and output (ASCII “0”), then set up the final “4”.

In the entirety of this source, there are only four other valid Brainfuck symbols: the “+” and “-” on line 2, which increment then decrement the value at address 0 (i.e., does nothing); then the “>.” on line 5, which will output the value at address 3 (i.e., does nothing, because the value here is 0 and ASCII point 0 is just the null string terminator). Thus, again, the source in its entirety can be run through a Brainfuck interpreter to output “404”.

Now there’s the question of the final line: In C, Python, Perl and Brainfuck, it is has no effect. There’s also the curious spacing of the #define macros on the first two lines. Also, what’s up with that random “4” in the Brainfuck code on line 3, anyway?… It may all just be a red herring, but I have a feeling that there is a bit more to this!

Either way, this is neat. Kudos to whoever wrote it! I reminds me of those “How many triangles can you see?” puzzles, but an order of magnitude more complex.

Movies of 2014

Last year’s rating system was an attempt to give more justifiable scores, but it was too complicated. Moreover, neither of us are technical experts on some of the things we were judging, so that seemed a tad disingenuous. I was tempted to introduce an even more sophisticated system, but Mrs. Xoph and I had a very busy year and ended up rating everything we saw retrospectively (i.e., just now), using a simple ten point scale!

It works well enough :)

Continue reading Movies of 2014

Syntactic Shibboleths

A shibboleth is a word or phrase that can be used to identify a speaker’s background based on the way that it is pronounced by them. So, for example, one could vaguely determine which end of the UK someone was from by how they pronounce the vowel in the word “bath”: [æ] would err towards the north, while it’s [ɑ] in the south (east). It’s actually possible to isolate a person’s origin quite precisely by testing a number of shibboleths, with respective, intersecting regions, together. A kind of linguistic Guess Who, if you like.

Four years ago, for my MA dissertation (available on request), I conducted an experiment that called for native English speakers. However, because it was disseminated over the Internet, I had no way of certifying this, while still needing a way to improve the signal-to-noise ratio in my data. My innovation was to test subjects’ responses (i.e., grammaticality judgements) on subtly warped syntactic structures which are peculiar to English (or, at least, Germanic languages). Subjects who baulk at the dodgy constructions are more likely to be native speakers, against those who accept them because they look “about right”.

These are my syntactic shibboleths (actually, “morphosyntactic”, but that breaks the alliteration). The first four have been “battle tested” in research — and proved very useful in sanitising my data — the others (and there are certainly more) seem like viable candidates from my experience. For each, I give three examples of corruption, largely to make native speakers cringe!

Continue reading Syntactic Shibboleths

London Calling

Fortunately for me, I have now escaped from London. Not so fortunately, I still work there…which is a bit of an epic commute, but I digress!

Anyway, regarding the day job, recently people have been rather confused about how they can determine a London address. I immediately suggested using the postcode, as they generally follow a regular pattern and, in the capital, they correspond to the compass points in the “outward” part. That is:

Outward Area
E East
EC East Central
WC West Central
W West
N North
NW North West
SE South East
SW South West

This can easily be turned into a regular expression that can be used to validate, presuming the postcodes are already validated, London postcodes:

This is fine for Inner London, but as the city slowly absorbs its suburban neighbours, this won’t work for the foetid glory that is its greater metropolitan area. We have to expand into the orbiting postal code regions:

Outward Area
BR Bromley
CM Chelmsford
CR Croydon
DA Dartford
EN Enfield
HA Harrow
IG Ilford
KT Kingston-upon-Thames
RM Romford
SM Sutton
TW Twickenham
UB Southall
WD Watford

Now it would be straightforward to put together an alternation group of these thirteen (unlucky for some) codes, but apparently that would overgeneralise as some areas covered are not considered within Greater London. To solve this problem, I was given a 64MB CSV file of every valid London postcode and let loose!

I wasn’t about to upload over 300,000 postcode records into a database table, so I figured I would proceed down my regular expression route. Fortunately, with a bit of command line fu, I was able to reduce those records down to 130 unique, Greater London outward postcodes:

The 130 outwards that I obtained represent the actually valid Greater London postcodes for the above districts. As postcodes follow a quite simple format, I was able to condense these into 13 regular expressions:

Outward Area Regular Expression
BR Bromley ^BR[1-8]
CM Chelmsford ^CM1[34]
CR Croydon ^CR([02-9]|44|90)
DA Dartford ^DA([15-8]|1[4-8])
EN Enfield ^EN[1-9]
HA Harrow ^HA\d
IG Ilford ^IG([1-9]|11)
KT Kingston-upon-Thames ^KT([1-9]|1[7-9]|22)
RM Romford ^RM([1-9]|1[0-5]|50)
SM Sutton ^SM[1-7]
TW Twickenham ^TW([1-9]|1[0-59])
UB Southall ^UB([1-9]|1[018])
WD Watford ^WD([236]|23)

We know that the inward part of postcode is always of the format \d[A-Z]{2}$ and it should follow the outward by a space, although this is often missed or doubled up. So we can take the alternation group of the above outwards, factor and include the inward pattern to obtain this beast:

Now we can both determine and classify London addresses without having to resort to an enormous lookup table :)

Not to gloat, but it’s most satisfying to know ones tools well. What would have taken my colleagues — even those who claim to be developers — literally days, I was able to do in less than an hour.

This is my blog.
There are many like it, but this one is mine.
My blog, without me, is useless.
Without my blog, I am useless.