Syntactic Shibboleths

A shibboleth is a word or phrase that can be used to identify a speaker’s background based on the way that it is pronounced by them. So, for example, one could vaguely determine which end of the UK someone was from by how they pronounce the vowel in the word “bath”: [æ] would err towards the north, while it’s [ɑ] in the south (east). It’s actually possible to isolate a person’s origin quite precisely by testing a number of shibboleths, with respective, intersecting regions, together. A kind of linguistic Guess Who, if you like.

Four years ago, for my MA dissertation (available on request), I conducted an experiment that called for native English speakers. However, because it was disseminated over the Internet, I had no way of certifying this, while still needing a way to improve the signal-to-noise ratio in my data. My innovation was to test subjects’ responses (i.e., grammaticality judgements) on subtly warped syntactic structures which are peculiar to English (or, at least, Germanic languages). Subjects who baulk at the dodgy constructions are more likely to be native speakers, against those who accept them because they look “about right”.

These are my syntactic shibboleths (actually, “morphosyntactic”, but that breaks the alliteration). The first four have been “battle tested” in research — and proved very useful in sanitising my data — the others (and there are certainly more) seem like viable candidates from my experience. For each, I give three examples of corruption, largely to make native speakers cringe!

London Calling

Fortunately for me, I have now escaped from London. Not so fortunately, I still work there…which is a bit of an epic commute, but I digress!

Anyway, regarding the day job, recently people have been rather confused about how they can determine a London address. I immediately suggested using the postcode, as they generally follow a regular pattern and, in the capital, they correspond to the compass points in the “outward” part. That is:

Outward Area
E East
EC East Central
WC West Central
W West
N North
NW North West
SE South East
SW South West

This can easily be turned into a regular expression that can be used to validate, presuming the postcodes are already validated, London postcodes:

This is fine for Inner London, but as the city slowly absorbs its suburban neighbours, this won’t work for the foetid glory that is its greater metropolitan area. We have to expand into the orbiting postal code regions:

Outward Area
BR Bromley
CM Chelmsford
CR Croydon
DA Dartford
EN Enfield
HA Harrow
IG Ilford
KT Kingston-upon-Thames
RM Romford
SM Sutton
TW Twickenham
UB Southall
WD Watford

Now it would be straightforward to put together an alternation group of these thirteen (unlucky for some) codes, but apparently that would overgeneralise as some areas covered are not considered within Greater London. To solve this problem, I was given a 64MB CSV file of every valid London postcode and let loose!

I wasn’t about to upload over 300,000 postcode records into a database table, so I figured I would proceed down my regular expression route. Fortunately, with a bit of command line fu, I was able to reduce those records down to 130 unique, Greater London outward postcodes:

The 130 outwards that I obtained represent the actually valid Greater London postcodes for the above districts. As postcodes follow a quite simple format, I was able to condense these into 13 regular expressions:

Outward Area Regular Expression
BR Bromley ^BR[1-8]
CM Chelmsford ^CM1[34]
CR Croydon ^CR([02-9]|44|90)
DA Dartford ^DA([15-8]|1[4-8])
EN Enfield ^EN[1-9]
HA Harrow ^HA\d
IG Ilford ^IG([1-9]|11)
KT Kingston-upon-Thames ^KT([1-9]|1[7-9]|22)
RM Romford ^RM([1-9]|1[0-5]|50)
SM Sutton ^SM[1-7]
TW Twickenham ^TW([1-9]|1[0-59])
UB Southall ^UB([1-9]|1[018])
WD Watford ^WD([236]|23)

We know that the inward part of postcode is always of the format \d[A-Z]{2}$ and it should follow the outward by a space, although this is often missed or doubled up. So we can take the alternation group of the above outwards, factor and include the inward pattern to obtain this beast:

Now we can both determine and classify London addresses without having to resort to an enormous lookup table :)

Not to gloat, but it’s most satisfying to know ones tools well. What would have taken my colleagues — even those who claim to be developers — literally days, I was able to do in less than an hour.

The Beast of Wimbledon

About three weeks ago, in the dead of a cold and ironically hackneyed foggy night, my wife and I were stirred from our bed by a scream. An ear-piercing series of short bursts of terror that chilled us to the bone. The sound of a woman — albeit somewhat caricatured — in immediate peril. We lay there in confused silence, our duvet huddled around to protect us from whatever horror our imaginations concocted outside our bedroom window.

The screams died away oddly. A few minutes passed. Timidly, I got up from bed and peered through a crack in the curtain, careful not to cause too much movement that would compromise my surveillance. Nothing. Just a cold, misty night. Other than the screams, there had been no sign of a struggle. No hasty footsteps. No clattering of discarded weapons. No thuds of lifeless bodies. Just the terrible screams and terrible silence.

Convinced that it was some poor soul about to meet their maker, I called the police. I had no evidence — there was no body — but it sounded so much like the death throes of my fellow man, to my ear, I was compelled to play it safe. Some fifteen minutes later, the flashing blue lights of a police car shone through the chinks between our curtains, dancing on our bedroom ceiling. We heard the officers’ deliberate footsteps outside, while their radios buzzed and crackled static. Nothing. They pulled away, leaving us trying to sleep.

My wife wasn’t convinced that it was the screams of a person. Given the obvious lack of foul play, I was inclined to agree with her, but what was it? An urban fox? I searched for videos, to compare the noises they make with that now etched into my psyche. They bark and howl like dogs; admittedly, at a higher pitch, but nothing like what we heard that night. Nor owls. Nor cats.

Last night, I heard it again. My wife slept soundly, while that same petrifying series of womanly screeches resonated through my slumber. They lasted longer than the first time; becoming distinctly less human-like as they progressed. An unholy retching that, again, petered out into the night, leaving no trace. I eventually fell back to sleep.

I’m rational enough to weigh-up the evidence. There is none that impinges on my or my wife’s safety. However, the mystery remains to disturb my sanity: What could it be? A feral womble? The spirit of the departed, perhaps brutally murdered and forced to haunt SW19 until her death is avenged?

The stuff local folklore is made of.

The Royal We

Many languages — particularly those of the Pacific islands and Native Americans — make a morphological distinction in the first person plural: they have a form which is known as inclusive, that is, it includes both the speaker (and possibly the speaker’s concerns) and the listener; and an exclusive form, which only relates to the speaker and, necessarily, their concerns, but doesn’t include the listener. It occurs to me that, while morphologically unmarked in the verb conjugation, we make a similar distinction in English and use it in exactly the same way.

In the Event of Thermonuclear War

Risk Assessment

Institutionalised education in the arts will finally be ousted as frivolity by a skirmish on such scale that the survivors will envy the dead. From the cold ashes of the ensuing nuclear winter, people will learn to create on their own volition, without the trappings of self-indulgence and pretence. Art for art’s sake while sipping on rare earth mineral tea is yesterday’s black. Systems that survived the electromagnetic saturation will succumb to ritual sacrifice — to heed warning unto others — as what is left of our species attempts to rekindle their humanity.

Status: Untested; Won’t fix

