## Distance between Words

Which pair is more different?

• `keyboard | keyb`ard`
• `keyboard | keybpard`
• `keyboard | keebored`

Of course in mathematics we get to decide among many definitions of size and there is no “correct” answer. Just what suits the application.

I can think of two approaches to defining distance measures between words:

• sound-based — `d(Hirzbruch, Hierzebrush) < d(Hirzbruch, Hirabruc)`
• keyboard-based — `d(u,y) < d(u,o)`

Reading on online fora (including YCombinator, tisk tisk) the only distance functions I hear about are the ones with Wikipedia pages: Hamming distance and Levenshtein distance.

These are defined in terms of how many word-processing operations are required to correct a mis-typed word.

• How many letters do I need to insert?
• How many letters do I need to delete?
• How many letter-pairs do I need to swap?
• How many `vim` keystrokes do I need?

and so on—those kinds of ideas.

#### inter-letter interaction effects

If we could get conditional probabilities of various kinds of errors — like

• Am I more likely to mis-type `ous` while writing
• `varoius`
• `precarious`
• `imperious`
• ? There could be some kind of finger- or hand-based reason, like if I’ve just been using right-handed fingers near my `ous` fingers, or that I have to angle my hand weirdly in order to hit the previous couple strokes in some other word?
• Am i more likely to mis-type `reflexive` as `reflexible` when the document topic is gymnastics?
• Am i more likely to make a typo in google if I’m typing fast?
• What if you can catch me mis-placing my hand on the homerow/ `how dp upi apwaus fomd tjos crazu stiff?` That’s almost like just one error. (It’s certainly less distance from the real sentence than a random string of characters of equal length.)
• Or if I click the mouse in the wrong place before correcting my spelling? `d(Norschwanstein, Ndorschwanstein)` or `d(rehabilitation, rehabitatiilon)`
• Am i more likely to isnert a common cliche rather than what i actually mean after a word that begins a common cliche/

#### A Bit Of  Forensics

EDIT: Once I got about halfway throguh this article, I stopped correcting my typoes, so you can see the kind that I make. I was typing on a flat keyboard, asymmetrically holding a smallish non-Mac laptop (bigger than an Eee) with my elbows out, head down — except when I type fast and interchange letters, with perfect posture, “playing the piano” with my ten finger muscles rather than moving my wrists — at an ergonomic keyboard with a broken M. I actually don’t recall which way i wrote this article. I may hav eeven written it in shifts.

Here are some nice ones as well. Look at the comments section. By the posting times (and text) you can see that the debate was feverish—no time for corrections and the correspondents were steamed up emotionally. Their typoes really have personalities—for example Kien makes a lot of errors with his right middle finger moving up. (`did → dud`, `is → us, promoted → promotied, inquisition → iquisition`, `mean → meaqn`,` Church → Chruch`,` because → becuase`,` Copernican → Ceprican`, `your → you`, `clearly → cleary`) but also some errors of spelling with no sound-distance (`Pythagoras → Pythagorus`) and uses both the sounds `disingenious` and `disingenuous`. Letter-switching, ilke I do, is common; a few fat-fingers (`meaqn`) or forgotten letters, but this `iou` stuff seems unusual and possibly characteristic of something.

Other participants make different sorts of errors, or at least with different frequencies (they’re relatively more likely to omit or switch letters than to use the wrong letter, for example). But let’s just focus on Ken because so many errors of the typoes are localised to that right middle finger. I wonder if Ken has a problem with that finger? Or maybe his keyboard is shaped in such a way that it’s difficult to correctly strike those keys specifically? (Maybe certain ergonomic keyboards would fit this — or an Eee Pc with the elbows out and “pigeon-toed” hands. But why would the errors then be localised to the right middle finger? It’s more mobile than pinky & ring fingers and we’re not taught to stick it to the homerow like the index finger.) I rule out the theory that his right hand hovers above the keyboard rather than sitting on the homerow because then he should make similar errors with `yuiop` and maybe `bnm,.hjkl;` as well. Also, notice that he doesn’t make comparable errors with `ewr` as with `iou`. How do we know he sits symmetrically? I have a tough time deciphering why there are more errors with that finger on a first read-through.

We could find more of Ken’s writing here and see how he types when he’s less agitated. I bet there are no `Ceprican`’s there but `Pythagorus` would still be. As for `Chruch`? Hmmm. Don’t know.

#### Big Data vs Models

Now the big-data-ists (the other half of Leo Breiman’s partition of statistical modellers -vs- data miners) would probably say “Google has a jillion search results including measurements of people correcting themselves and including time series of the letters people type — so just throw some naive Bayes at that pile and watch it come to the correct answer!” Maybe they’re right.

If someone wants to mess around with this stuff with me — leave me a comment. We could grab tweets and analyse typoes within differnet text-…[by which tool] was used to send the tweet. For example the Twitter website means it was keyboard-typed, certain mobile devices have Swype, other errors we might be able to guess tha tis …[that it’s] a T9 mobile keyboard.

• Could we tell if a person is left-handed by their keyboard mistkaes?
• Could we guess their education level/
• Could we tell what tweeting platform they used by their errors rather than by
• Could we tell where they’re from? Or any other stalky information that advertisers/HR want to know but web browsers want to hide about themselves? (Say goodbye to mandatory drug testing in the workplace, say hello to your boss getting an email when a statistics company that monitors your twitter feed guesses you smoked pot last night based on the spelling and timing of your Facebook posts.)