The Dancing Men

The Dancing Men

In Arthur Conan Doyle's The Adventure of the Dancing Men Doctor Watson was once again amazed at his companion's penetrating insights, this time into the cryptic messages hidden in the dancing men, a seeming child's scrawl of dancing stick figures.

When first presented with the mystery, Holmes could do nothing---a short encrypted message could mean anything: "These hieroglyphics have evidently a meaning. If it is a purely arbitrary one, it may be impossible for us to solve it. If, on the other hand, it is systematic, I have no doubt that we shall get to the bottom of it." Once presented with a few more secret messages, however, he rapidly broke the encryption, for the conspirators had foolishly used a simple encryption scheme. At the risk of spoiling the story of the dancing men, I'll give Holmes's description of how he decrypted the messages:

"The first message submitted to me was so short that it was impossible for me to do more than say, with some confidence, that the symbol stood for E."

Clever Holmes had seen, just as Young had with the Rosetta stone, that changing each letter (or word) into another symbol may make the message look odd, but it doesn't change each letter's frequency. Moreover, like Scrabble players today, he knew that e is the most common letter in English. So he simply found the most frequent dancing man and assumed that it stood for e.

Now, in the single word I have already got the two E's coming second and fourth in a word of five letters. It might be `sever,' or `lever,' or `never.' There can be no question that the latter as a reply is most probable.... Accepting it as correct, we are now able to say that the symbols stand respectively for N, V, and R.

And so, by bits and pieces, Holmes broke the encryption and unmasked the clueless conspirators. Everyone keeps secrets. Over two thousand years ago, for example, Julius Caesar encrypted messages to his generals far afield by cyclically mapping letters to the third letter on in the alphabet: a became d, b became e, and so on, and z became c. Thus, the message 'attack at dawn' would become 'dwwdfn dw gdzq.'

Over fifteen hundred years later, Mary, Queen of Scots, used such an encryption to plot with Spain the assassination of her cousin, Queen Elizabeth I. Sadly for Mary, Elizabeth's secret agents, who had cleverly instigated the conspiracy in the first place, used a statistical analysis to quickly break the encryption. So at eight in the morning of Wednesday, February 8, 1587, Mary lost her head.

Governments around the world took note and never again used anything as simple as a Caesar encryption. But while the new schemes they came up with were more complex, they still only substituted and re- arranged letters and other symbols. Nobody could think of anything better.

So although practical secrecy advanced over the centuries---particularly during the fury of technological development we now call the Second World War---modern secrecy really only started in 1948 when Claude Shannon, a brilliant researcher at AT&T Bell Laboratories, put his finger on the real problem. Shannon saw that roughly seven in every ten letters in a long English message are redundant. For example, the three letters e, t, and a alone account for well over a quarter of all letters used in English. If English were not redundant, all letters would occur with equal frequency.

Similarly, the nine words: the, of, and, to, it, you, be, have, and will, account for a quarter of all words used. It says something about us that I, me, my, and mine didn't make the top nine, but you did.

Because all words and letters aren't equally likely, each letter of a word, or each word of a sentence, builds context for the next one, thereby reducing the choices. For this reason, the first few letters of a word (the first few words of a sentence, the first few sentences of a book) are usually more important (that is, less redundant) than later ones.

For example, in English q is always followed by u. Also, if a word starts with th the next letter is almost surely a vowel, the pseudovowel y, or an r. If a word starts with the chances are almost one in two that the next letter is r. And if a word starts with ther the next letter is almost surely a vowel or m. Similarly, if we hear the word the we usually expect the next word to be a noun, or the beginning of a noun phrase. If we hear the phrase the cat ate the, we expect the next word to be mouse. Of course, the next word could just as well be television. The more previous context there is, the fewer the possible continuations---and the greater our surprise when the next word isn't what we expected.

Languages are so redundant to help us understand garbled communication. But that massive redundancy makes long encrypted messages, even after massive rearrangements and replacements, pretty easy to break by statistical analysis. Lik* al* hu*an *ang*age*, E*gli*h i* ex*rem*ly *edu*dan*.

Of course, even a simple encryption can take time to break; for dramatic reasons, Doyle shortened Holmes's task considerably. Besides the three possibilities he identified, the word he interpreted as never could have been aedes, bedew, beget, beret, or scores of others. Today, of course, we can find all such possibilities in millionths of a second with a computer.

Secret writers took all those lessons to heart but didn't know what to do about them. The only answer seemed to be to use their computers to pile on more and more rearrangements and replacements in greater and greater profusion, hoping that the secrecy breakers' computers weren't fast enough to keep up. But given the way computers were improving, they knew that was a losing proposition.

One If by Land, Two If by Sea