Cryptanalysis of the Vigenere cipher

The Vigenere cipher is probably the most commonly known example of a polyalphabetic cipher, attributed (incorrectly) to Blaise de Vigenere in the 16th century.  It has been used and regarded as being "practically unbreakable" as late as World War I, in spite of knowledge of a general solution since at least the mid 1850s.

The Vigenere cipher is interesting not because it has been broken, but because its solution involved, or at least stimulated, the first systematic application of statistical methods to cryptanalysis.  A brief description of these methods is presented, including a Java applet (with source code) which automates the solution of a Vigenere cipher.

Frequency analysis

A monoalphabetic cipher is a simple substitution, as in the cryptograms in a daily newspaper.  Each occurrence of a particular letter is replaced by another throughout the message, so that, for example, plaintext 'a' becomes ciphertext 'K', 'b' becomes 'S', etc.  (The convention of using lowercase for plaintext and uppercase for ciphertext will be used throughout.)  Solution of a monoalphabetic cipher typically begins by guessing the assignments of some ciphertext letters based on their frequency of occurrence in the message.  This is possible due to the non-uniform distribution of letters in language, specifically English (used in the examples throughout).  Since the letter 'e' is by far the most frequently occurring letter in English text (see the chart below), it is reasonable to guess that the most frequently occurring ciphertext letter is assigned to plaintext 'e'.

[Graphics:HTMLFiles/index_1.gif]

A polyalphabetic cipher is more complex in that a particular plaintext letter is not always replaced by the same ciphertext letter.  The Vigenere cipher is a special case of a periodic polyalphabetic cipher, and works as follows.  A message is encrypted using a keyword; imagine writing the keyword repeatedly underneath and alongside the message, and "shifting" each plaintext letter in the alphabet according to the corresponding key letter.  If letters are represented by integers modulo 26, then this encryption scheme is given by c = p + k (mod 26), where c, p and k are letters of the ciphertext, plaintext and key, respectively.  For example:


plaintext   thatwasthecuriousincident
   key      blazeblazeblazeblazeblaze
-------------------------------------
ciphertext  USASABDTGIDFRHSVDIMGJOEMX

Note that plaintext 't' is assigned to several different ciphertext letters ('U', 'S', 'T' and 'X').  The effect is to "flatten" the distribution of ciphertext letter frequencies, making them appear more random; the longer the keyword, the more uniform the resulting distribution tends to be, providing less information for the cryptanalyst.

Index of coincidence

Recall the 5-letter keyword from the previous example.  Each of the 1st, 6th, 11th, etc., letters of the message is encrypted using the same key letter 'b'.  That is, they are encrypted using a simple monoalphabetic substitution, and so the corresponding frequency distribution for those letters should resemble that shown above.  The same applies to the subset of message letters encrypted by each of the other 4 key letters.

More generally, if m is the keyword length, then each of the m blocks of equally-spaced ciphertext letters should have a frequency distribution that is not random, but resembling that for English text.  This suggests that the cryptanalyst may try all possible keyword lengths, selecting the value for which each of the resulting ciphertext blocks is sufficiently non-random.

The key observation is that "sufficiently non-random" is a measurable quality, using a metric called the index of coincidence.  It is defined to be the probability that two randomly selected letters will be identical, but it is effectively a measure of "distance" from the uniform distribution.  For completely random text, this probability is about 0.038.  For English text (or monoalphabetically encrypted English text), it is about 0.065.

In addition to finding the length of the keyword, a variant of the index of coincidence may also be used to determine the actual letters of the keyword.  The mutual index of coincidence for two blocks of text is the probability that two randomly selected letters, one from each block, will be identical.  The utility of this metric is based on the following claim: if two blocks of ciphertext are encrypted with the same key letter, the mutual index of coincidence of the blocks will be significantly greater than if they are encrypted with two different key letters.  By successively "shifting" the letters in one of the blocks (or equivalently, repeatedly encrypting with key 'b'), and noting the shift which results in the greatest mutual index of coincidence, the relationship between the two corresponding key letters may be determined.  By examining all pairs of ciphertext blocks, the key letters may almost be derived directly.

"Almost," because only the relationships between key letters may be derived this way.  Without a known particular key letter, there are still 26 possible cyclic shifts of the actual keyword.  However, simple monoalphabetic frequency analysis on each ciphertext block may "lock" the keyword into one of those possibilities.  For example, if 'F' is by far the most frequently occurring ciphertext letter in a particular block, this suggests that 'F' is assigned to plaintext 'e', or equivalently, that the corresponding key letter is 'b' ('F' "minus" 'e').

Automatic cryptanalysis

Use the Java applet below to experiment with the Vigenere encryption and decryption process.  Paste a selection of plaintext or ciphertext into the text area, enter a keyword in the indicated field, and press the "Encrypt" or "Decrypt" button as appropriate.  Note that spaces, punctuation, etc., will be automatically stripped off.

More interesting is the "Guess key" button.  Paste a selection of Vigenere-encrypted ciphertext into the text area (for convenience, a sample is included below), and press "Guess key."  An estimate of the keyword is given based on the combination of index of coincidence tests and frequency analysis described above.  Press "Decrypt" to see the resulting plaintext.  Note that the program is not perfect, but its accuracy improves with longer ciphertext (see the performance analysis in the next section).  As a guide, a selection of text that fills the entire text area is usually enough.

Error! You must use a Java-enabled browser.

Sample ciphertext

DZLNRFWRNWTDTHPMHOQIBYEWTSPSRMPYWGMDSSGSXPDSLFAONVPAIMMPYWGMDSHDLBOFNV
NPDNJNJCNQQLNHSODAAMMTTXFVEIREXMYSLFTNRTFNTNVTQABIUSASLJDASXFYTHSOSACF
FPNJIFYLXESZURIEJOTGPYSHHFCTGEUEOAIJXPNVULNSLFLSJIEPXBIFOIMKMJSNMTEHDV
FLNXTPTNSXPHHHGIJOTAPFLCAJDHSSECAVQZLTSIOEINRUZTGIDFRHSVDIMGJOEMXPQTGI
EZGHRUSEMMHSTSMNPTGIEZGCMEYOSLJYGHRUSEMMHSTSMNPTGEUHARXIPCTVJZURMONICI
OERDQBCKDHTSEQPPNKGSMXER

Algorithm performance

For a given ciphertext, the algorithm implemented in the Java applet involves several computational steps.  First, the length of the keyword is estimated based on the index of coincidence test.  Then the relationships between keyletters are computed using mutual indices of coincidence.  Finally, a particular cyclic shift of the keyword is selected based on frequency analysis.  At each step, a single choice is made on the path of decisions that leads to a single estimate of the keyword.  This single guess is not always accurate, as the following chart shows.  The probability of successful decryption was estimated by bootstrapping 1000 random keywords and random plaintext selections of the appropriate length from "The Hound of the Baskervilles" (a total of about 250,000 characters).  As should be expected, the cryptanalysis is made more difficult by shorter ciphertext and/or a longer keyword.

[Graphics:HTMLFiles/index_2.gif]

References

Kahn, David.  The Codebreakers.  New York: Scribner, 1996.

Stinson, Douglas.  Cryptography: Theory and Practice.  Boca Raton: CRC Press, 1995.

Download Java source

Eric Farmer
erfarmer201@comcast.net


Created by Mathematica  (July 8, 2003)