For a few years now I’ve had a website listing the longest words in English, so when I saw Google Ngram, I thought it could be fun to poke around with.
The data is based on the scans by the Google Books project, covering roughly the last 200 years and 500,000 published books, and contains the frequency of words and phrases, broken down by year.
Google don’t seem to mention are the most common phrases found, so I decided to work them out. I wrote some ugly but workable scripts, downloaded the data to a rackspace cloud server, and added up the results, limited to two word phrases (or word grams).
From the results, I’ve found the most common two word combinations used in English are:
Most common 2 part Word Grams
of the
in the
to the
and the
on the
See the graph output of the phrases, as you can see, “of the” appears more than twice as often as the next common, “in the”.
Kind of boring, right? So I looked further down, searching for the most common phrase which wasn’t just a pair of very short combinations and found inside the top 500:
united states
new york
See the graph output
So there you have it, it seems that the United States is the most written about thing ever, closely followed by New York!
If you’re interested in how I did the technical bits and pieces, let me know and I’ll tidy up the scripts a bit and upload them to github. Overall, the script took around 12 hours to download the 25GB of files, uncompress them, and compile the raw data into something quicker to query.