The popular literature on “big data” is still fairly sparse. “Uncharted” by Erez Aiden and Michel Jean-Baptiste is a great addition. While their focus is on the application of big data to the humanities, they share a fresh perspective that is accessible and useful for other research, including healthcare. While they touch on a variety of big data methodologies, by far the topic they give the most attention is N-gram analysis of text. N-grams are a technique for analyzing word or phrase frequency as a percent of the total body of a text or aggregate of texts. The “N” in N-gram refers to the number of words in a phrase. For example, “Thomas Edison” has an N of 2, “Edison” an N of 1 and “Thomas Alva Edison” has an N of 3.
The exciting development that the authors share is the power of Google books to enable a wide variety N-gram analyses. Google has been scanning paper and digital texts into a service called “Google Books”. The volume of text that they have scanned so far surpasses the contents (paper) of the Library of Congress. Google has generated N-grams from this body of text and released a public tool for querying the N-grams from these books, dating to the 1600s. This depth allows users to use N-grams to track changes in attitude and interest for a wide variety of topics. The user interface for Google N-grams is very accessible, but allows provides a wide variety of options that enable considerable flexibility and control over the queries.
“Uncharted” shares a wide variety of fascinating examples, including clear evidence of Nazi censorship and the increasing rate of fame for modern celebrities compared to historic figures. They show how the use of wildcard “*” can allow comparison of expressions that begin with the same words, for example “University of”, to show comparative frequency of mention for related terms.
As I have experimented with N-grams, I realized a few important limitations. First, by its nature, Google Books is populated by text found in books, not journal articles or online media content. This can influence the balance of term usage to reflect types of discussions that are included in books. So a news event that is heavily discussed in the media for a brief spike may not be mentioned in books. The second limitation is the time lag, right now Google Books is only current to 2008. In the technology world, the 5-6 year difference can be an eternity. The third limitation is that an N-gram is ultimately a measure of word usage. As such, the accuracy or quality of the word associations should not be over-interpreted. It is an observational tool.
I highly recommend this book, it sparked my imagination and will surely lead you into some interesting explorations. I have dedicated a new #Research blog posting to a few of my healthcare related N-grams, including an analysis of the shift from “electronic medical record” to “electronic health record” and from “health care” to “healthcare”.
One thought on “Book – “Uncharted””