Categorical Data : How to analyze word frequencies

To analyze word appearance frequencies in Keshif, follow the steps below. We'll be using a list of quotes from the Star Wars movies as an example, and you can find the dataset

1) Split characters by white-space in between


Note: In the example above, we had converted all letters to lower-case and removed any punctuation marks in the dataset. To be able to detect simple variations on the same word, such as "Peace" and "peace", update your data source to include all lower-case letters, and remove punctuation marks and numbers as well if appropriate. 

2) Remove the stop words

When you split original text from speech or documents, it will include many common words that won't contribute to analysis, such as "a, the, you", etc. This is also the case for our dataset, where, you, the, i to, a are the 5 most common words.


Below, we manually edited the dataset to remove these common words from the text. "Luke" is now the most common word used in the original trilogy! By highlighting "Galactic Empire" we can see that "now" and "yes" are used proportionately more by the characters with the empire. 

Keshif does not detect "stop-words" (common words which do not have information context,  info), or parse word structures to detect shared roots of the words. you should manually remove the most common stop words from the source dataset. If you need stop-word support inside the tool for analyzing word frequencies, please contact us for feature development. 

2) Filter out words with low frequencies

When you visualize the words, it is likely that you'll end up with a lot of words appearing only once. A high number of categories will slow down the interactivity of your dashboard, so you may want to limit the number of words in this chart.

In the screenshot below, there are 2798 words detected, and over 2200 of them appear at most 5 times.

By using the "remove uncommon" option in the configuration panel, and specifying "5" as the minimum number of lines that the word needs to appear in, we reduce the list from 2708 to 511 words!

3) Analyze which words appear together frequently in answers.

With Keshif, you can even analyze which words appears together! Below, by extending the categorical bar charts to its set matrix, we can see which words appeared together more often (by their circles), and even select and filter word-pairs. For example, "lord" and "vader" are among the most-frequent word pairs, and 1 of these quotes also include the word "power" alongside.

For details on our multi-categorical analysis features, please see  https://help.keshif.me/article/54-analyzing-multi-categorical-data


Discussions about Word Clouds

While word clouds can be pleasing to the eye on a first look, they are inefficient for the primary goal of Keshif: accurate analysis and of information. ("Our experiments seem to suggest you will be doing just fine with simple lists", from research by Enrico Bertini's team -link.) Therefore, Keshif chooses not to support word cloud charts. When needed, you can create static word clouds of specific text corpus using  EdWordle.

Still need help? Contact Us Contact Us