Text Mining on Twitter

Khristian Kotov

Introduction

The Twitter platform has gained millions of users from all around the world and it is generating an endless stream of data. Although Twitter is also used routinely by PR departments of corporations, universities, and heads of states to propagate their agendas to public, the majority of tweets still comes from regular users and can possibly be utilized as a probe of a "public mood" on a variety of subjects.

A simple Shiny application allows one to perform basic text mining operations on a sample of relatively new tweets coming from various places. Below I'll give you an insight on how to use this application. If you are not familiar with the Text Frequency - Inverse Document Frequency (TF-IDF) method, think of it as a reweighting technique that allows one to rank down "low information" words like "the", "and", or "have" and rank up infrequent, possibly informative words.

Unfortunately I do not have a shiny server on my own. The application data is updated every time you open the link above, which results in a noticeable delay. Please, bear with a half-minute delay.

Daily activity

The first two tabs show stacked and decomposed daily activity of the Twitter users:


High Weight Words


List of 100 highest TF-IDF words is displayed on High Weight Words sub-tab

Document frequency is number of documents (tweets) with the term over total number of documents

Associations

One can type a term and query associations (TF-IDF weighted co-occurance) along these lines:


Although term vodka seems to be also associated to the russian president, it follows term 'nutella', which, in my taste, makes associations at such threshold not very credible

Summary

The sole purpose of this exercise was exercising text mining with TF-IDF method. I wanted to group tweets on similar subjects and identify major discussion trends (not based on just hash tags). Unfortunately, twitter turned out to be not very suitable for my purposes as it limits length of document to several hundred symbols. Such a strong limit results in a very sparse document term matrix and make tweets distance (well, cosine distance) driven by a random coincidence of one term.