Exploring Emotion in Reddit: Know Your Data

Paulo Malvar
6 min readOct 23, 2017

Introduction

Last year (2016) we worked on training an emotion classifier that we integrated into Courier. Initially this classifier was trained using the Support Vector Machine (SVM) algorithm provided in scikit-learn. Since then, this classifier has evolved and it has been trained using a Deep Learning architecture that we put together using the fantastic combo that is Keras+Tensorflow. For a detailed discussion of our emotion classifier you can read my previous post entitled “Learning Emotions from Reddit.”

Even though this post is based on the SVM version of the classifier, we still think that it would be fun to tell the story that we found ourselves into back when we started working on recognizing emotions in informal, conversational text from Reddit.

So, after training, testing, and refining this classifier, we decided to have some fun with it and explore the frequency distribution of emotions in the entire version of the Reddit corpus that we downloaded around July 3rd, 2015. We specify the actual download date because several larger versions were released in the subsequent days, until the final version of the corpus was published, which contains around 1.7 billion Reddit comments and weighs 250GB compressed. Our version of this corpus contains around 54 million comments and weighs 5.1GB compressed.

So let’s explore this corpus!

Pre-processing Steps

Pre-processing of the Reddit corpus was performed in several modular stages. The first stage entailed the extraction of comments from the JSON objects in which they were encapsulated along with some metadata to contextualize them.

Extracted comments were subsequently processed using Python’s markdown module to convert them into html and BeautifulSoup to convert them into plain text.

All comments that did not have actual textual information, that is, ascii art or URLs only, were discarded. Comments that passed this filtering were finally tokenized and sentence split using NLP-processing utilities that we have developed in-house.

Metadata signals extracted along with Reddit comments were: ‘subreddit_name’, ‘subreddit_id’, ‘created_utc’, ‘retrieved_on’, ‘author’, ‘parent_id’, ‘ups, ‘downs’, and ‘controversiality’.

During the second pre-processing stage we further analyzed extracted comments using additional NLP techniques. In particular, this stage focused on lemmatizing and POS-tagging the extracted comments using a Python port of the Volsunga POS-tagging algorithm (DeRose: 1988).

Finally, the third pre-processing stage focused on applying our emotions classifier to each individual sentence in order to determine whether they carried any emotional content or not.

All of this information was saved in a local sqlite3 database for easy storage and retrieval.

Exploring Emotions. What the…?

After all the pre-processing steps were applied to our Reddit corpus, it was time for us to start exploring how emotions frequency is distributed. For this we conducted several experiments using the ‘created_utc’ signal to plot emotions overtime using Python’s matplotlib module.

We plotted all emotions in one graph using their ‘created_utc’ timestamp that we smoothed using the Exponential Moving Average algorithm, with N=23.

Emotions over created_utc timestamps

Then we plotted individual emotions using the same ‘created_utc’ timestamps and same smoothing value, N=23. Below, we only show graphs for Disgust and Surprise.

Disgust over created_utc timestamps
Surprise over created_utc timestamps

Leaving aside the exceptionally large emotion peaks that can be observed for some days, there was something bizarre that we observed in these plots. Why do all of these graphs have a perfect wave-like form? Are we doing something wrong?

We started reviewing the code that we had put together to generate the graphs. Everything looked OK…

So then we converted all timestamps to regular dates and printed the minimum and maximum values we had.

Minimum date: ‘2015–01–01’

Maximum date: ‘2015–01–31’

Now things started to make sense. It turned out, because we hadn’t checked that before 😅🔫, that the version of the Reddit corpus that we downloaded only had data for the month of January 2015. So we were looking at emotions over time for only one month.

In order to make sense of what the peaks and the valleys that we observed in the graphs meant, we picked one day of that month, January 15th 2015, and plotted data to see how emotions evolved during each hour of that day. Below we only show graphs for Anger, Disgust, and Fear, but all emotions displayed the same pattern.

Anger distribution during Jan 15th 2015
Disgust distribution during Jan 15th 2015
Fear distribution during Jan 15th 2015

Aha!! As these graphs clearly show emotion peaks are reached during evenings and late nights and valleys correspond to earlier times of the day. So what does this mean?

One, we believe that what we are seeing, by proxy of using emotion frequency distributions over time, user interaction patterns with Reddit.

And two, as the one-day graphs clearly indicate, user interaction valleys coincide with early morning hours of the day (possibly when most people get things done at work?) and peaks coincide with later hours of the day –evenings and nights– (most likely after work when people get back home?).

In order to further analyze user interaction with Reddit, we decided to plot emotions by converting all timestamps to dates and grouping them by date to see what those patterns looked like. We also smoothed these graphs using the same Exponential Moving Average algorithm but this time with N=2. Below we only show graphs for Joy and Sadness.

Joy distribution during Jan 2015
Sadness distribution during Jan 2015

Again, wave-like patterns emerged! So let’s look at January 2015 calendar to see how these patterns align.

It turns out that people don’t interact with Reddit as much during the weekends (see for example, valley for January 10th and 11th, Saturday and Sunday; and valley for January 17th and 18th, again Saturday and Sunday). Peaks are displayed during weekdays, reaching their maxima during mid-week.

Conclusion

Our main conclusion is that people really need to know their data. Not having checked the date range that we were working with for our version of the Reddit corpus led us to believe that we were doing something wrong.

It was a pleasant surprise, though, when we started making sense of the patterns we were seeing in the graphs we generated and understood exactly what we were seeing. Sometimes data works as a proxy to reveal some hidden gems. In this case we were able to visualize daily, weekly, and monthly user interaction patterns with Reddit.

Fun, right?

References

DeRose, S. J. (1988). Grammatical category disambiguation by statistical optimization. Computational Linguistics, 14, 31–39.

--

--