Exploring what predicts viral Reddit posts [OC]
Data Source: Reddit API
Tools: Praw, python, sklearn, infogram
Bonjour reddit! When I shared a post earlier this week that showed a post’s likelihood of making the front page with different scores at the 30 minute mark it generated a lot of interest and follow up questions. I wanted to answer these with a follow up post that contained more
visualizations (to the tune of what people were requesting) as well as a better break down of my methodology when finding and plotting this data.
To start with let me define what I mean when I say a post goes viral:
For my project “going viral” just means that a post received more than 8000 net up votes. I considered making the definition of going viral relative to the subreddit the post is in but I figured how viral a post is does not need to be a relative concept and instead works better as an absolute one.
The type of model I create with sklearn for these predictions was an SVM model.
Also, too keep the scope of this project more manageable I only collected data on viral posts from some of the more major subreddits. In total I collected 7675 posts from the following subs:
Art, AskReddit, aww, books, dataisbeautiful, DIY, Documentaries, EarthPorn, explainlikeimfive, food, funny, gaming, gifs, jokes, LifeProTips, movies, music, pics, ShowerThoughts, space, sports, tifu, todayilearned, videos and worldnews.
I also broke down these subs into the following groups with sample sizes in parenthesis, for some more in depth visualizations:
Visual Media (2675): Art, aww, dataisbeautiful, EarthPorn, food, gifs, videos, pics, Documentaries
Text Information (3180): AskReddit, books, explainlikeimfive, jokes, LifeProTips, todayilearned, ShowerThoughts, world news News Oriented (820): worldnews, sports, gaming
Now let me touch on each of the visualizations I have provided:
Post Score at 30 Minutes VS % Chance of Going Viral This chart is most similar to my original post in that it is showing different scores for a post when it is 30 minutes old on the x axis and the probability of it going viral on the y axis. As you can see, the original aggregated line is in green, while I added other lines to this graph for the different subreddit groups that I defined. The data seems to suggest that the type of subreddit something is posted in effects how many up votes the posts needs after 30 minutes to make it likely to go viral. Furthermore, if we define the competitiveness of a subreddit as how many posts it needs at the 30 minute mark to have a greater than 50% chance of going viral we can see that visual media subreddits seems to be the least competitive while news subreddits seem to be the most competitive.
2) Posts Age When it Got X Up Votes VS % Chance of Going Viral
This chart shows how quickly a post reaches different scores and how that effects its probability of going viral. As you can see these all seem to be roughly linear relationships and proportional.
3) Up Votes Minus Age in Minutes VS % Chance of Going Viral
This visualization was more of a last minute addition. As you can see as the difference between score and age increases it is more likely to go viral. Oddly though this data is not smooth and seems to swing back and forth as it goes up. This could be a result of not having enough data for accurate, smoothed results, or it could be indicative of some insight into Reddit’s underlying algorithm or some other unaccounted for variable.
Lastly, I know in my last post some people were flaming the shit out of me for having no clue what I am doing, and specifically for using sklearn for this project. Those criticisms were all fair as I am relatively new to ML and sklearn and honestly don’t have the best clue what I am doing. This project was actually a way for me to teach myself ML and I figured if I could share some interesting visualizations of data along the way great.