Extract posts from any subreddit in Python using PRAW
Hey fellow Python coder! In this tutorial, we will be learning how to extract data from various Subreddits present on Reddit. If you haven’t heard about Reddit before, let me explain what the exact purpose of the website is.
Reddit is a social media platform where people can discuss and share content on a wide range of topics. Reddit is divided into different “subreddits” which help to distribute the content into various categories focused on a certain topic.
Pre-requisites for Extracting Subreddits Posts
Well when it comes to extracting subreddit posts using Python, you have to make some extra efforts and set Reddit API for you which will help you extract the posts in the later section. The question is how do you achieve that? Let’s follow the following steps:
Navigate to Reddit’s Application Page. If you have not logged in already or don’t have an account on Reddit yet then you have to take the necessary action to get the page. You will be able to see a button that says “Create Another App” or “Create New App”. Simply click on the button. You will be able to see a form as shown below:
You can fill in the exact data that I have shown below or change it according to your preferences. But make sure that Redirect URI is used as http://localhost:8080
. The reason for doing the same is this is considered a common practice when a person is developing a local application with the Reddit API. This helps make the Authentication process easier.
After your app is created, you can find the client ID and client secret on the app details page. I am sharing the location of both in the screenshot below:
Once you have both your ID and secret key, you are all ready with your tools to get right into the coding in the upcoming section!
Scraping Subreddit Posts using praw
library
The praw library is a Python wrapper for the Reddit API that makes it easy for you to interact with Reddit using Python programming language. So first of all let’s make sure the library is installed in our system using the command pip install praw
in either the command prompt or jupyter notebook/ google colab notebook.
Setting Data Variables
Now if you are all set, then let’s start by setting some data values which include the following: clientID, secretKEY, userName, and the subReddit that needs extraction. Have a look at the code below:
clientID = 'Your_client_id_here' secretKEY = 'Your_secret_key_here' userName = 'Your_username_here' subReddit = input("Enter the Sub-Reddit whose posts you need to extract : ")
Instead of the dummy values for the ID, key, and username make sure you put your data there. For the subreddit value, let’s take the input of the same from the user itself to make our code more dynamic.
The user only needs to put the name of the Sub-Reddit whose posts need to be extracted as shown in the output below. For this tutorial, to make things fun I have chosen the subReddit as ‘Memes’.
Enter the Sub-Reddit whose posts you need to extract : Memes
Creating Reddit and SubReddits Bot
Next, let’s create a Bot instance for us that will extract posts for us and make our lives easier. For creating the bot instance, we will make use of the praw.Reddit
function which will take all the data that we set as parameters. Along with this, we will also create a subReddit Bot from the Reddit Bot which will take your subReddit name as a parameter.
import praw clientID = 'Your_client_id_here' secretKEY = 'Your_secret_key_here' userName = 'Your_username_here' subReddit = input("Enter the Sub-Reddit whose posts you need to extract : ") redditBot = praw.Reddit(client_id=clientID, client_secret=secretKEY, user_agent=userName) subredditBot = redditBot.subreddit(subReddit)
Extracting Data from the SubReddit Bot
When it comes to Reddit, there are various categories of posts available such as ‘New Posts’, ‘Top Posts’, ‘Hot Posts’, and others. The functions to get the various types of posts respectively which are pretty obvious. To get the new posts we have the function as new
and similarly, for top and hot posts, we have the functions top
and hot
respectively.
Let’s start off by extracting theĀ new posts from the Memes SubReddit using the code below. Let’s try and print the title
of the subReddit post.
for post in subredditBot.new(): print(post.title)
The output of the code execution looks like this:
You can see there are many post titles available when we ran the code. Now when we further extract more data about the posts then the whole output will get quite messy. To solve the problem why not make use of the DataFrames
feature from the pandas
module to make our presentation great?
Let’s try to achieve that now using the code below. And while we are at it, let’s add in new data about the post as well i.e. unique ID of the post.
import pandas as pd post_IDs = [] post_Titles = [] for post in subredditBot.new(): post_IDs.append(post.id) post_Titles.append(post.title) final_Data = {'Post ID': post_IDs, 'Post Title': post_Titles} dataFrame = pd.DataFrame(final_Data) print(dataFrame)
The output of the code looks like this now:
The output looks pretty sorted and organized now. Right?
Next, let’s not restrict ourselves to just two data points about the post. Have a look at the code below to extract more data about the posts as well.
import pandas as pd post_IDs = [] post_Titles = [] post_Authors = [] post_NumComments = [] post_UpvoteRatio = [] for post in subredditBot.new(): post_IDs.append(post.id) post_Titles.append(post.title) post_Authors.append(post.author.name if post.author else None) post_NumComments.append(post.num_comments) post_UpvoteRatio.append(post.upvote_ratio) final_Data = { 'Post ID': post_IDs, 'Post Title': post_Titles, 'Author': post_Authors, 'Num Comments': post_NumComments, 'Upvote Ratio':post_UpvoteRatio } dataFrame = pd.DataFrame(final_Data) print(dataFrame)
The final output looks like this:
Now did you notice that the data visualization looks much messier and not exactly organized? So to make them organized properly, let’s use the power of CSV
files of the pandas
library with the code below:
dataFrame.to_csv(subReddit+" newPosts.csv") readDF = pd.read_csv(subReddit+" newPosts.csv") readDF.head()
The final output looks like this:
Now it’s all organized and looks pretty good right? Fabulous!
Conclusion
In this tutorial, we’ve explored how to extract subreddit posts using Python and the praw
library. By setting up a Reddit APIĀ and the praw
library, we were able to retrieve various details from the new posts of a specified subreddit. Why not give a shot for the hot
and top
posts as well. All you need to do is modify one single function in the entire code.
Hope you liked the tutorial and learned something new through it.
Also Read:
- Scrape HTML Table from a web page or URL in Python
- Extract current stock price using web scraping in Python
- Web Scraping using lxml in Python
Happy Scraping!
Leave a Reply