Extract posts from any subreddit in Python using PRAW

Hey fellow Python coder! In this tutorial, we will be learning how to extract data from various Subreddits present on Reddit. If you haven’t heard about Reddit before, let me explain what the exact purpose of the website is.

Reddit is a social media platform where people can discuss and share content on a wide range of topics. Reddit is divided into different “subreddits” which help to distribute the content into various categories focused on a certain topic.

Pre-requisites for Extracting Subreddits Posts

Well when it comes to extracting subreddit posts using Python, you have to make some extra efforts and set Reddit API for you which will help you extract the posts in the later section. The question is how do you achieve that? Let’s follow the following steps:

Navigate to Reddit’s Application Page. If you have not logged in already or don’t have an account on Reddit yet then you have to take the necessary action to get the page. You will be able to see a button that says “Create Another App” or “Create New App”. Simply click on the button. You will be able to see a form as shown below:

Subreddits Posts Extraction using Python

You can fill in the exact data that I have shown below or change it according to your preferences. But make sure that Redirect URI is used as http://localhost:8080. The reason for doing the same is this is considered a common practice when a person is developing a local application with the Reddit API. This helps make the Authentication process easier.

Subreddits Posts Extraction using Python

After your app is created, you can find the client ID and client secret on the app details page. I am sharing the location of both in the screenshot below:

Subreddits Posts Extraction using Python

Once you have both your ID and secret key, you are all ready with your tools to get right into the coding in the upcoming section!

Scraping Subreddit Posts using praw library

The praw library is a Python wrapper for the Reddit API that makes it easy for you to interact with Reddit using Python programming language. So first of all let’s make sure the library is installed in our system using the command pip install praw in either the command prompt or jupyter notebook/ google colab notebook.

Setting Data Variables

Now if you are all set, then let’s start by setting some data values which include the following: clientID, secretKEY, userName, and the subReddit that needs extraction. Have a look at the code below:

clientID = 'Your_client_id_here'
secretKEY = 'Your_secret_key_here'
userName = 'Your_username_here'
subReddit = input("Enter the Sub-Reddit whose posts you need to extract : ")

Instead of the dummy values for the ID, key, and username make sure you put your data there. For the subreddit value, let’s take the input of the same from the user itself to make our code more dynamic.

The user only needs to put the name of the Sub-Reddit whose posts need to be extracted as shown in the output below. For this tutorial, to make things fun I have chosen the subReddit as ‘Memes’.

Enter the Sub-Reddit whose posts you need to extract : Memes

Creating Reddit and SubReddits Bot

Next, let’s create a Bot instance for us that will extract posts for us and make our lives easier. For creating the bot instance, we will make use of the praw.Reddit function which will take all the data that we set as parameters. Along with this, we will also create a subReddit Bot from the Reddit Bot which will take your subReddit name as a parameter.

import praw

clientID = 'Your_client_id_here'
secretKEY = 'Your_secret_key_here'
userName = 'Your_username_here'
subReddit = input("Enter the Sub-Reddit whose posts you need to extract : ")

redditBot = praw.Reddit(client_id=clientID, client_secret=secretKEY, user_agent=userName)
subredditBot = redditBot.subreddit(subReddit)

Extracting Data from the SubReddit Bot

When it comes to Reddit, there are various categories of posts available such as ‘New Posts’, ‘Top Posts’, ‘Hot Posts’, and others. The functions to get the various types of posts respectively which are pretty obvious. To get the new posts we have the function as new and similarly, for top and hot posts, we have the functions top and hot respectively.

Let’s start off by extracting theĀ new posts from the Memes SubReddit using the code below. Let’s try and print the title of the subReddit post.

for post in subredditBot.new():
    print(post.title)

The output of the code execution looks like this:

SubReddit All New Posts Output

You can see there are many post titles available when we ran the code. Now when we further extract more data about the posts then the whole output will get quite messy. To solve the problem why not make use of the DataFrames feature from the pandas module to make our presentation great?

Let’s try to achieve that now using the code below. And while we are at it, let’s add in new data about the post as well i.e. unique ID of the post.

import pandas as pd

post_IDs = []
post_Titles = []

for post in subredditBot.new():
    post_IDs.append(post.id)
    post_Titles.append(post.title)
    
final_Data = {'Post ID': post_IDs, 'Post Title': post_Titles}
dataFrame = pd.DataFrame(final_Data)

print(dataFrame)

The output of the code looks like this now:

SubReddit All New Posts DataFrame Output

The output looks pretty sorted and organized now. Right?

Next, let’s not restrict ourselves to just two data points about the post. Have a look at the code below to extract more data about the posts as well.

import pandas as pd

post_IDs = []
post_Titles = []
post_Authors = []
post_NumComments = []
post_UpvoteRatio = []

for post in subredditBot.new():
    post_IDs.append(post.id)
    post_Titles.append(post.title)
    post_Authors.append(post.author.name if post.author else None)
    post_NumComments.append(post.num_comments)
    post_UpvoteRatio.append(post.upvote_ratio)
    
final_Data = {
    'Post ID': post_IDs,
    'Post Title': post_Titles,
    'Author': post_Authors,
    'Num Comments': post_NumComments,
    'Upvote Ratio':post_UpvoteRatio
}

dataFrame = pd.DataFrame(final_Data)
print(dataFrame)

The final output looks like this:

SubReddit All New Posts DataFrame Output

Now did you notice that the data visualization looks much messier and not exactly organized? So to make them organized properly, let’s use the power of CSV files of the pandas library with the code below:

dataFrame.to_csv(subReddit+" newPosts.csv")
readDF = pd.read_csv(subReddit+" newPosts.csv")
readDF.head()

The final output looks like this:

SubReddit All New Posts DataFrame Output

Now it’s all organized and looks pretty good right? Fabulous!

Conclusion

In this tutorial, we’ve explored how to extract subreddit posts using Python and the praw library. By setting up a Reddit APIĀ  and the praw library, we were able to retrieve various details from the new posts of a specified subreddit. Why not give a shot for the hot and top posts as well. All you need to do is modify one single function in the entire code.

Hope you liked the tutorial and learned something new through it.

Also Read:

  1. Scrape HTML Table from a web page or URL in Python
  2. Extract current stock price using web scraping in Python
  3. Web Scraping using lxml in Python

Happy Scraping!

Leave a Reply

Your email address will not be published. Required fields are marked *