Web Scraping using C++

So in this post, I’ll be teaching you how Web scrapping is done in C++ and its libraries. Before starting anything, web scrapping refers to collecting useful information from a website and converting it into a form more useful for analysis. We will be using two libraries, namely libcurl and libxml2 to fulfill our task.

You first need to install both the libraries. The complete step to install can be found on their respective official websites.

Required steps for Web Scrapping

  1. Download the page whose information you want to scrap, using libcurl.
  2. using libxml2, we will parse the HTML doument.
  3. Export the parsed data into a file.

Step 1 : Download Web Page

What we will do first is run a GET request(you can also use POST request, but generally for such purposes we use GET request).You can use the following code as a boilerplate so that the function returns the HTML document as a string.(Note that we will be using a sample webpage for our scraping purposes).

#include <iostream>
#include <string>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>
#include <curl/curl.h>

using namespace std;

string fetch_web_page_content(const string& web_page_url) {
    CURL* curl_handle = curl_easy_init();
    string downloaded_content;

    if (curl_handle) {
        try {
            curl_easy_setopt(curl_handle, CURLOPT_URL, web_page_url.c_str());
            curl_easy_setopt(curl_handle, CURLOPT_WRITEFUNCTION,
                [](void* buffer, size_t size, size_t nmemb, string* response_data) {
                    response_data->append(static_cast<char*>(buffer), size * nmemb);
                    return size * nmemb;
                });
            curl_easy_setopt(curl_handle, CURLOPT_WRITEDATA, &downloaded_content);
            CURLcode result = curl_easy_perform(curl_handle);
            if (result != CURLE_OK) {
                throw runtime_error(curl_easy_strerror(result));
            }
        } catch (const exception& e) {
            cerr << "Error fetching URL: " << e.what() << endl;
            downloaded_content = ""; // Clear output on error
        }
        curl_easy_cleanup(curl_handle);
    }

    return downloaded_content;
}

int main() {
    curl_global_init(CURL_GLOBAL_ALL);

    string retrieved_html = fetch_web_page_content("https://scrapeme.live/shop/");
    if (!retrieved_html.empty()) {
        // Process the HTML document here (e.g., using libxml for parsing)
        cout << retrieved_html;
    }

    curl_global_cleanup();
    return 0;
}

 

The output of the following code will be the HTML of the target URL.

Parsing the HTML

Now, rest of the job will be done by libxml2.
First of all, we can create a data-structure for storing the data that we are going to scrap.

struct Product{
    string url;
    string image;
    string name;
    string price;
};

Next step is to feed the HTML document to libxml2

Next step is to parse the HTML string using  htmlReadMemory(), so that we can use Xpath selectors on the parsed string.

Finally, what we need to do is based on the required data, we can apply our selector strategy to store all the data that we need. For this, what we will do is store the complete list of products inside an array, then we can just iterate over each product and scrap out the useful information.

The code for the above purpose looks like this :

#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>
#include <vector>

using namespace std;

vector<Product> parse_products_from_html(const string& html_document) {
    // Parse the HTML document
    htmlDocPtr doc = htmlReadMemory(html_document.c_str(), html_document.length(), nullptr, nullptr, HTML_PARSE_NOERROR);
    if (!doc) {
        throw runtime_error("Failed to parse HTML document");
    }

    // Create XPath context
    xmlXPathContextPtr context = xmlXPathNewContext(doc);
    if (!context) {
        xmlFreeDoc(doc);
        throw runtime_error("Failed to create XPath context");
    }

    // Find product elements
    xmlXPathObjectPtr product_html_elements = xmlXPathEvalExpression((xmlChar*)"//li[contains(@class, 'product')]", context);
    if (!product_html_elements) {
        xmlXPathFreeContext(context);
        xmlFreeDoc(doc);
        throw runtime_error("Failed to evaluate XPath expression");
    }

    // Extract product information
    vector<Product> products;
    for (int i = 0; i < product_html_elements->nodesetval->nodeNr; ++i) {
        xmlNodePtr product_html_element = product_html_elements->nodesetval->nodeTab[i];
        xmlXPathSetContextNode(product_html_element, context);

        // Extract product details
        string url = xpath_extract_string(context, ".//a/@href");
        string image = xpath_extract_string(context, ".//a/img/@src");
        string name = xpath_extract_string(context, ".//a/h2/text()");
        string price = xpath_extract_string(context, ".//a/span/text()");

        products.push_back({url, image, name, price});
    }

    // Free resources
    xmlXPathFreeObject(product_html_elements);
    xmlXPathFreeContext(context);
    xmlFreeDoc(doc);

    return products;
}

string xpath_extract_string(xmlXPathContextPtr context, const char* xpath_expr) {
    xmlXPathObjectPtr result = xmlXPathEvalExpression((xmlChar*)xpath_expr, context);
    if (!result || result->nodesetval->nodeNr == 0) {
        // Handle missing or empty results
        return "";
    }

    xmlChar* content = xmlNodeGetContent(result->nodesetval->nodeTab[0]);
    string value = string(reinterpret_cast<char*>(content));
    xmlFree(content);
    xmlXPathFreeObject(result);
    return value;
}

Export Data

The last step is to export the scraped data into a more useful format like CSV, so that it can be further used in the future, if required.A dummy code for the above purpose is :

void saveToCSV(const std::vector<PokemonProduct>& products, const string& filename) {
        ofstream csv_file(filename);
        csv_file << "url,image,name,price" << std::endl;

        // Write product data to CSV
        for (const auto& product : products) {
            std::string csv_record = product.url + "," + product.image + "," + product.name + "," + product.price;
            csv_file << csv_record << std::endl;
        }

        csv_file.close();
    }

That was it for this post. Feel free to comment down below if you have any queries.

Leave a Reply

Your email address will not be published. Required fields are marked *