Extract URL from a string in C++

In this post, we are going to learn how to extract all possible URLs from a given string in C++. We have a library named Regular Expression Library (abbr. REGEX) to achieve our goal .

Steps involved in extracting the URL :

  1. The first and foremost step is that you have to create a sample regular expression, which will be used as the standard to extract the URLs from the given string. You will find a lot of websites on the internet that provide a lot of regular expressions, which you can directly use in your code. Since you are an ardent reader of our posts, I’ll provide you with a regex later on in this post.
  2. Using regex iterator to extract all possible URLs from the given string(Will be clear when you see the code). What this iterator does it captures the matched substring index by index.
  3. We can store all the matched substring inside a vector for future use.

Following code is an example code to implement REGEX in C++ :

#include <iostream>
#include <regex>
#include <string>

using namespace std;

void URL(string text, vector<string>& urls){
  //regex string
  regex url_regex(R"((?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+)");
  //regex iterator that will traverse the entire input string and store the URLs 
  regex_iterator<std::string::iterator> iter(text.begin(), text.end(), url_regex);
  //an iterator pointing to the end of the container storing all URLs
  regex_iterator<std::string::iterator> end;
  
  //store all URLs inside the vector
  while (iter != end) {
      urls.push_back(iter->str());
      ++iter;
  }
  return ;
}

int main()
{
  // Given String str
  string str = "are bhai https://codeforces.com mast website hai magar www.codespeedy.com best hai";
  vector<string> urls;
  // Function Call
  URL(str, urls);
        for(auto it : urls){
    	    cout<<it<<endl;  
        }

  return 0;
}

The output of the above code will be :

https://codeforces.com
www.codespeedy.com

Some points to remember :

  1. Remember that all regex strings might not be suitable for all types of input strings. So choose your Regex string after testing thoroughly.
  2. Test your code using various inputs to ensure correctness.

 

That was all for this post and I hope you understood. Feel free to leave down a comment in case of any queries/doubts.

Related Topics:
Web Scraping using C++

Leave a Reply

Your email address will not be published. Required fields are marked *