
RSS (which, in its latest format, stands for “Really Simple Syndication”) is a family of web feed formats used to publish frequently updated content such as blog entries, news headlines or podcasts. An RSS document, which is called a “feed,” “web feed,” or “channel,” contains either a summary of content from an associated web site or the full text. RSS makes it possible for people to keep up with their favorite web sites in an automated manner that’s easier than checking them manually.
Many applications make use of the information contained in RSS feeds. One of the challenges associated with this “site feed scraping” is being able to find the URLs of the feeds in webpages. That’s why, after a little research and a little coding, we came up with a PHP class that does just that. When passed a URL, it will return all the RSS feed URLs that appear on that page. It will even make those links absolute in case the links it finds are relative ones. The class only needs one thing to get working. The URL of the page you want to scan for RSS feeds. This URL is also used to resolve relative RSS links.
Behind the scenes, the class performs three main steps. First, the cURL library is used to fetch the content pointed to by the URL the user passed.. Second, since PHP doesn’t have an SGML parser built in like Python seems to, so getting all the <link> tags has to be done manually. A few regular expressions and some simple string splitting made it all real easy. Last but not least, the function goes through all the links found, figures out which ones belong to RSS feeds, resolves them to absolutes URL if necessary, and stores them on an array, making sure the link isn’t already listed to prevent duplicate links (e.g., the RSS appears more than once in the page).
All this is done behind the scenes, so that the users of this class are abstracted from all this complexities. Currently, this class only looks for feeds in <link> tags, so if a page has an <a> tag linking to a feed, the class won’t be able to find it. This is not too big a problem, though, as most sites use <link> tags for their feeds, so that search engines can easily find them.
You can download the code for this class here. The ZIP package contains the fully documented class source code, as well as a simple example to illustrate how to use the class. This class is featured in Freshmeat.
Tags: .NET,
Code,
Content,
Content Scraping,
cURL,
library,
Links,
Manual,
News,
pear,
PHP,
Python,
regular expressions,
RSS,
Sample Code,
Scraping,
Search Engines,
Wordpress,
ZIP