This function takes a URL and returns a plain-text version of the page. It uses to retrieve the page and a combination of to strip all unwanted whitespace. This function will even strip the text from STYLE and SCRIPT tags, which are ignored by functions such as strip_tags (they strip only the tags, leaving the text in the middle intact).

were split in 2 stages, to avoid deleting single carriage returns (also matched by \s) but still delete all blank lines and multiple linefeeds or spaces, trimming operations took place in 2 stages.

function webpage2txt($url)
{
$user_agent = “/4.0 (compatible; MSIE 5.01; NT 5.0)”;

$ch = curl_init(); // initialize handle
curl_setopt($ch, CURLOPT_URL, $url); // set url to post to
curl_setopt($ch, CURLOPT_FAILONERROR, 1); // Fail on errors
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); // allow redirects
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); // return into a variable
curl_setopt($ch, CURLOPT_PORT, 80); //Set the port number
curl_setopt($ch, CURLOPT_TIMEOUT, 15); // times out after 15s

curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);

$document = curl_exec($ch);

$search = array(’@<script[^>]*?>.*?</script>@si’, // Strip out
‘@<style[^>]*?>.*?</style>@siU’, // Strip style tags properly
‘@<[\/\!]*?[^<>]*?>@si’, // Strip out tags
‘@<![\s\S]*?–[ \t\n\r]*>@’, // Strip multi-line comments including CDATA
‘/\s{2,}/’,

);

$text = preg_replace($search, “\n”, html_entity_decode($document));

$pat[0] = “/^\s+/”;
$pat[2] = “/\s+\$/”;
$rep[0] = “”;
$rep[2] = ” “;

$text = preg_replace($pat, $rep, trim($text));

return $text;
}

Potential uses of this function are extracting from a webpage, counting words and things like that. If you find it useful, drop us a comment and let us know where you used it.

Rate this script!


Incoming Links (via Tecnorati):
Nothing Reported

Tags: , , , , , , , , , , , ,