More often than not, programmers are faced with the challenge of stripping all HTML content from a document, leaving only plain text. While PHP provides functions for stripping tags out of (X)HTML documents (i.e. strip_tags, fgetss), they don’t strip all the unwanted bits of documents (such as the text between the <script> and </script> tags). Looking for a solution to this problem, I found a nifty one, originally posted by uersoy at tnn dot net, that deals with the text inside <script>, <style>, comments, as well as regular tags.
<?php
function get_plain_text($document){
$search = array(‘@<script[^>]*?>.*?</script>@si’, // Strip out javascript
‘@<style[^>]*?>.*?</style>@siU’, // Strip style tags properly
‘@<[\/\!]*?[^<>]*?>@si’, // Strip out HTML tags
‘@<![\s\S]*?–[ \t\n\r]*>@’ // Strip multi-line comments including CDATA
);
$text = preg_replace($search, ”, $document);
return $text;
}
?>
This function will strip all unwanted stuff from your document and should you need to keep some of it, you just modify the regular expression list (on $search). Still, if you only want to get rid of just some HTML tags, you should use PHP’s native functions, as they’re way faster and more flexible.
Tags: .NET, Code, Content, Conversion, Flex, HTML, Java, Javascript, Manual, Perl, PHP, Sample Code, Text Transformation














