15 April 2013

This snippet is about parsing HTML with PHP. It's hard to parse HTML, because people write incorrect HTML syntax very often. This code forces DOMDocument to read and later write HTML. So your changes get made and later saved correctly. There are many different ways to parse HTML DOMDocument is bundled with PHP so it doesn't require any extra libraries, thus might be a good choice.

Source code viewer
  1. // Load html as DOMDocument.
  2. $dom = new DOMDocument();
  3.  
  4. // Ignore warnings during loading possibly bad HTML syntax.
  5. @$dom
  6. ->loadHTML(
  7. '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head><body>' .
  8. $html .
  9. '</body></html>'
  10. );
  11.  
  12. // Change html.
  13. $nodes = $dom->getElementsByTagName('a');
  14. foreach ($nodes as $node) {
  15. foreach ($node->attributes as $attribute) {
  16. if ($attribute->name === 'url_id') {
  17. $node->removeAttribute($attribute->name);
  18. $node->setAttribute('href', $urls[$attribute->value]);
  19. }
  20. }
  21. }
  22.  
  23. // Save the changed html.
  24. $result = trim(preg_replace('~<(?:!DOCTYPE|/?(?:html|head|body|\?xml))[^>]*>\s*~i', '', $dom->saveHTML()));
Programming Language: PHP