HTML parsing has long been a pain point in PHP, especially when dealing with real-world, imperfect markup. This article shows how recent PHP releases change that story, introducing a modern HTML parser that accurately handles today’s web content, robust CSS selector support for element retrieval, and new DOM classes that align more closely with the official DOM specification. We will walk through practical examples in web scraping, content extraction, and HTML transformation, and you will learn how to migrate existing parsing code to benefit from these native improvements.

What is HTML parsing? It’s essentially a process to turn an HTML string into an object that we can work with more easily. It makes the elements inside easier to access. Your browser does this every time you load a web page.

Illustration of an HTML snippet being converted into a Document Object Model tree. The code on the left contains html, head, title, body, h1, and p elements; the diagram on the right shows those nodes as a hierarchy with text nodes like “Sample Page,” “Hello World,” and “This is a paragraph.”

The HTML is parsed into a DOM (Document Object Model) object. When you use your browser’s developer tools to inspect an element on a webpage, you are relying on the DOM. We can also turn a DOM object back into an HTML string if we need to save it somewhere.

Illustration of a Document Object Model tree on the left being converted back into HTML markup on the right. The tree contains html, head, title, body, h1, and p nodes, and the output shows the equivalent HTML source.

For many developers, parsing HTML has long been a source of frustration. While PHP’s DOMDocument class has allowed us to do this, its reliance on the libxml2 library meant it could not handle the kind of HTML that browsers today have to deal with. It tripped up on certain inline JavaScript, leaking the code into other parts of the resulting DOM, and often required unreliable hacks to produce a decent result.

Note: There has been an open issue to add HTML5 support in libxml2 for a while. There also appears to have been some progress in the last year. At the same time, however, the maintainer of the library, Nick Wellnhofer, has announced he is stepping down. So at the time of writing, it’s unclear what the future holds.

With the release of PHP 8.4 in November 2024, that era is finally over.

IPC NEWSLETTER

All news about PHP and web development

PHP 8.4 introduces a massive overhaul to the DOM extension, featuring a new, standards-compliant HTML5 parser [2], native CSS selector support, and a modernized set of DOM classes. These additions make PHP a viable, high-performance choice for web scraping, content extraction, and HTML transformation tasks that previously required slower third-party libraries based on older specs.

Note: From this point on, we will simply use “HTML” rather than “HTML5.” The current standard is called the “HTML Living Standard,” maintained by WHATWG.

I’m slowly updating the PHP Readability library, used for article extraction, to use the new DOM API in PHP. In this article, we’ll explore what’s new, and walk through practical examples of how to migrate your own code.

Technical Foundation

The secret sauce behind PHP 8.4’s parsing capabilities is Lexbor, a high-performance HTML parser written in C by Alexander Borisov. Unlike libxml, Lexbor is based on the WHATWG HTML Living Standard. This means it parses HTML more like a modern web browser does – handling unclosed tags and quirky markup.

Because Lexbor is a C library, it is incredibly fast. It eliminates the overhead of userland parsers (like the popular html5-php library) and often outperforms the libxml parser while providing significantly better accuracy. It is included in the DOM extension by default, requiring no extra configuration or external dependencies.

The integration of Lexbor into PHP, along with the other DOM changes described in this article, came about thanks to Niels Dossche. Niels is a PHP core contributor and researcher at Ghent University.

For backward compatibility, he has ensured that the changes do not affect existing code, by providing the new DOM classes under the Dom namespace.

The Old Way: Parsing with libxml

Three-column comparison of LIBXML, HTML5-PHP, and LEXBOR. LIBXML is described as a fast C-based parser from 1999 based on HTML4 with partial HTML5 support; HTML5-PHP as a PHP library from 2013 based on an older W3C HTML5 standard; and LEXBOR as a fast C-based parser from 2018 based on the newer WHATWG standard, available by default in most PHP 8.4 installations.

To really appreciate the upgrade, let’s first look at the old way. The native PHP way has been to rely on the DOMDocument class, which uses libxml under the hood. While libxml is excellent for XML, it predates the current HTML Living Standard and has struggled with modern markup for a long time.

Consider the following HTML document. It contains two paragraphs and a script element in between them.

<!DOCTYPE html>
<title>Valid HTML Document</title>
<p>Paragraph 1</p>
<script>console.log("</html>Console log text");</script>
<p>Paragraph 2</p>

This is valid HTML. A browser knows that </ html> is inside a string in a script element and should be treated as text. However, DOMDocument gets confused.

Note: Many opening and closing tags can be omitted, e.g. < html>, < head>, < body>. The parser will infer them automatically. I want to stress that all the HTML I’m presenting in this article is valid, conforming to the current HTML standard. While the HTML parser spec goes into details about how to handle invalid, non-conforming HTML, we’re setting a lower bar here when comparing parsing results, using HTML you’ll encounter in the wild.

$dom = new DOMDocument();
$dom->loadHTML($html);
$paragraphs = $dom->getElementsByTagName('p');
echo "Found {$paragraphs->length} paragraphs.";
// Output: Found 3 paragraphs.

Why 3 paragraphs, and not 2? Because the DOMDocument sees the </ html> inside the script, assumes the document has ended, and then treats the remaining text (Console log text”);) and the second paragraph as new content outside the body, mangling the structure entirely. If you serialize this back to HTML, you get a broken mess:

<html>
<body>
<p>Paragraph 1</p>
<script>console.log("</script>
</body>
</html>
<html>
<p>Console log text");</p>
<p>Paragraph 2</p>
</html>

Two-column table mapping CSS selectors to XPath 1.0 expressions, with examples such as div.content, article#main, [src*="avatar"], article p, and article > p, plus two emoji-marked rows showing more awkward XPath equivalents for matching text and links.

The Workarounds

So what have developers done about this? Historically, there have been two main approaches:

Use a better parser: The popular library html5-php implements an older W3C HTML5 parsing spec in pure PHP. It’s an improvement over libxml, but it hasn’t kept up with the latest spec (WHATWG’s HTML Living Standard). Additionally, being a PHP implementation means it is slower than C-based parsers like libxml.
Clean the HTML before parsing: Some developers used the Tidy extension to repair and clean markup before parsing it.

With the new parser, neither of these should be needed now.

Note: Tidy re-writes the HTML in a way that older parsers can sometimes parse better. But not always. I’ve encountered HTML which either Tidy itself struggles with, or in which Tidy’s output doesn’t produce better results when passed to PHP’s DOMDocument.

EVERYTHING IS CONNECTED TO THE INTERNET

Explore the Web Development Track

The New Way: Parsing with Lexbor

With the release of PHP 8.4, PHP introduces the new Dom\HTMLDocument class. When you parse HTML using this class, you are using Lexbor, PHP’s new HTML parser.

Here is how we parse the same document with Lexbor, using the new class:

$dom = Dom\HTMLDocument::createFromString($html);
$paragraphs = $dom->getElementsByTagName('p');
echo "Found {$paragraphs->length} paragraphs.";
// Output: Found 2 paragraphs.

The new parser correctly identifies the context of the script tag and preserves the document structure.

$Diagram showing the same source HTML processed with the new PHP 8.4+ Dom\HTMLDocument::createFromString() parser. It correctly reports 2 paragraph elements and serializes the document without corrupting the script content, demonstrating proper HTML5 parsing.$

Performance and Standards

Comparing PHP’s new parser with the html5-php library, I found the native PHP implementation is approximately 3.6x faster on average for typical news and blog pages. For larger, more complex documents, users should find it even faster.

More importantly, it adheres to a more recent HTML standard. HTML today is a “Living Standard” maintained by the WHATWG, meaning it has no version numbers and changes over time. Both libxml and html5-php are based on older standards. Lexbor, PHP’s new parser, is based on the more recent WHATWG standard, so is closer to modern browser parsing.

New DOM Classes

To support the new features without breaking decades of existing code, PHP 8.4 introduces a new set of DOM classes under the DOM namespace. These live alongside the existing global classes (like DOMDocument), allowing both APIs to coexist in the same application.

Here is how the key classes map to the new namespace:

DOMDocument → Dom\HTMLDocument (there is also Dom\XMLDocument for XML)
DOMElement → Dom\Element
DOMNode → Dom\Node
DOMText → Dom\Text
DOMAttr → Dom\Attr
DOMXPath → Dom\XPath

Why create new classes instead of fixing the old ones? Niels found that attempts to fix bugs in the old DOM classes caused too many issues because many of us have had to rely on the incorrect behavior. By creating a fresh namespace, the new classes can adhere strictly to the spec while the old classes remain untouched for legacy code.

Migration and Interoperability

Thankfully, migration doesn’t have to be all-or-nothing. You can mix both APIs in the same codebase. And if you need to, you can import legacy DOMNode objects with the importLegacyNode method:

$oldDom = new DOMDocument();
$oldDom->loadHTML('<p>Old node</p>');
$oldElement = $oldDom->getElementsByTagName('p')->item(0);
echo "Old element class: " . $oldElement::class . PHP_EOL;
// Output: Old element class: DOMElement

$newDom = Dom\HTMLDocument::createFromString('<!DOCTYPE html>');
$newElement = $newDom->importLegacyNode($oldElement, deep: true);
echo "New element class: " . $newElement::class . PHP_EOL;
// Output: New element class: Dom\Element
$newDom->body->append($newElement);
// Serialise to HTML
echo $newDom->body->innerHTML;
// Output: <p>Old node</p>

DOM Properties and innerHTML

The new API introduces several quality-of-life improvements that reduce boilerplate code.

IPC NEWSLETTER

All news about PHP and web development

Top-Level Properties

You no longer need to traverse the tree to find the < body > or < head > tags. They are now exposed as first-class properties on the document object:

$html = '<!DOCTYPE html>
<title>Old title</title>
<h1>Hello</h1>';

$dom = Dom\HTMLDocument::createFromString($html);
// Access convenience elements directly
echo $dom->head::class . PHP_EOL; // Dom\HTMLElement
echo $dom->body::class . PHP_EOL; // Dom\HTMLElement
// Read or write the title directly
echo $dom->title . PHP_EOL;
// Output: Old title
$dom->title = "New Title";
echo $dom->head->innerHTML;
// Output: <title>New Title</title>

innerHTML Support

What about getting the HTML content of an element? We now have native innerHTML support.

It works just like JavaScript:

$div = $dom->querySelector('div');

// Read content
echo $div->innerHTML;

// Write content (automatically parses the string into nodes)
$div->innerHTML = '<p>Replaced content</p>';

Note: While_ innerHTML is supported, _outerHTML is not yet available in this release.

CSS Selector Support

Perhaps the most exciting feature for web scraping is native support for CSS selectors. You can finally say goodbye to getElementsByTagName and the complexity of DOMXPath.

The new classes implement querySelector and querySelectorAll, behaving identically to their JavaScript counterparts:

querySelector($selectors) — Returns the first descendant element that matches the CSS selectors
querySelectorAll($selectors) — Returns a NodeList containing all descendant elements that match the CSS selectors

$dom = Dom\HTMLDocument::createFromString($html);

// Find the first matching element
$article = $dom->querySelector('article.main');

// Find all matching elements (returns a NodeList)
$links = $dom->querySelectorAll('ul.nav > li > a');

Powerful Selectors

You aren’t limited to basic class or ID selectors. You have access to modern, complex CSS selectors:

Multiple Element Types: Select headers and paragraphs in one go:

$elements = $dom->querySelectorAll('h1, h2, h3, p');

Combinators (:is, :where): Simplify complex grouping:

// Select paragraphs and main headings that are direct children of article
$elements = $dom->querySelectorAll('article > :is(p, h1, h2)');
// Same as
// $elements = $dom->querySelectorAll('article > p, article > h1, article > h2');

State Selectors (:empty, :not):

// Find all paragraphs that are NOT empty
$elements = $dom->querySelectorAll('p:not(:empty)');

Relational Pseudo-class (:has): Get h1 headings that are followed immediately by an h2 heading:

$headings = $dom->querySelectorAll('h1:has(+ h2)');

Get all paragraphs in an article that have at least one link inside them:

$paragraphsWithLinks = $dom->querySelectorAll('article p:has(a)');

Attribute Selectors: Target specific attribute values, including partial matches (note the ‘i’ to signal case-insensitive matching):

// Find secure external links
$secureLinks = $dom->querySelectorAll('a[href^="https://" i]:not([href*="example.com" i])');

Note: One missing feature is the :scope pseudo-class, which can be used to refer to the current element when there’s a need to use a combinator. Using it currently throws a DOMException. $article->querySelectorAll(‘:scope > p’) This is a known limitation in Lexbor, and it is being worked on.

XPath Selectors

While CSS selectors are an excellent new addition, XPath remains available. I recommend using CSS selectors whenever you can, as they’re usually easier and more concise to write.

In the past, people would turn to XPath because CSS selectors were not as powerful as they are today, and they were not available in PHP natively. Those who wanted to use CSS selectors in PHP had to rely on libraries that converted CSS to XPath under the hood, such as Symfony’s CssSelector component.

Nonetheless, XPath can still be used if you need more complex logic in your selectors or if you’re migrating code that already relies on XPath.

Common CSS/XPath selectors

Namespace warning

It’s important to note that if you’ve previously used XPath with HTML parsed with PHP’s DOMDocument, switching to Dom\HTMLDocument will require that you pay attention to namespaces.

The new parser assigns namespaces to HTML, SVG and MathML elements, in line with the HTML standard. This means XPath queries that worked before may return empty results. Consider this HTML with an embedded SVG:

<article>
<svg width="200" height="100">
<text x="100" y="50">Hello SVG</text>
</svg>
</article>

With the old DOMDocument, a simple XPath query works without any namespace handling:

$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$texts = $xpath->query('//article//svg');
// Works: returns the <svg> element

However, with Dom\HTMLDocument, the same query returns nothing because the article and SVG elements are now placed in the HTML and SVG namespaces:

$dom = Dom\HTMLDocument::createFromString($html);
$xpath = new Dom\XPath($dom);
$texts = $xpath->query('//article//svg');
// Returns empty! The elements are in a namespace.

To fix this, you must register the namespace and use a prefix in your XPath:

$dom = Dom\HTMLDocument::createFromString($html);
$xpath = new Dom\XPath($dom);
$xpath->registerNamespace('h', 'http://www.w3.org/1999/xhtml');
$xpath->registerNamespace('s', 'http://www.w3.org/2000/svg');
$texts = $xpath->query('//h:article//s:svg');
// Works: returns the <svg> element

However, this does have a downsider: if your source HTML uses < template > elements, the contents of those elements will no longer be hidden when working with the DOM.

Serialize to HTML

Serialization is turning the DOM object you’ve been working with into an HTML string. If you need to save the results as an HTML file or store it in a database, you’ll want to serialize.

// Save the entire document
echo $dom->saveHtml();

// Save a specific node (and its children)
echo $dom->saveHtml($dom->body);

Practical Examples

Now that we understand the API, let’s put it to work in some real-world scenarios.

Web Scraping

Here is an example of extracting quotes and their authors into an array:

use Symfony\Component\HttpClient\HttpClient;

// Fetching the content
$client = HttpClient::create();
$response = $client->request('GET', 'https://quotes.toscrape.com/');
$html = $response->getContent();

// Parsing HTML with Lexbor
$dom = Dom\HTMLDocument::createFromString($html);

// Extract quotes using CSS selectors
$quotes = [];
foreach ($dom->querySelectorAll('.quote') as $element) {
$quote = $element->querySelector('.text')->textContent;
$author = $element->querySelector('.author')->textContent;
$authorUrl = $element->querySelector('a[href ^= "/author/"]')->getAttribute('href');
$quotes[] = [
'quote' => mb_trim($quote),
'author' => mb_trim($author),
'authorUrl' => $authorUrl
];
}

print_r($quotes);

JavaScript-rendered HTML

When working with real-world web pages, you will likely encounter HTML that contains shell elements that are then filled with content after JavaScript has been executed in your browser.

If the content you’re after requires JavaScript rendering, you will want to use a headless browser. There are services you can use for this, or if you’re testing locally, you can use Chrome’s –dump-dom flag:

chrome --headless --dump-dom https://quotes.toscrape.com/js/

You can capture the output in PHP with the following:

$url = 'https://quotes.toscrape.com/js/';
$command = 'chrome --headless --dump-dom ' . escapeshellarg($url);
$html = shell_exec($command);

Removing elements

A common task when working with HTML is to remove the bloat that is often interleaved with the content that you want to extract. This can be ads, related links, social media share buttons, and so on.

With CSS selectors, it’s easy to target all these in one comma-separated selector list.

$dom = Dom\HTMLDocument::createFromString($html);

// Remove clutter (scripts, styles, navs, footers)
$selector = 'script, style, nav, footer, aside, .ad-banner, .social-share';
foreach ($dom->querySelectorAll($selector) as $clutter) {
$clutter->remove();
}

Extracting article content

If you’re working with web articles (e.g., news stories, blog posts), I maintain the PHP port of Readability.js, which can be useful to isolate the content HTML automatically before you parse and work on it further.

use fivefilters\Readability\Readability;
use fivefilters\Readability\Configuration;

// Article URL
$url = 'https://www.medialens.org/2020/cogitation-meditation-in-an-age-of-cataclysms/';
// for simplicity we'll use file_get_contents() here
$html = file_get_contents($url);

// Configure Readability
$configuration = new Configuration([
'fixRelativeURLs' => true,
'originalURL' => $url,
]);

// Detect and extract article body
$readability = new Readability($configuration);
$readability->parse($html);
$contentHtml = $readability->getContent();
$dom = Dom\HTMLDocument::createFromString($contentHtml);

HTML Sanitization

When working with HTML you have not produced yourself (e.g., HTML you have fetched, or user-submitted content), you are handling untrusted HTML. Before outputting it for display, you should sanitize it to prevent XSS attacks. Symfony’s HTML Sanitizer component is designed for this.

use Symfony\Component\HtmlSanitizer\HtmlSanitizer;
use Symfony\Component\HtmlSanitizer\HtmlSanitizerConfig;

$config = new HtmlSanitizerConfig()->allowSafeElements()->allowRelativeLinks();
$sanitizer = new HtmlSanitizer($config);

$dirty = '<a href="/page" onclick="alert(\'XSS\')">Click</a>';
echo $sanitizer->sanitize($dirty);
// Output: <a href="/page">Click</a>

The sanitizer automatically strips dangerous attributes like onclick while preserving allowed elements and attributes.

IPC NEWSLETTER

All news about PHP and web development

Migration Guide

Migrating to the new API is generally straightforward, but there are a few key differences to be aware of.

Class Mapping

new DOMDocument() → Dom\HTMLDocument::createEmpty()
$dom->loadHTML( $ html) → Dom\HTMLDocument::createFromString($html)
$dom->loadHTMLFile( $ file) → Dom\HTMLDocument::createFromFile($file)
DOMElement → Dom\Element
DOMXPath → Dom\XPath

Replacing HTML5-PHP

If you are currently using the HTML5-PHP library, you can likely remove it entirely.

Old way (HTML5-PHP):

$html5 = new Masterminds\HTML5();
$dom = $html5->loadHTML($html);

New way (Native):

$dom = Dom\HTMLDocument::createFromString($html);

Not only is the code cleaner, but you will also see an immediate performance improvement. Note, however, that HTML5-PHP returns the legacy DOMDocument object after parsing, while the new code above returns Dom\HTMLDocument. So you might notice some differences in the API.

PHP 8.5 and the new URI Extension

PHP 8.5 introduces an updated URL parser. When working with HTML, we often also work with URLs. If you’ve used the parse_url function in the past, I recommend switching to the new URI extension.

Conclusion

The introduction of Dom\HTMLDocument in PHP 8.4 is a major update to PHP’s HTML capabilities. It transforms PHP from a language that could do HTML parsing (with enough caveats and libraries) into a language that really excels at it.

Whether you are building a simple scraper or a complex content transformation engine, there has never been a better time to do it in PHP.

Special thanks to Niels Dossche for his incredible work on this extension and Alexander Borisov for the Lexbor project.

Better HTML Parsing in PHP: Modern Techniques and Tools

Modern tools and techniques for parsing HTML in PHP

Better HTML Parsing in PHP: Modern Techniques and Tools

IPC NEWSLETTER

Technical Foundation

The Old Way: Parsing with libxml

The Workarounds

EVERYTHING IS CONNECTED TO THE INTERNET

Explore the Web Development Track

The New Way: Parsing with Lexbor

Performance and Standards

New DOM Classes

Migration and Interoperability

DOM Properties and innerHTML

IPC NEWSLETTER

Top-Level Properties

innerHTML Support

CSS Selector Support

Powerful Selectors

XPath Selectors

Namespace warning

Serialize to HTML

Practical Examples

Web Scraping

JavaScript-rendered HTML

Removing elements

Extracting article content

HTML Sanitization

IPC NEWSLETTER

Migration Guide

Class Mapping

Replacing HTML5-PHP

PHP 8.5 and the new URI Extension

Conclusion

Further Readings

Refactor Legacy PHP with Confidence

Advanced Docker for PHP Developers

Enjoying the content?

Explore other Tracks

Better HTML Parsing in PHP: Modern Techniques and Tools

Modern tools and techniques for parsing HTML in PHP

IPC NEWSLETTER

Technical Foundation

The Old Way: Parsing with libxml

The Workarounds

EVERYTHING IS CONNECTED TO THE INTERNET

Explore the Web Development Track

The New Way: Parsing with Lexbor

Performance and Standards

New DOM Classes

Migration and Interoperability

DOM Properties and innerHTML

IPC NEWSLETTER

Top-Level Properties

innerHTML Support

CSS Selector Support

Powerful Selectors

XPath Selectors

Namespace warning

Serialize to HTML

Practical Examples

Web Scraping

JavaScript-rendered HTML

Removing elements

Extracting article content

HTML Sanitization

IPC NEWSLETTER

Migration Guide

Class Mapping

Replacing HTML5-PHP

PHP 8.5 and the new URI Extension

Conclusion

Further Readings

Author

🔍 Frequently Asked Questions (FAQ)

1. What are the limitations of native HTML parsing in PHP?

2. Why is DOMDocument commonly used for HTML parsing in PHP?

3. What are modern alternatives to DOMDocument in PHP?

4. Why is HTML5-compatible parsing important in PHP?

5. Why should you avoid using regex for HTML parsing?

Refactor Legacy PHP with Confidence

Advanced Docker for PHP Developers

Top Articles About Web Development

Securing Web Applications with WebAuthn ...

17 Years in the Life of ElePHPant...

Asynchronous Programming in PHP...

Enjoying the content?

Explore other Tracks