How to use Symfony DomCrawler with examples

How to use Symfony DomCrawler with examples banner

The Symfony DomCrawler component is a powerful tool within the Symfony framework for web scraping and HTML/XML parsing. It provides a convenient and intuitive API for traversing and manipulating HTML or XML documents, making extracting specific elements, attributes, and text from web pages easier.

To use Symfony DomCrawler, you need to install it first by running:

$ composer require symfony/dom-crawler

Then, you need to construct the Crawler object by feeding it with HTML code.

And now, you can use the crawler object to query the HTML. Let's extract the inner text of the div.

$divText = $crawler->filter('body > div')->innerText();
var_dump($divText);

Will output

string(12) "Hello world!"

DomCrawler examples

Get attribute

You can use ->attr() to get the attribute of an element.

$crawler = new Crawler('<div data-test="test-value"></div>');
echo $crawler->filter('div')->attr('data-test');
// test-value

Using filter

The filter() method in Symfony DomCrawler allows you to select elements from the parsed HTML or XML content based on CSS selectors. It returns a new Crawler instance containing the matched elements, which you can then work with further. Here's how you can use the filter() method:

$crawler = new Crawler("<div><a class='link' href='https://google.com'>anchor</a></div>");

// Find elements with a specific CSS class
echo $crawler->filter('a.link')->attr('href');
// https://google.com

echo $crawler->filter('a.link')->innerText();
// anchor

Get the URL/href of a link

$crawler = new Crawler("<div><a class='link' href='https://google.com'>anchor</a></div>");

// Find elements with a specific CSS class
echo $crawler->filter('a.link')->attr('href');
// https://google.com

// or
$link = $crawler->filter('a.link')->link();
echo $link->getUri(); // https://google.com
echo $link->getMethod(); // GET

Get the HTML

$crawler = new Crawler("<div><a class='link' href='https://google.com'>anchor</a></div>");

// Find elements with a specific CSS class
echo $crawler->filter('a.link')->outerHtml();
// <a class="link" href="https://google.com">anchor</a>
echo $crawler->filter('a.link')->html();
// anchor

Get all the links

$html = <<<'HTML'
<!DOCTYPE html>
<html>
    <body>
        <div>
            <a href='https://google.com'>anchor1</a>
            <div>
                <a href='https://youtube.com'>anchor2</a>
            </div>
        </div>
    </body>
</html>
HTML;

$crawler = new Crawler($html);
$links = [];
$crawler->filter('a')->each(function (Crawler $link) use(&$links) {
    $links[] = ['href' => $link->attr('href'), 'anchor' => $link->text()];
});
print_R($links);

outputs

Array
(
    [0] => Array
        (
            [href] => https://google.com
            [anchor] => anchor1
        )

    [1] => Array
        (
            [href] => https://youtube.com
            [anchor] => anchor2
        )

)

Remove node

$html = <<<'HTML'
<!DOCTYPE html>
<html>
    <body>
        <div>
            <a href='https://google.com'>anchor1</a>
            <div class='inner'>
                <a href='https://youtube.com'>anchor2</a>
            </div>
        </div>
    </body>
</html>
HTML;

$crawler = new Crawler($html);
$parentNode = $crawler->filter('body > div')->getNode(0);
$childNode = $crawler->filter('div.inner')->getNode(0);
$parentNode->removeChild($childNode);
var_dump($crawler->outerHtml());

outputs

string(132) "<html>
    <body>
        <div>
            <a href="https://google.com">anchor1</a>

        </div>
    </body>
</html>"

XPath examples

XPath filtering in Symfony DomCrawler refers to the ability to use XPath expressions to target and filter elements within an HTML or XML document. XPath (XML Path Language) is a query language used to navigate through elements and attributes in XML and HTML documents. Symfony DomCrawler allows you to leverage XPath expressions to precisely target specific elements based on their hierarchical relationships, attributes, and content.

use Symfony\Component\DomCrawler\Crawler;

// Assuming $html contains your HTML content
$crawler = new Crawler($html);

// Using XPath expression to filter elements
$elements = $crawler->filterXPath('//div[@class="my-class"]');

// Iterate through the matched elements
foreach ($elements as $element) {
    // Do something with each $element
}

Using Dom Crawler with Symfony Panther

Symfony Panther is a testing and web scraping library built on top of the Symfony and Symfony DomCrawler components.

The Symfony DomCrawler component is integrated within Symfony Panther to facilitate interaction with the HTML structure of the web page. It allows you to easily locate elements, extract data, and perform actions like form submission or link clicking.

Here's a brief overview of how you can use Symfony DomCrawler in Panther:

use Symfony\Component\Panther\PantherTestCase;

class WebScrapingTest extends PantherTestCase
{
    public function testWebScraping()
    {
        $client = static::createPantherClient();

        // Navigate to a webpage
        $crawler = $client->request('GET', 'https://example.com');

        // Use DomCrawler to extract data
        $headingText = $crawler->filter('h1')->text();

        // Output the extracted data
        echo "Heading: $headingText\n";
    }
}

Click a button

use Symfony\Component\Panther\PantherTestCase;

class ButtonClickTest extends PantherTestCase
{
    public function testButtonClick()
    {
        $client = static::createPantherClient();

        // Navigate to a webpage with the button
        $crawler = $client->request('GET', 'https://example.com');

        // Find the button using DomCrawler
        $buttonCrawler = $crawler->filter('button.my-button-class');

        // Click the button
        $client->submit($buttonCrawler->form());

        // Now you can interact with the updated page after clicking the button
        // For example, you can assert things or extract data from the new page state
    }
}

Can Symfony Dom Crawler be used in Laravel?

Yes, you can use the Symfony DomCrawler component in Laravel. The DomCrawler component is a standalone package provided by Symfony and can be used in any PHP application, including Laravel.

Conclusion

Symfony Dom Crawler offers a versatile solution for parsing and interacting with HTML and XML content. Due to its seamless integration with Symfony's ecosystem, it excels in tasks like web scraping, data extraction, and automated testing.

With the ability to filter elements using CSS selectors and XPath expressions, developers can effortlessly navigate complex document structures and extract data.

Its standalone nature allows it to be used in frameworks like Laravel, enhancing parsing capabilities.

Symfony Panther, a testing library, utilizes Dom Crawler's features to simplify web interactions and automate testing. In summary, Symfony Dom Crawler proves itself invaluable for handling intricate HTML tasks, making it a go-to tool for web developers seeking efficient data manipulation solutions.