| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by benibela 2145 days ago

I have been using pattern matching for web-scraping. I think it is more robust than XPath. At least more reliable to detect invalid input.

Let's look at some of the Robula test cases:

Input:

      <head></head><body><h1 class="false"></h1><h1 class="false"></h1><h1 class="true"></h1><h1 class="false"></h1></body>

Task: get the true element, <h1 class="true"></h1>

XPath:

       //*[@class='true']

Pattern matching:

       <h1 class="true">{.}</h1>

Input:

       <head></head><body><h1 class="false" title="foo"></h1><h1 class="false" title="bar"></h1><h1 class="true" title="foo"></h1><h1 class="true" title="bar"></h1></body>

Get <h1 class="true" title="foo"></h1>

XPath:

       //*[@class='true' and @title='foo']

Pattern matching:

       <h1 class="true" title="foo">{.}</h1>

As you see, you do not need a new syntax for attributes. Input and pattern are the same!

Input:

       <h1></h1><h1></h1><h1></h1><h1></h1>

Get the third element.

XPath:

       //*[3]

Pattern matching:

       <h1></h1><h1></h1><h1>{.}</h1>

Input:

       <head></head><body><h1></h1><h1></h1><div><h1></h1></div><h1></h1></body>

Get the h1 in the div

XPath:

       //div/*

Pattern matching:

      <div><h1>{.}</h1></div>

This last example is actually getting to the point of pattern matching. Because every part of the patterns must match. If the div is missing, it will report, "div not found". If the h1 is missing in the div, it will report "h1 not found". But the XPath will just report "found these elements" or "found nothing".