| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by anonytrary 2818 days ago

I was expecting an ML-driven framework where you write the HTML you want to scrape, and the framework diffs the trees and attempts to extract the information from the target tree as best it can to match your input tree. That's what pops into mind when I think of "declarative" scraping.

  LET google = DOCUMENT("https://www.google.com/", true)

  INPUT(google, 'input[name="q"]', "ferret")
  CLICK(google, 'input[name="btnK"]')
  WAIT_NAVIGATION(google)
  LET result = (
    FOR result IN ELEMENTS(google, '.g')
      RETURN {
        title: ELEMENT(result, 'h3 > a'),
          description: ELEMENT(result, '.st'),
          url: ELEMENT(result, 'cite')
      }
  )
  RETURN (
    FOR page IN result
    FILTER page.title != NONE
    RETURN page
  )

Looks an awful lot like:

  const { document, input, elements, waitNavigation } = require("your-library")
  const scrape = () => {
    let google = document("...", true)
    input(google, "...", "...")
    click(google, "...")
    waitNavigation(google)
    return elements(google, ".g")
      .map(r => {...})
      .filter(p => {..})
  }
  scrape();

Am I missing something here? I don't see anything declarative about the the first one over the second; both of these look identical and rather imperative to me. Is "declarative" becoming a buzzword (thanks to React, maybe?), or am I missing something?