| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ianbicking 2818 days ago

I think this could go further in terms of making it declarative.

A simple declarative approach could taking this:

    LET google = DOCUMENT("https://www.google.com/", true)

and instead of thinking about it as an action (get this page), think about it as giving you an object. The result is a tuple of the URL, the time fetched, and maybe other information (like User-Agent). This helps with exploratory scraping, where you want to be able to repeat actions without always re-fetching the documents. And you'll be constructing a program, unlike a REPL where you always write the program top-to-bottom, including all your intermediate bugs.

Changing DOCUMENT() is easy enough. Things like CLICK() are a bit harder, though if you extend the data structures you can have a document that is the result of clicking a certain element in a certain previous document. Again to do it the first time you have to actually DO the action, but later on perhaps not. And you'll be constructing interstitial objects that are great for debugging.

Then what could make it feel really declarative is having more than one presentation of an execution. You can package up a scraping, and then you can answer questions about WHY you ended up with certain results.

3 comments

ziflex 2818 days ago

That's what you can do right now :)

https://github.com/MontFerret/ferret/blob/master/docs/exampl...

Document, returned form DOCUMENT() function, represents an open browser tab which allows you to do all interactions with the page.

link

ianbicking 2818 days ago

Well, that's what I'm saying... right now, making it represent an open browser tab with a specific state and where everything DOES something isn't declarative. But it could be declarative if you changed how those commands are implemented.

Or, to phrase it another way: if the program represents a PLAN then it's declarative. If it represents a series of things to DO then it's imperative. It seems like it's doing things, but it could plan things with the same syntax.

link

ziflex 2817 days ago

Oh yes. The reason if this is that for now the language itself is DOM agnostic, it's just a port of an existing one. (https://docs.arangodb.com/3.4/AQL/) . So, the entire DOM thing is implemented by standard library which is pluggable. In the future, I might extend the language to make it less DOM agnostic by introducing new keywords for dealing with that. But for now you have to move document object around. Which is not that bad, because you may open as many page as you want in a single query.

link

nerdponx 2818 days ago

I don't see why web scraping should be declarative at all. XPath is declarative, and hard to use. When a human browses a website, they do one thing at a time. That is inherently imperative. A DSL for highly-imperative "human-style" web scraping is a nice idea in my opinion. That's exactly what Ferret appears to be.

link

ziflex 2818 days ago

And you can open as many pages as you want in a single query (or as your memory allows you :) )

link