Hacker News new | ask | show | jobs
by kh_hk 4463 days ago
I like that there's people working to make scraping easier and friendly for everyone. Sadly (IMHO) the cases where these tools will probably fail are at the same time the same not really open on providing the data directly. Most scraper-unfriendly sites would make you request another page before to capture a cookie, set cookies on the request headers or a referer entry, or manually using regex magic to extract information from javascript code on the html. I guess it's just time one tool will provide such methods, though.

For my project I do write all the scrapers manually (that is, in python, including requests and the amazing lxml) because there's always one source that will make you build all the architecture around it. Something that I find that is needed for public APIs is a domain specific language that can work around building intermediate servers by explaining the engine how to understand a data source:

An API producer wants to keep serving the data themselves (traffic, context and statistics), but someone wants an standard way of accessing more than one source (let's say, 140 different sources). If only instead of making an intermediate service providing this standardized version, one could be able to provide templates that a client module would use to understand the data under the same abstraction.

The data consumer would be accessing the source server directly, and the producer would not need to ban over 9000 different scrapers. Of course this would only make sense for public APIs. (real) scraping should never be done on the client: it is slow, crashes and can breach security on the device.

1 comments

Surely there are difficulties in expecting data providers to produce their data in standard formats across industries and countries? I am naive as to how much and what data is available but that seems a stretch
If interested, take a look at my project on unifying bike sharing networks data. Besides providing a public API, we are also providing a python library that accesses and abstracts different sources under the same model [1, 2]

There are a lot of accessible sources (though, not documented), but there are also clear examples on how one would never provide a service! Some examples [3, 4]

What I was referring, though, was in a way to avoid having to build an intermediate server scraping services that are perfectly usable (JSON, XML) just because we (all) prefer to build clients that understand one type of feed (standard).

Maybe it's not about designing a language, but just as a new way of doing things. Let's say I provide the client with the clear instructions on how to use a service (its format, and where are the fields that the client understands (in an XPath-like syntax)).

That should be enough to avoid periodically scraping good-player servers, but at the same time being able to build client apps without having to implement all the differences between feeds. Besides, it would avoid being banned for accessing too much times a service, and would give data providers insight on who is really using their data.

Let's say we want to unify the data in Feed A and Feed B. The model is about foos and bars:

    Feed A:
    {
      "status": "ok",
      "foobars": [
        {
          "name": "Foo",
          "bar": "Baz"
        }, ...
      ]
    }

    Feed B
    [{"n": "foo","info": {"b": "baz"}},...]

    We could provide:
    {
      "feeds": [
        {
          "name": "Feed A",
          "url": "http://feed.a",
          "format": "json",
          "fields": {
            "name": "/foobars//name",
            "bar": "/foobars//bar"
          }
        },
        {
          "name": "Feed B",
          "url": "http://feed.b",
          "format": "json",
          "fields": {
            "name": "//n",
            "bar": "//info/b"
          }
      ]
    }
    Instead of providing a service ourselves that accesses Feed A and Feed B
    every minute just because we want to ease things on the client.
Not sure if that's what you asked, though.

[1]: http://citybik.es

[2]: http://github.com/eskerda/pybikes

[3]: https://github.com/eskerda/PyBikes/blob/experimental/pybikes...

[4]: https://github.com/eskerda/PyBikes/blob/experimental/pybikes...

Ok different feeds, same domain, unifying the model sis feasible, either as an intermediate or as a client "template thing"

Thank you - makes sense. I was thinking different data feeds different domains.