Hacker News new | ask | show | jobs
by floatingatoll 2533 days ago
Reducing your business costs by scraping a public access website is often considered an alternative to paying the business costs of the website operator.

Are you saving money at the expense of the site operator by scraping their site for public records, or are you saving money as well as the site operator?

If you're costing them money to reduce your own bottom line without their express written consent, that makes you "the bad guy". Offsetting costs onto an unwitting, non-consenting third party is an unethical approach to doing business.

I interpret your request as a similar problem to "help me with my homework problem". I could dig up papers and studies, but at the end of the day, you need to go do your homework. Reach out to each municipality and figure out a business arrangement with them that satisfies your needs. It's possible they do not wish you to perform this activity, in which case you will either need to violate their intent for your own profit using scraping or accede to their wishes and stop scraping their municipality. That's your homework as a for-profit business.

4 comments

I don't empathize with your viewpoint because, whether it's a web scraper, or a person, the work is exactly the same. There's no additional volume, or extra steps. We just emulate a worker.

We measure the value in FTEs, and when a researcher quits, we do not replace them if the appropriate FTEs have been reached with projects.

It's a major benefit to the business not only because we don't have to pay another employee, but we can reduce training costs, and costs incurred by mistakes. We can also adjust execution of one of these agents, which normally would require rearrangement of work instructions, and retraining.

These are public records, 90% of them do not have integrations for automated systems, and those that do, we utilize. They are typically search boxes with results. We are not circumventing any type of cost that would otherwise be incurred.

We do not log any of the results, store them locally, or maintain any of the PII with each search. If a case was searched 20 minutes ago, and comes up again, we rerun the entire thing just as a human would.

Finally, to your point about 'help me with my homework', I consider posting on the HN forums homework for this type of research. There are a diverse set of talented developers on here with esoteric experience. The fact that an article related to the work I do came up on here, I thought, was an excellent opportunity to seek advice and perspective.

Don't be discouraged by the spiteful kneejerk reactions in this thread. HN is a diverse place and some commenters get triggered by an association with one of their pet peeves and launch into a rant without taking time to assess the nuance of your position. I've been the butt of this behavior a few times and it can be pretty toxic.
Sadly, you are correct to have realized that many posters on HN are so naive that they will offer you $0/hour consulting for your for-profit business. Posting on the HN forums means you "don't have to pay another employee" that's an expert in the field. I can't do much to prevent this, but I don't much respect it, either.
What you call being naïve, I'd call being a good human being. Skilled professionals willing to freely share knowledge are a great thing. BTW., it's literally the foundation of our industry and the whole point of Open Source movement.

If it reduces market for some consultants, well, sucks to be them, they'd better find a different way of providing value. Not every value needs to be captured and priced. A world in which all value was captured and priced would really suck.

I'm glad that sites like Wikipedia, StackOverflow, and HN exist. I don't think the world is a worse place because they exist, and I respect the people who post there.

This is the same attitude that says, "why would someone just give away Open Source software when they could build a SaaS business instead?"

I don’t think Stackoverflow for “how can I avoid paying a municipality a reasonable public records fee” should exist, but I do endorse Stackoverflow in general. You’ll have to do what you will with that; generalizing my point to “all Stackoverflow” is certainly wrong, though.
>Posting on the HN forums means you "don't have to pay another employee" that's an expert in the field. I can't do much to prevent this

Sometimes the answer tells you much more about what skills you need to be hiring. Sometimes they give you a lead.

Public records are public.

The fact that some government organizations make it hard to retrieve public records is a flaw in the system. I'd be in favor of a national law requiring all public records to be published in machine-readable form.

In the mean time, it is our civic responsibility to conspire to circumvent these misbehaving public services.

If such a national law were passed with funding guaranteed for open publication of records, I would endorse your point of view.

No such funding exists, and municipalities are regularly denied tax increases by their voters for any reason — much less public records publication that would often embarrass and humiliate those same voters.

So in essence you're asking them to cut public services and staffing in order to give hundreds of dollars of IT costs a month to for-profit businesses who can't be bothered to pay some small fraction of their revenue for the costs of delivering those records.

It is our civic responsibility to republish those records for free as citizens. Doing so for profit at the expense of citizens is unethical.

If OP republishes all records received in a freely-downloadable, unrestricted form, then I would happily help them fix their scrapers. They, of course, do not.

Often what the municipalities are doing for public records is harder and more expensive than just publishing an API. So The funding excuse doesn't really cut muster with me.
Can you name a single for-profit public records scraper who republishes the parsed data scraped without charging for data access?

The public records are public. Charging for them is, by the above arguments, immoral. Therefore, not only the municipalities but also the businesses profiting from those public records owe us their scraped data, for free, without regard for profit concerns.

Not one for-profit business does so. Why is their immoral action acceptable, when the same action by a municipality is not?

There's nothing immoral about charging for content that you've aggregated. People sell dictionaries.

The problem here is that instead of building APIs (or just posting to FTP sites), governments are building offices and funding staff to answer snail mail requests. Or building sophisticated web forms and search engines.

It's obvious how we got to this point (before the internet, you obtained public records by walking into an office) but it's long past time to change. We don't need fancy web forms to search and find data; cut all that out and just provide data in machine readable form to anyone who wants it.

Someone will build a pretty commercial interface to public records data. Chances are, they can do it for less than the 8-figure sum required for UI development in the public sector. Win-win.

It is not obvious to me that reducing the cost to consult public data is necessarily a good thing. Just because this data is accessible, it should not amways also be accessible inexpensively. Example given: trial records should be public but it would probably not be nice to have all your judicial record displayed in people's glasses.
Some "public" records are in the gray area as in; should or should they not (black and white) be published. For example salaries, the employer might forbid disclosing salaries, but anyone can just request anyone's salary from the government because its public. But if they could be downloaded from an FTP ...
Can you name a single for-profit public records scraper who republishes the parsed data scraped without charging for data access?

Currently? Not off the top of my head. But there was one that scraped municipal records in a large midwest city and made them public for free because they were confusing to get to otherwise.

Unfortunately, the company was bought by a larger company and that portion of what they did was shut down.

Loveland (now apparently called Landgrid). https://landgrid.com/
Loveland is such a cool organization
Public records are published based on certain demand assumptions.

If a real-world demand for, say, some GIS data is hundreds of requests per day, then a crawler that comes in with hundreds requests PER MINUTE will obviously stress the infrastructure. Adjusting infrastructure to cope is not an instant process, nor is it a sure thing to begin with given all the budgeting formalities. So your "civic duty" will ultimately result in destruction of these services, because they simply don't have the means to deal with such thoughtless activism.

You've made an unfounded assumption -- that is, that the person you're responding to is scraping irresponsibly. If they are, as they say, simply replacing human researchers with the equivalent bots, then the net load change from automation is zero, or possibly even negative.
Imagine if search engines had to "reach out to each [site owner] and figure out a business arrangement with them." The world decided that opt out via robots.txt was a better approach.

If the municipality wants to get the information out, this could be a win-win, just like search engines were. Do check for robots.text, though!

We found at one job that approximate one quarter of well-known search engines blatantly use robots.txt noindex declarations as a list of URLs to index, and one openly mocked us for asking them to stop.

Voluntary honor systems don’t work, because there’s no way to compel non-compliers to stop other than standard “anti-attacker arms race” approaches, such as the obstacle described at the head of this thread.

It sounds like scraping is a big problem for you guys. What kind of outfit is it, if you don't mind me asking?
Drop me an email and I’m happy to describe further.
Well, the search engines decided that robots.txt was the better approach for them. Which makes sense, since they want control over as much data as possible, that's their profit motive. The jury is still out on whether that's a long-term win-win social contract between search engine companies and the world.
> Well, the search engines decided that robots.txt was the better approach for them. Which makes sense, since they want control over as much data as possible, that's their profit motive. The jury is still out on whether that's a long-term win-win social contract between search engine companies and the world.

Are you really arguing that the internet would be _more_ accessible if search engines had to reach out to every site they wanted to crawl?

How many companies out there complain about being scraped by Google? How many companies benefit from search-driven traffic?

The alternative would have been opt-in instead of opt-out. Everything excluded by default, except what robots.txt allows you to index.

Naturally, Google didn't want that.

I would assume that any site that was implementing JS-level blocks also has the appropriate robots.txt file in place.
That's not true in the actual web, however.

The best example is a large number of unimportant sites that send 429 errors for /robots.txt if they think it's a scraper. A 4xx result for robots.txt is considered to mean no robots.txt for most crawlers. So the website is getting the reverse of what it thought it was getting.

Why privilege traffic based on its source (whether it's from a human or Selenium)? If some resources are expensive to serve, you can rate limit them.
Because some information is more valuable than the sum of its parts.