Hacker News new | ask | show | jobs
by echelon 2440 days ago
This is fantastic. I would like to see wider legislation allowing scraping of IMDB, Genius, Reddit, Facebook, and Google made legal. These services receive free input from users. The data should remain free.

Edit (sort of off topic): There's still value in the building and providing services at scale, but this lowers the barrier to cross the moat for small players. The first step is data liberation. Then we can work to bring down the other cost barriers. It's a lot easier to build services that scale in 2019 than it was in 2005.

The semantic web was misguided in 200X, but we might want to take another swing at it in the future.

7 comments

If you add .json to the end of a Reddit URL, it will return JSON data. For example: https://www.reddit.com/r/ubuntu.json . It also works with comment threads and posts.
Wonderful feature also used by Trello https://trello.com/b/rq2mYJNn/public-trello-boards.json
Now that is an ergonomic API.
xml and rss seem to be the same exact output
I have adopted this in other projects and added the functionality there as well; it is a brilliant idea.
Yeah no need to scrape Reddit, their content is accessible via their API.
PRAW is also a great python reddit "scraper" that allows you to pull data via their API very easily.
Another side of this is that the entity doing the scraping is more often than not another company. Which means that if your proposal is implemented, a user can voluntarily give their personal data to Google/Reddit/Facebook etc but that company then has to make the user's personal data available to another company.
It's not quite like that. The first company cannot prevent scraping by individuals or another company of information that it already shows to everyone. Which, to me, is a good thing. My 2c.
Eh. I want my picture and name uploaded to LinkedIn, since it's a professional network and people use it to find me for good reasons. It may seem dumb, however not having a LinkedIn with a good picture can genuinely hurt your career.

I do NOT want my picture run through facial recognition software, or my name/email sold to marketers who will add it to a drip campaign.

Then don't make the data public. You can't have the cake and eat it too. Scraping is irrelevant here - a human can just as well take your picture from your LinkedIn page and include it in their face-recognition DB.
No they can't, not legally.
How's that? Obviously they can't keep the photo. But I don't see what would stop them from "viewing" the publicly available photo and saving markers that let them recognize the face again. After all, that's what any person does when they look at a photo.
The huge difference is having a human do it at scale is cost prohibitive.
PIIs and biometrics are special. So if I upload my photo to LinkedIn, I want it to be available when viewing my LinkedIn profile, but I expect that any other entity that scraps it off LinkedIn can't process it without my explicit consent (thanks to GDPR). Similarly with other data that's about me, a person.

But all other data, I'd argue, should be fair play. If an e-commerce sites publishes a list of products and prices, I believe it's desirable for other parties to be able to scrap it and process it, e.g. for offering a price comparison service.

Exactly. I don't like or "enjoy" LinkedIn but I do find it useful professionally.

Now it sounds like this ruling implies that by creating a profile on one platform, I have to accept that every company that comes along can include me in their corpus.

Maybe I should be able to set a pass-through GDPR flag on my profile such that third parties (subject to that regulation) will have to exclude me from their datasets.

So Google would need to allow scraping of search results? That would be a huge change, they currently prevent that pretty aggressively.
This is a problem because you're talking personal data, not because of scrapping. Personal / personally identifiable data is special and special protections apply to it. But regular data would fare just fine under GP's proposal.
It only applies to data displayed publicly, though. If Facebook and such started requiring logins to see personal data would that be such a bad thing?
I'm not certain, but it kind of sounds like even things behind a login are still scrapable. Assuming the general public can get a login easily anyway. Basically, just requiring an account is not enough to forbid scraping.
When you talk about data here these are people. This HiQ software is actually a bit scary. What if it gives a false signal which ends in an employees termination? Data on people should not be freely attainable, the person should give explicit access. If I don’t want HiQ processing my information (I don’t) then they shouldn’t be able to. Especially now with some employers requiring a LinkedIn profile.
Reddit has a decent API

The golden rule is to use the API before you start raw scraping.

for IMDb, they have a lot of data that is easily accessible, not sure what is missing though: https://datasets.imdbws.com/...
Only for personal and non-commercial use, which is probably not what startups need.
process it on your personal computer and use the output in your startup
That intermediate step doesn't get around the license.
it depends
Why would they have to make it available to startups in an easily accessible manner?
IMDB was originally crowdsourced, wasn't it?
So, have the data available. But I see no reason why they should have to go out of their way to change their site to make it easy for someone to get it.
It started as lists and shell scripts to query them in the newsgroup rec.arts.movies.
It's already legal. Adding law adds restrictions.
You're not wrong. In the general case though, adding law can mandate that a certain already occurring activity must be done.
What gives you a rightful claim to information that I gave to someone else, if neither I nor they consent?
You did consent. This is about information you gave to LinkedIn and told them to give to the general public.