| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by echelon 2440 days ago

This is fantastic. I would like to see wider legislation allowing scraping of IMDB, Genius, Reddit, Facebook, and Google made legal. These services receive free input from users. The data should remain free.

Edit (sort of off topic): There's still value in the building and providing services at scale, but this lowers the barrier to cross the moat for small players. The first step is data liberation. Then we can work to bring down the other cost barriers. It's a lot easier to build services that scale in 2019 than it was in 2005.

The semantic web was misguided in 200X, but we might want to take another swing at it in the future.

7 comments

polygot 2440 days ago

If you add .json to the end of a Reddit URL, it will return JSON data. For example: https://www.reddit.com/r/ubuntu.json . It also works with comment threads and posts.

avip 2439 days ago

Wonderful feature also used by Trello https://trello.com/b/rq2mYJNn/public-trello-boards.json

ludamad 2440 days ago

Now that is an ergonomic API.

polygot 2439 days ago

Also, it outputs XML and RSS too: https://www.reddit.com/r/ubuntu.rss and https://www.reddit.com/r/ubuntu.xml

ahbyb 2439 days ago

xml and rss seem to be the same exact output

sneak 2439 days ago

I have adopted this in other projects and added the functionality there as well; it is a brilliant idea.

diminoten 2439 days ago

Yeah no need to scrape Reddit, their content is accessible via their API.

Raidion 2440 days ago

PRAW is also a great python reddit "scraper" that allows you to pull data via their API very easily.

psv1 2440 days ago

Another side of this is that the entity doing the scraping is more often than not another company. Which means that if your proposal is implemented, a user can voluntarily give their personal data to Google/Reddit/Facebook etc but that company then has to make the user's personal data available to another company.

ptero 2440 days ago

It's not quite like that. The first company cannot prevent scraping by individuals or another company of information that it already shows to everyone. Which, to me, is a good thing. My 2c.

gkoberger 2440 days ago

Eh. I want my picture and name uploaded to LinkedIn, since it's a professional network and people use it to find me for good reasons. It may seem dumb, however not having a LinkedIn with a good picture can genuinely hurt your career.

I do NOT want my picture run through facial recognition software, or my name/email sold to marketers who will add it to a drip campaign.

Nextgrid 2439 days ago

Then don't make the data public. You can't have the cake and eat it too. Scraping is irrelevant here - a human can just as well take your picture from your LinkedIn page and include it in their face-recognition DB.

Dayshine 2439 days ago

No they can't, not legally.

wpietri 2439 days ago

How's that? Obviously they can't keep the photo. But I don't see what would stop them from "viewing" the publicly available photo and saving markers that let them recognize the face again. After all, that's what any person does when they look at a photo.

tekknik 2439 days ago

The huge difference is having a human do it at scale is cost prohibitive.

TeMPOraL 2439 days ago

PIIs and biometrics are special. So if I upload my photo to LinkedIn, I want it to be available when viewing my LinkedIn profile, but I expect that any other entity that scraps it off LinkedIn can't process it without my explicit consent (thanks to GDPR). Similarly with other data that's about me, a person.

But all other data, I'd argue, should be fair play. If an e-commerce sites publishes a list of products and prices, I believe it's desirable for other parties to be able to scrap it and process it, e.g. for offering a price comparison service.

rch 2440 days ago

Exactly. I don't like or "enjoy" LinkedIn but I do find it useful professionally.

Now it sounds like this ruling implies that by creating a profile on one platform, I have to accept that every company that comes along can include me in their corpus.

Maybe I should be able to set a pass-through GDPR flag on my profile such that third parties (subject to that regulation) will have to exclude me from their datasets.

dx034 2439 days ago

So Google would need to allow scraping of search results? That would be a huge change, they currently prevent that pretty aggressively.

TeMPOraL 2439 days ago

This is a problem because you're talking personal data, not because of scrapping. Personal / personally identifiable data is special and special protections apply to it. But regular data would fare just fine under GP's proposal.

bobthepanda 2440 days ago

It only applies to data displayed publicly, though. If Facebook and such started requiring logins to see personal data would that be such a bad thing?

jjeaff 2440 days ago

I'm not certain, but it kind of sounds like even things behind a login are still scrapable. Assuming the general public can get a login easily anyway. Basically, just requiring an account is not enough to forbid scraping.

tekknik 2439 days ago

When you talk about data here these are people. This HiQ software is actually a bit scary. What if it gives a false signal which ends in an employees termination? Data on people should not be freely attainable, the person should give explicit access. If I don’t want HiQ processing my information (I don’t) then they shouldn’t be able to. Especially now with some employers requiring a LinkedIn profile.

sireat 2440 days ago

Reddit has a decent API

The golden rule is to use the API before you start raw scraping.

OrgNet 2440 days ago

for IMDb, they have a lot of data that is easily accessible, not sure what is missing though: https://datasets.imdbws.com/...

tdhoot 2440 days ago

Only for personal and non-commercial use, which is probably not what startups need.

OrgNet 2440 days ago

process it on your personal computer and use the output in your startup

dwaltrip 2439 days ago

That intermediate step doesn't get around the license.

OrgNet 2438 days ago

it depends

JustSomeNobody 2440 days ago

Why would they have to make it available to startups in an easily accessible manner?

nitrogen 2440 days ago

IMDB was originally crowdsourced, wasn't it?

JustSomeNobody 2439 days ago

So, have the data available. But I see no reason why they should have to go out of their way to change their site to make it easy for someone to get it.

bewuethr 2440 days ago

It started as lists and shell scripts to query them in the newsgroup rec.arts.movies.

jakeogh 2440 days ago

It's already legal. Adding law adds restrictions.

pharrington 2440 days ago

You're not wrong. In the general case though, adding law can mandate that a certain already occurring activity must be done.

lonelappde 2439 days ago

What gives you a rightful claim to information that I gave to someone else, if neither I nor they consent?

austhrow743 2439 days ago

You did consent. This is about information you gave to LinkedIn and told them to give to the general public.