Hacker News new | ask | show | jobs
by whoisjuan 2471 days ago
This has been tried many times and it never seems to gain traction to become a relevant concept. Off the top of my head, I remember Kimono Labs that looked quite promising. Then it was acquired by Palantir and shut down. I also have seen many solutions that are similar (basically most scraping companies, like Diffbot which also claims to use machine learning for their extraction techniques)

What's the plan here to really become differentiated? Why is now the right time for this concept and not before when others tried it? Also, how do you plan to address the concerns of companies that don't want their data to be accessed programmatically? That seems like a big challenge to overcome in order to become commercially succesful.

3 comments

Thanks for you feedback ! We talked to the co-founders of Kimono Labs and their approach was a bit different. Our goal is to automate processes on the Internet and scraping is just the first step.

Timing is perfect because to do that, you need a robust headless browser and a smart way to locally identify the elements on the page if you don't want to maintain your scripts. That's why we use Puppeteer and TensorflowJS which didn't exists 2-3 years ago.

But sure, there are website owners who don't want an API for their website. Our plan is not to fight against them but to start with owner that are already convinced that they could benefits from automating the usage of their website. The banking sector understood that, and that's why Yodlee and Plaid are so successful today.

And if you step back, there are tons of websites that don't have the ressources to create an API (30% of the websites have been created using Wordpress) and don't know the value they could generate from it.

So yes, we'll have to overcome a lot of challenges to build this technology and make it accessible to everyone but we are convinced that the Internet will be used more and more programmatically in the future and we are just paving the way for it ;)

In regards to your question about companies' concerns: if the data is made publicly available (i.e. web page is not behind authentication), then why should it matter how it's accessed?
If you can access it programatically, then you can access it at scale which means you can quickly scrape content and replicate it somewhere else. Many business rely on a model where the data or information they generate is meant to be consumed by a human.

For example, Google temporarily bans your IP when you hit things like Google Play urls multiple times in a few minutes. This is clearly an attempt to block anyone but a human to extract information from the Play store.

I can imagine some companies wanting that data to be accessed in a specific delivery format (i.e. with branding experience attached).

Also might be concerned about inaccuracies from variable pricing models for example. There’s a few reasons why you may not want it accessible - hence one of the reasons why CORS is even a thing.

The API would bypass ads on the page?

I feel like this would have the same sort of friction that RSS had.

Which is to say, it could certainly still work.