Hacker News new | ask | show | jobs
by tsergiu 2424 days ago
I'm the founder of parsehub.

We are doing well and are independently owned.

I think there are 3 things that contribute to this:

1. It is very easy to make a prototype that looks "magical" but very hard to build something that works in real applications. There are an enormous amount of quirks that a browser allows, and each site you encounter will use a different set of those quirks. Sites also tend to be unreliable, so whatever you build has to be very resistant to errors.

2. There is a technological wall that every company in this space reaches where it is not yet possible to mass-specialize for different websites. So even if you're able to build a tool that works very well on any individual website, the technology is not there yet to be able to generalize the instructions across websites in the same category. So if a customer wants to scrape 1000 websites, they still have to build custom instructions for each website (5-10x reduction in labor vs scripting) when what they really want/is economically viable for them is to build a single set of instructions that will work for all similar websites (10000x reduction in labor vs scripting). This is something that we're working on for the next version of parsehub, but is still a couple years away from launch.

3. Many of the YC startups you hear about have raised funding from investors and have short term pressures to exit.

The combination of the three makes it very tempting to give up and sell.

2 comments

#2 is what would transform this from a nice niche tool, to something very valuable. In the ecommerce space, tracking competitor pricing is a great example of this type of thing. I can also see use casese for recipe's, finance, healthcare, you name it. Those b2b use cases are worth real money.

Just curious, in your experimentation, have you found it necessary to train a new model for each "category"? Or have you found a way to generalize it?

Training a new model for each category is already possible today, but doesn't achieve the goal (mass-specialization).

The problem is that when you pre-train a model, you can only solve for the lowest common denominator of what every customer might want.

In ecommerce, for example, you might pre-train to get price, product name, reviews, and a few other things that are general to all ecommerce. But you won't pre-train it to get the mAh rating of batteries, because that's not common to the vast majority of customers (even within ecommerce). It turns out that most customers need at least a few of these long-tail properties that are different than what almost every other customer wants, even if most of the properties they need are common.

And so the challenge is to dynamically train a model that generalizes to all "battery sites" based on the (very limited) input from a customer making a few clicks on a single "battery site".

I worked on this for a long time -

1. it's possible to make it "easy to switch" by having common building blocks and only changing the "selector" across sites - lots of companies in the space do this

2. it's impossible to do "just DOM" or "just vision/text" if you want to be able to generalize "get the price of the items"

- DOM doesn't represent spacial positioning very well (see: fixed/absolute positioning, IDs and dom changing without the visuals changing, ...) so you'd need the equivalent of an entire browser rendering engine in your "model" anyways!

- vision/text is messed up by random marketing popups (see: medium, amazon, walmart, ...), it's significantly more computationally expensive to do, and can't currently get >95% accuracy (which makes it useless, scraping needs very close to 100% accuracy in most use cases)

> So if a customer wants to scrape 1000 websites, they still have to build custom instructions for each website...

Can't this be crowdsourced in some way? Having each individual entity reinvent the same wheel feels like the main problem to me. What if there was a marketplace? The ability to buy / trade / sell? Maybe subscription based in some way?

If I wanted to scrape 100 sites, it might be worth $1 per year per site. Those who put in the time make money. Those who don't have the time would pay.

This isn't a technology issue per se. It's scaling a solution to the final gap the technology can't cover. A different kind of mechanical turk?

Crowdsourcing works in cases where lots of customers are interested in the same set of attributes to extract.

But by definition, customers interested in long-tail attributes (i.e. virtually all of them) don't have others to source those from.

Yes. But there might be some who would not be interested but still do it for minimal pay.

It would also lower the barrier to entry and thus increase the size of the market. Imagine if the first X sites I tired all needed more work. I'd likely quit. But if that didn't happen, I'd more likely continue.

Crowdsourcing isn't The Answer. But it's certainly a better step in the right direction.

Yes, it can! See https://apify.com/marketplace

Disclaimer: I'm a co-founder of Apify :)