| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kmike84 1880 days ago

That's interesting.. We're working on web data extraction in Zyte (former Scrapinghub); we have an Automatic Extraction product (https://docs.zyte.com/automatic-extraction-get-started.html) which combines ML and metadata to get data from websites automatically. Our learnings from building it:

1) metadata is helpful - not all of it, but some; 2) ML is obviously needed when metadata is missing, and metadata is missing very often; 2) Even when metadata is present, pure ML-based extraction often beats it in quality, with right ML models. A combination of ML+metadata fallbacks is even better.

Website creators often make mistakes providing metadata, they may misunderstand the schema and purpose of various fields, have metadata auto-generated incorrectly, etc. It is rarely about deceiving for the tasks we're working on (though it also may happen).

So, I don't see Zyte falling back to metadata analysis, ML models are already better than this human-provided metadata - but metadata is helpful, as one of the inputs.

We're going to publish product extraction benchmark soon, where, among other things, we compare automatic extraction with metadata-based extraction. In this evaluation we've got a result that ML + metadata is better than metadata not only overall (which is expected), but on precision as well.

I wonder if the reasons metadata is sometimes preferred are not related to quality, or to failure of ML approaches. If Google doesn't get data right, it is not Google's fault anymore, it is website's fault.