Hacker News new | ask | show | jobs
by jotto 4864 days ago
I haven't tested their data, but 1 reason why this is hard is due to the SKUs at Walmart and Best Buy. There may be a Samsung 42-inch TV that exists at Amazon, Walmart and Best Buy but have slightly modified specs for each of those mega retailers... and with this, Best Buy has a lot of SKUs that simply die. So once semantics3 has "reconciled" that samsung 42-inch TV across retailers, they'll have to continuously check whether any of the retailers have changed the SKU and/or URL on them.

source: I do this at dealzon.com for a very limited set of data where it's practical

2 comments

Yes, this is a big problem, which we have put in a lot of effort to tackle.

We try to calculate a 'hash' for the product, which is independent of the sku, factoring in all the structured metadata available - normalized dimensions (height, length, width), weight, model, manufacturer, etc.. We also account for small variations in the numerical data points.

So even if skus change with slightly different specs, the 'hash' remains the same and we can identify and reassign them.

Drop me a note at varun [at] semantics3.com - we could swap notes :)

[Edit: Added extra information]

We're working on this problem at Datafiniti (https://www.datafiniti.net). Since we index hundreds of sources for a similar service, we can leverage some basic string comparison techniques to normalize records from different sources and fill in attributes that are missing from any one source.