Hacker News new | ask | show | jobs
by photon_off 2311 days ago
Amazon has been my wheelhouse, fulltime, for the past 7 months. I'm building a site that looks at most listings per category and determines "true" scores for each one. The site also lets your filter and sort, instantly, on a variety of attributes, like "price", "unit price", "shipping time", "recent price drop %", "used price discount", "popularity", "brand quality", etc. Spending 7 months collecting and analyzing data of several Amazon categories has been exhausting, but quite revealing.

As you can imagine, dealing with low quality products with fake reviews is a challenge -- but it turns out it's not too hard to handle, even with my dataset which is far more limited than Amazon's. Without looking at any reviews or any metadata of reviews (author, count, chronology, etc), one could filter out "impostor" products with 95%+ accuracy.

Here's a neat trick: Next time you're unsure if a product has fake reviews, click on the brand of the product and see what else they sell. If you're looking at binoculars, and that same brand also sells dog food bowls, then maybe you should reconsider.

I've concluded that Amazon really doesn't care about fake reviews -- they will show users whatever listing has the maximum Expected Value (conversion rate * revenue), per your context (search term, category, or both). Even if a product has obvious fake reviews, if there are enough other people buying it it will float to the top, and Amazon is fine with that.

1 comments

> unit price

This is quite an important feature. Amazon already shows this information (sometimes at least) and yet they don't let you sort by it.

The main problem they have, I imagine, is sparse data. There are only certain fields (depending on category) which they force sellers to populate, eg: "name", "brand", etc. Item weight (which is distinct from _package_ weight), and "number of units" do not seem required, and so not many items have that information filled.

So, with sparse data, they have three choices:

1) Allow filter/sort by "unit price" and do not show the X% of listings that are missing this data -- many of which the user may actually be aware of and/or interested in.

2) Don't allow the option at all, and just rely on the fact your customer will do comparisons manually.

3) Try to derive the number of units from text cues in the product name, features, and description, then do #1.

It seems they chose #2. I'm going for #3.