Hacker News new | ask | show | jobs
by RasmusFromDK 681 days ago
Nice writeup. I've been through similar problems that you have with my contact lens price comparison website https://lenspricer.com/ that I run in ~30 countries. I have found, like you, that websites changing their HTML is a pain.

One of my biggest hurdles initially was matching products across 100+ websites. Even though you think a product has a unique name, everyone puts their own twist on it. Most can be handled with regexes, but I had to manually map many of these (I used AI for some of it, but had to manually verify all of it).

I've found that building the scrapers and infrastructure is somewhat the easy part. The hard part is maintaining all of the scrapers and figuring out if when a product disappears from a site, is that because my scraper has an error, is it my scraper being blocked, did the site make a change, was the site randomly down for maintenance when I scraped it etc.

A fun project, but challenging at times, and annoying problems to fix.

6 comments

Doing the work we need. Every year I get fucked by my insurance company when buying a basic thing - contacts. Pricing is all over the place and coverage is usually 30% done by mail in reimbursement. Thanks!
Thanks for the nice words!
I'm curious, can you wear contact lenses while working? I notice my eyes get tired when I look at a monitor for too long. Have you found any solutions for that?
I use contact lenses basically every day, and I have had no problems working in front of screens. There's a huge difference between the different brands. Mine is one of the more expensive ones (Acuvue Oasys 1-Day), so that might be part of it, but each eye is compatible with different lenses.

If I were you I would go to an optometrist and talk about this. They can also often give you free trials for different contacts and you can find one that works for you.

FWIW, that is the same brand that I use and was specifically recommended for dry-eyes by my optometrist. I still wear glasses most of the time because my eyes also get strained from looking at a monitor with contacts in.

I'd recommend a trial of the lenses to see how they work for you before committing to a bigger purchase.

> Acuvue Oasys 1-Day

I don't often wear contacts at work but I can second that these are great for "all day" wear.

Age is important factor here, not just contract brands.

As you get older, your eyes get dryer. Also, having done Lasik and needing contacts after many years is a recipe for dry eyes.

This is very likely age-dependent.

When I was in my 20s, this was absolutely not a problem.

When I hit my 30s, I started wearing glasses instead of contacts basically all the time, and it wasn't a problem.

Now that I'm in my 40s, I'm having to take my glasses off to read a monitor and most things that are closer than my arm's reach.

Wait until you get to 50 and you have to take OFF your glasses to read things that are small or close.

This is the most annoying part of all my vision problems.

Hah, I'm already there in my 40s! I'm seriously considering getting a strap for my glasses - right now I just hook them into my shirt, but they'll occasionally fall out when I bend over for something, and it's only a matter of time before they break or go into a sewer.
My eye doctor recommended wearing “screen glasses”. They are a small prescription (maybe 0.25 or 0.5) with blue blocking. It’s small but it does help; I work on normal glasses at night (so my eyes can rest) and contacts + screen glasses during the day and they are really close.
Go try an E-Ink device. B&N Nooks are small Android tablets in disguise, you just need to install a launcher. Boox devices are also Android.

I can use an E-Ink device all day without my eyes getting tired.

I cannot, personally. They dry out
For Germany, below the prices it says "some links may be sponsored", but it does not mark which ones. Is that even legal? Also there seem to be very few shops, are maybe all the links sponsored? Also idealo.de finds lower prices.
When I decided to put the text like that, I had looked at maybe 10-20 of the biggest price comparison websites across different countries because I of course want to make sure I respect all regulations that there are. I found that many of them don't even write anywhere that the links may be sponsored, and you have to go to the "about" page or similar to find this. I think that I actually go further than most of them when it comes to making it known that some links may be sponsored.

Now that you mention idealo, there seems to be no mention at all on a product page that they are paid by the stores, you have to click the "rank" link in the footer to be brought to a page https://www.idealo.de/aktion/ranking where they write this.

Fair enough, I had assumed the rules would be similar to those for search engines.
> One of my biggest hurdles initially was matching products across 100+ websites. Even though you think a product has a unique name, everyone puts their own twist on it. Most can be handled with regexes, but I had to manually map many of these (I used AI for some of it, but had to manually verify all of it)

In the U.S. at least, big retailers will have product suppliers build slightly different SKUs for them to make price comparisons tricky. Costco is somewhat notorious for this where almost everything electronics (and many other products) sold in their stores is a custom SKU -- often with slightly product configuration.

Costco does this for sure, but Costco also creates their own products. For instance there are some variations of a package set that can only be bought at Costco, so you aren't getting the exact same box and items as anywhere else.
Would that still matter if you just compare by description?
Isn’t this a use-case where LLMs could really help?
Yeah it is to some degree. I tried to use it as much as possible, but there's always those annoying edge cases that makes me not trust the results and I have to check everything, and it ended up being faster just building some simple UI where I can easily classify the name myself.

Part of the problem is simply due to bad data from the websites. Just as an example - there's a 2-week contact lens called "Acuvue Oasys". And there's a completely different 1-day contact lens called "Acuvue Oasys 1-Day". Some sites have been bad at writing this properly, so both variants may be called "Acuvue Oasys" (or close to it), and the way to distinguish them is to look at the image to see which actual lens they mean, look at the price etc.

It's true that this could probably also be handled by AI, but in the end, classifying the lenses takes like 1-2% of the time it takes to make a scraper for a website so I found it was not worth trying to build a very good LLM classifier for this.

> It's true that this could probably also be handled by AI, but in the end, classifying the lenses takes like 1-2% of the time it takes to make a scraper for a website so I found it was not worth trying to build a very good LLM classifier for this.

This is true for technology in general (in addition to specifically for LLMs).

In my experience, the 80/20 rule comes into play in that MOST of the edge cases can be handled by a couple lines of code or a regex. There is then this asymptotic curve where each additional line(s) of code handle a rarer and rarer edge case.

And, of course, I always seem to end up on project where even a small, rare edge case has some huge negative impact if it gets hit so you have to keep adding defensive code and/or build a catch all bucket that alerts you to the issue without crashing the entire system etc.

Do you support Canada?