Hacker News new | ask | show | jobs
API that uses neural networks to scrape product data (mlscrapedemo.herokuapp.com)
40 points by Buneme 2209 days ago
4 comments

Hello,

Over the past few months I've been working on a neural-network based web scraper for e-commerce websites. The aim is to be able to scrape product data from any product page (so far it extracts the name, price, main image URL and technical specification of the product).

I've developed a working prototype of the API along with a demo page (with rate limits so please use it within reason! ) in order to get some feedback before I carry on with the project.

Because the API is only a prototype, there are some features which are currently missing but will be added later on - for example: • Only English sites that use GBP, EUR or USD are supported • I haven't finished integrating my computer vision algorithms, which means that in some specific situations, the API might not detect a strikethrough and will therefore mix-up the "current price" of the product with the "old price" • The service is running on a hobby heroku server so the API takes a few seconds longer than it otherwise would.

I would appreciate any feedback on the API, in particular: • Is there any other product data that you would like to see (e.g product ID, delivery costs, etc)? • What sort of applications would you use this API for once it's fully developed? • Apart from e-commerce sites, what other types of websites would you like to see an API for (e.g news websites, real estate listings, etc)?

We really need a service like this.

We are a furniture e-commerce. Our vendors don't provided detailed product feeds. We have to rely on scraping.

The most difficult part of scraping the data is that we need to scrape all the product options (Material, Color, Size ...)

each option is a different SKU. see https://www.article.com/product/11833/sven-charme-tan-sofa

We also need to build nlp models to understand product dimensions and weight (useful when estimating shipping fee)

Hey - if this is true my company already has a pretty good solution for getting product info in a standard format from 10s of thousands of websites. My company also has to gather, format, and estimate dimensions and weight because we do only international shipping. Try out a random product url on zipx.com for an example.

Should we talk? My email is in my profile. I think there might be several ways your company and mine can help eachother actually...

Thanks for getting in touch, I've just sent over an email
I've heard of the SKU problem that e-commerce stores have to face with their vendors. Would you mind if I contacted you to learn more about it?
No problem, you can send an email to me@buneme.com :)
That's really useful feedback, especially the part about NLP models, thanks!!
Cool project, but i have no idea for what I can use it, is it like something to build like a price tracker?
Yes, a universal price tracker is a great example! Another potential use cases would be competitor price analysis so that you can react in real time to changes in your competitor's prices.
Well done for getting it set up! How are you training your model? Feeding in scraped product pages alongside metadata from an API to train it? And what are your training sources?

Very nice idea, looking forward to seeing how it develops!

Yes that's pretty much it - using a mix of ecommerce APIs and manual work to create data, and then using that as training data
Sounds like a lot of room for costly error. See door dash pizza arbitrage.
Interesting project.

Are there any prerequisites for the product page URL?

I just tested it out on a few e-commerce websites (fashion) and all the values returned from the product pages were null

It's currently a prototype, so I haven't fully finished integrating all the computer vision algorithms which means that it may miss some data in certain situations - but the purpose at this stage was just to see if this is something people would genuinely use and to get general feedback to help me decide what features to prioritise going forward.
Got it.

I have a side project I've been meaning to finish, where I would definitely use something like this (basically identifying price arbitrage opportunities across luxury fashion retailers).

I will bookmark and keep an eye out for your progress.