| Thank you so much! I am not that experienced building large scale projects so really appreciate your replies in thing post. I am quite surprised that shopify doesn't have hotlink protection for images! >I match products to binomial names through a process thats that uses a really complex regex, plus some manual labeling. Experimenting with using more ML here. Thats what I wondering. I am building something similar, region specific for books and sometimes the names are just a little off or partial or alternate names. I am currently doing a string comparison to match at least 80-90% of the words in the title, which works okay for now. So thank you for the ideas. Your product update frequency is very interesting, I always thought scraping for price aggregation meant one has to make sure its very frequently updated.
My approach is a bit different, it only scrapes on search, so not really scraping all the sites.
Not the best approach, but its scary to me to scrape complete websites and that much data lol
I currently am not using a db either but scraping and caching for 30mins, that specific item which now I think about is a bad idea if I want to make this a scalable project. I should start using a database indeed. Some feedback on the UI/UX, instead of having 'All plants' selected on the homepage, it would be nice to instead have a smaller grid of plants from each type/tag on the home page itself. Selecting any of the tag would work the same as now but homepage will have more to explore because currently its just overwhelming to do anything on the homepage. I am just looking in specific tags or just searching. Edit: This is a great resource for adding more info about pet friendly plants to the listed plants.
https://www.aspca.org/pet-care/animal-poison-control/toxic-a... |
One other tip - many sites have APIs that will give you their product data. You may need to contact them about getting access. Or it may be publicly available. But that is better than scraping if it is possible.