Hacker News new | ask | show | jobs
Show HN: Crawling car dealerships for a real time consumer search engine (demanjo.com)
72 points by webtrill 4830 days ago
20 comments

I have built something very similar to this is the past, here are a couple of observations:

Scraping:

- Dealer web sites are run by a handful of different data brokers, for the most part if you find a good way to scrape one (say dealer who uses http://www.dealer.com/ than you can extend your scraper to get others)

- Dealer web sites, in general, are horrible to view.

- Learn to love VIN explosion/decoding - http://www.researchmaniacs.com/VIN/VIN-Decoder.html the dealers enter features in so many different ways, it is your best chance to normalize the data.

- Normalize the ext. color, create a distinct list of all the crazy colors of car makers give their cars and all the short hand dealers give the colors. Create a map of the colors and apply when scraping

- Scraping is like farming, a lot of initial work, but there is constant upkeep for changing sites

Display:

- User's don't want to search by city as much as what is close to them, you should geocode the dealerships and display distance. For instance if I search West Des Moines, I would expect inventory in Des Moines also to come up

- Add searching by zip code, you can easily find database of the centroids. It can also be a cheap way of geocoding the dealerships

- Switch Mileage to use miles instead of KM, it looks like most of the inventory is in the US an that is what the user will expect.

- Use a ip2geo to set the initial location of the search, right now it looks like it is all over the place, check to see if the browser supports geo location and optionally set the initial search by that

Changing sites is not a problem as the algorithms are generic to target any website!! We do not develop algorithms for each provider out there, that will not be feasible and quiet frankly a waste of time on our part.

For geolocation searches, we auto detect a users location. I cannot guarantee the accuracy of a users location as the geo data source is a free version. You can however easily select the miles to increase the radius of your search.

How did you do bulk VIN explosion? Pay for it (rates are high in my research) or scrape one of the free sites?
Built my own scraping some of the free sites, You can do basic make and year off plain VIN, then scraped some of the free sites with break downs of the various VINs to build a a map.
I think this is nice, and I already found a deal I haven't come across yet on a vehicle I'm interested in, so thanks!

A few initial thoughts:

- I can't click the radio buttons themselves within each options group; only the link/label itself is clickable

- I would rather the mileage filters be < 10k, < 20k, < 36k, etc. If I can't have this, I want to be able to select multiple mileage filters at once

- I want to select multiples of other filters too. Year for example. eBay allows me to enter "2009-2011" or "2009-". I can only view one year at a time on this site.

- It took me a minute to realize I had to select a make before a model (yes, I feel stupid for this); showing an empty stub where the options will appear once I have selected a make is not intuitive - how about not showing the not-yet-ready filters until they are relevant?

- Seeing the "Contact Dealer" red button directly below the phone number made me initially think I was going to be calling the dealer. I finally clicked it after seeing no other place to view the dealer's own listing. I'd make getting to the dealer's site a bit more prominent

A few thoughts: - I think CarFax has an affiliate program that might let you link through to their reports, keyed by VIN. These reports are a pretty valuable tool to shoppers as the condition of a used car is driving 50% of the buying decision.

- One data point that is really important to dealers is how long the car has been on the lot. Hypothetically you could just track how long the same car has appeared at the same dealership. The longer the car has been sitting there, the more motivated they'll be to sell it. I remember hearing 30 days is a long time for dealerships to sit on a car. For buyers this could be good information to have.

- Getting the mobile experience right would be a huge win. So often the cars the dealer lists aren't actually the cars they have on the site (I'm not sure how much of this is intentional and how much is a matter of how fast inventory turns over). So when you get to the dealership, you want to quickly and easily be able to comparison shop - get that right and you cover people in that really critical uncomfortable moment.

What about trim levels? Trim levels are crucial in the buying decision. For example, if you search for a Ford F-150 on your site, there is no way to see if it's an XLT, King Ranch, Raptor, etc. The price and equipment difference in those vehicles is huge.

What about options and equipment? Does the car have navigation? Sunroof? Most consumers have specific option packages in mind when searching for car online.

This is why VIN explosion is necessary for any serious automotive shopping site. If a consumer can't narrow the vehicles down to a trim and option package level then it won't get wide adoption.

VIN explosion doesn't always return truck trims, as many times the actual truck beds are added after the vehicle rolled off the assembly line.

So they will probably need more than VIN explosion for that; there are some companies that do provide that data though.

Trucks are certainly the most difficult to decode. But if you are using something like ChromeData for the VIN data, and combine that with the info from the dealer's site then you can usually narrow the vehicle down to a specific trim level.

Not always however. Dealers frequently have incorrect or missing information on their websites, so garbage-in, garbage out.

This is why scraping dealer websites for data is always going to be problematic. Far better to work with the providers to have them send you the data. It's faster, easier and you get far better data.

I used to work for a competitor in the same space as AutoRevo =) Chrome was ok, but there was another provider that had exact vin matches in their catalog. It was a bit more expensive, but made it so the trim field was a non-issue. I don't remember the name, it's been too long.
It might have been AutoData, but they have merged with Chrome. Edmunds has a decoder, but it's pretty laughable. I'm not aware of any other major players in that field outside of those.

Chrome offers 1-1 matches on VIN to style Ids for most OEM's, but it's an additional cost.

We'll be adding the trim levels in the near future. We agree it can be very useful when buying a car. Thanks
Suggest a map that shows which dealer and where a car is located. There's a subset of car buyers where the car itself is less important than the garage they bought it from, in case they need to bring it back for repairs / tuning etc. This is especially true of used cars. And very important on mobile
Maybe it's something country-specific, but why would you do repairs in the same garage? Or did you mean the dealer's guarantee period?
For folks who lease cars (a poor financial decision long-term, but lower monthly payments) you need to take the car back to the dealership you bought it from at regular intervals (3-10K miles).
Overall, impressive data, but UI/UX could be improved.

'Refine By' menu shows an arrow on mouseover, but user can only click on words, not arrow or high-lighted area.

'Refine By' pop-out menus show a square, value as hyperlink, count. Square looks like a checkbox (with rounded corners) but clicking on square produces no result (this user expected a checkbox response, & ability to select multiple models, colors, etc.)

Change of other filters does not require a 'go' button, but change of search radius does.

Results appear to have a wide radius, but pull-down for location does not show what default/pre-selected radius is (by experimentation, 10 miles).

Generally, the user should be able to tell what filters are currently active, and what their values are.

Clear Price filter action appeared to clear all filters.

US states and placenames with multi-word names should have all elements capitalized: s/New mexico/New Mexico/ s/San luis obispo/San Luis Obispo/

Price filter should allow either end-point to be absent.

Mileage should allow arbitrary range end-points, like price.

Year should allow a range, or checkboxes.

Consider having Refine by Make hide the less popular makes behind a 'more' button. So, by default display top N makes & 'more', 'more' displays top N2 (or all) makes & 'more', top N4 or all, top N*8 or all, etc. (otherwise the menu may grow to many dozens of obscure makes)

Support 'open in a new tab' on the hyperlinks show below the first search box. Searching <color> <make> <letter> displayed 3 links with counts, but displays nothing - sigh. (perhaps the back-end function is not yet implemented?)

Again, impressive data.

Great job! Nicely done.

A suggestion, also a source of competitive advantage: Allow selecting more than one Make per search. Seems nobody does this. I would like to see all SUVs except those by the big North American manufacturers. To do this I need to execute many separate searches.

Thanks!!

Will deploy multiple selections of filters by weeks end.

Servers couldn't handle it.
I really wish people would test their sites before doing this. I can understand if someone else submits your stuff and you had no idea it was going to happen (but even then...), but if you create an account with no history for the explicit purpose of submitting to HN as the OP did, you should at least test it under some semblance of load.
I really wish people would test their sites before doing this.

I'm conflicted; whilst ultimately you're right, it's very easy to type those words, and not as easy to test for a realistic load.

It's not as simple as throwing ab at your website, you need to use a proper tool (e.g. JMeter), make sure you're testing realistic user behaviour (even basics such as whether images and CSS have an impact), and ensure that you're not getting a false sense of security (e.g. how many connections are actually hitting the server at the same time?).

So yeah, whilst I kinda agree with you, I think it's a lot easier to say than to carry out.

I totally agree, and if I tried to do it I'd probably botch it myself. I wouldn't even know where to start as I've ever built anything that needs to be under this kind of load where the deployment isn't handled by people better at it than I.
amplified log playback works best in my experience.

of course, this requires at least some public traffic to play back. i've only used proprietary tools for this in my personal experience, but i believe jmeter has this functionality (log sampling).

This seems really interesting. When you say proprietary tools, I assume you mean more custom-built solutions and less paid tools available to anyone?
yeah, tools built in-house to test specific apps, not commercial apps
You're assuming the HN submission isn't the load test.
I would love to see a post on how to test your site for load issues like ones you would face from hn
I posted an Ask HN a few months ago asking what folks do to test but the best answer I got was "load test your website" :P
I understand what you mean. but we have actually tested the site several times before posting it on HN. We are working hard at this and will be back shortly. Thanks.
Looks like everything's working now. Can I ask why when searching for cars in the US all the mileages are in Km?
True, will have to change this.
Just a weird error. If I accept the location tracking and use this url: http://www.demanjo.com/new/search?3=3670447604&0=3526732...

Then it will show the search, but go to another search afterwards.

Hi, can your provide more details into this error? What location are you in?

Works perfectly for me from my location.

Will be very helpful.

Searched for WRX, but got a bunch of Hondas. Looks like Honda is a fallback if model is not recognized...
It's clean and fast. How do you fetch the data and what's the revenue model? Cheers.
Funny, I made something very similar when I was looking to change my car 2 years ago or so, but did't open sourced it. The UI was awful, but it had email notifications when a search query matched
Seems very useful (looking for a car right now), but the filter by price range function seems to return zero results with no regard to the values entered.
Noticed, deploying a fix.

Thanks!

How did you manage to come into agreement with car dealers to crawl their sites and use their photos in your aggregator? Congratulations!
I would guess they did so in the same way that google asks every website operator before crawling and caching. (I.e., I suspect they didn't come to any explicit agreement. If google doesn't, why should they need to?)
Given that Google has been sued over that countless times, as a small startup I would be wary of following their example.
This is probably the wrong attitude towards founding startups. In general, you shouldn't unnecessarily risk the business -- but if people are throwing roadblocks in your way, lots of startups seem to generally do pretty well when they play fast-and-loose with rules. The logic is that nobody's going to bother to sue you until you get big and can defend yourself. Obviously taking a big risk like this isn't ideal, but you shouldn't let it stop you from moving forward with a business.
“Listings are currently sourced from several delearship websites by means of crawling and extracting relevant content available on the host application. If you are a dealer wishing to list and/or promote your inventory on Demanjo, we can help you drive qualified, local shoppers to your dealership.”

and “The selection and placement of listings on this page, except featured listings, were determined automatically by a computer program. For premium placement, please contact us.”

It doesn’t sound like he asked permission.

Yep and as I worked on sites for a big Audi dealer in the past they would not be happy giving stuff to scrapers as opposed to getting the lead direct.

BTW for information in the UK a lead for our Audi B2C site was worth around £60.

Sounds dodgy from a Google perspective republishing other peoples content - though I know that Google looked at doing a niche car product - like they have with hotels etc so might not be a viable long term business.

> Sounds dodgy from a Google perspective republishing other peoples content

Actually, it is just publishing compiled facts. It might violate the ToS, but it probably doesn't violate copyright.

Do you know of any good articles demonstrating the repercussions of violating a ToS vs. violating a copyright?

I'm guessing violating a copyright is more likely to result in aggressive legal action whereas a ToS violation would just get you banned from the service or sent some sort of cease and desist.

yeah its a bit of a grey area I suspect that the big players don't want to be the first one to start legal proceedings - they want some one else to pull the trigger.

I used to work for Reed Elsevier and there was rampant scrapeing and plagiarizing going on usually to create crappy MFA sites or to insert middlemen (offering no social value) into the job board market.

I possibly could see EU based recruitment companies going after indeed - maybe if stepstone are up for a fight.

Matt Cutts whats the deal on allowing indeeds search results into your index I thought you did not like other se results in Googles index

As long as they respect the robots.txt, and aren't hitting the site constantly, there shouldn't be a problem technology-wise.

Now, whether this business wants a potential competitor doing this, that's a different story.

I'm seeing a fair few duplicates in the results, probably need to work on the algorithm for filtering these out.
Noticed!! some dealerships are operating under several distinct domain names which create these duplicates. We are currently working on a solution for this.
There is a similar site in Russia: http://auto.yandex.ru/
Is there any reference for designing classified websites? In terms of usability and experience.
How come there is no cars in San Francisco ? Yu should may be think of parsing craiglist ?
Our crawler is currently in motion and i see about 239 cars in SF.

We are solely focusing on dealership websites.

The only suggestions I have when I type "San Fra" are : San Clara, MB San Josef Bay, BC San Josef, BC San Joseph Bay, BC"
Your IP address resolves to Canada.

You can only search for cars in your country.

Will have to change this if required at some point

That's even more weird because I'm in France, using a french ISP without any proxy or whatever and I can see cars from Texas, Nevada, ...
You should be able to click the image on the homepage to go to the listing.
Thanks.

Will incorporate.

Cheers!

Great, the website is a really good idea!
Great job. Love the idea. Can you tell us which technologies you used ?
Everything is proprietary!! that is our big advantage.

Data store code name "Saycron" sits on the Demanjo's Distributed file system.

Indexing -- > uses a data structure i call "Octo-tree" Outperforms any multidimensional index i came across in any academic papers.

Web server -- > Code name "SlimAPE". simple non blocking.

Web framework -- > Code name "HerikX".

I will write about all of these technologies later on my blog whenever i get to setting up one.

I'd like to get in touch. Looked at your profile and no email so can you drop me one if you get the opportunity? Thanks!