Hacker News new | ask | show | jobs
by mosseater 1533 days ago
I used to work at a startup that aggregated apartment listings. Long gone now, we couldn't compete with Zillow or Apartments.com. But what we were doing is just aggregating all of the rental websites we could possibly scrape into one interface.

It was hard. There are so many scams out there. If you are not hand curating listings, you have to rely on somewhat novel approaches to filter out all the bad data. For example, any mention of 'Jesus' or 'God' automatically blocked the listing. Sure there might have been legitamate listings, but anytime I would go in and check 99% of the time it was a scam. You also have scams of people listing units they don't own or have any relation too. They ask for a security deposit up front and just pocket it, leaving the new tenant to have an awkward conversation with the real resident.

Data is often unformatted too. Scraping out bedrooms and bathrooms can even be a challenge on websites like craigslist where the listing is just one big paragraph. (Not anymore, craigslist has come a long way, but that's how it was 10 years ago). Often times you had to just search for the closest number around the word "bed" or "bath". Don't even think about getting features like "driveway" or "laundry" out of them.

In the end, we ended up utilizing some pretty intense ETL pipelines to collate historical data, census information, property assessment data, and other things to try to get a more accurate picture on our listings.

But that didn't win out. What won out are the sites like Apartments.com or Zillow, where legitimate property owners can post their listings in a formatted searchable way. We could scrape them and post the same listings on our site, but at that point we were just pushing our customers to another platform that honestly worked better than our own.

We couldn't have the most up to date data, that was determined by how fast we could go back to scrape a listing. And often times we were knee deep in a battle to avoid being blocked by these companies. Often times, after we had exhausted our proxies, the only thing left to use was Tor.

2 comments

> I used to work at a startup that aggregated apartment listings. Long gone now, we couldn't compete with Zillow or Apartments.com. But what we were doing is just aggregating all of the rental websites we could possibly scrape into one interface.

[...]

> We couldn't have the most up to date data, that was determined by how fast we could go back to scrape a listing. And often times we were knee deep in a battle to avoid being blocked by these companies. Often times, after we had exhausted our proxies, the only thing left to use was Tor.

The flip side is if you are looking at listing from small property management companies, they are probably low on resources. The websites aren't very well optimized to serve thousands of concurrent users. Every request to search inventory hits SQL Server which is probably one small box with no automatic failover. I don't like it but unless we somehow help these people better optimize their website (how?),they will continue using heavy handed tactics like blocking scrapers.

SQL Server should have no problem serving hundreds to thousands of queries per second unless the query is really pathological (which it might be!).
What sort of scam would mention Jesus or God in the description?
They say they're out of the country on a mission trip and are OK renting the place out below market to a responsible person (faith preferred but not required) who will take good care of it. Then they ask for a "viewing fee" before they send you a code for the non-existent lockbox with the keys.

Source: got to the "send me a viewing fee in itunes gift cards" stage of one of these scams, once. Googled the initial email and it's a common scan template.

The same as orthographic mistakes in postings or scam emails. They filter for more naive and gullible people.
The kind that is trying to build trust with religious people.