Hacker News new | ask | show | jobs
by Tanjreeve 1212 days ago
>Great point, thank you. However, I think this leaves out many companies - those that don't have job postings

Unless you're planning to cold call people and get them to pinky swear to tell you honestly what they're using or you have some other plan then you're somewhat stuck anyway for companies that don't post jobs.

Also worth noting but plenty of companies won't even really tell you anyway. E.g plenty of companies will have language like "systems programming language" or "object oriented language". When they could be using anything from C-family to Haskell (leaving aside how secretive many Haskell jobs are or being hidden in custom dialects) You are going to be running into all kinds of human BS, it'll be fun but a can of worms nonetheless.

>I think a job board like startup.jobs would solve this by creating a job archive - then it would be prime scraping material. But it's only a job board with (mainly) current jobs

Not sure how much experience you have modelling data but this can also be trickier than expected to capture postings by date even leaving aside the fun of unstructured data and differences in models between platforms and your judgement calls needed to decide where you're crawling.

Having cut my teeth scraping property listings of competitor websites you come to realise most boards incentivise people to delete and repost ads so they boost their recency score and appear higher in the search. So now you will have duplicates messing up your data which you want to deal with if you're trying to create value off your data.

The classified site also doesn't like this so will try to stop this gaming of the system so that game of cat and mouse will normally mess up your scraping and dedupe logic too.

As said it's a potentially fun can of worms to open. I was just making a joke about HN commenters tendency to massively underestimate the oceans of complexity that seperates their hello world project from an enterprise grade "just a CRUD app" system that people pay for. E.g all the people that could totally build twitter with a sqlite DB and some bash scripts + sellotape etc.

1 comments

Thanks for taking the time to explain!

>You're somewhat stuck anyway for companies that don't post jobs.

Good news on this front. I have manually compiled a limited tech stack DB for roughly 5,000 US startups over the last decade (limited because it only has Frontend and Backend languages/frameworks for each company). Much of this data is current too, thanks to the 2021 boom in jobs. And the majority of startups either share their technologies in job descriptions by simply listing them or alluding to them, with fun statements like "we welcome skills in Python, Ruby, or JavaScript/Node.js (but Ruby would be ideal)." It's big tech companies that are more likely to be vague, because you could end up being hired into one of a multitude of product groups using different technologies.

On a side note: the second Ruby example is one reason scraping will be insufficient, and why language-specific job seekers pull their hair out using most job boards. If you search for Python positions, that Ruby company will pop up because it has the word Python.

If this was a viable idea, then I'd really need to get deeper into the stack (i.e. "preferred" experience with technologies like AWS, RabbitMQ, Spark, etc). This is crucial, allowing a hiring manager or recruiter not just the data to hit the basic requirements in candidate sourcing, but exceed them by delivering those "pluses." But I digress. Perhaps I would wait to see if this idea even has legs before committing to the time investment of digging for these "secondary" technologies.