Hacker News new | ask | show | jobs
by genuinelydang 596 days ago
”you could almost build a new kind of Job Search Service that matches job descriptions to job candidates”

The key word being ”almost”. Yes, you can get similarity matches between job requirements and candidate resumes, but those matches are not useful for the task of finding an optimal candidate for a job.

For example, say a job requires A and B.

Candidate 1 is a junior who has done some work with A, B and C.

Candidate 2 is a senior and knows A, B, C, D, E and F by heart. All are relevant to the job and would make 2 the optimal candidate, even though C–F are not explicitly stated in the job requirements.

Candidate 1 would seem a much better candidate than 2, because 1’s embedding vector is closer to the job embedding vector.

5 comments

Even that is just static information.

We don't know if Candidate 2 really "knows A, B, C, D, E and F by heart", just that they claim to. They could be adding whatever to their skill list just, even though they hardly used it, just because it' a buzzword.

So Candidate 1 could still blow them out of the water in performance, and even be able to trivially learn D, and E in a short while on the job if needed.

The skill vector wont tell much by itself, and even prevent finding the better candidate if its used for screening.

> We don't know if Candidate 2 really "knows A, B, C, D, E and F by heart", just that they claim to. They could be adding whatever to their skill list just, even though they hardly used it, just because it' a buzzword.

That is indeed a problem. I have been thinking about a possible solution to the very same problem for a while.

The fact: people lie on their resumes, and they do it for different reasons. There are white lies (e.g. pumps something up because they aspire to something but were not presented with an opportunity to do it, yet they are eager to skill themselves up, learn and do it, if given an opportunity). Then there are other lies. Generally speaking, lies are never black or white, true or false; they are a shade of grey.

So the best idea I have been able to come up with so far is a hybrid solution that entails the text embeddings (the skills similarity match and search) coupled with the sentiment analysis (to score the sincerity of the information stated on a resume) to gain an extra insight into the candidate's intentions. Granted, the sentiment analysis is an ethically murky area…

Sincerity score on a resume? I can't tell if you're joking or not. I mean yeah, any sentence that ends in something like "...yeah, that's the ticket." would be detectable for sure, but I'm not sure everyone is as bad a liar as Jon Lovitz.
Are you speaking hypothetically or from your own experience? The sentiment analysis is a thing, and it mostly works – I have tested it with satisfactory results on sample datasets. It is relatively easy to extract the emotional context from a corpus of text, less so when it comes to resumes due to their inherently more condensed content. Which is precisely why I mentioned ethical considerations in my previous response. With the extra effort and fine tuning, it should be possible to overcome most of the false negatives though.
Sure AI can detect emotional tones (being positive, being negative, even sarcasm sometimes) in writing, so if you mean something like detecting negativity in a resume so it can be thrown immediately in the trash, then I agree that can work. Any negative emotionality is always a red-flag.

But insofar as detecting lies in sentences, that simply cannot be done, because even if it ever did work the failure rate would still be 99%, so you're better off flipping a coin.

So your point is that LLMs can't tell when job candidates are lying on their resume? Well that's true, but neither can humans. lol.
> The key word being ”almost”. Yes, you can get similarity matches between job requirements and candidate resumes, but those matches are not useful for the task of finding an optimal candidate for a job.

Text embeddings are not about matching, they are about extracting the semantic topics and the semantic context. Matching comes next, if required.

If a LLM is used to generate the text embeddings, it would «expand» the semantic context for each keyword. E.g. «GenAI» would make the LLM expand the term into directly and loosely related semantic topics, say, «LLM», «NLP» (with a lesser relevance though), «artificial intelligence», «statistics» (more distant) and so forth. The generated embeddings will result in a much richer semantic context that will allow for straightforward similarity search as well as for exploratory radial search with ease. It also works well across languages, provided the LLM had a linguistically and sufficiently diverse corpus it was trained on.

Fun fact: I have recently delivered a LLM assisted (to generate text embeddings) k-NN similarity search for a client of mine. For the hell of it, we searched for «the meaning of life» in Cantonese, English, Korean, Russian and Vietnamese.

It pulled up the same top search result across the entire dataset for the query in English, Korean and Russian. Effectively, it turned into a Babelfish of search.

Cantonese and Vietnamese versions diverged and were less relevant as the LLM did not have a substantial corpus in either language. This can be easily fixed in the future, once a new LLM version that will have been trained on a better corpus in both, Cantonese and Vietnamese, languages – by regenerating the text embeddings on the dataset. The implementation won't have to change.

The trick is evaluate the score for each skill, also weighing it by the years of experience with the skill, then sum the evaluations. This will address your problem 100%.

Also, what a candidate claims as a skill is totally irrelevant and can be a lie. It is the work experience that matters, and skills can be extracted from it.

That's not accurate. You can explicitly bake in these types of search behaviors with model training.

People do this in ecommerce with the concept of user embeddings and product embeddings, where the result of personalized recommendations is just a user embedding search.

> not useful for the task of finding an optimal candidate

That statement is just flat out incorrect on it's face, however it did make me think of something I hadn't though of before, which is this:

Embedding vectors can be made to have a "scale" (multiplier) on specific terms which represent the amount of "weight" to add to that term. For example if I have 10 years experience in Java Web Development, then we can take the actual components of that vector embedding (i.e. for string "Java Web Development") and multiply them by some proportionality of 10, and that results in a vector that is "Further" into that direction. This represents an "amount" of directional into the Java Web direction.

So this means even with vector embeddings we can scale out to specific amounts of experience. Now here's the cool part. You can then take all THOSE scaled vectors (one for each individual job candidate skill) and average them to get a single point in space which CAN be compared as a single scalar distance from what the Job Requirements specify.

Then you would have to renormalize the vectors. You really really want to keep the range -1..1 because that is a special case where cosine similarity equals dot product equals Euclidean distance.
I meant the normalized hyperspace direction (unit vector) represents a particular "skill" and the distance into that direction (extending outside the unit hypersphere) is years of experience.

This is geometrically "meaningful", semantically. It would apply to not just a time vector (experience) but in other contexts it could mean other things. Like for example, money invested into a particular sector (Hedge fund apps).

This makes me realize we could design a new type of Perceptron (MLP) where specific scalars for particular things (money, time, etc.) could be wired into the actual NN architecture, in such a way that a specific input "neuron" would be fed a scalar for time, and a different neuron a scalar for money, etc. You'd have to "prefilter" each training input to generate the individual scalars, but then input them into the same "neuron" every time during training. This would have to improve overall "Intelligence" by a big amount.