Hacker News new | ask | show | jobs
Ask HN: What kind of scoring algorithms exist?
24 points by esflow 2310 days ago
Hi HN, in my project I need to sort apartments and there is a lot of data about each of them. I give each parameter score, f.e. to the size of the apartment I give some certain score, to location as well, but it doesn't really work well because I don't understand how to give the weight to each parameter. Question: Is there some sort of guideline on scoring/sorting things, some algorithms that might help? Or you might have some suggestions on where to look for information about such things. Thanks!
9 comments

It's tricky. Read up on Pareto Optimization.

You can't really trade off square footage vs commute length linearly because there is no objective criterion.

What you can do is prove that Apartment A has fewer square feet than Apartment B and a longer commute so A is dominated by B.

Out of your complete set of apartments you can that there is a small set that dominate all the others. When you are down to that you can make your personal choice from that set.

At this moment how I do it: The initial score is 0, based on year of construction (newer is better ofc) I add 0-36 points, amount of square meters I multiply by 0.5 and add it as extra points, based on how prestigious is the district I add extra 0-4 points and there are many other things.

So why it's not possible like that to give score to each apartment? I mean if I find right weight for each parameter? How f.e. google rates websites, I guess it gives some kind of score to each of them? Or not?

An obvious problem is that the linear score doesn't represent the value I feel I get from the attribute.

For instance, I live in a 2000 sq ft. space, I will have to store things if I move into a 1500 sq ft. space, and have to sell them or throw them out if I move to a 500 sq ft. space.

A 3000 sq ft space would feel spacious to me but I would not get 10x the utility if I had a 30,000 sq ft space because I don't have enough stuff to fill it.

See https://en.wikipedia.org/wiki/Utility

That's why I'm wondering how to score it. If I give too much weight to the surface it won't make sense definitely. So at the beginning surface and price were the most important so then I had a lot of really big and cheap apartments from suburbs as first results, which didn't make sense, then I introduced rating based on district and few other parameters and it improved rating a lot but still, it's not good enough. I will check this Utility, thanks.
For real estate, a quick and dirty solution that comes to mind is asking price per square foot, assuming that the market itself has taken into account all the common relevant factors. Unless price is something you intend to test against, of course.
But even in the same neighborhood super well-furnished apartment and shell apartment will have very different prices per square foot. Also depending on construction year price per square can greatly differ.
Hopefully I understand your problem correctly...

You can use window functions to do things like dense rank, rank, percentiles, etc on each parameter in order to normalize the data. E.g., this one is at the 63rd percentile in size, 20th in distance, etc. This doesn't work so well if you have lots of 0s in your data.

Or you can find the min and max of each and divide by the max. This one is 42% of max, etc.

In each case you're trying to normalize diff parameters to represent something comparable (x/max, percentile, etc) so you can combine them. You can also do intermediate operations like take the logs or take the z score if you're trying to muffle the effects of outliers.

Thanks a lot, I will try it and test how well it works.
The problem you describe is known as "multi objective optimization".

Normalizing the input data to similar ranges usually helps, but there is no single golden rule how to weight. It depends on what you want to accomplish.

But regardless of any weighting in multi objective optimization problems there is a subset of all items (apartments in your case) that is better then all the items not in this set. This set is called the "pareto front". There are methods to compute this set.

You can't decide which item of the pareto front is better than another; it is a rock, scissors, paper situation. But the pareto front can exclude a lot of items, that you then don't need to consider. These items are worse in every aspect (optimization objective) than any item from the pareto front.

As a computer science student we often used population based optimization methods for dealing with multi objective optimization. For example ant colony optimization or evolutionary algorithms.

In these kind of things I usually start by multiplying (or dividing) the parameters to get a score (weight=1). Then I sort them, see if they look "good" and add weights as needed. My thinking is that the first result should match my preferences and the last result should not. I know it's a kind of confirmation bias, but..

When I bought my car years ago, I had some parameters, but comparing the price was harder because of devaluation. I remember that I assumed a 15% yearly devaluation so that I could compare prices. For instance, a 2000's car valued 1000€ was almost similar in price to a 1999's car valued 850€..

I make sure to always include a "Preference" (aka bias) parameter and give it more or less weight the more it harms my results.

if i were you i would look to PCA analysis and cluster analysis. PCA can help find what are the key (reduced) set of drivers/dimensions that provide you with the ability to explain the majority of your variability. Cluster analysis could help you group the data you have in meaninful reduced set of groups and then you can measure dimensions within clusters and between clusters to help you come up with a strong score system. Again assuming you have a relative high dimensional problem with many data
Yes, there are several thousand apartments. Based on what to cluster them?
Well, rank based on what? You need a metric to optimise for.

Maybe that's price, or maybe that's user clicks or bookings if you're Airbnb.

Rank based on apartment data, like surface, district, price, year of construction, number of rooms, etc. So based on it to rate apartment, in the end, to sort it from the best to worst :)
Again, you don't have a metric to optimize.

You could, for a simple example, label a number of these properties yourself, with values. Say you could go rate 1000 of them with a rating out of 10.

Then you could learn the importance for each of these parameters by trying to approximate your rating (regression of some kind).

Alternatively, because labeling a lot of these, and doing so accurately is a non-trivial task, you can look into generating these ratings.

In learning to rank for search for example, you can extract this information from interactions the users have with your items.

Cosign similarity is a standard method of searching multi-dimensional data. https://en.wikipedia.org/wiki/Cosine_similarity
Thanks, I will check it out!
37% rule in algorithms to live by book's
Thanks, sounds interesting, I will check it.