Hacker News new | ask | show | jobs
by mgraczyk 1716 days ago
I worked on the things you mentioned at Instagram, doing the tuning this-and-that for about a year on a few different surfaces including the home feed.

You are right that for the most part, people look at metrics and make decisions about what to ship, and only the best engineers and data scientists spend time thinking about the actual product.

However, you're wrong about what metrics are important. Since at least early 2020, and to some extent since 2016, there is a hard and enforced constraint on so-called "wellbeing" and "integrity" metrics. Facebook actively measures the sort of things reported in the WSJ piece (self-reported wellbeing) as well as many others like "bullying" (as measured by human reviewers), "known misinformation", "hate speech", etc.

When engineers make changes to feed, they are generally not allowed to regress these non-engagement metrics. The focus of many shipping conversations is how to address even unmeasured potential risks to these metrics. A huge number of experiments are run specifically targeted at improving these metrics.

4 comments

Appreciate the insights. It makes sense that this class of metrics have made it into the decision making process, and I am glad it is happening.

The two concerns that come to mind without deeply understanding the problem is that: 1) Measuring a qualitative, nebulous metric like "wellbeing" (which could mean different things for different people) is likely very hard to do right 2) In my experience, things tend to move fast, and experiments often don't run for _that_ long. I would hypothesize that Facebook's negative effects on users is a compounding effect that emerges over the scale of months. Sure, you can leave a small % of users in a holdout group of your experiment, but how often is that getting revisited?

I do like the idea that there are teams out there that are taking it as a goal to positively move these non-engagement metrics. If FB is going to correct course then steps like this are a big part of that.

Yes, wellbeing is a very hard thing to measure. I didn't work on it directly so I can't really weigh in on the high level philosophy, but the general strategy seems to be to measure a lot of things.

As for the holdouts, people do revisit the holdouts extremely often. I'd say Instagram does holdouts better than any other place I'm familiar with (better than most of Facebook). For higher level engineers and product managers (5-6 +), the holdouts are one of the biggest signals for performance review.

Others pointed out that it's hard to measure subjective factors well. I'd point to a different issue - if you have a metric for, say, wellbeing which is correlated with what you actually care about, putting pressure on other parts of the system - like maximizing engagement - will systemically warp those metrics to be less accurate.

For an extensive discussion of how this can happen, see: https://arxiv.org/abs/1803.04585

That's a good point, and definitely happens in places where I've worked.

On the other hand, I think Facebook is pretty good about constantly reevaluating metrics and trying to make sure that they track what the company actually cares about. The mechanism for this is partially embarrassment avoidance. If there are obvious egregious examples of violations that are not tracked by metrics, employees loudly complain and the company culture expects those responsible to explain what went wrong and how it will be fixed (better metrics).

From what I've seen in practice, this usually results in changing the engagement metrics rather than the well-being metrics. For example FB changed most raw engagement metrics to "authentic engagement" metrics at some point while I was there. Instead of counting total likes, you count likes from accounts that are not deemed to have participated in "inauthentic engagement" (you can read FB's blog for definitions).

Can you then explain why, as the whistleblower alleges, all the programs to keep the newsfeed "clean" for the 2020 election campaign were turned off a month or two after the election? This would have certainly lead to a degradation of all the non-engagement metrics. It seems inconsistent with what is being leaked right now from within FB.

In my opinion, what you describe is what facebook wants people to believe, but actively undermines and internally prevents from happening. In other words, tracking wellbeing / non-engagement metrics and everything around it is PR that seemingly even employees are made to believe.

I think that's a mischaracterization of what happened.

I can't speak to most of the product surfaces but for those that I'm familiar with, the most accurate description of what happened is that approved, tested changes that were known to affect so-called "civic integrity" were delayed until after the election to avoid breaking anything or regressing civic integrity until after the election.

For this next part I'm mostly just speculating, but I think I have a more informed opinion on this than most outside of Facebook: It's important to understand how FB measures civic integrity. Facebook generally uses "prevalence" metrics for these things, which look something like "the percent of sessions in which the viewer saw at least one item classified as X", where here X would be something like "civic misinformation" or "inauthentic civic engagement". After the election, bots and bad actors were much less active and invested, so prevalence automatically went down. Since FB makes shipping decisions in part based on prevalence, this decrease means that there is more "budget" to regress these metrics.

Put another way, Facebook sets goals about the overall prevalence of bad content, so when that bad content goes away for exogenous reasons, Facebook can do more things that trade off engagement metrics for prevalence of bad content.

> The focus of many shipping conversations is how to address even unmeasured potential risks to these metrics. A huge number of experiments are run specifically targeted at improving these metrics.

How do you run tests on unmeasured metrics?

How to "address", not "test". For example, you can add logging for specific cases or patterns that you expect to be problematic. You can spot check scores on individual ranked entities. You can do additional analysis or run experiments on certain subsets of the userbase to measure the impact on important groups of users.
How do you "measure the impact on important groups of users"?