| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by calrueb 1716 days ago
	I like to imagine there are a bunch of Facebook Homefeed engineers out there who ran dozens experiments, tuning this-and-that, and found the ones that boosted revenue, and user engagement by X%. No one thought too deeply at _why_ some experiments were successful. Perhaps it was chalked up as "our new ML model is matching users to content they care about more efficiently". Then they shipped it, and everyone celebrated their success. My theory is that experiment dashboards, data visualizations, and "metrics" allow employees to almost fully disassociate from their product, and the end users. Engineers at large companies don't need to use, or even be familiar with the products they spend all day building. Yes, leadership and small pockets (researchers) can see the full picture. Everyone else stays willfully in the dark, doesn't look too closely, and clocks out at 5 happy because their dashboards look good.

11 comments

sp332 1716 days ago

I think over the years, there have been enough news articles about Facebook recommending conspiracy theory groups for most employees to have a good idea of how engagement was driven. Every news organization researched how to get clicks, so it's not a secret that articles with more-negative headlines get more engagement. Remember the employee walkout? There is plenty of awareness within the company about how the sausage gets made.

scns 1716 days ago

My guess would be, that weighting negative stimuli higher was an evolutionary advantage. The individuals that listened birds did not survive, those wary of predators did.

mgraczyk 1716 days ago

I worked on the things you mentioned at Instagram, doing the tuning this-and-that for about a year on a few different surfaces including the home feed.

You are right that for the most part, people look at metrics and make decisions about what to ship, and only the best engineers and data scientists spend time thinking about the actual product.

However, you're wrong about what metrics are important. Since at least early 2020, and to some extent since 2016, there is a hard and enforced constraint on so-called "wellbeing" and "integrity" metrics. Facebook actively measures the sort of things reported in the WSJ piece (self-reported wellbeing) as well as many others like "bullying" (as measured by human reviewers), "known misinformation", "hate speech", etc.

When engineers make changes to feed, they are generally not allowed to regress these non-engagement metrics. The focus of many shipping conversations is how to address even unmeasured potential risks to these metrics. A huge number of experiments are run specifically targeted at improving these metrics.

calrueb 1716 days ago

Appreciate the insights. It makes sense that this class of metrics have made it into the decision making process, and I am glad it is happening.

The two concerns that come to mind without deeply understanding the problem is that: 1) Measuring a qualitative, nebulous metric like "wellbeing" (which could mean different things for different people) is likely very hard to do right 2) In my experience, things tend to move fast, and experiments often don't run for _that_ long. I would hypothesize that Facebook's negative effects on users is a compounding effect that emerges over the scale of months. Sure, you can leave a small % of users in a holdout group of your experiment, but how often is that getting revisited?

I do like the idea that there are teams out there that are taking it as a goal to positively move these non-engagement metrics. If FB is going to correct course then steps like this are a big part of that.

mgraczyk 1716 days ago

Yes, wellbeing is a very hard thing to measure. I didn't work on it directly so I can't really weigh in on the high level philosophy, but the general strategy seems to be to measure a lot of things.

As for the holdouts, people do revisit the holdouts extremely often. I'd say Instagram does holdouts better than any other place I'm familiar with (better than most of Facebook). For higher level engineers and product managers (5-6 +), the holdouts are one of the biggest signals for performance review.

davidmanheim 1704 days ago

Others pointed out that it's hard to measure subjective factors well. I'd point to a different issue - if you have a metric for, say, wellbeing which is correlated with what you actually care about, putting pressure on other parts of the system - like maximizing engagement - will systemically warp those metrics to be less accurate.

For an extensive discussion of how this can happen, see: https://arxiv.org/abs/1803.04585

mgraczyk 1704 days ago

That's a good point, and definitely happens in places where I've worked.

On the other hand, I think Facebook is pretty good about constantly reevaluating metrics and trying to make sure that they track what the company actually cares about. The mechanism for this is partially embarrassment avoidance. If there are obvious egregious examples of violations that are not tracked by metrics, employees loudly complain and the company culture expects those responsible to explain what went wrong and how it will be fixed (better metrics).

From what I've seen in practice, this usually results in changing the engagement metrics rather than the well-being metrics. For example FB changed most raw engagement metrics to "authentic engagement" metrics at some point while I was there. Instead of counting total likes, you count likes from accounts that are not deemed to have participated in "inauthentic engagement" (you can read FB's blog for definitions).

ricw 1716 days ago

Can you then explain why, as the whistleblower alleges, all the programs to keep the newsfeed "clean" for the 2020 election campaign were turned off a month or two after the election? This would have certainly lead to a degradation of all the non-engagement metrics. It seems inconsistent with what is being leaked right now from within FB.

In my opinion, what you describe is what facebook wants people to believe, but actively undermines and internally prevents from happening. In other words, tracking wellbeing / non-engagement metrics and everything around it is PR that seemingly even employees are made to believe.

mgraczyk 1716 days ago

I think that's a mischaracterization of what happened.

I can't speak to most of the product surfaces but for those that I'm familiar with, the most accurate description of what happened is that approved, tested changes that were known to affect so-called "civic integrity" were delayed until after the election to avoid breaking anything or regressing civic integrity until after the election.

For this next part I'm mostly just speculating, but I think I have a more informed opinion on this than most outside of Facebook: It's important to understand how FB measures civic integrity. Facebook generally uses "prevalence" metrics for these things, which look something like "the percent of sessions in which the viewer saw at least one item classified as X", where here X would be something like "civic misinformation" or "inauthentic civic engagement". After the election, bots and bad actors were much less active and invested, so prevalence automatically went down. Since FB makes shipping decisions in part based on prevalence, this decrease means that there is more "budget" to regress these metrics.

Put another way, Facebook sets goals about the overall prevalence of bad content, so when that bad content goes away for exogenous reasons, Facebook can do more things that trade off engagement metrics for prevalence of bad content.

tobltobs 1716 days ago

> The focus of many shipping conversations is how to address even unmeasured potential risks to these metrics. A huge number of experiments are run specifically targeted at improving these metrics.

How do you run tests on unmeasured metrics?

mgraczyk 1716 days ago

How to "address", not "test". For example, you can add logging for specific cases or patterns that you expect to be problematic. You can spot check scores on individual ranked entities. You can do additional analysis or run experiments on certain subsets of the userbase to measure the impact on important groups of users.

tobltobs 1716 days ago

How do you "measure the impact on important groups of users"?

rossdavidh 1716 days ago

Almost precisely the problem with using money as a measure of everything. I'm not opposed to money, I'm even in favor of free markets, but this is a real problem. "What gets measured, gets improved". Hence, what is not measured, is what gets sacrificed.

reilly3000 1716 days ago

Absolutely. Aggregates are the enemy of insights in many many cases.

samstave 1716 days ago

FB does this in spades - as does google. And their goals with AI/ML is to have these tweaks done by AI - and let the humans only take glee in seeing their shipped code working and making results that make Zucks coffers happy.

FB ran an experiment of language across millions of users about a greeting sentiment and other syntax/context to see which types of phrases got the highest level of engagement... (I cant find the story, it was on NPR a few years ago...

When I was at FB they had a "build your own bar" event where each department were to build their own bar for happy hours, and then FB would give ~$400 or something like that to stock it up...

Down the rown from us was the ML/BI group of some sort - they had their bar up super fast and were having happy hours every day it seemed.... bunch of weird folks on that team, super smart, weird folks.

pradn 1716 days ago

There's also the element of stock comprising 30-50% of total compensation. Don't rock the boat too much.

annexrichmond 1716 days ago

Indeed, this is part of the reason why I left my last job. I called it EDD (experiment-driven development). That's all they did for several months. I was cynical about it for many reasons:

- the premise of the experiments were often silly, basically just growth hacking to see if we can move the needle a little

- I was hardly convinced that they were "valid" experiments: was it a valid sample? are we recording data correctly? are we querying data correctly? All of which was done by engineers with minimal statistics background

- as OP said, the dissociation with users. Users were just an aggregate numbers (ie., sum of clicks). I wanted a more tangible feedback loop, like knowing more directly what the user is thinking and how they are using your product.

Now I do Infrastructure.

runawaybottle 1716 days ago

To add to your point, there’s not a soul in tech that ever went ‘I wonder if the results of this A/B test are ethical?’. Our industry is just not built with this sensibility, the same way Finance is not self-aware of greed.

If it makes money in Finance and meets regulatory guidelines, all is fair.

If it’s the optimal solution in tech, and doesn’t break laws or cause a noticeable usage drop-off, all is fair. When the FB algos drop engagement, we’ll see quasi-ethics from the company (perceptively to us, to them, it’ll just be an optimization).

mgraczyk 1716 days ago

This isn't even close to being true. For example, when we ran A/B tests at Instagram, we would often dig into the results with breakdowns by important protected demographics. Engineers and data scientists who cared about justice would go out of their way to build tools and spend time ensuring that "good" changes didn't adversely affect small groups which were hard to measure.

I personally ran analysis like this to detect high and unexpected latency on people with cheaper cell phones (disproportionately minorities in the US). The results of my analysis led to changes that reduced this disparity (although the changes were minor and helpful for other reasons).

prancer_or_vix 1716 days ago

> I personally ran analysis like this to detect high and unexpected latency on people with cheaper cell phones (disproportionately minorities in the US). The results of my analysis led to changes that reduced this disparity

Something tells me that "the poors" not having access to the problematic content/software isn't the the ethical dilemma that's being discussed.

That's like saying "I ran the analysis that determined that powder cocaine being expensive was causing a disparity in access, so I helped invent crack cocaine so that even minorities could experience cocaine addiction".

jedberg 1716 days ago

Definitely not true. There were tests we ran at Netflix that would be considered "successful" based on metrics but were not implemented for social reasons, most often because they increased engagement with kids too much.

dleslie 1716 days ago

> To add to your point, there’s not a soul in tech that ever went ‘I wonder if the results of this A/B test are ethical?’.

Speak for yourself. I've definitely done this.

colordrops 1716 days ago

I worked with Facebook for a short period in 2011 when at a large social gaming company, and it was completely obvious then that they were sociopaths that intentionally took advantage of base instincts for profit. These people made a game of it, and openly slept around on their spouses as a show of being part of the in-group. I can't imagine that its gotten more "disconnected" and better since then.

I understand that its a nice gesture to give others benefit of the doubt with charitable interpretations but don't be naive about it - Facebook doesn't deserve this.

colordrops 1716 days ago

For those downvoting, I should have made it clear that the FB employees, as well as employees at the company I worked at, explicitly discussed using psychological tricks to milk people, literally talking about them like animals, such as cows and whales.

Leparamour 1716 days ago

Facebook doesn't just exclusively employ engineers, they also employ psychologists and neuroscientists exactly in order to refine these parameters and increase engagement. It's not an honest mistake, the way loot boxes, gacha mechanics and dark patterns aren't honest mistakes.

mgraczyk 1716 days ago

When I worked at Instagram, no psychologist or neuroscientist was ever involved in any part of product development, experimentation, or parameter refinement in any way.

Sometimes "psychologists" (product analysts?) looked at data and wrote reports. This was generally used for very high level product direction, and had more to do with reading the pulse on what people want rather than any kind of actual "psychology".

For example, did you know that US teens don't like to see "too many" memes on their feed? These are the kinds of questions "psychologists" at facebook are answering. There simply does not exist any technology to use "psychology" to optimize low level parameters for engagement metrics.

zozin 1716 days ago

I don’t blame the engineers, they’re clearly very talented because their products are addictive. I don’t blame the executives either, because they’re clearly very good at managing a trillion dollar behemoth and making ungodly amounts of money for their shareholders. It’s not an employee or executive’s job to create products that are good for society.

I blame our inept and captured public officials for not regulating these products out of existence. It’s their jobs to protect the public from things that harm society. The tech industry clearly can’t regulate itself, no industry can, just like tobacco refused to use filters and car manufacturers refused to install catalytic converters. Preventing kids from getting addicted to apps and suffering from depression and low self-esteem is not in Facebook’s interest other than as a public relations problem.