Hacker News new | ask | show | jobs
by ocdtrekkie 1948 days ago
The only problem I have with this requirement is that it requires giving only news organizations access to this information. Your concerns can adequately be addressed by ensuring everyone has even access to that information.

Google and Facebook's algorithms should be required to be publicly disclosed. As a society, we should demand that we are able to see the algorithms that every web property lives and dies based on, that lives are built and destroyed by.

3 comments

These algorithms are not human readable code. They are massively complex interconnected systems of many black box ML models. I don't understand what clarity people think releasing the "algorithms" will bring. In fact, describing ranking as a single algorithm is pretty misleading.
As you say, explaining the intracicies of the algorithm is a fools errand. I guess it is more reasonable if you turn it around: these changes have drastic impact on businesses, so there is a duty to behave responsibly in administering them.

If Google really has no idea what the impact of a change will be then it is fairly irresponsible to make that change given the real world harm it can cause. But I suspect in general it does have at least a reasonable idea what the effect of changes will be - that is why it is making them.

So the more reasonable version of this is that they need to submit human interpretable descriptions of the effect of changes based on reasonable evidence and validation of their models.

Monitoring search engine and social network ranking and filtering updates should be more efficient than complaining about biased parrots (language models). This is a tip to certain ethics researchers who are raising scandals about search bias, but not in the right place - go in the field, check the fucking feeds, leave your abstract ethical tower and measure the reality.
I'm sorry, but this post sounds pretty abstract itself. What exactly do you propose they should do?
Instead of trying to argue the gender bias in "doctor - man + woman == nurse" (abstract ethical argument) they should check the search results for bias (concrete, measured effect).
In many cases they (Google) don't know the impact of changes until they try deploying the changes, and there's ML in the picture, not just algorithms. As I understand it, they often run tests that expose the change to a limited subset of users first.
Yes, but they don't just do random stuff. They make changes with the intention to adjust the experience in certain ways, so making those intentions public is important.
If it’s ML that is doing all the work to display articles. Then ML has a long way to go.
No, it's ML tasked with "user engagement" doing the work. Not ML in general.
I also believe any algorithm that isn't human-readable should be banned. If it can't be understood, nobody can validate that it isn't racist, sexist, or slanted towards encouraging violence and harm.

The fact that technology companies have been grossly negligent and irresponsible isn't a reason to not regulate them: It's proof regulation needs to be much, much stronger.

This is an incredibly naive perspective. I guess you want to ban search engines, self driving cars, automated filtering of lewd and abusive content (why do you think FB isn’t full of porn? It’s not a hand engineered algorithm), automatic speech recognition for the hearing impaired, and a vast swath of important technology I didn’t list. I don’t think you really understand the implications of what you’re asking for. Sorry - black boxes are here to stay. And they are immeasurably useful. I could spend hours listing important and crucial technologies that you want banned because you are scared of racism.
Search engines already worked before ML, neither automated filtering nor self-driving cars actually work in reality.
I agree with you but I am still scared of racism.
> I agree with you but I am still scared of racism.

My suspicion is that the concern with machine learning over racism is rooted in two things. The first is just the general modern trend of accusing anything you don't like of being racist, because everybody hates racism and wants to fight it. And the second is the fear on the part of people who make a living fighting racism that machine learning might actually put them out of a job.

Because machine learning is basically a paperclip optimizer. You tell it to maximize a thing, it maximizes the thing and minimizes everything else. Racism isn't paperclips, so the paperclip optimizer will optimize for smashing it in favor of making more paperclips. And then they're out of business.

Because when you look at the criticism of this stuff, it generally looks like this. ~12% of the population is black, only ~5% of the selected applicants are black, the algorithm is accused of racism.

But nothing is that simple, because all kinds of things like income and education level and so on correlate with race, so you have to take all of those things into account before you can tell what's going on. And taking into account all of the available data is how machine learning works.

Which isn't to say that you couldn't make an algorithm racist. Tell it to optimize for applicants with a particular skin color and it does. But then your problem isn't with the algorithm, it's with the jackasses who asked for that.

What to optimize for is a much more general and difficult question. (Hint: Not paperclips.)

> My suspicion is that the concern with machine learning over racism is rooted in two things. The first is just the general modern trend of accusing anything you don't like of being racist, because everybody hates racism and wants to fight it. And the second is the fear on the part of people who make a living fighting racism that machine learning might actually put them out of a job.

I don't get to how you go from this statement, to then again explaining exactly how racism is embedded in algorithms. By using the biased data we have in the real world...

No, the racism is a real issue, though a lot of it is caused by limited training data. Having an image recognition algorithm identify Africans and South Asians as gorillas doesn't happen because the designers intended it, but because their training data had only light-skinned human faces and dark-skinned primates. But the effect is racist even though this wasn't the intent.

Likewise, if the system is trained to duplicate human decision-making (like who gets loans), interesting things can happen: if the decision-makers unconsciously favored whites over blacks, the algorithm could wind up weighing skin color or stereotypically Black or Latino names negatively, meaning that the final model is explicitly racist, just because there is a correlation in the training data. That doesn't mean we shouldn't use deep learning, it means that it's not responsible to just fit the training data and ship without testing for such problems.

I absolutely want to ban self-driving cars that behave in ways no human can explain or understand! The mere idea that anyone would think that should be legal is borderline insane.

All you are doing here is convincing me that tech companies are just runaway trains with nobody at the controls!

> I absolutely want to ban self-driving cars that behave in ways no human can explain or understand!

Can you explain or understand the algorithms humans use to drive cars?

Screw that.

Explain to me step by step how you walk.

Humans are held responsible if they cause harm to others. If a driver hits a pedestrian on purpose he is charged with murder. Who do you charge if a self-driving car behaves in this way?
What about all the other examples he listed. What about cancer detection? Or viral spread prediction? Drug discovery or medical imaging diagnosis? Physics research?

Machine learning is very widely used in the sciences and extremely beneficial to humanity in uncountably many ways and assuredly countless more to come. Of course technologies can be used for evil but so can nearly everything that exists. I believe your proposal comes from a desire to help or better the world, but to ban all non-human-readable algorithms is frankly ridiculous and demonstrates a naive understanding of the issue. It sounds a lot like the calls by the U.S. Congress to ban encryption.

Here is what I think:

- In medical: your doctor should be responsible for your diagnosis and drug company is responsible for defective drugs, except when they get away with lobbying and hiring good lawyers.

- In physics: I'm not sure if it's as big of a problem as in social networks. But consider this case: If you cannot reproduce the result of an experiment due to a ML model being cryptic, that would lead to huge credibility issue in science.

At best, you may be able to justify black boxes providing secondary indicators: Maybe using AI to study cancer detection might lead you to a new solid discovery, but "we use AI to determine if you have cancer" should never be the mission, as it fails to generate useful information about how it is detected.
Continue this line of thinking, would you want all algorithms banned? Might as well shut everything down :shrug:
We can't even explain all physical phenomena, so good luck with banning anything that depends on the gravity of earth to function, because we don't know what gravity is.
But gravitational laws stay unchanged for millenials isn't it ? If I toss an apple, it will falls down. If I throw it fast enough, it goes into orbital mode.
> I also believe any algorithm that isn't human-readable should be banned. If it can't be understood, nobody can validate that it isn't racist, sexist, or slanted towards encouraging violence and harm.

I'm not sure a human-readable algorithm exists for ranking all the web pages in the world based on natural language input. In fact, I'm pretty sure such an algorithm does not, and potentially cannot, exist given the absolute failure of all approaches towards NLP that weren't based on absolute masses of text data and complex models.

Are you willing to make Google 10% as effective to achieve your goal of a human-readable algorithm?

you don't need any NLP to rank webpages (in fact the entire innovation of Google was that they figured out a way to rank pages completely ignoring that fact). Pagerank works fundamentally by treating the web as a graph and prioritising results based on their connections, that is to say it ranks based on popularity and is agnostic about the content of the actual page.

This generally has worked well. On the other hand, actually attempting to manipulate search results based on automated handling of content is what has given us countless of censorship debates or simply failure where even uncontroversial content is removed or downranked because it violated some sort of strange rule because it had a 'bad word' in it. On Facebook recently clothing ads for the disabled people were banned[1], because turns out the ML system only cared about the wheelchair, not the person in it.

It's actually fairly straight-forward to build recommender systems on transparent, graph-based algorithms and it gives you the added advantage of not discriminating in strange ways.

[1]https://www.nytimes.com/2021/02/11/style/disabled-fashion-fa...

You've just skipped over the early days of Google where they relied primarily on PageRank and bad actors manipulated it to death.

It's trivial to generate webs of fake, inter-related content and use that specifically to feed incoming links to valuable pages. Or to comment-spam websites so aggressively it ruins them. Or all of the secret deals between high-ranking sites to feed links even though the sites weren't related. There are countless examples of black-hat techniques to break PageRank.

I am sorry but you simply can't build a sustainable search engine without deeply understanding the user intent and the meaning behind the indexed pages.

>There are countless examples of black-hat techniques to break PageRank

there are also countless of adversarial examples to trick ML algorithms. In fact this is in many ways worse because of the 'idiot savant' character of ML systems, which are almost always oblivious to context and can be tricked in ways that aren't apparent from the design of the system.

In contrast to systems that are legible or even formally verifiable ML systems are entirely unable to provide any guarantees. When someone breaks pagerank at least it's apparent how they broke it. When an ML system mistakes a turtle with a fractal pattern on its shell for a gun nobody knows how to fix the system in any reliable way, other than feed it more data and pray.

Pagerank worked fine when it was invented. It's a very elegant algorithm. But in a perfect illustration of Goodhart's law, it fell apart once people realized that they could game it to increase their traffic. Google has been in a constant arms race against unscrupulous SEO practices ever since.
>Google has been in a constant arms race against unscrupulous SEO practices ever since.

One company controls 80% of what is found on the internet. They set rules, restrictions, penalties that are not public. They do not pass any sort of regulatory muster. They rip and tear through businesses standing in their way. They crush out a person's online existence through never explained reasons. They use every advantage they can to tweak a human's emotions, drive and needs to feed more and more advertisements.

You suggest those trying to use every advantage they can to rank higher unscrupulous?

Google's fight to keep search results crisp ended soon after they began selling advertising. Google long ago quit innovating search to be better for people, they've made it better for advertisers.

what is the weather today, Google?

I agree that you don't need NLP to rank webpages (though it certainly helps), but you do need it to parse the kinds of queries given to search engines these days. The days of logical OR and NOT are long gone I'm afraid.

> It's actually fairly straight-forward to build recommender systems on transparent, graph-based algorithms and it gives you the added advantage of not discriminating in strange ways.

I think other commenters have addressed the PageRank issue, but I'd be super interested in papers doing the work you note above.

> Are you willing to make Google 10% as effective to achieve your goal of a human-readable algorithm?

Absolutely. If it can't be done responsibly and ethically, perhaps it should not be done.

what % of people do you think would be willing to stop using search engines because they are unethical?
To me, their response didn't seem to indicate that it should be directly decided by people. This is a consumer protection matter, and to stretch an analogy, like a list of ingredients on a consumable. Here we have these black boxes, and no list of ingredients, yet they drive and shape our world. A Person can't EVEN directly decide if they wanted to.
If you look at the actual data, you will find that black box models are in fact responsible for preventing the majority of abusive content including hate speech and porn on social media platforms. Ban these models and you’d find your favorite social media platform is more abusive. Most of the racism and sexism you are concerned about comes from other humans.
Do you apply the same standard to people?

Tell me, how did your brain come up with what you wrote? How do I validate that it isn't racist, sexist, or slanted towards encouraging violence and harm?

By asking them. You can't just ask an algorithm, it must be designed to show its own work. Credibility is another problem...
Why can’t you just test the algorithm? It’s not conclusive, but it’s also not worthless.
Seems to me that’s a viable answer. How can we test an algorithm like Google’s ranking though? We can’t feed it consistent data like in a software test. It relies on too much information, and what we know about it indicates we can’t extract it out to test against it—except for results in the real world.

Not to mention Facebook’s are even more difficult. Tangentially related, remember when you could use “View As” on your profile page to see what your profile looked like to others? It doesn’t work anymore, only works for Public and Yourself; you can no longer choose the person to view as.

It’d be great to test these algorithms. We can’t. They need to be designed and instrumented so this is possible.

lol. sorry, but that reminds me of a skit by an Australian comedian:

male guest: "now first of all, let me just start by saying I'm not racist..."

female guest: "pfft..."

host: "ah see you made a noise there, but a lot of people accuse him of being a racist, so I think it's very helpful to know that he actually isn't one..."

Right, like I said, credibility is a different problem. But at this point, we don't even get a lie from them, we get nothing. At least a lie can be checked and examined. There's nothing available at all currently.
Very few people have the ability to influence the success or failure of every business on the planet. Those that do are heavily scrutinized for racist or sexist behavior. (Sometimes they also don't get convicted anyways, but that's another matter.)
> Very few people have the ability to influence the success or failure of every business on the planet.

In other words the solution to this should be antitrust enforcement and decentralization of power.

> any algorithm that isn't human-readable should be banned

There's existing a term for people with this view:

https://en.wikipedia.org/wiki/Luddite

You refer to the activists who successfully protected their quality of life by refusing to let someone else use technology to ruin it.

An apt comparison.

I'm sorry I have to tell you this, but they were not successful.
The luddites obtained numerous concessions and retired comfortable. Not clear how that’s unsuccessful.
>If it can't be understood, nobody can validate that it isn't racist, sexist, or slanted towards encouraging violence and harm.

This is quite a bizarre claim as there is famously an entire category of problems that are hard to solve but easy to verify: P vs NP

Yeah, they can give you the architecture drawn as a nice mind map, list the hyper-parameters, but that's like knowing the algorithm of the compiler, it doesn't help detect a bad program. The question is what the model is learning, not how. What are the inputs and what is it learning to output.
Explainable models do not preclude the systemic problems you highlight. Plenty of systems before the advent of non-explanatory ML models had those defects. One option is to define test and validation sets and encourage 3P validation, somewhat like how accreditation works in other contexts.
Publicly disclosing the algorithms would drastically increase the pace of gaming them and resulting in pay to play system where the fanciest SEO wins.

Google and Facebook partially relies on the obscurity to keep the fighting the spam battle. IMO we don't have the technology yet to have fully open ranking algorithms that are not quickly broken.

To think of it - similar to crypto around WW2.

This isn't as true as it once was.

Google's best asset for ranking is their user data. Even if you had the exact algorithm, you couldn't game it without massive amounts of user traffic. (At least not for popular searches.)

No. Their best asset is their deep understanding of what makes a page "good" and the intent behind a search query.

You could get rid of all their user data and it would still be a great search engine.

delegating the war against spam rather then being picked up by the user doesn't seem right. To give Big Tech such power to relieve ourselves of a mild annoyance is destructive. This is understood in other aspects of life Hence we have local governments which are inefficient and inconvenience people greatly. Yet it is found that selling all our problems is counterproductive. It ends with monopolies. The answer to this isn't to charge tech companies for the privilege of dictating our lives, rather, it's greater accountability on behalf of big tech and more responsibility on the part of consumers. The only cure for google domination is for the transfer of information online to become more democratised.
This excuse has been used to protect Google and Facebook for decades, but considering disinformation campaigns, civil unrest, and outright genocide has been the cost... I think the price of using obscurity to prevent SEO tactics is way too high.
The root cause isn't algorithms, it's a lack of accountability (both of companies and of users). The problem with 8chan wasn't some inadvertently harmful AI, it's that the site and its users damaged the world for several years without facing consequences.
I'm curious how this advance update thing is supposed to work. What does disclosing those details look like, actually?

The reason I'm asking is that as these things grow in complexity, it's quite possible that even if you join the team that works on these systems it will probably take you a pretty long time to understand how they really work. Their actual behaviour is likely to still be mysterious a lot of the time because they're driven by data.

Is a high-level description in english OK? Do we need to see pseudocode? The source code code? Do they have to open source it? What parts, if it's tied to internal frameworks? If there is ML, do they have to disclose all their sauce there? The trained network / weights? The training data, if the alg alone is useless without a data set?

Any human-initiated change to search algorithms is presumably human-understandable. Someone writes a rule to downrank some terms or traits of a website, they presumably document it somewhere.

That documentation will need to be shared, and the implementation of the rule change will need to be delayed until the disclosure window has passed.

Human understandable, yes, but the details of particular changes might only make sense to humans familiar with the system.

But yeah, the product manager view / documentation of intent sounds generally reasonable.

I do wonder how useful that would be to the news orgs in practice.

Honestly, first and foremost, I expect a firehose of documentation, if Google isn't lying about making dozens of changes to it's algorithms every day. News companies might need a full-time guy (or team) just to sit there and read through them all.

But on the other hand, a bunch of journalists will have a ton of never-before-seen information about how the world's most powerful companies affect every other company on the planet. That alone is going to be worth some major exclusives.

Also, by the mere nature of being forced to share it, Google and Facebook will have to clean up their acts, they'll have to assume any change they make that could open them up to legal scrutiny will be found.

You underestimate the complexity here by orders of magnitudes. You also overestimate the usefulness to news companies. You underestimate the harm that bad actors can take.

The search algorithm tells you the order of search results for a particular set of terms. Except that as input you need to feed it a graph of the entire indexed internet, which is re-indexed periodically as the content on the index changes. How does knowing that benefit new companies? What, exactly would your hypothetical full-time guy/team, equipped with that index at huge cost, tell their company that would justify the time and expense? That they should write interesting content that lots of people consume?

Second, the general approach has been published and is well documented [1], as are its susceptibilities to attack [2]. So there's your algorithm, what does it tell you?

Third, general SEO isn't the problem, it's coordinated attacks that can poison all search results / ads markets if enough detail is known. Google invests [3] heavily to address these areas [4].

Finally, you underestimate how much of a firehose you'd have to drink from. It describes all of the internet.

[1] http://infolab.stanford.edu/~backrub/google.html

[2] https://en.wikipedia.org/wiki/PageRank#Manipulating_PageRank

[3] https://www.quora.com/What-does-the-Counter-Abuse-Technology...

[4] https://www.blog.google/around-the-globe/google-europe/meet-...

You might want to note some very important parts of your first-listed source:

> Furthermore, advertising income often provides an incentive to provide poor quality search results. For example, we noticed a major search engine would not return a large airline's homepage when the airline's name was given as a query. It so happened that the airline had placed an expensive ad, linked to the query that was its name. A better search engine would not have required this ad, and possibly resulted in the loss of the revenue from the airline to the search engine. In general, it could be argued from the consumer point of view that the better the search engine is, the fewer advertisements will be needed for the consumer to find what they want. This of course erodes the advertising supported business model of the existing search engines. However, there will always be money from advertisers who want a customer to switch products, or have something that is genuinely new. But we believe the issue of advertising causes enough mixed incentives that it is crucial to have a competitive search engine that is transparent and in the academic realm.

Larry and Sergey themselves both believed that ad-funded search was problematic, and that a transparent search engine in the academic realm was "crucial".

Unfortunately, Larry and Sergey's price was clearly billions of dollars.