Hacker News new | ask | show | jobs
by jonathanmayer 1177 days ago
Context: I teach at Princeton and study social media and recommendation systems.

From a very quick skim of the repositories, this appears to be quite limited transparency. The documentation gives a decent high-level overview of how Tweet recommendation works—no surprises—and the code tracks that roadmap. Those are meaningful positive steps. But the underlying policies and models are almost entirely missing (there are a couple valuable components in [1]). Without those, we can't evaluate the behavior and possible effects of "the algorithm."

[1] https://github.com/twitter/the-algorithm-ml

13 comments

I work on Google Assistant Suggestions and I don't think it's very practical to open-source an algorithm like that including the models and the underlying data. Both of them can live in separate services and be frequently updated.

I am assuming that open sourcing the code aims to increase transparency about the business logic of the ranking decisions. At the same time you don't want spammers to be able to easily run experiments against a cloned version of your system.

> But the underlying policies and models are almost entirely missing (there are a couple valuable components in [1]). Without those, we can't evaluate the behavior and possible effects of "the algorithm."

Haven't gone through yet, but yeah, if that's the case, all this is, is a glorified framework to plug your own in.. Not exactly what was promised.

Did you also skim the accompanying (or rather, main) repo, https://github.com/twitter/the-algorithm ?

From a quick clone and line-count, it has:

  235 kLOC .scala
  136 kLOC .java
  22  kLOC .py
  7   kLOC .rs
So I don't think you did, since you posted so quickly and that's a LOT of code.

I also haven't skimmed this code except very superficially, but perhaps you should since you're out there making statements with your Princeton credentials.

(I posted this comment with the heads-up a few minutes after your comment above and then expanded it as you didn't respond.)

I think you misunderstood. He's saying the training models are not there.
For example, MostRecentCombinedUserSnapshotSource seems to be influential (such as for calculating "tweepcred"), but we can't see how it's calculated.
Wouldn’t that make them easy prey of “spam SEO”. However, given the framework isn’t it still possible to guess the models?
The spam SEO issue should be dealt/thought about _before_ engaging in the whole adventure, and having to guess how it could work if decently implemented properly defeats the "open source" spirit of it.

More credits would be given if the very idea of open sourcing the algorithm hasn't already been discussed to death with predictions of the difficult points and how it probably won't happen in any sane way.

And them be pilloried for not doing it or not fast enough. Damn if you do, damn if you don’t.

I’m starting to think the broblem with Elon is mostly personal, he’s just a proxy and default wrong.

(not that I approve of his behaviors, but I can’t enjoy this whole mobbing that he’s getting; not that he cares this I’m not worried he’s getting traumatized in any way? it’s just how it’s become an identitarian trait for a certain group that irks me.)

Makes me wonder if a way to override people SEO hacking the algorithm is to create a market of open-source algorithms that each individual can choose and then it's not trying to hack THE algorithm but having to hack many and not knowing which algorithm an individual is using.
You don't have to target each 'algorithm' all at once. You can target them one at a time. Hell you can run A/B test single out the easiest targets.
Yes but right now there is 100% of the users using the one algorithm (or chronological). If one doesn't know what percentage or which people are using which algorithm, it becomes harder to know which ones to try to hack to have the biggest result.
Those look older to me. They all have last updated dates for October and November 2022.
FB open source algo looks much better, right? /s
Is it valid to focus tracking a Dem/Rep split when that split is an exclusionary design for many Americans? Or is it not exclusionary in your belief? I'm curious of a social science perspective.

Ignoring the global nature of Twitter for a moment.

So why did they opensource it?
So they could pretend to be open. It's the "Open"AI model. Open-washing?
This is a very cynical take. They should be commended for publishing recommendation code at all, which no other major social network does.
Well if they say “we will open source the algorithm” and then what they really open source is a little bit of slightly relevant code that doesn’t allow us to understand the algorithm, then what we can deduce is that they are trying to weasel out of public commitments.

I can’t say for sure if that happened, but if they made a clear promise and then did something else, it’s perfectly reasonable to call that out.

Devil's advocate though: imagine you were to open source (probably with quite a short deadline) some 'algorithm' used in whatever you work on, but the rest should stay private; how would you go about that?

I don't think it's easy, there's inherently some interface(s!) where it's a hand-wavey 'get the thing from the private bit', and defining that sensibly is hard, and if you try to do it well will probably lead to a lot of meetings, scope creep, etc. - and as far as that goes it's not easy anyway, since it's highly technical and implementation-specific yet also a management/policy decision to make.

It depends on what your goal in open sourcing is. Are you looking to provide a base for others to build software on, and to provide a way for others to contribute back to your code? Then publishing the code makes sense.

Are you looking to build public trust in you and your organization? Then dumping a bunch of code with no context isn't going to help much, as it's not code but behavior that builds or destroys trust.

Are you looking to lean into a polarized partisan environment, pushing a narrative where its you and your supporters against an unfair group of "others"? Then a big splashy move high on symbolism and low on substance that will inspire lots of high profile, divisive media coverage is a great way to go.

If you were doing it in good faith, you wouldn't need to publish the actual code. Most likely you should publish an article and a flowchart explaining how the algorithm works. Publishing a partial chunk of code just creates a story that supporters who don't understand can parrot that "they opened their algorithm".
I still hear reverse-FUD about nvidia supposedly fully open-sourcing their Linux driver, when in reality they opened a tiny kernel portion of it that allows the main proprietary blob to connect to necessary kernel interfaces. You have to call out this bullshit when you see it.
Wait, what? AFAIU what you say is true, except for the part where the “main proprietary blob” does not run on the CPU. This isn’t as glorious as an actual open-source driver would be, but it does have meaningful advantages—e.g. you now have a ghost of a chance of implementing Nvidia GPU support on a non-Linux kernel, by uploading the GPU-side blob and rewriting the CPU-side shim as required. Or is the blob license-restricted from being used line that?
The "main proprietary blob" they're talking about is the userspace portion of the driver; the portion which does all of the heavy lifting. That definitely runs on your CPU. The only part they open-sourced is the kernel portion of the driver, which just exists to facilitate communication between the userspace driver and the hardware.
Hey, we can get even more cynical. Why should we trust that this code is even similar to what they run in production currently?
I can't imagine deliberately special casing Elon's account in something they made from scratch to fool people.
Let's have reasonable goals, shall we ? "Their shit doesn't stink as bad as others'" is nothing commendable, especially after souch publicity.
I say "why not both". Even if they are doing it only for good PR, we encourage it by giving them praise, because we should encourage things we want. (While remembering that they are not our friend, they are an entity we should pressure, and the way we pressure is by giving praise when they do things we like, and critcisim when they do not).
I’d give them more credit if they’d been honest and kept it secret then lie to my face and pretend they didn’t?
They should be commended for open sourcing something they don't understand because they fired all of the people whom understood it? Elon admitted as much.
This is like FB open sourcing the compiled frontend code you can see yourself using inspect.

If we commend them for this we're helping promote and encourage this faux open source virtue signaling

No, that's very different.
There is clearly a lot of information to share. It's worth considering this could be step 1 of n as opposed to assuming the worst possible intention.
It's healthy to have a normal amount of cynicism. They released it for a reason. "The goal of our open source endeavor is to provide full transparency to you, our users, about how our systems work."

Why be transparent (or try to appear transparent)? To convince people to trust your platform (or to recruit - which seems to be another goal of the post). Why would Twitter want or need to do this now? Well, there is a bit of context. This disclosure doesn't exist in a vacuum.

I love this take. Doomed if you do, doomed if you don't.
If we are willing to not assume some borderline "it's what they want you to think" conspiracy play, obviously there was always going to be a lot of highly interested and qualified people taking a very close look at this and, at some point, there was always going to be very definitive conclusion of what's the deal with what they released.

If your play was "it's some source code, hence people will think we are open, and that should be really good for us", that would make you a very special kind of idiot in this space.

That was one of Elon’s core statements when he first talked about buying Twitter. If he had gotten it out sooner there would be an easier link between the two, but if you want more context just go read the old tweets and articles from the Twitter vs Elon days.
If we can't build anything with this, is it "source"?
"Does not include batteries"
You must be new to Musk's business practices.
It's no secret that Twitter, like any other social media platform, is driven by user engagement and ad revenues. The more time we spend on the platform, the more valuable it becomes for them. With this new open-source algorithm, they're essentially crowdsourcing improvements to their system to better serve us the content we crave.

this move could be seen as a strategic PR play to boost their public image amidst the growing concerns around algorithmic bias and lack of transparency. By inviting the community to collaborate and address these issues, they're not only shifting some of the responsibility onto the users but also deflecting potential criticism.

Because they let go many of the engineers working on it?
Noone has mentioned this before - I don't know if it's really related, but afaik the European Union is thinking about requiring social media platforms to be more transparent when it comes to recommendations etc. If you can already say "hey we have a lot already online!" then maybe the laws will become less strict.
bc he have no devs anymore and thinks the community will fix it for free
PR and it was already leaked last week.
PR
> But the underlying policies and models are almost entirely missing... Without those, we can't evaluate the behavior and possible effects of "the algorithm

And neither can spammers find and test the cracks and edge cases that would allow them to break the system, that does sound reasonable to me. If they were public there would be an arms race between spammers/those wishing to game the system and Twitter engineers.

Then don’t pretend to release “the algorithm.”
They’re explaining how it works without giving the specifics. Much like the US military explains how the nuclear deterrent works without disclosing detailed plans and control codes.
It's an open algorithm, but it's not open data! (joking)
What did you expect?
I don’t know if the parent’s expectations matter here. This is more about making sure others don’t misunderstand the meaning here.
Good point. I didn't see it like that. Thanks!
Can i audit your classs for free?