| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jonathanmayer 1177 days ago

Context: I teach at Princeton and study social media and recommendation systems.

From a very quick skim of the repositories, this appears to be quite limited transparency. The documentation gives a decent high-level overview of how Tweet recommendation works—no surprises—and the code tracks that roadmap. Those are meaningful positive steps. But the underlying policies and models are almost entirely missing (there are a couple valuable components in [1]). Without those, we can't evaluate the behavior and possible effects of "the algorithm."

[1] https://github.com/twitter/the-algorithm-ml

13 comments

eterevsky 1176 days ago

I work on Google Assistant Suggestions and I don't think it's very practical to open-source an algorithm like that including the models and the underlying data. Both of them can live in separate services and be frequently updated.

I am assuming that open sourcing the code aims to increase transparency about the business logic of the ranking decisions. At the same time you don't want spammers to be able to easily run experiments against a cloned version of your system.

bilekas 1177 days ago

> But the underlying policies and models are almost entirely missing (there are a couple valuable components in [1]). Without those, we can't evaluate the behavior and possible effects of "the algorithm."

Haven't gone through yet, but yeah, if that's the case, all this is, is a glorified framework to plug your own in.. Not exactly what was promised.

tpmx 1177 days ago

Did you also skim the accompanying (or rather, main) repo, https://github.com/twitter/the-algorithm ?

From a quick clone and line-count, it has:

  235 kLOC .scala
  136 kLOC .java
  22  kLOC .py
  7   kLOC .rs

So I don't think you did, since you posted so quickly and that's a LOT of code.

I also haven't skimmed this code except very superficially, but perhaps you should since you're out there making statements with your Princeton credentials.

(I posted this comment with the heads-up a few minutes after your comment above and then expanded it as you didn't respond.)

Lord_Zero 1170 days ago

I think you misunderstood. He's saying the training models are not there.

kadavy 1177 days ago

For example, MostRecentCombinedUserSnapshotSource seems to be influential (such as for calculating "tweepcred"), but we can't see how it's calculated.

eecc 1177 days ago

Wouldn’t that make them easy prey of “spam SEO”. However, given the framework isn’t it still possible to guess the models?

makeitdouble 1176 days ago

The spam SEO issue should be dealt/thought about _before_ engaging in the whole adventure, and having to guess how it could work if decently implemented properly defeats the "open source" spirit of it.

More credits would be given if the very idea of open sourcing the algorithm hasn't already been discussed to death with predictions of the difficult points and how it probably won't happen in any sane way.

eecc 1175 days ago

And them be pilloried for not doing it or not fast enough. Damn if you do, damn if you don’t.

I’m starting to think the broblem with Elon is mostly personal, he’s just a proxy and default wrong.

(not that I approve of his behaviors, but I can’t enjoy this whole mobbing that he’s getting; not that he cares this I’m not worried he’s getting traumatized in any way? it’s just how it’s become an identitarian trait for a certain group that irks me.)

jimkleiber 1175 days ago

Makes me wonder if a way to override people SEO hacking the algorithm is to create a market of open-source algorithms that each individual can choose and then it's not trying to hack THE algorithm but having to hack many and not knowing which algorithm an individual is using.

muddi900 1175 days ago

You don't have to target each 'algorithm' all at once. You can target them one at a time. Hell you can run A/B test single out the easiest targets.

jimkleiber 1174 days ago

Yes but right now there is 100% of the users using the one algorithm (or chronological). If one doesn't know what percentage or which people are using which algorithm, it becomes harder to know which ones to try to hack to have the biggest result.

modeless 1177 days ago

What about these? https://huggingface.co/Twitter

simonw 1177 days ago

Those look older to me. They all have last updated dates for October and November 2022.

EastSmith 1177 days ago

FB open source algo looks much better, right? /s

zhte415 1176 days ago

Is it valid to focus tracking a Dem/Rep split when that split is an exclusionary design for many Americans? Or is it not exclusionary in your belief? I'm curious of a social science perspective.

Ignoring the global nature of Twitter for a moment.

meghan_rain 1177 days ago

So why did they opensource it?

daveguy 1177 days ago

So they could pretend to be open. It's the "Open"AI model. Open-washing?

cubefox 1177 days ago

This is a very cynical take. They should be commended for publishing recommendation code at all, which no other major social network does.

SequoiaHope 1177 days ago

Well if they say “we will open source the algorithm” and then what they really open source is a little bit of slightly relevant code that doesn’t allow us to understand the algorithm, then what we can deduce is that they are trying to weasel out of public commitments.

I can’t say for sure if that happened, but if they made a clear promise and then did something else, it’s perfectly reasonable to call that out.

OJFord 1176 days ago

Devil's advocate though: imagine you were to open source (probably with quite a short deadline) some 'algorithm' used in whatever you work on, but the rest should stay private; how would you go about that?

I don't think it's easy, there's inherently some interface(s!) where it's a hand-wavey 'get the thing from the private bit', and defining that sensibly is hard, and if you try to do it well will probably lead to a lot of meetings, scope creep, etc. - and as far as that goes it's not easy anyway, since it's highly technical and implementation-specific yet also a management/policy decision to make.

anyonecancode 1176 days ago

It depends on what your goal in open sourcing is. Are you looking to provide a base for others to build software on, and to provide a way for others to contribute back to your code? Then publishing the code makes sense.

Are you looking to build public trust in you and your organization? Then dumping a bunch of code with no context isn't going to help much, as it's not code but behavior that builds or destroys trust.

Are you looking to lean into a polarized partisan environment, pushing a narrative where its you and your supporters against an unfair group of "others"? Then a big splashy move high on symbolism and low on substance that will inspire lots of high profile, divisive media coverage is a great way to go.

jjeaff 1176 days ago

If you were doing it in good faith, you wouldn't need to publish the actual code. Most likely you should publish an article and a flowchart explaining how the algorithm works. Publishing a partial chunk of code just creates a story that supporters who don't understand can parrot that "they opened their algorithm".

5e92cb50239222b 1177 days ago

I still hear reverse-FUD about nvidia supposedly fully open-sourcing their Linux driver, when in reality they opened a tiny kernel portion of it that allows the main proprietary blob to connect to necessary kernel interfaces. You have to call out this bullshit when you see it.

mananaysiempre 1176 days ago

Wait, what? AFAIU what you say is true, except for the part where the “main proprietary blob” does not run on the CPU. This isn’t as glorious as an actual open-source driver would be, but it does have meaningful advantages—e.g. you now have a ghost of a chance of implementing Nvidia GPU support on a non-Linux kernel, by uploading the GPU-side blob and rewriting the CPU-side shim as required. Or is the blob license-restricted from being used line that?

mort96 1176 days ago

The "main proprietary blob" they're talking about is the userspace portion of the driver; the portion which does all of the heavy lifting. That definitely runs on your CPU. The only part they open-sourced is the kernel portion of the driver, which just exists to facilitate communication between the userspace driver and the hardware.

philote 1177 days ago

Hey, we can get even more cynical. Why should we trust that this code is even similar to what they run in production currently?

concordDance 1176 days ago

I can't imagine deliberately special casing Elon's account in something they made from scratch to fool people.

rakoo 1177 days ago

Let's have reasonable goals, shall we ? "Their shit doesn't stink as bad as others'" is nothing commendable, especially after souch publicity.

jrochkind1 1176 days ago

I say "why not both". Even if they are doing it only for good PR, we encourage it by giving them praise, because we should encourage things we want. (While remembering that they are not our friend, they are an entity we should pressure, and the way we pressure is by giving praise when they do things we like, and critcisim when they do not).

LastTrain 1176 days ago

I’d give them more credit if they’d been honest and kept it secret then lie to my face and pretend they didn’t?

raiyni 1176 days ago

They should be commended for open sourcing something they don't understand because they fired all of the people whom understood it? Elon admitted as much.

hanniabu 1177 days ago

This is like FB open sourcing the compiled frontend code you can see yourself using inspect.

If we commend them for this we're helping promote and encourage this faux open source virtue signaling

cubefox 1177 days ago

No, that's very different.

correlator 1177 days ago

There is clearly a lot of information to share. It's worth considering this could be step 1 of n as opposed to assuming the worst possible intention.

yurodivuie 1176 days ago

It's healthy to have a normal amount of cynicism. They released it for a reason. "The goal of our open source endeavor is to provide full transparency to you, our users, about how our systems work."

Why be transparent (or try to appear transparent)? To convince people to trust your platform (or to recruit - which seems to be another goal of the post). Why would Twitter want or need to do this now? Well, there is a bit of context. This disclosure doesn't exist in a vacuum.

mirkules 1176 days ago

I love this take. Doomed if you do, doomed if you don't.

jstummbillig 1177 days ago

If we are willing to not assume some borderline "it's what they want you to think" conspiracy play, obviously there was always going to be a lot of highly interested and qualified people taking a very close look at this and, at some point, there was always going to be very definitive conclusion of what's the deal with what they released.

If your play was "it's some source code, hence people will think we are open, and that should be really good for us", that would make you a very special kind of idiot in this space.

joshspankit 1177 days ago

That was one of Elon’s core statements when he first talked about buying Twitter. If he had gotten it out sooner there would be an easier link between the two, but if you want more context just go read the old tweets and articles from the Twitter vs Elon days.

kzrdude 1177 days ago

If we can't build anything with this, is it "source"?

bilekas 1177 days ago

"Does not include batteries"

justapassenger 1177 days ago

You must be new to Musk's business practices.

avanti 1176 days ago

It's no secret that Twitter, like any other social media platform, is driven by user engagement and ad revenues. The more time we spend on the platform, the more valuable it becomes for them. With this new open-source algorithm, they're essentially crowdsourcing improvements to their system to better serve us the content we crave.

this move could be seen as a strategic PR play to boost their public image amidst the growing concerns around algorithmic bias and lack of transparency. By inviting the community to collaborate and address these issues, they're not only shifting some of the responsibility onto the users but also deflecting potential criticism.

bradly 1177 days ago

Because they let go many of the engineers working on it?

carstenhag 1176 days ago

Noone has mentioned this before - I don't know if it's really related, but afaik the European Union is thinking about requiring social media platforms to be more transparent when it comes to recommendations etc. If you can already say "hey we have a lot already online!" then maybe the laws will become less strict.

llx2 1176 days ago

bc he have no devs anymore and thinks the community will fix it for free

w0m 1177 days ago

PR and it was already leaked last week.

anigbrowl 1177 days ago

PR

helsinkiandrew 1176 days ago

> But the underlying policies and models are almost entirely missing... Without those, we can't evaluate the behavior and possible effects of "the algorithm

And neither can spammers find and test the cracks and edge cases that would allow them to break the system, that does sound reasonable to me. If they were public there would be an arms race between spammers/those wishing to game the system and Twitter engineers.

ivalm 1176 days ago

Then don’t pretend to release “the algorithm.”

helsinkiandrew 1176 days ago

They’re explaining how it works without giving the specifics. Much like the US military explains how the nuclear deterrent works without disclosing detailed plans and control codes.

novok 1176 days ago

It's an open algorithm, but it's not open data! (joking)

ngrilly 1177 days ago

What did you expect?

TaylorAlexander 1177 days ago

I don’t know if the parent’s expectations matter here. This is more about making sure others don’t misunderstand the meaning here.

ngrilly 1176 days ago

Good point. I didn't see it like that. Thanks!

bobobob420 1177 days ago

Can i audit your classs for free?