| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sweezyjeezy 109 days ago

> But the entire value is that it can be automated. If you try to automate a small model to look for vulnerabilities over 10,000 files, it's going to say there are 9,500 vulns. Or none.

'Or none' is ruled out since it found the same vulnerability - I agree that there is a question on precision on the smaller model, but barring further analysis it just feels like '9500' is pure vibes from yourself? Also (out of interest) did Anthropic post their false-positive rate?

The smaller model is clearly the more automatable one IMO if it has comparable precision, since it's just so much cheaper - you could even run it multiple times for consensus.

3 comments

johnfn 109 days ago

Admittedly just vibes from me, having pointed small models at code and asked them questions, no extensive evaluation process or anything. For instance, I recall models thinking that every single use of `eval` in javascript is a security vulnerability, even something obviously benign like `eval("1 + 1")`. But then I'm only posting comments on HN, I'm not the one writing an authoritative thinkpiece saying Mythos actually isn't a big deal :-)

jorvi 108 days ago

My proof-in-pudding test is still the fact that we haven't seen gigantic mass firings at tech companies, nor a massive acceleration on quality or breadth (not quantity!) of development.

Microsoft has been going heavy on AI for 1y+ now. But then they replace their cruddy native Windows Copilot application with an Electron one. If tests and dev only has marginal cost now, why aren't they going all in on writing extremely performant, almost completely bug-free native applications everywhere?

And this repeats itself across all big tech or AI hype companies. They all have these supposed earth-shattering gains in productivity but then.. there hasn't been anything to show for that in years? Despite that whole subsect of tech plus big tech dropping trillions of dollars on it?

And then there is also the really uncomfortable question for all tech CEOs and managers: LLMs are better at 'fuzzy' things like writing specs or documentation than they are at writing code. And LLMs are supposedly godlike. Leadership is a fuzzy thing. At some point the chickens will come to roost and tech companies with LLM CEOs / managers and human developers or even completely LLM'd will outperform human-led / managed companies. The capital class will jeer about that for a while, but the cost for tokens will continue to drop to near zero. At that point, they're out of leverage too.

MidnightRider39 108 days ago

Leadership is also a very human thing. I think most people would balk at the idea of being led by an LLM.

One of the main functions of leaders (should be) is to assume responsibility for decisions and outcomes. A computer cant do that.

And finally why should someone in power choose to replace themselves?

coldtea 107 days ago

>One of the main functions of leaders (should be) is to assume responsibility for decisions and outcomes. A computer cant do that.

Sure it can. "Assuming responsibility" just means people/the law lets you to.

It can be totally empty too, like CEOs or politicians "assuming responsibility" for some outcome but nevertheless suffering zero conseuences.

eiens 108 days ago

Someone in power doesn’t get to choose - the board of directors do. Who’s job is to act in the best interest of shareholders.

Firms tend to follow peers in an industry - once one blinks the rest follow.

MidnightRider39 108 days ago

The board of directors are also people in power - why not replace them with an LLM as well if it works so well for CEOs?

eru 108 days ago

> Someone in power doesn’t get to choose - the board of directors do. Who’s job is to act in the best interest of shareholders.

Alas, shareholder value is a great ideal, but it tends to be honoured in practice rather less strictly.

As you can also see when sudden competition leads to rounds of efficiency improvements, cost cutting and product enhancements: even without competition, a penny saved is a penny earned for shareholders. But only when fierce competition threatens to put managers' jobs at risk, do they really kick into overdrive.

coldtea 107 days ago

>shareholder value is a great ideal

It's one of the most horrible ideas ever, responsible for anything from market abuse and enshittification to rent seeking and patent trolling.

dbdr 108 days ago

> Someone in power doesn’t get to choose - the board of directors do

Since the board of directors can decide to replace the CEO, it's not the CEO who holds the (ultimate) power, it's the board of directors.

jsjohnst 108 days ago

Since the majority shareholder(s) can decide to replace the board of directors, it’s not the board of directors who holds the (ultimate) power, it’s the majority shareholder(s).

johnfn 108 days ago

Your proof-in-pudding test seems to assume that AI is binary -- either it accelerates everyone's development 100x ("let's rewrite every app into bug-free native applications") or nothing ("there hasn't been anything to show for that in years"). I posit reality is somewhere in between the two.

coldtea 107 days ago

Considering that "AI will replace nearly all devs" and "AI will give 100x boost" and such we were promised, it makes sense to question this.

After almost all hyped technology is also "somewere between the two" extremes of not doing what it promises at all and doing it. The question is which edge it's closer to.

eiens 108 days ago

LLM’s are capable of searching information spaces and generating some outputs that one can use to do their job.

But it’s not taking anyone’s job, ever. People are not bots, a lot of the work they do is tacit and goes well beyond the capabilities and abilities of llm’s.

Many tech firms are essentially mature and are currently using too much labour. This will lead to a natural cycle of lay offs if they cannot figure out projects to allocate the surplus labour. This is normal and healthy - only a deluded economist believes in ‘perfect’ stuff.

ipaddr 108 days ago

"it’s not taking anyone’s job, ever"

It has already and that doesn't mean new jobs haven't been created or that those new jobs went to those who lost their jobs.

johnfn 108 days ago

In this entire thread of conversation, I never said that LLMs would take people's jobs, and that is not something I believe.

nopinsight 108 days ago

> LLMs are better at 'fuzzy' things like writing specs or documentation than they are at writing code.

At least for writing specs, this is clearly not true. I am a startup founder/engineer who has written a lot of code, but I've written less and less code over the last couple of years and very little now. Even much of the code review can be delegated to frontier models now (if you know which ones to use for which purpose).

I still need to guide the models to write and revise specs a great deal. Current frontier LLMs are great at verifiable things (quite obvious to those who know how they're trained), including finding most bugs. They are still much less competent than expert humans at understanding many 'softer' aspects of business and user requirements.

locknitpicker 108 days ago

> Microsoft has been going heavy on AI for 1y+ now. But then they replace their cruddy native Windows Copilot application with an Electron one.

This.

Also, Microsoft is going heavy on AI but it's primarily chatbot gimmicks they call copilot agents, and they need to deeply integrate it with all their business products and have customers grant access to all their communications and business data to give something for the chatbot to work with. They go on and on in their AI your with their example on how a company can work on agents alone, and they tell everyone their job is obsoleted by agents, but they don't seem to dogfood any of their products.

mlmonkey 108 days ago

> My proof-in-pudding test is still the fact that we haven't seen gigantic mass firings at tech companies

This assumes that companies will announce such mass firings (yeah, I'm aware of WARN Act); when in reality they will steadily let go of people for various reasons (including "performance").

From my (tech heavy) social circle, I have noticed an uptick in the number of people suddenly becoming unemployed.

naasking 108 days ago

> My proof-in-pudding test is still the fact that we haven't seen gigantic mass firings at tech companies

Jevon's paradox.

gspetr 107 days ago

For Jevons paradox to be a win-win, you need these 3 statements to be true:

1)Workers get more productive thanks to AI.

2)Higher worker productivity translates into lower prices.

3)Most importantly, consumer demand needs to explode in reaction to lower prices. And we're finding out in real-time that the demand is inelastic.

Around 1900, 40% of American workers worked in agriculture. Today, it's < 2%.

Which is similar to what we see with coding: The increase in demand has not exploded enough to offset the job-killing of each farmer being able to produce more food.

ummonk 108 days ago

What's a situation where one needs to use `eval` in benign way in JS? If something is precomputable (e.g. `eval("1 + 1")` can just be replaced by 2), then it should be precomputed. If it's not precomputable then it's dependent on input and thus hardly benign -- you'll need to carefully verify that the inputs are properly sanitized.

argee 109 days ago

With LLMs (and colleagues) it might be a legitimate problem since they would load that eval into context and maybe decide it’s an acceptable paradigm in your codebase.

bloaf 108 days ago

I remember a study from a while back that found something like "50% of 2nd graders think that french fries are made out of meat instead of potatoes. Methodology: we asked kids if french fries were meat or potatoes."

Everyone was going around acting like this meant 50% of 2nd graders were stupid with terrible parents. (Or, conversely, that 50% of 2nd graders were geniuses for "knowing" it was potatoes at all)

But I think that was the wrong conclusion.

The right conclusion was that all the kids guessed and they had a 50% chance of getting it right.

And I think there is probably an element of this going on with the small models vs big models dichotomy.

Kye 108 days ago

I think it also points to the problem of implicit assumptions. Fish is meat, right? Except for historical reasons, the grocery store's marketing says "Fish & Meat."

And then there's nut meats. Coconut meat. All the kinds of meat from before meat meant the stuff in animals. The meat of the problem. Meat and potatoes issues.

If you asked that question before I'd picked up those implicit assumptions, or if I never did, I would have to guess.

roxolotl 108 days ago

I’ve got many catholic relatives that describe themselves as vegetarians and eat fish. Language can be surprisingly imprecise and dependent upon tons of assumptions.

alwillis 108 days ago

> I’ve got many catholic relatives that describe themselves as vegetarians and eat fish

Those are pescatarians.

It's like how a tomato is a fruit, but it's used as a vegetable, meat has traditionally been the flesh of warm-blooded animals. Fish is the flesh of cold-blooded animals, making it meat but due to religious reasons it’s not considered meat.

roxolotl 108 days ago

Right exactly. The point is that dictionary definitions don’t always align with cultural ones.

idopmstuff 109 days ago

> 'Or none' is ruled out since it found the same vulnerability

It's not, though. It wasn't asked to find vulnerabilities over 10,000 files - it was asked to find a vulnerability in the one particular place in which the researchers knew there was a vulnerability. That's not proof that it would have found the vulnerability if it had been given a much larger surface area to search.

sweezyjeezy 108 days ago

I don't think the LLM was asked to check 10,000 files given these models' context windows. I suspect they went file by file too.

That's kind of the point - I think there's three scenarios here

a) this just the first time an LLM has done such a thorough minesweeping b) previous versions of Claude did not detect this bug (seems the least likely) c) Anthropic have done this several times, but the false positive rate was so high that they never checked it properly

Between a) and c) I don't have a high confidence either way to be honest.

direwolf20 107 days ago

Mythos was also asked to find a vulnerability in one file, in turn for each file. Maybe the small model needs to be asked about each function instead of each file. Okay, you can still automate that.

jgalt212 107 days ago

or run multiple cheap models in parallel: MOE^n, in effect.