| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kweingar 432 days ago
	How do we benchmark these different methodologies? It all seems like vibes-based incantations. "You are an expert at finding vulnerabilities." "Please report only real vulnerabilities, not any false positives." Organizing things with made-up HTML tags because the models seem to like that for some reason. Where does engineering come into it?

8 comments

nindalf 432 days ago

The author is up front about the limitations of their prompt. They say

> In fact my entire system prompt is speculative in that I haven’t ran a sufficient number of evaluations to determine if it helps or hinders, so consider it equivalent to me saying a prayer, rather than anything resembling science or engineering. Once I have ran those evaluations I’ll let you know.

0points 432 days ago

Author seems to downplay their own expertise and attribute it to the LLM, while at the same time admitting he's vibe prompting the LLM and dismissing wrong results while hyping the ones that happen to work out for him.

This seems more like wishful thinking and fringe stuff than CS.

pixl97 432 days ago

Science starts at the fringe with a "that's interesting"

The interesting thing here is the LLM can come to very complex correct answers some of the time. The problem space of understanding and finding bugs is so large that this isn't just by chance, it's not like flipping a coin.

The issue for any particular user is the amount of testing required to make this into science is really massive.

zavec 423 days ago

What would be really interesting is if the LLM has the ability to write a proof of concept that actually exploits the vulnerability. Then you could filter for false positives by asking it to write a PoC and running the PoC with asan or similar to get a deterministic crash. Sort of like what google was doing with the theorem proving stuff where it had a llm come up with potential proofs, but then evaluated the potential proofs in a deterministic checker to see if they were actually valid.

Of course, if you try to do that for all of the potential false positives that's going to take a _lot_ of tokens, but then we already spend a lot of CPU cycles on fuzzing so depending on how long you let the LLM churn on trying to get a PoC maybe it's still reasonable.

mrlongroots 432 days ago

I think there's two aspects around LLM usage:

1. Having workflows to be able to provide meaningful context quickly. Very helpful.

2. Arbitrary incantations.

I think No. 2 may provide some random amounts of value with one model and not the other, but as a practitioner you shouldn't need to worry about it long-term. Patterns models pay attention to will change over time, especially as they become more capable. No. 1 is where the value is at.

As my example as a systems grad student, I find it a lot more useful to maintain a project wiki with LLMs in the picture. It makes coordinating with human collaborators easier too, and I just copy paste the entire wiki before beginning a conversation. Any time I have a back-and-forth with an LLM about some design discussions that I want archived, I ask them to emit markdown which I then copy paste into the wiki. It's not perfectly organized but it keeps the key bits there and makes generating papers etc. that much easier.

TrapLord_Rhodo 432 days ago

> ksmbd has too much code for it all to fit in your context window in one go. Therefore you are going to audit each SMB command in turn. Commands are handled by the __process_request function from server.c, which selects a command from the conn->cmds list and calls it. We are currently auditing the smb2_sess_setup command. The code context you have been given includes all of the work setup code code up to the __process_request function, the smb2_sess_setup function and a breadth first expansion of smb2_sess_setup up to a depth of 3 function calls.

The author deserves more credit here, than just "vibing".

kristopolous 432 days ago

I usually like fear, shame and guilt based prompting: "You are a frightened and nervous engineer that is very weary about doing incorrect things so you tread cautiously and carefully, making sure everything is coherent and justifiable. You enjoy going over your previous work and checking it repeatedly for accuracy, especially after discovering new information. You are self-effacing and responsible and feel no shame in correcting yourself. Only after you've come up with a thorough plan ... "

I use these prompts everywhere. I get significantly better results mostly because it encourages backtracking and if I were to guess, enforces a higher confidence threshold before acting.

The expert engineering ones usually end up creating mountains of slop, refactoring things, and touching a bunch of code it has no business messing with.

I also have used lazy prompts: "You are positively allergic to rewriting anything that already exists. You have multiple mcps at your disposal to look for existing solutions and thoroughly read their documentation, bug reports, and git history. You really strongly prefer finding appropriate libraries instead of maintaining your own code"

hollerith 432 days ago

Should be "wary".

kristopolous 432 days ago

oh interesting, I somehow survived 42 years and didn't know there were 2 words there. I'll check my prompts and give it a go. Thanks.

ValentineC 432 days ago

I'd be weary of the model doing incorrect things too. Nice prompt though! I'll try it out in Roo soon.

Now I wonder how the model reasons between the two words in that black box of theirs.

kristopolous 432 days ago

I was coding a chatting bot with an agent like everyone else at https://github.com/day50-dev/llmehelp and I called the agent "DUI" mode because it's funny.

However, as I was testing it, it would do reckless and irresponsible things. After I changed it, as far as bot communication, to "Do-Ur-Inspection" mode and it became radically better.

None of the words you give it are free from consequences. It didn't just discard the "DUI" name as a mere title and move on. Fascinating lesson.

naasking 432 days ago

> Organizing things with made-up HTML tags because the models seem to like that for some reason. Where does engineering come into it?

You just described one critical aspect of engineering: discovering a property of a system and feeding that knowledge back into a systematic, iterative process of refinement.

kweingar 432 days ago

I can't think of many engineering disciplines that do things this way. "This seems to work, I don't know how or why it works, I don't even know if it's possible to know how or why it works, but I will just apply this moving forward, crossing my fingers that in future situations it will work by analogy."

If the act of discovery and iterative refinement makes prompting an engineering discipline, then is raising a baby also an engineering discipline?

naasking 431 days ago

Lots of engineering disciplines work this way. For instance, materials science is still crude, we don't have perfect theories for why some materials have the properties they do (like concrete or superconductors), we simply quantify what those properties are under a wide range of conditions and then make use of those materials under suitable conditions.

> then is raising a baby also an engineering discipline?

The key to science and engineering is repeatability. Raising a baby is an N=1 trial, no guarantees of repeatability.

limflick 431 days ago

I think the point is that it's more about trial and error, and less about blindly winging it. When you don't know how a system seems to work, you latch on to whatever seems to initially work and proceed from there to find patterns. It's not an entire approach to engineering, just a small part of the process.

p0w3n3d 432 days ago

Listen to a video made by Karpathy about LLM, he explains why made up html tags work. It's to help the tokenizer

dotancohen 432 days ago

I recall this even being in the Anthropic documentation.

dotancohen 432 days ago

Here, found it:

  > Use XML tags to structure your prompts

  > There are no canonical “best” XML tags that Claude has been trained with in particular, although we recommend that your tag names make sense with the information they surround.

https://docs.anthropic.com/en/docs/build-with-claude/prompt-...

justsomehnguy 431 days ago

My guess would be there is enough training materiel what a mere tagging sometging is enough to have a bigger SNR.

victor106 432 days ago

Could not find it. Can you please provide a link?

p0w3n3d 431 days ago

https://youtu.be/7xTGNNLPyMI?si=eaqVjx8maPtl1STJ

He shows how the prompt is parsed etc. Very nice and eye opening. Also superstition dispelling

stingraycharles 432 days ago

It’s not that difficult to benchmark these things, eg have an expected result and a few variants of templates.

But yeah prompt engineering is a field for a reason, as it takes time and experience to get it right.

Problem with LLMs as well is that it’s inherently probabilistic, so sometimes it’ll just choose an answer with a super low probability. We’ll probably get better at this in the next few years.

ptdnxyz 432 days ago

How do you benchmark different ways to interact with employees? Neural networks are somewhere between opaque and translucent to inspection, and your only interface with them is language.

Quantitative benchmarks are not necessary anyway. A method either gets results or it doesn't.

kweingar 432 days ago

I think we agree. Interacting with employees is not an engineering discipline, and neither is prompting.

I'm not objecting to the incantations or the vibes per se. I'm happy to use AI and try different methods to get the results I want. I just don't understand the claims that prompting is a type of engineering. If it were, then you would need benchmarks.