Hacker News new | ask | show | jobs
by JustAndy 545 days ago
I'm not really sure I understand your sorting example, maybe try it out in gpt and post the link to show exactly what you mean.

The refusal of the model is something trained into the model by the process of rlhf, and it can also be untrained, by the process of abliteration [1].

Also, LLMs are capable of using tools in this very moment [2].

[1]: https://huggingface.co/blog/mlabonne/abliteration [2]: https://www.anthropic.com/news/analysis-tool

1 comments

I'm deliberately blurring refusal with having an accurate picture of its own abilities and, past that, having an accurate picture of of what it can do given tools. Both are tested by

   "Can you X?"
With refusal you find just how shallow it is because it really will answer all sorts of questions that are "helpful" in making a nuclear bomb but when you ask it directly it shuts up. In another sense nothing it does is "helpful" because it's not going to hunt down some people in central asia who have 50kg of U235 burning a hole in their pocket for you, which is what would actually "help".

I use tool using LLMs frequently, but I find they frequently need help using their tools, it is a lot of fun to talk to Windsurf about the struggles it has with its tools and it feels strangely satisfying to help it out.