| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by goodside 1283 days ago

1. I'm mostly working on Scale Spellbook, which is like OpenAI Playground but with features for evaluation and comparison of variant prompts, trying out open-source LLM models like FLAN-T5, and collecting feedback on generations using Scale's network for human labeling and annotation. https://scale.com/spellbook

2. I've seen demos of this implemented in GPT-2, where the model's attention to the prompt is visualized during a generation, but I'm struggling to find it now. It can't be done in GPT-3, which is available only via OpenAI's APIs.

3. Prompt engineering can be quantitatively empirical, using benchmarks like any other area of ML. LLMs are widely used as classification models and all the usual math for performance applies. The least quantitative parts of it are my specialty — the stuff I post to Twitter (https://twitter.com/goodside) is mostly "ethnographic research", poking at the model in weird ways and posting screenshots of whatever I find interesting. I see this as the only way to identify "capability overhangs" — things the model can do that we didn't explicitly train it to do, and never thought to attempt.

2 comments

larve 1283 days ago

Maybe BertVIZ? https://github.com/jessevig/bertviz

link

ilaksh 1283 days ago

How do things like FLAN-T5 or other open models compare to text-davinci-003 or code-davinci-002 for code generation?

link