|
|
|
|
|
by goodside
1235 days ago
|
|
1. I'm mostly working on Scale Spellbook, which is like OpenAI Playground but with features for evaluation and comparison of variant prompts, trying out open-source LLM models like FLAN-T5, and collecting feedback on generations using Scale's network for human labeling and annotation. https://scale.com/spellbook 2. I've seen demos of this implemented in GPT-2, where the model's attention to the prompt is visualized during a generation, but I'm struggling to find it now. It can't be done in GPT-3, which is available only via OpenAI's APIs. 3. Prompt engineering can be quantitatively empirical, using benchmarks like any other area of ML. LLMs are widely used as classification models and all the usual math for performance applies. The least quantitative parts of it are my specialty — the stuff I post to Twitter (https://twitter.com/goodside) is mostly "ethnographic research", poking at the model in weird ways and posting screenshots of whatever I find interesting. I see this as the only way to identify "capability overhangs" — things the model can do that we didn't explicitly train it to do, and never thought to attempt. |
|