Hacker News new | ask | show | jobs
by msp26 959 days ago
I'm interested in more testing on the context side of things.

For my NLP pipelines, I batch n-articles together to process (extract fields from) in one prompt (final output is something like this {"1":[{}], "2": [{},{}]...}) in one message. Compute-wise it's inefficient but OpenAI charges by the token so it doesn't matter. It's very reliable on gpt-4 8k.

I was also pretty happy with the results on 4-turbo initially but it seems that once you go past 30k-ish tokens in context (needs way more testing), it shits itself. The indexes don't match anymore and n_final_output is different from n_articles.

Still, great model and even if the limits are lower in practice I suspect I'll get good use out of it.

Edit: With better prompting, it feels stable at n=42, ~42000 prompt tokens.

1 comments

Interesting. I was skeptical about some of their claims regarding longer context, since it's been my experience that these models just get lost after enough of it.
Yeah, degraded performance on long contexts has been observed in plenty of other models [https://arxiv.org/abs/2307.03172] so I was cautious too. Unfortunately I don't have access to 4-32k. I would have liked to test that out too.