| HN Mirror

Thanks a ton! And that’s a great question!

Before starting this project, I realised that while there are a ton of resources that talk about using these models for chat inference and QnA over documents — no one did a good job of stress-testing them on sample complexity.

We all know that LLMs have the power of generalisability but how do they actually compare to the likes of BERT and Distilbert that have become household names in the world of NLP. Can these LLMs compare with them on tasks beyond chat? Like classification, Named entity recognition, etc?

If you go over to a model folder, let’s say Flan or Falcon, you will notice that the README has a rich documentation of our research findings. This, I guarantee you, you won’t find anywhere else. Additionally, the inference section has a good study of how these models fare when the number of requests go up, and the associated costs.

I will end by saying that a lot of people and repositories are just riding the wave of the buzz surrounding LLMs without answering a lot of questions that data scientists and ML engineers actually have. And those questions (4 pillars of evaluation framework) are necessary to answer for enterprises to build software. Not just slap together a chat interface, and then calling a revolutionary product.