|
|
|
|
|
by hn_20591249
829 days ago
|
|
The data pipeline isn't included in this release, and we already know it is a pretty simple RAG pipeline using qdrant, https://twitter.com/qdrant_engine/status/1721097971830260030. Nothing about using data in "real time" predicates that the model parameters need to be this large, and is likely quite inefficient for their "non-woke" instructional use-case. |
|
Eg, there is a lot of noise in social data, and worse, misinfo/spam/etc, so we spend a lot of energy on adverserial data integration. Likewise, queries are often neurosymbolic, like on a data range or with inclusion/exclusion criteria. Pulling the top 20 most similar tweets to a query and running through a slow, dumb, & manipulated LLM would be a bad experience. We have been pulling in ideas from agents, knowledge graphs, digital forensics & SNA, code synthesis, GNNS, etc for our roadmap, which feels quite different from what is being shown here.
We do have pure LLM work, but more about fine-tuning smaller or smarter models, and we find that to be a tiny % of the part people care about. Ex: Spam classifications flowing into our RAG/KG pipelines or small model training is more important to us than it flowing into a big model training. Long-term, I do expect growing emphasis on the big models we use, but that is a more nuanced discussion.
(We have been piloting w gov types and are preparing for next cohorts, in case useful on real problems for anyone.)