Hi HN, I am Jan, CTO and co-founder of Pathway.com.
We’ve built a LLM microservice that answers questions about a corpus of documents, while automatically reacting to additions of new docs. The single, self-contained service fully replaces a complex multi-system pipeline that scans in real-time for new documents, indexes them into a specialized database and queries it to generate answers. Everyone can have their own real-time vector now.
An index is a software component building block, which becomes a database when wrapped with the data management system. We will see more and more traditional databases to add a vector-search index, for instance pgvector makes a vector database out of PostgreSQL.
The LLM App is meant to be self-sufficient and takes a "batteries included" approach to system development - rather than combine several separate applications into a large deploymet, that includes databases, orchestrators, ETL pipelines it combines several software components, such as connectors and indexes into a single app which can be directly deployed with no extra dependencies.
Such an approach should make the deployments easier (there are fewer moving parts to monitor and service), while also being more hackable - e.g. adding some more logic on top of nearest neighbor retrieval is easy and adds only a few statements to the code.
I see the ingested documents in the data folder don't have an id field, only a doc field.
{"doc": "Using Large Language Models in Pathway is simple: just call the functions from `pathway.stdlib.ml.nlp`!"}
What if I pass two contradictory statements? Is there a way to remove (or better update) a document with a new version?
For example, if I am ingesting some public docs, and I update a doc page. How do I make so that it only takes the answer from the latest document version?
This depends on the data source used. Some track updateable collections, some have a more "append-only" nature. For instance, tracing a database table using CDC+Debezium will support reacting to all document changes out of the box.
For file sources, we are working on supporting file versioning and integration with S3 native object versioning. Then the simply deleting the file or uploading a new version would be sufficient to trigger re-indexing the affected documents.
Pathway (https://github.com/pathwaycom/pathway) is a data processing framework we are developing that unifies stream and batch processing of large datasets. It lets developers concentrate on writing the data processing logic, without worrying about tracking changes to data and updating the results. The same code can then be run on batch data (e.g. during testing) or on real-time data streams (i.e. online query processing)
In the LLM app, Pathway allows concentrating on prompt building and querying the LLM APIs as if the corpus of documents were static, while all updates to it are handled by the framework itself.
- https://github.com/pathwaycom/llm-app/blob/main/llm_app/path... for the simplest contextless app
- https://github.com/pathwaycom/llm-app/blob/main/llm_app/path... for the default app that builds a reactive index of context documents
- https://github.com/pathwaycom/llm-app/blob/main/llm_app/path... for the contextful app reading data from s3
- https://github.com/pathwaycom/llm-app/blob/main/llm_app/path... for the app using locally available models