Hacker News new | ask | show | jobs
by mkumar10 1080 days ago
- Seen a few of these. Are you all working on providing an easy way to maybe use LLMs for chatting/search without sending my data to OpenAI? If yes, how will you verify the quality is "reasonable"?

- How is this better than Rewind, Needl, Mem, etc all the personal search engine that have been doing the rounds lately from various knowledge bases? Is the selling point that it's Open-source? Also if Apple improves spotlight, I wonder how useful this will be.

5 comments

Hello! One of the developers of Khoj here.

The way we see it, building in the open is going to be critical for creating an aligned, trustworthy AI assistant.

Note: while all LLM tools look fairly similar on the surface these days, our specific approaches are fairly different. Give us a try and see what you think :-)

And yet you didn't answer them at all.
I can expand on that (I'm the other developer working on the project).

> Seen a few of these. Are you all working on providing an easy way to maybe use LLMs for chatting/search without sending my data to OpenAI? If yes, how will you verify the quality is "reasonable"?

We're working on building a helpful AI assistant, with or without OpenAI. We use offline SentenceTransformer models for search and OpenAI (currently) for chat.

To allow user to verify quality, with search you've to look at the quality of the results returned. For chat we pass references (from your docs) used to generate the response. A lot more should be done, open to suggestions.

We also have our own chat quality test suite that "benchmarks" chat capabilities (via pytest)

> How is this better than Rewind, Needl, Mem, etc all the personal search engine that have been doing the rounds lately from various knowledge bases? Is the selling point that it's Open-source? Also if Apple improves spotlight, I wonder how useful this will be.

- I've tried Rewind. It's a neat project with a slick UI, no doubt about it. But 1. It has a cold boot problem (you can only search stuff you've opened since you installed Rewind) and 2. It's limited to Mac (M1+) machines. Khoj will index all supported files across your data sources and it can run on other machines easily.

- Needl, based on their homepage, seems to provide fuzzy/keyword based search. Khoj search works offline and supports natural language queries (e.g search for "sold my car for" and it'll find notes about your Toyota Corolla or Ferrari)

- Mem.ai is pretty neat as well. We'd love to add all the features they have. With Khoj you can self-host if you prefer or use Khoj cloud if you want to sync across devices. And it integrates into your existing tools (Emacs, Obsidian and Web)

In summary, Khoj being open-source is a critical differentiator for an AI assistant to be trustable (you can see what the code is doing). But all the AI assistance approaches are also different.

>Are you all working on providing an easy way to maybe use LLMs for chatting/search without sending my data to OpenAI?

From a brief look at the github repo there seems to be need to setup OpenAI API key so not sure if this currently has the ability to chat / search w/o sending or needing a OpenAI API access ?

Search does currently work 100% offline - none of your data would be sent to OpenAI if all you're doing is searching for your local documents. You could completely disable your internet connection and it would still work.

Chat currently is only integrated with OpenAI because it had the highest quality + lowest barrier to entry. We're experimenting with open source LLMs and hope to have an alternative available soon.

"The way we see it, building in the open is going to be critical for creating an aligned, trustworthy AI assistant."

Isn't this service just a very thin wrapper around chat-gpt? How on earth do you have any influence on alignment or trustworthiness. That's like saying your coffee cup makes your coffee fair trade.

This whole thread is very disingenuous, it's literally a simple interface for the OpenAI-API drenched in fake buzzwords boosted to the top of HN to scam investors.

You're being overly critical. You can definitely control the alignment of your assistant with prompt engineering and embeddings. They never say they control the underlying model.

It's an open source project and I don't see why you need to be so obnoxious about it.

Are they being obnoxious without cause though?

The Khoj website says, and I quote:

> Khoj's offline AI models allow you to find information using natural language queries. Search using terms that are similar to what you're looking for, rather than exact or fuzzy matches. Khoj search works offline. So if you self-host your data never leaves your machine and search works without internet.

Emphasis mine.

It seems somewhat disingenuous.

I get it, parts of it run offline, parts of it use the openai api… but that’s not what it says on the box.

Why is the project making a song and dance about self hosting and being open source when it’s just another openai app.

If it’s not just another openai wrapper, cut the openai part of it out and pitch it that way, sure.

…but as it stands, I’m pretty sceptical.

Lots of people are doing the “ai magic” tech demo stuff at the moment, but when you cut them off from the openai api the magic goes away and what’s left isn't very good or interesting.

Maybe this is different? …but it doesn’t look like it; and since they’re tied up with the openai api and you can’t use it without that, how would I even tell?

>> Khoj search works offline. So if you self-host your data never leaves your machine and search works without internet.

> Emphasis mine.

> It seems somewhat disingenuous.

I've been trying it. Khoj search does work offline. Khoj chat (they are literally seperate functions in the app) requires an openAI key and if you give it one, uses openAI.

Yes! It's a bit more than, "somewhat disingenuous," to say a system built to use the OpenAI API works with you to make sure, "your data never leaves your machine".

That's like saying I invented a new form of transportation where you're feet never leave the ground but in actuality I'm just a travel agent sending you to the airport.

"your data never leaves your machine" is only mentioned in the Search section, where it is it true. No-one reading that would assume that meant everything considering the two last sentences above in the Chat section explicitly says it's using OpenAI.

Really feels like people are nitpicking and hating on this project for no good reason. I feel sorry for the authors.

It feels like you are reading too much into this. Really don't understand all the bashing here. It's an open source software for building things using OpenAI. Do you think LangChain is similarly disingenuous? Or the Vercel AI SDK?
Neither of those things claim:

> So if you self-host your data never leaves your machine

Langchain works completely offline with appropriate LLM/API backend & vectore store if needed
It says open source AI personal assistant.

The AI isn't open source and sending your data to a third party isn't really trustworthy personal.

I understand your concerns, but let me zoom out a little here and talk about the nature of open source.

Open source means that the source code which is developed for a piece of software is fully open (i.e, anyone can read, fork, modify the code) for what they are installing.

According to your definition, it would be really hard to do anything that is fully, end to end, open source. We've developed the code on Macs, hosted the code on GitHub, written plugins for Obsidian and GitHub, hosted the website on AWS. All of those are closed sourced software.

https://www.redhat.com/en/topics/open-source/what-is-open-so...

That being said, we are planning to integrate an open source LLM soon. When we added chat, Open AI just had the best one, but the space is changing so quickly. We're both super enthusiastic about seeing all the open source tooling for this stuff evolve.

The problem is not that there is "glue" to closed-source apps. It's that the essential core of your product, without which your product has no content or meaningful use — is _someone else's closed-source model._

If I market "a totally creative-commons blockbuster Hollywood movie", but my actual product is just a creative-commons-licensed set of driving directions to some nearby movie theater where you can buy tickets to see the same copyrighted movies anyone else is offering, then _the fundamental essence_ of what I'm offering is not, in fact, creative-commons. I sold people on a _movie_ with that license, and then failed to deliver.

That's what you've done here.

To be clear, the fatal flaw is that your marketing is dishonest about what your product currently is, not that your product is something nobody wants. I'd recommend either making your marketing honest, or else making your product live up to what your marketing promises.

_Then_ you do the PR push on HN.

> it would be really hard to do anything that is fully, end to end, open source. We've developed the code on Macs, hosted the code on GitHub, written plugins for Obsidian and GitHub, hosted the website on AWS.

Yeah, nobody did open source before those things existed...

I don't understand the presumption that the AI should be open source here. If I release an open source SDK for talking to an API, it's still open source even if the underlying API isn't.
> I don't understand the presumption that the AI should be open source here.

Because it literally says "open-source AI".

Exactly my thoughts. This person is just gaming the system on here
> Seen a few of these. Are you all working on providing an easy way to maybe use LLMs for chatting/search without sending my data to OpenAI?

Curious: What informs reservations about the use of OpenAI models? Their API terms state explicitly that they do not use customer data for training and that they delete it after 30 days, anyway.

> Also if Apple improves spotlight, I wonder how useful this will be.

There are 3x more Android phones and PCs than iPhones and Macs. Just sayin'

> What informs reservations about the use of OpenAI models?

Three things. For one, I have no reason to take them at their word that they aren’t saving data to train on. Two is that OpenAI will shut down one day, and thus I would like any services I run to outlive them. Third and finally, I have hardware and it’d be a waste not to use it. As a bonus, I find it hypocritical a company that benefits so heavily from open source would hide away their models as closed source in fear of copycats.

> For one, I have no reason to take them at their word that they aren’t saving data to train on.

How are you able to trust cloud providers(even VPS or managed bare metal ones)? I have seen the same sentiment among bigger companies who happily store all users data in the cloud.

I don’t. Any data I purposefully store in the cloud that has any significance I store encrypted. I also do my best to minimize my exposure to non-E2EE services for important purposes, and self-host when possible.
This industry has an atrocious track record of claiming to respect privacy, and then doing something entirely different. I have no reason to think OpenAI are lying, but it would still be wise to be extremely cautious of putting sensitive data in their hands.
Given the narrative and the place (HN) you’re saying, I’m betting you don’t use Google for storing your data either, but the vast majority of the world does. For someone who trusts Google I am almost there in how much I trust OpenAI to the same level as well. Doesn’t mean I think they’re the good guys, but that I am not worried about the risk that much.
So, basically, you don’t care about the privacy of your digital data. That is fine, but it represents an extreme position regardless of how many people follow this path.
it’s not an extreme position. it’sa position shared by significant numbers of people worldwide, as is evident from the number of customers of these platforms you feel threatened by. it’s only considered “extreme” in the echo chamber of HN.
Sure it is from a 10 point scale.

0 - Full privacy off the grid

9 - Brain implant with all data shared to the world

8.5 - Allowing Google to have and scan for ads and government perusal all of your personal emails, written thoughts, location info, friends and accomplices, calendar, photographs, etc

I think killing animals and eating them is an extreme position too but it’s considered obnoxious to say that, so how is this any different?
>they do not use customer data for training and that they delete it after 30 days, anyway.

I don't use X, just keep it around, 'just in case' for 30 days.

as someone who refers back to previous chats quite frequently i’m glad they do this and would use a feature to extend that period of time.
It’s API calls not their user chat portal. You can’t access the stored data, they say they keep it around for 30 days in case of abuse so they can refer back to it to verify and take action.
> Also if Apple improves spotlight, I wonder how useful this will be.

Do you really not see the usefulness of a solution that caters to the remaining 88% (desktop/notebooks) of the market?

Reasonable from openAI is again at their whims & changes to what they consider is appropriate for you.

Haven't seen a roadmap on Spotlight to include semantic search across my entire local drive. Maybe if they Integrate Journal/Freeform/Notes into one thing then it is deliberate & works with things I explicitly want it to understand & help me work with rather than the tools that you've listed which just help you find stuff

To me, this makes a significant difference.

While I would prefer that I could run the LLM locally, being able to see the code that calls the api is a clear second best. At this point in time, I am not going to trust any black box that can read my data and run "AI" on it because I find the risk too big. If I can self-host something, I might just be willing to try it out.