Hacker News new | ask | show | jobs
by sabaimran 1081 days ago
Hello! One of the developers of Khoj here.

The way we see it, building in the open is going to be critical for creating an aligned, trustworthy AI assistant.

Note: while all LLM tools look fairly similar on the surface these days, our specific approaches are fairly different. Give us a try and see what you think :-)

3 comments

And yet you didn't answer them at all.
I can expand on that (I'm the other developer working on the project).

> Seen a few of these. Are you all working on providing an easy way to maybe use LLMs for chatting/search without sending my data to OpenAI? If yes, how will you verify the quality is "reasonable"?

We're working on building a helpful AI assistant, with or without OpenAI. We use offline SentenceTransformer models for search and OpenAI (currently) for chat.

To allow user to verify quality, with search you've to look at the quality of the results returned. For chat we pass references (from your docs) used to generate the response. A lot more should be done, open to suggestions.

We also have our own chat quality test suite that "benchmarks" chat capabilities (via pytest)

> How is this better than Rewind, Needl, Mem, etc all the personal search engine that have been doing the rounds lately from various knowledge bases? Is the selling point that it's Open-source? Also if Apple improves spotlight, I wonder how useful this will be.

- I've tried Rewind. It's a neat project with a slick UI, no doubt about it. But 1. It has a cold boot problem (you can only search stuff you've opened since you installed Rewind) and 2. It's limited to Mac (M1+) machines. Khoj will index all supported files across your data sources and it can run on other machines easily.

- Needl, based on their homepage, seems to provide fuzzy/keyword based search. Khoj search works offline and supports natural language queries (e.g search for "sold my car for" and it'll find notes about your Toyota Corolla or Ferrari)

- Mem.ai is pretty neat as well. We'd love to add all the features they have. With Khoj you can self-host if you prefer or use Khoj cloud if you want to sync across devices. And it integrates into your existing tools (Emacs, Obsidian and Web)

In summary, Khoj being open-source is a critical differentiator for an AI assistant to be trustable (you can see what the code is doing). But all the AI assistance approaches are also different.

>Are you all working on providing an easy way to maybe use LLMs for chatting/search without sending my data to OpenAI?

From a brief look at the github repo there seems to be need to setup OpenAI API key so not sure if this currently has the ability to chat / search w/o sending or needing a OpenAI API access ?

Search does currently work 100% offline - none of your data would be sent to OpenAI if all you're doing is searching for your local documents. You could completely disable your internet connection and it would still work.

Chat currently is only integrated with OpenAI because it had the highest quality + lowest barrier to entry. We're experimenting with open source LLMs and hope to have an alternative available soon.

"The way we see it, building in the open is going to be critical for creating an aligned, trustworthy AI assistant."

Isn't this service just a very thin wrapper around chat-gpt? How on earth do you have any influence on alignment or trustworthiness. That's like saying your coffee cup makes your coffee fair trade.

This whole thread is very disingenuous, it's literally a simple interface for the OpenAI-API drenched in fake buzzwords boosted to the top of HN to scam investors.

You're being overly critical. You can definitely control the alignment of your assistant with prompt engineering and embeddings. They never say they control the underlying model.

It's an open source project and I don't see why you need to be so obnoxious about it.

Are they being obnoxious without cause though?

The Khoj website says, and I quote:

> Khoj's offline AI models allow you to find information using natural language queries. Search using terms that are similar to what you're looking for, rather than exact or fuzzy matches. Khoj search works offline. So if you self-host your data never leaves your machine and search works without internet.

Emphasis mine.

It seems somewhat disingenuous.

I get it, parts of it run offline, parts of it use the openai api… but that’s not what it says on the box.

Why is the project making a song and dance about self hosting and being open source when it’s just another openai app.

If it’s not just another openai wrapper, cut the openai part of it out and pitch it that way, sure.

…but as it stands, I’m pretty sceptical.

Lots of people are doing the “ai magic” tech demo stuff at the moment, but when you cut them off from the openai api the magic goes away and what’s left isn't very good or interesting.

Maybe this is different? …but it doesn’t look like it; and since they’re tied up with the openai api and you can’t use it without that, how would I even tell?

>> Khoj search works offline. So if you self-host your data never leaves your machine and search works without internet.

> Emphasis mine.

> It seems somewhat disingenuous.

I've been trying it. Khoj search does work offline. Khoj chat (they are literally seperate functions in the app) requires an openAI key and if you give it one, uses openAI.

Yes! It's a bit more than, "somewhat disingenuous," to say a system built to use the OpenAI API works with you to make sure, "your data never leaves your machine".

That's like saying I invented a new form of transportation where you're feet never leave the ground but in actuality I'm just a travel agent sending you to the airport.

"your data never leaves your machine" is only mentioned in the Search section, where it is it true. No-one reading that would assume that meant everything considering the two last sentences above in the Chat section explicitly says it's using OpenAI.

Really feels like people are nitpicking and hating on this project for no good reason. I feel sorry for the authors.

It feels like you are reading too much into this. Really don't understand all the bashing here. It's an open source software for building things using OpenAI. Do you think LangChain is similarly disingenuous? Or the Vercel AI SDK?
Neither of those things claim:

> So if you self-host your data never leaves your machine

You're quoting the paragraph under "Search", describing their search engine. I feel you're misrepresenting it.

Anyway, I definitely don't think this deserves to be described as "a simple interface for the OpenAI-API drenched in fake buzzwords boosted to the top of HN to scam investors" or "twitter get-rich-quick-guru level lousy and fake, and is clearly boosted to the top of HN".

Horrible reactions in this thread to open source software you can fork to use whatever you want. Really disappointing.

Langchain works completely offline with appropriate LLM/API backend & vectore store if needed
No-one is preventing you from creating a PR or fork this project to add whatever backend you want. Did LangChain fully cover all backends on release? Are you not allowed to release a project that only supports OpenAI?

You really need to explain what you are hating on here.

It says open source AI personal assistant.

The AI isn't open source and sending your data to a third party isn't really trustworthy personal.

I understand your concerns, but let me zoom out a little here and talk about the nature of open source.

Open source means that the source code which is developed for a piece of software is fully open (i.e, anyone can read, fork, modify the code) for what they are installing.

According to your definition, it would be really hard to do anything that is fully, end to end, open source. We've developed the code on Macs, hosted the code on GitHub, written plugins for Obsidian and GitHub, hosted the website on AWS. All of those are closed sourced software.

https://www.redhat.com/en/topics/open-source/what-is-open-so...

That being said, we are planning to integrate an open source LLM soon. When we added chat, Open AI just had the best one, but the space is changing so quickly. We're both super enthusiastic about seeing all the open source tooling for this stuff evolve.

The problem is not that there is "glue" to closed-source apps. It's that the essential core of your product, without which your product has no content or meaningful use — is _someone else's closed-source model._

If I market "a totally creative-commons blockbuster Hollywood movie", but my actual product is just a creative-commons-licensed set of driving directions to some nearby movie theater where you can buy tickets to see the same copyrighted movies anyone else is offering, then _the fundamental essence_ of what I'm offering is not, in fact, creative-commons. I sold people on a _movie_ with that license, and then failed to deliver.

That's what you've done here.

To be clear, the fatal flaw is that your marketing is dishonest about what your product currently is, not that your product is something nobody wants. I'd recommend either making your marketing honest, or else making your product live up to what your marketing promises.

_Then_ you do the PR push on HN.

> it would be really hard to do anything that is fully, end to end, open source. We've developed the code on Macs, hosted the code on GitHub, written plugins for Obsidian and GitHub, hosted the website on AWS.

Yeah, nobody did open source before those things existed...

I don't understand the presumption that the AI should be open source here. If I release an open source SDK for talking to an API, it's still open source even if the underlying API isn't.
> I don't understand the presumption that the AI should be open source here.

Because it literally says "open-source AI".

It says "open-source AI assistant".
Exactly my thoughts. This person is just gaming the system on here