|
|
|
|
|
by kgeist
864 days ago
|
|
Interesting, I also manage an IRC bot with multimodal capability for months now. It's not a real LMM - rather, a combination of 3 models. It uses Llava for images and Whisper for audio. The pipeline is simple: if it finds a URL which looks like an image - it feeds it to Llava (same with audio). Llava's response is injected back to the main LLM (a round robin of Solar 10.7B and Llama 13B) to provide the response in the style of the bot's character (persona) and in the context of the conversation. I run it locally on my RTX 3060 using llama.cpp. Additionally, it's also able to search on Wikipedia, in the news (provided by Yahoo RSS) and can open HTML pages (if it sees a URL which is not an image or audio). Llava is a surprisingly good model for its size. However, what I found is that it often hallucinates "2 people in the background" for many images. I made the bot just to explore how far I can go with local off-the-shelf LLMs, I never thought it could be useful for blind people, interesting. A practical idea I had on my mind was to hook it to a webcam so that if something interesting happens in front of my house, I can be notified by the bot, for example. I guess it could also be useful for blind people if the camera is mounted on the body. |
|
There's a creepypasta or an SCP entry in there. You do not recognize the people in the background.