Hacker News new | ask | show | jobs
by victorbjorklund 1034 days ago
What type of analysis do you do on the text? And how is the performance/cost of running vs more specialized models trained for the task?
1 comments

This isn't our field but its something similar. So say some of your clients are old publications. Some have articles dating back to the 1800s. Nearly all the work is digitized but searching for something in the great categorized mess is a nightmare. As most old publications are downsizing they don't have the man power to curate there archives but are inundated with research requests nearly 24/7. As a service to help these publications maintain there image as an organized informative keeper of historical records you could do the following. 1. have an LLM make a series of tags for all the articles. 2. make a summary for all the articles to improve search results. 3. provide a service to them or up sold to their clients were a question/prompt can be ran across every article or a section of articles.

> how is the performance/cost of running vs more specialized models trained for the task. most models are GNU licensed so thats not an issue. But I imagine you meant the age old question of hosting yourself vs using openAI. Truth is as of now it currently is not foretasted to beat using one of the less intelligent models on openAI. hardware cost alone yes but Dev time is very expensive. Lucky were a small company & our CEO sees this as training. Because LLMs are so new there really isn't a large labor market for it yet. If our devs and engineers get in this early then we can beat others to market as the technology develops and new opportunities come to light. on top of having possible HIPPA, GDPR, or other security laws to follow that OpenAI has been very shooty about, we do not want be at the whim of OpenAI or another SaaS provider on a mission critical part of a vertical. They have talked about depreciating old model. As well they have had content changes in there models to placate political critics, well not realizing that this pulls the rug out from under developers that need any sense of stability from there product.

Do you have any suggestions about how to start implementing something like this in-house? I'm sitting on thousands of PDFs (that can be trivially turned into text) and it would be really useful to train an LLM on them for information retrieval.

But the dev and computing cost of this feels so huge that I'm not even sure where to start.

my first way of showcasing this was by taking a spare computer sitting around the office then writing a little python script that used and LLM to parse information out of file names that our finance team would use to label rebilling invoices. the invoices included the client, payment date, amount, late payment status, etc write in a concluded an completely non consistent file name. the little office PC had 16gb of ram so it was usable for an LLM via the CPU and I just let it run for like 2 days. I continued with my normal work and when it finished I had an intern spend 1 whole day validating just 6% of the data and found it to be 97 percent accurate. I made some obvious changes an was able to fill in that 3% gap. (later we did find a hand full of errors but over all you could consider the validation 99% accurate)

While it really resonated with my management I felt worried I wouldn't be able to replicate these kind of results on other projects.

THE ONLY REAL ADVICE I CAN GIVE ON AI PROJECTS IS . . . don't let your managements expectation of LLMs out weigh its capabilities.

I'm sure I speak for many people here when your non-tech fluent directors get together and think GPT4 is some sort of deity. GPT4 smart (or used to be at least) ill give it that, but small locally hosted 7b/13b LLMs are very limited and people for whatever reason get AI infatuation the second they finally see you show direct value in it they will lose there shit in its assumed capabilities. you got to be direct with them that no matter what dumb video they saw on Sam Altman, what your are proposing is not that. Be very clear in its possible scope because there is some idiot in our organization that will assume assume you can programmatically answer prayers. I actually had this guy from our networking team try and raise a concern about the LLM going sentient and us having a "Skynet" problem. granted this was back in march/2023 so AI histira was a little more rampant but still.

tl;dr my recommendation for your pdf project is run https://github.com/oobabooga/text-generation-webui. if your can get a 30 series GPU in your company Then run a 13B 4bit model that can pull info, assign tags, run minor analysis on your text. else find a spare 16gb machine and do the same but but over a longer time scale.

run a prompt that checks for hallucinations. "does the following text make sense? previous prompt + text if yes then keep else make intern do it.

GPT-j-7b is still one of the best models because it has indexing & categorizing at the main prosperous. other models are great but core idea behind LLMs is that its just a high level auto complete