Hacker News new | ask | show | jobs
by the_real_sparky 1263 days ago
I think the actual interface with OpenAI’s platform is the easy part. Everybody and their dog will have a version of this. Just look at the comments so far - many of us have already been playing with it.

If you want a real moat, figure out how to parse existing PDF documentation that is really badly formatted. Think diagrams and tables with text floating in various places, etc. Documentation of this style is very common in industries where physical things are being built in the real world. The standards documentation (IEEE, ANSI, NFPA, etc) doesn’t usually parse cleanly, much less the messier internal documentation within the businesses.

Grobid is the best example of such a documentation parser, but it is so laser focused on academic papers that it fails to properly process industry-style standards and SOP documentation. What the world needs right now is a Grobid that works for other kinds of messy documentation.

1 comments

I think you are right, this will be the key differentiator for anyone building a service like this - I guess like with most machine learning/data science projects - the real work is on the data engineering side of things.

One thing that all these models will lack is the ability to include diagrams (on both the input and output side). Working out a clever way to do that would be very cool.

At the moment there are some difficulties with the GPT interface - the most tricky one being the limit on the length of the input prompt. I'm not sure at the moment how much fine tuning helps with this.

But, my assumption is that OpenAI will improve this, so there's not a huge way to differentiate here.

Yep, my only idea so far has been to generically describe the figures in text format. Doing so through recognition in any level of detail will be extremely tough, as often the drawings differ by variations that would be difficult for a model to understand. It may not matter that much though, as usually the notes and headings around each figure provides a lot of context. So maybe you can get 75% of the way there by identifying the “block” and keeping the textual information in that area associated together so that it can be fed into the embeddings (and thus later the LLM) as a single unit of related information.

It’s frustrating though as often there are hundreds to thousands of pages of this stuff with diagrams and drawings randomly situated together on the pages. Documentation like this was designed to be dense for printing and consumed by a human that is familiar with it from regular use. I’m a bit concerned that the only solution may be paying a technical expert to sit down and convert it all to blocks of text. It would be an expensive endeavor, and even after it’s complete any changes (which happen often) would have to be continually maintained.

If that’s the only solution then I may still go for it, as I think the value to the business of having all knowledge instantly searchable and then automatically summarized will be considerable.

You can ask ChatGPT to create SVGs and at some point in the past you could even trick it into embedding them as base64 images. Not sure if it still works since ChatGPT is unreachable for me currently.

More details:

https://www.reddit.com/r/ChatGPT/comments/zsnscy/i_asked_cha...

Adding diagrams as inputs is probably as easy as feeding in an additional CLIP embeddings during training. The trick here will be how to get enough training data. Perhaps there are enough StackOverflow questions with images in the question. For output, you could also finetune some diffusion model on that data.

I’ve actually talked with ChatGPT and asked it both to output mairmaid diagrams of discussed architecture (context was kubernetes clusters, namespaces and Pods) and also read diagrams and convert them correctly to kubectl commands to build the diagram.