I am currently evaluating whether to launch a simple yet effective tool that converts your data into a well-structured JSON. If there is a need for this type of tool, I can develop it.
I'm getting AI to spit out stuff in YAML, because it has trouble doing it in JSON (long story). So we'd get something like follows:
----
- action: web
- url: https://news.ycombinator.com
- text: this is the link to hacker news
But AI being what it is, it can be inconsistent. Sometimes it spits out several actions, sometimes it break text apart or do bullet points which is interpreted as a line. Sometimes it hallucinates, like `- web: https://...`
Also most clients are designed to handle JSON perfectly and they have trouble with parsing JSON. And what if there's colons in other lines.
So a YAML to JSON tool would be nice. And better if there's some error correction. Currently, I'm manually writing the error correction, and sometimes it corrects things that aren't errors.
However, knowing AI, it could be possible that AIs would output JSON perfectly fine next week. I know OpenAI Assistants can do this to an extent and Gemini seems to be doing it okay.
Yes, I'm planning to use Gemini for this context because of the 1M context length, but 15 reqs per min seems to be a little off :(
However, my implementation (not entirely mine btw) will ensure that whatever the user gets as an output is a valid & type-safe JSON response; otherwise, it will throw an error.
You could do this with ease with an LLM and function calling. What would you be doing differently? If you can help non technical ppl then maybe but I’d argue anyone using JSON is likely technical :)
I’ve built a few of these with OpenAI function calling and it seems to do a pretty good job. I have yet to convert those over to open source models but doable.
Try it out and see what ppl think. The one thing you have to keep in mind is privacy around data. If you can somehow make this work such that user privacy is not an issue then you might be good. Lots of companies are hesitant to process their data outside of their servers but again depends on data classification policies.
Once you build it let me know I can help you test it out.
Tabulated separated data into json would be useful if you specify some grammar to skip unwanted mark-up or identify an html/css/js denotation of the table in question. Could be array or keyed type, transformation afterward is easy.
Maybe follow how jq and things like beautiful soup work?
In another comment you were saying you'd use an LLM to do this. If that is your solution, I'd say that a good percentage of people on HN can do this themselves - what do you have in mind for the tool that would make it a better solution vs. our own engineering?
Yes, It will require LLM to convert your unstructured data into a type-safe JSON format, It will be a very simple hobby project that will get you a JSON response as fast as possible instead of writing your own implementation or similar stuff. I plan to simplify this friction because JSON is used in various places. If it somehow works, then I think it would save some time for devs to focus on other parts of their projects :)
I do. It's great to see that you would make a tool for this use case. I'm intending to store my data in Convex. So if the output is a JSON file, then I'd convert that JSON file to JSONL, and then run a script to insert that JSONL as rows in Convex.
There are plenty of tools (i.e. convertcsv.com, convertjson.com) that convert structured data, but you are talking about unstructured data. It would be useful assuming the JSON had a predictable structure.
thanks for asking; here, I'm taking sample input data from a blog post that might look like this, and I have specified the format below, which I'll feed to an LLM and will get a response as a JSON, which will be type-safe and validated at the backend.
{
"data": "Title: The Effects of Sleep on Memory, Authors: Dr. Jane Doe, Dr. John Smith, Publication Date: 2022-09-15, Journal: Journal of Neuroscience, Abstract: This study explores how sleep influences memory consolidation. Our research indicates that participants who had a full night's sleep performed better on memory tests compared to those who did not. Keywords: sleep, memory, neuroscience, cognition",
"format": {
"title": { "type": "string" },
"authors": {
"type": "array",
"items": { "type": "string" }
},
"publicationDate": { "type": "string" },
"journal": { "type": "string" },
"abstract": { "type": "string" },
"keywords": {
"type": "array",
"items": { "type": "string" }
}
}
}
or it can be an HTML, or general text document, or an article pdf whatever
Expected Output will be this,
{
"title": "The Effects of Sleep on Memory",
"authors": [
"Dr. Jane Doe",
"Dr. John Smith"
],
"publicationDate": "2022-09-15",
"journal": "Journal of Neuroscience",
"abstract": "This study explores how sleep influences memory consolidation. Our research indicates that participants who had a full night's sleep performed better on memory tests compared to those who did not.",
"keywords": [
"sleep",
"memory",
"neuroscience",
"cognition"
]
}
Also most clients are designed to handle JSON perfectly and they have trouble with parsing JSON. And what if there's colons in other lines.
So a YAML to JSON tool would be nice. And better if there's some error correction. Currently, I'm manually writing the error correction, and sometimes it corrects things that aren't errors.
However, knowing AI, it could be possible that AIs would output JSON perfectly fine next week. I know OpenAI Assistants can do this to an extent and Gemini seems to be doing it okay.