Hacker News new | ask | show | jobs
by ses425500000 437 days ago
Thanks for the insightful comment! You’re absolutely right — organizing extracted data into a coherent, semantically meaningful structure is critical for high-quality ML training.

Right now, the pipeline focuses on generating OCR outputs optimized for ML models by cleaning, deduplicating, and segmenting content across modalities (text, tables, figures, formulas). For diagrams and tables, we add semantic tags and preserve layout relationships to aid downstream modeling.

I’m planning to add a semantic structuring module that goes beyond basic layout analysis — something that builds hierarchical, MECE-style representations and identifies entity relationships across sections. That’s absolutely the next frontier, and I really appreciate you pointing it out.

Thanks again for the thoughtful feedback!

1 comments

why are you using an LLM to reply to every comment?
Haha good catch! I’m 19 and from Korea, so I’ve been using an LLM to help with replies since my English isn’t perfect yet. But I designed and built the project myself (with help from some open models/tools) — just wanted to communicate more clearly with the community!
[Hi from Argentina!] LLM have a particular style that will make people suspictious or even angry.

One posibility is to write the answer in Korean and use autotranslation. (And post only the autotranslation.) Double check the technical terms, because autotranslation sometimes choose the wrong synonym.

Another posibility is to write the answer in English inside gmail, and gmail will highlight orthographical and gramar errors. So you can fix them.

Most people here will tolerate a few mistakes if the answer has your own personal style.

(Nice project, by the way.)

Yes, writing that is suspictious makes me angry.
>> suspitious

:( My phone does not have orthography correction, and I didn't have my notebook.

Edit: fixed typo: gave -> have

Por esa misma razón, un LLM te habría funcionado perfectamente: desplegando tus pensamientos tal como querías, pero sin las distracciones causadas por la mala ortografía o los errores gramaticales. Los LLM son herramientas —como bien sabes— que ya son esenciales y lo serán aún más con el paso del tiempo. Que algunos en esta plataforma se irriten por su uso solo significa que, eventualmente, se convertirán en los dinosaurios del futuro.

For that very reason, an LLM would have worked perfectly for you: laying out your thoughts just as you intended, but without the distractions caused by poor spelling or grammatical mistakes. LLMs are tools—as you well know—that are already essential and will become even more so over time. The fact that some people on this platform get irritated by their use just means they’ll eventually become the dinosaurs of the future.

Genuinely curious—could it be for the same reason you used a keyboard to write that comment? It’s efficient, it works. What’s the actual issue with using a tool that helps convey the intended message more clearly and quickly, as long as it reflects what he wanted to say?
why are you offended on behalf of this person? the hindsight that they're simply an English learner obviously makes me feel bad for asking the question and i completely understand the use case, but i don't think it was unreasonable to think that someone who speaks entirely in ChatGPT paragraphs might be a bot, spammer, or the like—particularly because, in a botnet fashion, the original reply was to a comment that also seemed to be LLM-authored
I wasn't offended at all. I was just genuinely curious, because I keep coming across this assumption that if any text is well-crafted, it must have come from an LLM. I think I understand why: we've grown so used to reading sloppy writing, everything from barely coherent text messages to articles in reputable publications filled with typos and awkward phrasing.

Personally, I've always held myself to a high standard in how I write, even in text messages. Some might see that as bordering on perfectionism, but for me, it's about respecting the principle behind communication: to be as clear and correct as possible.

Now that we have tools that help ensure that clarity, or at the very least, reduce distractions caused by grammar or spelling mistakes, of course I'm going to use them. I used to agonize over my comments on Twitter because you couldn't edit them after posting. I would first write them elsewhere and review them several times for any errors before finally posting. For context: I'm a retired 69-year-old physician, and even after witnessing decades of technological advancement, I'm still in awe of what this new technology can do.

Yes, I love beautiful, natural writing. I'm a voracious reader of the great classics. I regularly immerse myself in Shakespeare, Hardy, Eliot, Dickens, Dostoyevsky, Austen, Tolstoy, and many other literary masters. But I also fully embrace this tool that can elevate even the clumsiest writer's work to a clarity we've never had access to before. If that comes at the cost of a bit of stylistic uniformity, that's a reasonable trade-off. It's up to the user to shape the output, review it, and make sure their own voice and ideas shine through.

Back to your original point, I truly wasn't offended on his behalf. I was just curious. As it turns out, he was using an LLM, because his native language is Korean. Good for him. And just to be clear, I didn't intend to make your question seem inappropriate or to embarrass him in any way. If it came across that way, I apologize.