Hacker News new | ask | show | jobs
by andrewlu0 1 day ago
nice - did you write a custom parser for PDF/DOCX? we wrote one for XLSX after running into event loop issues with sheet JS
2 comments

Using lopdf[1] for PDF parsing, rtf-parser[2] for RTF, calamine[3] for XLSX, and I'm sure you know that DOCX/PPTX/etc. is basically just a zip file of XML + text. The LLM cares about textual data (which just gets moderately cleaned up post-extraction), so I (thankfully) didn't have to deal with rendering. But showing a preview or end-result to a user would be a huge plus, so I can see myself using your library.

[1] https://github.com/J-F-Liu/lopdf

[2] https://github.com/d0rianb/rtf-parser

[3] https://github.com/tafia/calamine

What about rendering? That's the hard part.
we built a library @extend-ai/react-xlsx on top of it that renders the parsed contents onto a canvas

testing was mostly manual with a test corpus we generated. its not perfect but its pretty close for most files we've seen

For me, rendering was just a nice-to-have.
Sorry I meant to ask the author of Extend UI not you.
First of all thanks for the great library. It is so much more thay an UI kit!

We wrote (should say are writing) our own xlsx parser in Rust on IronCalc:

https://github.com/ironcalc/IronCalc/tree/main/xlsx