Hacker News new | ask | show | jobs
by arbitrandomuser 915 days ago
I wonder if they knew that they could get html versions of the paper by just changing the link from ...arxiv.. to ar5iv..
1 comments

I did try that at first, it was hard to parse through the HTML code and organize into logical sections (authors, references, abstract) and then clean up the text to prepare it optimally for chunking and embedding. Once I found GROBID I just went with that route because it handled all that for me.