Hacker News new | ask | show | jobs
by bpchaps 2754 days ago
A little late to the comment party, but I was wondering the same. I'm working on a web scrape workflow that's currently using Tika. I'm very interested in to see how well this does in comparison.
1 comments

I was quite surprised by how powerful and flexible Tika can be, and my use-case was pretty basic: crawling a network drive to index project artifacts like Office docs and media files and pushing them into an Elasticsearch index.

Have you found any major problems or shortcomings in your usage?

One small problem is that it sometimes doesn't make newline separations properly. In my use case, I was extracting email addresses from web scrapes - some email addresses would come out as "blah@blah.comRandomWord"