Hacker News new | ask | show | jobs
by sbarre 2754 days ago
I was quite surprised by how powerful and flexible Tika can be, and my use-case was pretty basic: crawling a network drive to index project artifacts like Office docs and media files and pushing them into an Elasticsearch index.

Have you found any major problems or shortcomings in your usage?

1 comments

One small problem is that it sometimes doesn't make newline separations properly. In my use case, I was extracting email addresses from web scrapes - some email addresses would come out as "blah@blah.comRandomWord"