Hacker News new | ask | show | jobs
by Skywing 5005 days ago
The company that I work for provides litigation services, such as distributed document conversion tools, review platforms and such. We've actually hosted data for reviewing attorneys of some of the larger cases over the past 15 years.

This Enron dataset is one of the standard sets of data that we use and test the speed and resilience of our software against.

I always liked the Enron data because the "smoking gun" terms were disguised using Star Wars terms, like "jedi" and "wookie". It does not look like this site has embedded email attachments indexed, so you may not see any interesting searches for these terms, but I did see a few questionable ones for "jedi". :)

This set also contains some of most hilarious, typical inner-office humor emails that I've had the pleasure of being able to debug. I remember one day, while testing our distributed automated document conversion tool (basically convert any document into a PDF (not a simple task, think about all the possibilities)), we noticed one of the workers had hung up on a PowerPoint document. So, first thing I did was open the document and it was a slideshow of porn images basically with embedded sound files. The audio files are what crashed the app, but when I opened it at the office the audio played loudly and my co-workers were like "wtf?". That was a hilarious moment.

2 comments

I also love the Enron data set for the way my antivirus software has its own little jamboree every time I extract the attachments from it.
Impressive! Did you guys compete in TREL?
Nope. I've never heard of it. I tried looking it up on Google but didn't come up with anything, either. What is TREL?