| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ZeroCool2u 1793 days ago
	I've worked with 10-K's and 8-K's extensively for the purposes of using them for NLP. This is extremely arduous work and a clear winner in terms of profitable ideas, so kudos to the team for the launch, this is really impressive. Perhaps this is giving a bit too much away in terms of the secret sauce, but would love if you could talk a bit about how you handle the wild disparities in the structure of the documents. Do you parse the XBRL?

1 comments

piesauce 1793 days ago

Thanks for the kind words! We don't use XBRL at all. We did try it initially, but it was wildly inconsistent across companies. I think one of the things that worked well for us was that we spent a lot of time at the initial stages of the pipeline (efficient sentence and word tokenization, span detection), that bode well for our models later on.

link

ZeroCool2u 1793 days ago

Thanks! This is similar to where I ended up landing as well. It turns out using a non-standardized standard format is practically worse than dealing with giant blobs of plain text!

link

kbennatti 1793 days ago

So true

link