Hacker News new | ask | show | jobs
by msherry 4067 days ago
It's true that the SEC makes this data available free of charge, but it's typically not useful without a lot of extra processing. XBRL(http://en.wikipedia.org/wiki/XBRL) is a standard of sorts, but there seems to be no enforcement of how it's used across different companies' filings (or even across multiple filings from the same company), and many critical pieces of information are filed under e.g. company-specific extensions. It's definitely a pain working with the data in any meaningful way.

FWIW, www.downside.com had a hard time reading the latest filings for AAPL, the only one I tried. Visiting http://www.downside.com/cgi/testfinancialsextract.cgi?url=ht... threw an error "ERROR: SGML Parse error: EXCEPTION reading from net: The filing is bigger than the maximum allowed size of 5000000 bytes. at EDGAR/netutil.pm line 75."

2 comments

Right. I haven't updated that in years, and HTML is more bloated than it used to be. The Downside financial extractor, written around 2000, is totally obsolete - it reads the human-readable tables and tries to make sense of them. That hasn't worked well since HTML stopped using <table> for tables. It doesn't understand XBRL; it reads the human-readable tables and attempts to create an early version of XBRL from them. I'd looked into building a better system years ago, but there's a patent problem with a patent on extracting data from financial tables where the sign of items is ambiguous and it's not obvious which lines add up to which totals. ("Net loss" in a human readable table may be expressed either as a positive or negative number.)

A more modern system is deep inside of sitetruth.com; if SiteTruth can find the business behind a web site and tie it to a SEC filer, a button will appear to access the SEC filings for that company.

Getting SEC filing data is an absolute nightmare. Every time I think of a project that includes SEC filing data (Executive names/ages, MD&A text analysis, etc.) I skip it and move on to something that's just as interesting but less time consuming and more doable. There doesn't appear to be a scalable solution.
What's the problem finding executive names and ages? Get the SEC index for a CIK, pull the latest DEF 14A form[1], and start parsing the tables. Build a 2D data structure for each table. Look for tables that have column headings including "Name" and "Age". Then back up from the start of the table to the previous heading that's not associated with a previous table, and look for keywords in the heading such as "Director(s)" or "Executives".

It's tougher when the filer tries to be cool and doesn't use tables for tabular data.[2] Then you have to figure out which <div> items are line breaks and which aren't. Fortunately, the SEC doesn't let you put Javascript or off-site CSS in a filing; it all has to be in one document.

Yes, dumb scraping techniques like looking for CSS class names won't help, but it's not really that hard.

[1] http://www.sec.gov/Archives/edgar/data/1288776/0001308179140... [2] http://www.sec.gov/Archives/edgar/data/1326801/0001326801140...

You just justified the existence of OP's API with your explanation.
Their system isn't capable of extracting complex info such as executive names and ages, which is what the requestor wanted. The API only does the easy stuff, returning fields from XML.

Edgar Online was sort of a data troll. They bought FreeEdgar to make them go away. After the SEC put up their own search engine, Edgar Online was mostly unnecessary, and it was sold to RR Donnelly. There's also "secinfo.com", which someone runs as a spare-time activity and does about as much as Edgar Online. There's no need for a pay service to get this free data.