Hacker News new | ask | show | jobs
by alok-g 4783 days ago
>> I recently started work on an attempt to improve the classification of English vocabulary by grade level.

I would be interested in this. Let me know if you plan to open this. What data sources are you using?

1 comments

I'll provide an API for it, won't be ready for months though. It's not a big priority yet -- part of a larger project.

The data sources aren't that interesting. After trying for a while to find something already pre-compiled, I quit and resorted to Googling for phrases like, "9th grade spelling list", and aggregating the data from the results by hand. There are a bunch of sites for teachers and home educators and the like that include tables of vocabulary for various grades. It's tedious, but it works.

Sorry if this sounds sticky, but since you have already done this, could you please share what you have? This sounds useful to me.
A bit behind schedule, but here it is:

http://www.shomisearch.com/api/vocab/grade/8/

http://www.shomisearch.com/api/vocab/grade+sources/9/

...etc. The two lists available for now are "grade" and "grade+sources"; the first list will return the vocabulary list for that grade, the second will return the vocabulary list plus the sites that the data was pulled from.

Valid grade levels are "pre-k", "k" (or "kindergarten" if you like), "1" thru "12", and "college" (or "collegiate").

Currently it just returns results as text/plain with no bells or whistles.

On my to-do list for this is: documentation at /api, json result formatting, more lists & list options, and the ability to POST some text to the vocab api and get back the median & mode of the grade values for the text.

I don't intend to do any of that right away though, since this all just started out as a planned feature for a larger project.

There are currently ~21,000 entries in the database. Lots of duplicates and disagreements on grade levels for words, as expected.

If there's anything else you think is important enough for me to get on right away, let me know.

Thanks. This helps. This was a great idea. I would like to know what your larger project is when you are ready.
A natural-language search engine (hence my earlier comments on NLP). I've been using it as a news reader for a while now. Should be able to open invites for the reader part in a month or so.

The crawler collects lots of metadata from content. The vocabulary engine is part of a planned down-the-road feature that will provide additional metadata for crawled content, as well as eventually help users find other users that write comments they want to read. (It has an "anti-social" aspect planned, where user interaction will be allowed on reader content, but the software will encourage users to form loose-knit groups of around 20 or so.)

I think the number one problem of social networks right now is that they try to grow without restraint and force hundreds (or thousands) of people to all interact. But humans aren't wired like that; we don't do that well. What does seem to work well is LiveJournal-style communities, or the BB communities, where people get partitioned off into smaller groups by common interests, with other people crossing between interest groups.

I'm not super excited about the community stuff though. The NLP work has been a blast so far, the interpreter I wrote seems to be working out well. I took a somewhat naive, very clean approach to NLP, and I think it'll support up to a few thousand different types of metadata (I can't even imagine that yet) and at least as many different phrases. I have a little more mostly front-end work to do on the reader, then after that I'll start working on providing direct access to the search engine behind the reader. (The reader isn't an RSS reader, it's a search engine interface that lets you use the results from searches as news feeds -- like, right now I have a "front page of HN" news feed. Reddit content is also being crawled, I just need to edit the parser to accept a query like, "front page of HN and r/technology and r/startups".) Eventually I'll mess around with the community part.

So, for example, users with an account on the reader (and, later, the search engine) might have articles closer to their own reading level get a mild rankings boost.

OK. I'll have a basic version of the API available later today (Saturday). It will provide the word lists and sources. I need to sleep now.