| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ebishop 5555 days ago

Have you considered implementing some sort of script to scan some of the large biological databases and add links/metadata for the datasets they contain?

Looking at what's in CKAN now, it seems that it's mostly datasets that are a bit more easily understood than most of the biological data that's out there, but at the same time indexing and accessing biological data is a HUGE problem for researchers in this field.

There are currently some big databases such as the data stored by the UCSC genome browser (genome.ucsc.edu/downloads.html) and all sorts of expression/small RNA data available from GEO (Gene Expression Omnibus, http://www.ncbi.nlm.nih.gov/geo/), and lots of other slightly more esoteric databases like flybase.org, which specializes in fruit fly data.

Truly doing a proper job of indexing/classifying all of this is a close-to-impossible task (and in many cases requires specialized knowledge), but there are an absurd number of publicly available biological datasets out there. If you wanted to rapidly expand the number of entries you have you could use a script to index one or two of the big databases like GEO, and fill in the metadata from what they already have.

Of course, I can also understand why you might prefer to have the majority of the datasets in your site be the sort of thing most people (or at least, non-biologists) can interpret vs. something that's highly specialized like this. Not to mention, keeping up with all the new data, and properly filling in all the metadata could be a real can of worms.

1 comments

kindly 5553 days ago

Sorry for the late reply. I sadly do not understand the concerns of this field very well. There are many very large datasets referenced on ckan, mainly links to huge triple stores. There are many biological data sets also eg flybase as mentioned. These triple stores are too big to do any decent dynamic linking against which is big shame.

If you get the opportunity could you repost this to ckan-discuss@lists.okfn.org. There are people on that list that understand these issues far more than me and they would love to hear from anyone interested.

link