Hacker News new | ask | show | jobs
by threedots 2051 days ago
I run a business in this space. Realistically your chances of having data that is useful to a HF (of any kind) is pretty low so I wouldn't bank on it as a revenue source unless you have a strong reason to believe (1) your data is predictive of something an investor cares about and (2) isn't already covered by other data.
5 comments

There are also ways to reliably find and curate this data for trading firms and asset managers. But that usually requires a somewhat uncommon blend of skills; like statistics, web scraping, reverse engineering, or some domain expertise that gives you an edge in (legally) finding and using nonpublic data. Scrappiness also helps a lot.
Out of curiosity what resolution and time scale is useful. Is it fair to assume the most hedge funds are relatively good at tracking recent information and the value is in older archives that's hard to collect?

Also are event streams and large connect event graphs like Forge.AI what sells actually useful?

Curious as my phd research potentially has applications for information extraction and event linking. But not entirely sure if those applications are actually valuable

> Is it fair to assume the most hedge funds are relatively good at tracking recent information and the value is in older archives that's hard to collect?

No, like he said the most valuable data has a signal that is independent from other data sets, which generally means something proprietary, and almost always without a long history. Having a cleaned archive of standard data with a long history is valuable but already pretty well served. It would be hard to compete with CRSP.

I’m curious the types of data that are useful, what goes into them, and what the curators of that data might make for them.
Could I send you a sample?

I’m curious about certain data I’m scraping for a project.

If you shoot me an email I can tell you very quickly if it's viable, and direct you to specific people at firms who would buy it. I probably don't need a sample if you give an honest description of it.
Is it better to provide the raw data, or to instead provide some interesting statistics from aggregating or running a model on the data?
Usually hedge funds prefer the raw data so their own research teams can do modeling and analysis.
Once they see the raw data they'll have a good idea how to get it. Why don't they just set up their own scraper at that point, instead of paying me?
Because scraping data sucks, occasionally has compliance concerns, and is a different core competency from trading. They would rather offload all of the bullshit involved in maintaining a robust scraping operation than pay their research team to do it.

Time spent on maintaining a scraping operation is time taken away from optimizing your ETL process and producing actionable research for your trading team. You know how people pay to have their pipes unclogged even when they know how it's done? Same idea.

If all it is some data scraped off a few web sites that they could get an intern to do in a week or three, then it's unlikely to be valuable enough for them to pay you a substantial sum of money.

The most valuable data is data that is difficult to gather. Think things like proprietary (i.e. unpublished) industry data. The canonical "sexy" alternative data set sold to hedge funds is counts of cars in retail parking lots from satellite photos.

I disseminate realtime transaction information from blockchain mempools (BTC,ETH) and flag any that create large state updates, is this useful information for any hedgefunds in the cryptospace?
Trying to get my credit union to allow users to opt in to sell bulk annon checking account transactions to a hedge fund. Having that much point of sale data would be huge.
uh no, please don't do that. Individual decisions that seem harmless can be incredibly destructive in the aggregate, and this is one of them.
For those that need an example, a great reminder is the target data mining story at https://www.forbes.com/sites/kashmirhill/2012/02/16/how-targ...
I'm not saying you're wrong, but I am curious what negative trends you believe will emerge from someone actioning this data.
Would you mind elaborating on the sorts of harms you're thinking of?