| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by alexanderdaw 3764 days ago
	1. Stream your data into Kafka using flat JSON objects. 2. Consume your kafka Feeds using a Camus Map Reduce job (a library from linked in that will output hdfs directories with the data). 3. Transform the hdfs directories into usable folders for each vertical your interested in, think of each output directory as an individual table or database. 4. Use HIVE to create an "external table" that references the transformed directories. Ideally your transformation job will create merge-able hourly partition directories. Importantly you will want to use the JSON SERDE for your hive configuration. 5. Generate your reports using hive queries. This architecture will get you to massive, massive scale and is pretty resilient to spikes in traffic because of the Kafka buffer. I would avoid Mongo / mysql like the plague in this case, a lot of designs focus on the real time aspect for a lot of data like this, but if you take a hard look at what you really need, its batch map reduce on a massive scale and a dependable schedule with linear growth metrics. With an architecture like this deployed to AWS EMR (or even kinesis / s3 / EMR) you could grow for years. Forget about the trendy systems, and go for the dependable tool chains for big data.