| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by proddata 1703 days ago
	That article takes various concepts from typical TSDB solutions and seemingly only looks at the bad sides. Time series data has many different forms, not every form works for every TSDB solution. For the 3 caveats at the top, there are already two TS solutions that look promising (QuestDB, TimescaleDB). Often an operational analytics DB (Clickhouse, CrateDB) might also be a solution.

5 comments

akulkarni 1703 days ago

(TimescaleDB co-founder)

Thanks for the mention, and I completely agree :-)

Personally, there is a lot in this article that is misguided.

For example, it essentially defines "time-series database" as "metric store." As TimescaleDB users know, TimescaleDB handles a lot more than just metrics. In fact, we handle any of the data types that Postgres can handle, which I suspect is more than what Honeycomb's custom store supports.

  TSDBs are good at what they do, but high cardinality is not built into the design. The wrong tag (or simply having too many tags) leads to a combinatorial explosion of storage requirements.

This is a broad generalization. Some time-series databases are better at high cardinality than others. Also, what is "high-cardinality" - 100K? 1M? 10M? (We in fact are designed for _higher cardinalities_ than most other time-series databases [0])

  In contrast, our distributed column store optimizes for storing raw, high-cardinality data from which you can derive contextual traces. This design is flexible and performant enough that we can support metrics and tracing using the same backend. The same cannot be said of time series databases, though, which are hyper-specialized for a specific type of data.

We just launched tracing and metrics support in the same backend - in Promscale, built on TimescaleDB [1]

I do commend the folks at Honeycomb for having a good product loved by some of my colleagues (at other companies). I also commend them for attempting to write an article aimed to educate. But I wish they had done more research - because without it, this article (IMO) ends up confusing more than educating.

For anyone curious on our definition of "time-series data" and "time-series databases": https://blog.timescale.com/blog/what-the-heck-is-time-series...

[0] https://blog.timescale.com/blog/what-is-high-cardinality-how...

[1] https://blog.timescale.com/blog/what-are-traces-and-how-sql-...

ignoramous 1703 days ago

How does timescale (a single-purpose database) hold up against single-store (a multi-purpose database)? Of course, timescale is cheaper, but other than that, have you folks compared / contrast against single-store as a TSDB?

PS https://www.timescale.com/papers/timescaledb.pdf is 404

akulkarni 1703 days ago

TimescaleDB performs quite well. One of our unique insights is that it is quite possible to build a best-in-class time-series database on top of Postgres (although it’s not easy ;-)

Here is one benchmark: https://blog.timescale.com/blog/timescaledb-vs-influxdb-for-...

There are some challenges with building on Postgres - but what we’ve been able to do is build innovative capabilities that overcome these challenges (Eg columnar compression in a row-oriented store, multi-node scale out).

We also have some exciting things that we are announcing this week. Stay tuned :-)

PS - Where did you find that PDF? Thought we took it down (it was hard to keep it up to date :-) )

ignoramous 1702 days ago

Thanks.

Re: paper: I stumbled upon it when going through other timescaledb threads on news.yc, specifically here, https://news.ycombinator.com/item?id=13943939 (5 yrs ago)

eska 1703 days ago

I had a serious case of deja vu reminding me of your article on compression in timescaledb :-D

akulkarni 1703 days ago

Thanks for reading that article :-)

oconnore 1703 days ago

> Often an operational analytics DB (Clickhouse, CrateDB) might also be a solution

This might be a bit off topic, but speaking of gaps in common observability tooling: is an OLAP database a common go-to for longer-timescale analytics (as in [1])? We're using BigQuery, but on ~600GB of log/event data I start hitting memory limits even with fairly small analytical windows.

In this context I have seen other references to: Sawzall (google), Lingo (google), MapReduce/Pig/Cascading/Scalding. Are people using Spark for this sort of thing now? Perhaps a combined workflow would be ideal: filter/group/extract interesting data in Hadoop/Spark, and then load into OLAP for ad-hoc querying?

[1]: https://danluu.com/metrics-analytics/

proddata 1703 days ago

> is an OLAP database a common go-to for longer-timescale analytics (as in [1])?

I would not consider Clickhouse or CrateDB "classic" OLAP DBs. I can speak for CrateDB (I work there), that it definitely would be able to handle 600GB and query across it in an ad-hoc manner.

We have users ingesting Terabytes of events per day and run aggregations across 100 Terabyte.

nicoburns 1703 days ago

What kind of hardware requirements would be needed to store and query this much data?

proddata 1703 days ago

- Depends - Just inserting, indexing, storing and simple querying can be done with little memory (i.e. 1:500 memory-disk-ratio 0.5GB RAM per 1TB disk). Typical production clusters with high query load are in the 1:150 range i.e. 64GB RAM for 10TB disk).

Otherwise typical general purpose hardware (Standard SSDs, 1:4 vCPU:memory ratios, ...)

nicoburns 1703 days ago

Interesting, so that'd be about 1 vCPU and 4GB RAM per 625GB of data. That seems very price efficient. Would something like AWS's EBS be sufficient for this? Would you need one of the higher tiers? Or would you be looking at running this on a box with locally attached storage?

proddata 1702 days ago

Most of CrateDB clusters run on cloud providers hardware (azure, aws, alibaba). Using EBS (GP2 or now GP3) is also quite common. Due to the indexing / storage engine, gp disks are typically sufficient and faster disks have little to no advantage

claytonjy 1703 days ago

Wouldn't 0.5GB RAM per 1TB disk be more like 1:2000 memory-disk-ratio? Which is even better!

proddata 1702 days ago

Sorry, mixed up the number 2GB memory (0.5GB heap). So 1:500 is correct

jpgvm 1703 days ago

For longer-scale timeseries I still recommend Druid as the go-to. Mainly because if you make use of it's ahead-of-time aggregations (which you can do for real-time or scale-out batch ingestion) then your ad-hoc queries can execute extremely quickly even over very large datasets.

Druid only really has 1 downside, which is it's still a bit of a pain to setup. It's gotten a ton ton better in recent times and I have been contributing changes to make it work better out of the box with common big data tooling like Avro.

For performance it's the top dog except for really naive queries that are dominated by scan performance. For those you are best off with Clickhouse, it's vectorized query engine is extremely fast for simpler/scan heavy workloads.

shaklee3 1703 days ago

We used clickhouse on about 80TB in a raid10 setup. It was extremely fast

alfiedotwtf 1703 days ago

What are the best books out there to learn about Time Series databases? (there are already a million for relational and graph, but haven't seen one for time series). Bonus on how to implement one

dominotw 1703 days ago

CMU did a timeseries series couple of years ago:

https://www.youtube.com/playlist?list=PLSE8ODhjZXjY0GMWN4X8F...

Things have changed a little bit now , but not much.

alfiedotwtf 1701 days ago

Excellent, thanks

jpgvm 1703 days ago

If you want something like Honeycomb but scales better then maybe look at Druid.

dikei 1703 days ago

Last time I checked, Druid were not very good at ad-hoc tasks because it lacked join and SQL supports was sketchy. How is it now ?

jpgvm 1703 days ago

Limited JOIN support. SQL is now very good.

JOINs vs no JOINs isn't an adhoc vs not-adhoc thing but more of a schema thing. If you try jam a star schema into it you aren't going to have a good time. This is true for pretty all of these more optimised stores. If you have a star schema and want to do those sorts of queries (and performance or cost aren't your #1 driving factors) then the better tool is a traditional data warehouse like BigQuery.

This probably won't be the case forever though, there is significant progress in the Presto/Trino world to enable push-down for stores like Druid which would allow you to source most of your fact table data from other sources and then join into your events/time-orientated data from Druid very efficiently.

camel_gopher 1703 days ago

Take a look at IronDB too. High scale distributed implementationd.