Analytics 101: Choosing the right database | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	Analytics 101: Choosing the right database (reflect.io)
	96 points by turoczy 3738 days ago

10 comments

greggyb 3738 days ago

Analytics 101: Choosing the right database is the wrong first step.

I was excited when I saw 'choosing the right data model' as one of the rules, but they are talking about the data models the DB uses internally. The important data model is choosing how you model the data you have to analyze. I have my biases, but I'd argue that a dimensional model would be a good starting if we're really at a 101 level and extensions to the model are for future classes/development.

Starting with the end in mind is very important. When I look at this, I think in terms of data culture. Who needs to be able to do what in your organization? What types of question will you need to answer most often? What types of questions will you need to support in an ad-hoc manner?

To many organizations "analytics" means arithmetic, but with complex filtering logic and business logic, traditional BI, essentially. To others, "analytics" means R code monkeys. To others it may mean specifically visualizations and the presentation layer. There are many interpretations of the word. Regardless of the interpretation, process and culture are more important to understand before the technology.

For a rough analogy, it's like saying "Software development 101: Choosing the right programming language". Sure it matters, but knowing what your software needs to support and what the primary use cases are are more important to understand.

Ninja edit: Grammar.

p4wnc6 3738 days ago

I agree with all of this and I think it also carries over to software design too: People tend to think about which design pattern, from some preconceived menu of design patterns, they will need, instead of applying common sense to how the business workflow will look with the desired outputs, and then working backwards from there, usually with some Occam's Razor sprinkled in to make sure you don't overdo it with design mumbo jumbo.

What this has taught me over time is that the best systems, whether for data modeling and data storage, or for software, are systems that go to great lengths to ensure it is extremely cheap and easy to reconfigure, redesign, scrap everything and try something new, and adapt your designs to the real life pain points you didn't anticipate.

I've been very frustrated in the last several years because you see so many people trying to shoehorn this sort of idea into so-called "Agile" methods, but those methods put the emphasis in the wrong place. They depict agility as a property of the team of humans beings, and the particular tasks and schedules of the humans involved. They don't do anything to improve the agility of the underlying software systems, and you can have the most Agile team ever but still get burnt by committing to a bad and hard-to-change design, even if you're generating gobs of story points or other nonsense.

One of the most important and powerful planning tools is to prototype your architecture. Begin doing actual work with it. "Beta test" it with a limited set of actual business users. Collect data, like you're profiling, and make these decisions with evidence about your actual use case, not trite analogies or generalizations of it.

And place a premium on tools that demonstrably make it easy to scrap an underperforming design and replace it with a different design.

threeseed 3738 days ago

Modelling your data dimensionally is becoming less and less popular in the analytics space.

It used to be the standard in data warehouses but now the trend is to leave the data unstructured and use query tools e.g Drill or do multiple ETL into structured versions. But even the structured versions would not be relationally modelled to any great extent. Data scientists typically want access to data right now not in a few months when your database guy has finishing modelling it.

greggyb 3738 days ago

Dimensional modeling, though often associated with a traditional waterfall methodology, is not tied to this delivery methodology.

It remains the most understandable model to the largest population of end users. Analytics is a very broad term, as I mentioned in my original post, and the audience is huge. If "analytics" to you implies an audience of primarily data savvy end users, then dimensional modeling may not hold as much value.

I tend to find that the data scientists at our clients still do a lot more data wrangling than data science when they don't have clean models to work with as a baseline.

Additionally, there's a big difference between exploratory analysis of new data sources where access and low latency are key, and well-known domains that have fairly predictable needs. The former have a habit of transforming into the latter. Dimensional models remain one of the most efficient physical structures of data for a read-dominant workload.

Long story short, I think it's worthwhile to both of us to look outside our bubbles. I work for a BI and data science consultancy. My focus specifically is in core BI workloads, and so I'm definitely overexposed to more traditional modeling techniques. I can guarantee you, though, that the cycle time for a usable pilot that includes a fully realized dimensional model is more on the order of a handful of weeks than months in a typical delivery.

What's your bubble?

paxcoder 3738 days ago

Any recommendations? Maybe start a flowchart for collaborative editing? Please link me.

sandstrom 3738 days ago

Surprised that RethinkDB[1] isn't mentioned. It has support for replication and sharding, plus a query language well suited for analytics.

(I'm not affiliated with them, just think they get proportionally little coverage given their interesting product)

[1] http://rethinkdb.com/

lobster_johnson 3738 days ago

Last I checked, RethinkDB had poor aggregation performance — which is key to doing analytics, unless you want to use it purely as a master data store and do aggregations elsewhere.

Some simple "group by" selects that would take ~3 seconds with Postgres or Elasticsearch would take several minutes with RethinkDB, unless it died of RAM starvage first. It looks to me like RethinkDB is not optimized for sequential read access, nor is its caching algorithm tuned to such workloads.

I believe it also lacks many of the aggregations that you'll want to use, like multi-level bucketing on different dimensions.

This was ~1 year ago, though, so it may have massive improved since then, who knows.

danielmewes 3738 days ago

Daniel @ RethinkDB here.

We're constantly improving performance and a lot has happened within the past year. I think that at this point RethinkDB is as good a database for analytics as many of the other general-purpose databases when it comes to features and performance.

From what I can tell, there are still two main limitations that apply in some, but not all scenarios:

* Grouping big results without an associated aggregation requires the full result to fit into RAM. I believe this was the limitation that you ran into a year ago, which lead to RAM exhaustion. This limitation is still there ( https://github.com/rethinkdb/rethinkdb/issues/2719 in our issue tracker). However we're shipping a new command `fold` with the upcoming 2.3 release of RethinkDB, which can be used in the vast majority of cases to perform streaming grouped operations (in conjunction with a matching index). See https://github.com/rethinkdb/rethinkdb/issues/3736 for details.

* Scanning data sets that don't fit into memory on rotational disks is still inefficient. Most SQL databases deploy sophisticated optimizations to structure their disk layout in order to minimize the effects of high seek times. RethinkDB's disk layout it built with a stronger focus on SSDs. This limitation hence doesn't apply if the data is stored on SSDs.

sandstrom 3738 days ago

Focusing on SSDs seems reasonable.

Rotational disks are on their way out. For example, Samsung recently introduced a 15TB SSD[1], able to compete even with the largest rotational disks.

[1] https://news.samsung.com/global/samsung-now-introducing-worl...

lobster_johnson 3737 days ago

That's great to hear, thanks. I'll have to do another test.

hackerboos 3738 days ago

I love the dashboard that comes out of the box [1] - it's very polished.

[1] - https://rethinkdb.com/docs/administration-tools/

solidr53 3738 days ago

Came here for this. Rethinkdb is awesome, changefeed analytics, I'm in!

threeseed 3738 days ago

RethinkDB isn't popular in analytics space because to my knowledge there isn't any Spark or Hadoop integration. That's a deal breaker right there.

gtrubetskoy 3738 days ago

Actually, you might want to not choose any database at all, but instead focus on deciding on the data format, such as Parquet (http://parquet.io) or Avro (https://avro.apache.org/), etc. Many of the tools such as Hive, Impala, Spark, etc. support these formats natively.

You will also need to think about the schema, partitioning, compression and other parameters, and those are not trivial decisions.

threeseed 3738 days ago

The data format is important. ORC/Parquet being substantially faster then Text or Sequence files.

But the query engines are far more important in terms of performance. Just spend any time with SparkSQL and then Hive and you'll know what I mean.

pookeh 3738 days ago

Surprised http://druid.io/ wasn't mentioned. This db was made specifically for both real time analytics and batch analytics. It even has a nice front end http://imply.io/

pella 3737 days ago

NEW: Scalable & Open Source PostgreSQL extension https://www.citusdata.com ( based on PG9.4 / PG9.5 )

Github: https://github.com/citusdata/citus

HN: https://news.ycombinator.com/item?id=11353322 "Citus Unforks from PostgreSQL, Goes Open Source (citusdata.com)" ( 24th March, 2016 )

"What is Citus?

- Open-source PostgreSQL extension (not a fork)

- Scalable across multiple hosts through sharding and replication

- Distributed engine for query parallelization

- Highly available in the face of host failures "

"Citus provides users real-time responsiveness over large datasets, most commonly seen in rapidly growing event systems or with time series data . Common uses include powering real-time analytic dashboards, exploratory queries on events as they happen, session analytics, and large data set archival and reporting." https://www.citusdata.com/blog/17-ozgun-erdogan/403-citus-un...

hbcondo714 3738 days ago

This is by no means a comprehensive list of databases and that's not the intent of this article. The real intent is that it's a simple read for many companies still running solely on 'general purpose databases' and showing where newer database technologies can fit in based on their data needs. Upvote.

cdeshpande 3738 days ago

What about ElasticSearch. Even though its search engine, its growing in popularity as schemaless JSON data store

lobster_johnson 3738 days ago

Surprised Elasticsearch isn't mentioned in the article.

Unlike several of the databases mentioned, it has a data model particularly appropriate for analytics: While only apparently schemaless, its schema is extensible (no need to pre-declare it), and by default every column is indexed. Which means that there's no extra work on the client to assert the existence of indexes for new fields.

More importantly, it does complex, nested, distributed aggregations (top-K, date histograms, etc.) out of the box, and is incredibly fast at it, owing to the columnar-store-like Lucene index model. You can do complicated aggregations across millions of values over several dimensions in milliseconds.

Elasticsearch has consistency issues, though, and even with 2.x and the recent translog support you should probably never use it as a primary data store.

Some of the other databases mentioned (Cassandra, Riak and so on) are useful mostly as primary datastores that get processed into something that can do aggregations. For example, Cassandra -> Elasticsearch is probably a great combo.

threeseed 3738 days ago

ElasticSearch is very nice as a JSON data store and has great integration with Spark/Hadoop.

The only issue is that its write performance isn't great and there have historically been questions about how to get official support. It's definitely got a bright future ahead of it.

Xeoncross 3738 days ago

I've always wanted a super-compact database for storing integers on smaller setups where I don't have the resources to run a dedicated logging server.

I can represent almost everything as an int. Like time, cpu usage, line number, etc. Even just a single byte is enough for most things like which server number or custom error was thrown.

fweespee_ch 3738 days ago

Please, please don't set your text color to #888 or #999 on a white background.

I don't want to have to edit your CSS just so I can read the text.

dizzystar 3738 days ago

I use the blacken plugin on my browser, which automatically converts all text to black. I know this isn't your point, but it is much better than altering CSS to fix the web.

Otherwise, I'd just assume the designers disrespect all of us with poor vision and take it for granted the writing is done with equal care. Rather than ranting about it, I'd rather not think about it.

With that said, for some reason, blacken doesn't automatically convert this site to black, meaning I have to manually convert the text to black. Even after the author fixed the color, it is still impossible for me to read. Please don't override user-assisted plugins. Being on the fort page of HN, I'm assuming this is a decent article, but I have no desire to read it now. thanks.

bilmeswe 3738 days ago

We're going to darken the text on our next push. Thanks for pointing it out.

fweespee_ch 3738 days ago

The problem with blacken is on certain color schemes you don't want black text.

bilmeswe 3738 days ago

We'll make an adjustment. Thanks for the feedback.

Raphmedia 3738 days ago

FYI, your Contrast Ratio is 2.8:1. Aim for 4.5:1

Try using #4c4c4c

You can use this tool to test. http://webaim.org/resources/contrastchecker/

A bad contrast ratio will always look very bad on older screen, making text hard to read. It is also hard to read by people with eyes issues.

"WCAG 2.0 level AA requires a contrast ratio of 4.5:1 for normal text and 3:1 for large text (14 point and bold or larger, or 18 point or larger). Level AAA requires a contrast ratio of 7:1 for normal text and 4.5:1 for large text."

paoloiam 3738 days ago

Turoczy, thank you for sharing! Great insights.