Hacker News new | ask | show | jobs
by Thaxll 3031 days ago
Weird, did they try to use https://www.scylladb.com/?
4 comments

Why throw away something proven to run at massive scale, that you understand and trust for something that's new, has never been run at that scale, and you have no experience running? If you have a team of software engineers, and the latency problem is a software problem, fix the software problem.

When you already know Cassandra, and you already know RocksDB, and you already have an engineering team, it makes far more sense to combine the two things you know how to use at scale than to try to use some new thing NOBODY has run at scale.

> some new thing NOBODY has run at scale

Outbrain uses ScyllaDB in production at scale across multiple data centers. Not sure if it's Instagram scale, but still enough to prove it's reliability and performance.

https://www.outbrain.com/techblog/2016/08/scylladb-poc-not-s...

7 hosts in that poc, that is not "at scale"
Scylla can handle 10-100x the load of Cassandra on the same servers. Scale is more than just the number of hosts.
Data density is a thing. If u putting 10tb on a c* host, switching to Scylla doesn’t fix the issues that putting 1pb of data on a host would involve (ie backing that up). Throughput of 100mb of data done in marketing benchmarks are rarely relevant.
Ok, but that's a different issue and nobody is suggesting 1PB of data on a single node as a good idea. The comment was that "scale" is more than just a simple count of nodes. Even if you keep the data the same size, Scylla can handle it with much better performance which is a good enough reason for many to use it.
Since that poc they're using Scylla on hundreds of machines... that's just an old post.
I was going to say the same thing. It seems pretty clear at this point that Java is not a good programming language to build a database on if you care about strong 99% latency guarantees. The engineers in the article came to this conclusion and so did the Scylla people years ago.

Scylla is AGPL for the OSS version though so testing it out would not be an option without getting a commercial license first.

> Scylla is AGPL for the OSS version though so testing it out would not be an option without getting a commercial license first.

Huh? The AGPL is not a non-commercial-use-only license.

If you have proprietary software that you would like to combine with AGPL code (i.e., not interact with as a service) and is available to the general public over the Internet, and you want keep your code proprietary, sure, you may not want to use the AGPL. But you could say the same thing about proprietary software you want to combine with GPL code and sell to the general public.

If you're either using the software through it's existing defined public interfaces, or you're okay releasing anything you modify or link into the software, the AGPL (and the GPL) are fine. Lots of people distribute proprietary products that include GPL code, like Chromebooks, Android phones, routers, GitHub Enterprise, etc. We figured out years ago that the Linux kernel is not just a non-commercial product. Why are we having the same misconceptions about the AGPL?

The exact interpretation varies from company to company. Some companies take the strict stance of "if you use this library in any way in your application, you must open source your entire application." I've found that some libraries explicitly state that requirement within their FAQs for their community/free edition as opposed to their commercially (and paid) licensed equivalent.

At the end of the day, it's not worth risking yourself (or your company) when the owners of the library claims a software license works a certain way and you disagree. Sure you might be right and you might even prevail in court, but the potential legal fees usually aren't worth the trouble.

I ran into this issue when I was selecting a library to generate PDFs for my internship over the summer: https://itextpdf.com/AGPL

FWIW, I’m pretty sure Facebook also uses mongodb heavily (or at least through its acquisition of parse) I don’t know if they have a commercial license or not, but they aren’t strangers to the AGPL
MongoDB also makes their interpretation of the AGPL pretty well known - unfortunately it’s at odds with established interpretation of the normal GPL.

Requiring GPL/AGPL software as a dependency even if you don’t link to it, but instead talk to it over the network does not mean you haven’t developed a derivative work in terms of the letter and spirit of the license. This is in-part why I steer 100% clear of MongoDB, there’s nothing stopping them from changing their view on the license and deciding to pursue legal action against people who use it in non-AGPL compatible manners down the road.

I think you answered the question yourself [1].

The wording is ambiguous and as far as I know there have been no court cases yet that have yet to define what constitutes a connection between the end user and whether transitive connections count. If it's ambiguous to a software developer then corporate lawyers are definitely going to say no.

[1] https://news.ycombinator.com/item?id=16523858

EDIT: I realize that AGPL is valid for commercial use but since its terms are so onerous, especially once the lawyers get involved, it effectively makes the AGPL unusable in a larger corporation.

There are those who've deployed on Java with tight latency requirements: https://martinfowler.com/articles/lmax.html?t=1319912579 - Benchmarked at around 6 million transactions/second.

The issue isn't so much Java the language, as it is being aware of the GC, and developing with it in mind.

That removes the value proposition of Java though which is that you don't need to worry about memory management. If you need to mentally track every implicit allocation and deallocation in Java then you are essentially writing code in a kneecapped version of C++.
Even in a database, most of the code isn't performance sensitive. Making your life a bit harder in the fast path so it's easier in the slow path is at least a tradeoff worth considering.
Your fast path is still crippled by your slow paths' mess. It doesn't matter if you've isolated and optimized those paths in isolation, to the GC there's just Your Process and it's going to suspend Your Process whenever it wants for however long it needs regardless of what's currently happening.

So if you're latency sensitive then all of your code needs to be aggressive at avoiding object creation. All of your code becomes part of "the fast path", even if it's in a different thread.

Or you isolate your fast path in a different process or a non-GC'd runtime, the later being the approach taken here by Instagram.

Based on previous research [1], almost every part of a DB is in the hot path.

[1] http://15721.courses.cs.cmu.edu/spring2018/papers/02-inmemor...

I'd definitely believe that every part of query execution is in the hot path, as this paper describes. But most code in any reasonable DB system isn't part of query execution.
> If you need to mentally track every implicit allocation and deallocation in Java then you are essentially writing code in a kneecapped version of C++.

Well, Java has the advantage of being platform (and to a certain degree, runtime) independent, plus a robust set of best practices and ecosystem when it comes to modules and library handling, which is pretty hard to get done right for C/C++ projects.

Node.js is also platform independent and has a package manager but I wouldn't use it for a High Performance / Low Latency application like a database.
> Well, Java has the advantage of being platform (and to a certain degree, runtime) independent

What is the benefit of that? Who on earth runs a DB written in Java on windows? Any useful server software will end up using platform native features, be it SQL server, MySQL, HBase, ...

> Who on earth runs a DB written in Java on windows?

There still are lots of Windows-only shops.

developing in Java with awareness of the GC doesn't mean tracking allocation and deallocation of memory, it means developers should avoid allocating lots of new Objects when possible. In practice this means creating view, cursor, or offset type Objects that map to arrays of more primitive data types.
Exactly. Developing GC aware code is still easier and safer than c++ memory management. Particularly because when you screw up in c++ you get crashes or data loss, and when you screw up in Java you mostly just get GC pauses.
Mandatory mention of D and its @nogc feature (as it sounds, a compile-time guarantee that a function doesn't allocate on the garbage-collected heap)

https://dlang.org/blog/2017/06/16/life-in-the-fast-lane/

(ScyllaDB employee here)

I don't believe one would need a commercial license just to test a product in any way? They are not making that part of any product at that point, so no concerns here.

They can't test on production servers.

Fake data, non-userfacing servers, sure.

Even if that is a problem, we provide anyone that is interested with a 30-day evaluation license of Scylla Enterprise.
That puts you about 2/3rds of the way down the list of things to try. You're before all the products that have no evaluation license at all, but after all the open source options that developers can test at their leisure.

If I'm trying a products evaluation license, you can be sure I've tried literally every other option under the sun first, including investigating the possibility rolling my own if situationally appropriate. No form of development is slower than the kind where I have to wait for a company in another timezone to give me permission to use their software, so it's always last on my options list unless the company has frankly amazing reviews that pique my curiosity.

Instagram doesn't operate any user-facing Cassandra servers, though. They run user-facing web servers that talk to Cassandra internally.

I don't like the AGPL because it's unclear on this exact sort of thing, but it does seem to me like the obvious reading of "all users interacting with it remotely through a computer network" does not encompass the connection between Instagram end users and their internal Cassandra.

And, in any case, they released sources for the thing they came up with - which is all that the AGPL requires. If they're okay with doing that, they can definitely use the AGPL for production commercial software.

Counldn't sticking a proxy in front of any AGPL software defeat its purpose then? If you don't consider transitive connections it seems pointless to me.
That's probably a question for a lawyer, but I would not be surprised at an interpretation that a proxy that just mirrors the API of the thing it proxies doesn't insulate you from license compliance, in the same way that a library that just wraps a GPL library doesn't insulate you from license compliance. The question is whether the user is interacting with the AGPL product - if you're talking to software via a proxy, you'd likely say you're interacting with it, but if you're talking to some other software that happens to use that software, are you really interacting with it?

I guess the weird case is that when I'm using the Instagram app, I wouldn't say I'm personally interacting with even the Instagram front-end servers (the way I am in a browser), I'm just interacting with the app which happens to use the servers. And that does sound like not what the license authors would like.

The server is AGPL. The client is Apache licensed. So I don't see a problem with using the AGPL version in commercial product.

Noone claims that a product using the MySql driver is a derivative work of the MySql server?

Edit: Of course, IANAL...

That's specifically what the AGPL does (as opposed to the GPL.) The copyleft "infection" is deliberately transmitted via network clients, not just static linking.

So you actually can't release a permissively licensed client for an AGPL server. I mean, they did, clearly, but the AGPL itself would seem to make that inconsistent.

But then none of this has ever been litigated and both the AGPL and GPL themselves are very confusingly worded so shrug.

As I understand it, with the GPL, you must offer source code under the GPL to everyone you distribute the software to. With the AGPL, the same goes for those that use the software over the network.

So you must offer the source of the database everyone who connects to the database over the network, under the AGPL. But if you deliver a web app, not a database-as-a-service, your users don't connect to the database. And since this database uses the Cassandra protocol, I'd say your web app isn't a derived work of the database in any way.

Of course, that last part is the sticky bit. But if applications using database servers via a well defined protocol are judged to be derived works, we might have other problems - hence the reference to MySql in my first post.

Requiring GPL’d software to function means your product is a derivative work, full stop as far as the spirit and letter of the license is concerned. Using it over a network instead of linking against it doesn’t change this, if you depend on MySQL or any of the forks and distribute your software it must be made available under a GPL-compatible license. The requirements of the AGPL become clear in this regard as well; network use is distribution with the AGPL - incorporating AGPL’d software into your application means you must consider your entire application as licensed under the AGPL or compatible license.

MongoDB muddied the waters here by deciding to interpret the AGPL differently, but I wouldn’t risk your business on it.

Doesn't AGPL allow commercial use?
Per the AGPL preamble[0]:

"The GNU General Public License permits making a modified version and letting the public access it on a server without ever releasing its source code to the public.

The GNU Affero General Public License is designed specifically to ensure that, in such cases, the modified source code becomes available to the community. It requires the operator of a network server to provide the source code of the modified version running there to the users of that server. Therefore, public use of a modified version, on a publicly accessible server, gives the public access to the source code of the modified version."

[0] https://www.gnu.org/licenses/agpl-3.0#preamble

Yes it does, but AGPL licenced software is super banned at all major companies because it is very viral. You have to make derivative software available under AGPL even if the end user accesses it only over a network.
It does, but there are conditions that apply. Some companies don't like going that route - which in my opinion is not a thing, but I can't understand the concern.

But for testing, I don't see any impediment.

- has anyone run it FB scale? for how long?

- how many experienced scylladb devops are there globally that we can hire?

Those questions asked at BigTechCo before it adopts somebody elses tech.

FB already operates RocksDb and Cassandra so there's way less technical, career, financial risk for just hacking the two together with some aggressive refactoring.

Does FB still use Cassandra? I thought they abandoned them ages ago and then databricks picked it up?
FB abandoned Cassandra (which was really only used for message inbox indexing) when they redid how messages work years ago, but the re-adopted a large C* infra when they bought Instagram.
The article is literally written by Instagram, which is FB.
Well, it's a separate product that Facebook acquired. True or not, it's a common perception that Facebook abandoned Cassandra.

https://www.wired.com/2014/08/datastax/

> then databricks picked it up

I think it's DataStax. Databricks is the company behind Spark.

My thought exactly. Would be interesting to know if they did and if yes, why they chose to develop something in-house anyway.