I've run Kafka at large scale. I've also seen even larger scale attempts to replace it.
Just use Kafka. Seriously, it's rock solid and is practically lingua franca in backend architecture these days. Everyone understands it and every data processing framework or service supports it.
Kafka is much, much more than just distributed pub/sub. It's disk cache optimizations alone make rolling your own a terrible idea.
This is a principle our industry implements poorly. It often seems like each new generation (for very small values of the word generation) must reinvent the same thing. Perhaps because the old thing was too complex to understand immediately... complexity driven by the needs of the underlying problem... complexity which the new implementation will inevitably obtain if it survives long enough and becomes popular enough that anyone cares.
(That said, I'm highly in favor of innovation of most any kind; building new things is great, if the new thing has some plausible innovation over the old thing!)
It’s true not just in software , but in many aspects of modern society. We value innovation (even when it is bot innovative) a lot more than maintenance. Just look at physical infrastructure as an example.
> Kafka is much, much more than just distributed pub/sub
In between 0% Kafka and 100% Kafka: Kafkas REST proxy gives a minimalistic API surface for using the system. Recreating that API in another language with a simpler backend is highly achievable...
My Kafka installation has a parallel relational backend that provides an onboarding story for smaller apps and groups, for example. It provides about 13% of Kafkas functionality and can't scale meaningfully, but provides baseline data streaming in a pinch and is API compatible with how we use Kafka most of the time.
This is actually interesting because Kakfka’s is great and the protocol makes sense but I want a binary to run so I don’t have to deal with the java ecosystem.
> You can lose data on crash and there is no replication, so you have to orchestrate that yourself doing double writes or something.
> The super simple architecture allows for all kidnds of hacks to do backups/replication/sharding but you have to do those yourself.
> My usecase is ok with losing some data, and we dont have money to pay for kafka+zk+monitoring(kafka,zk), nor time to learn how to optimize it for our quite big write and very big multi-read load.
> Keep in mind that there is some not-invented-here syndrome involved into making it, but I use the service in production and it works very nice :)
I'm scratching my head about "we dont have money to pay for kafka+zk+monitoring(kafka,zk)". Kafka and Zookeeper are both open source. As are monitoring and alerting tools such as Prometheus. Surely the hosting and storage costs are similar. So what does this project offer its creator, other than a great deal of infrastructural debt and all the latent bugs of a roll-your-own solution that lacks a community?
Just setting that up will cost at least 2-4gb of ram, and we are streched thin as it is, 2gb of ram would mean we have to get one more node for our kubernetes cluster in gcloud.
Me and my team understands the 300 lines of code that go into rochefort and can twist and modify it for our needs.
Performance will make or break our startup, which deals with real time user behaviour analytics, and doing high performant java for a while, I know very well how much time I will have to spend looking at G1 logs to fine tune it.
I am sure we wont use rochefort after we scale up, but for now I think gives us greather velocity than kafka (just because if we want to modify kafka we have to spend a week on a simple change).
I want to be able to add more meta information in the header, or read the files from another process, rsync and read them to my laptop, add custom reducers etc, all those things will take me minutes with rochefort and days with kafka.
I'm a self-taught, have a single dedicated server, I have a single-instance Kafka running on top of ZK. Yes, I lose the benefits of replication, failover, etc. I don't need that though. The whole installation took me half an hour, learning Kafka took maybe 3 hours, and as long as my server's been up, Kafka's been up.
Granted, I am not monitoring Kafka, but I do other processes.
The other nice thing is that now that I have ZK, other software that need it can just reuse the same process.
I think using the maintenance cost as a reason to write your own tool, is a short-sighted decision.
Even if that's the case, deploying and scaling a Kafka cluster is something that hundreds of companies have figured out and publicly written about. It's something that you can hire an experienced engineer to fix. When this thing runs into problems, they will be all new ones.
> I love the concepts Kafka defines so clearly, but the software is too complex and have dozens of knobs you have to adjust.
This is one of our biggest headaches, and it's not even that a Kafka server itself is so configurable. We have hundreds of teams writing client applications, and jumping on bridges because Kafka clients have poor configurations is getting old. Too many knobs to twiddle, but I guess that's what happens if you're expecting to be able to tweak for high performance.
I'm always very curious about the backstory of projects like this. Without that backstory there is very little chance I'd try out something like this.
Ideally the read me would explain why Kafka didn't cut it, why the trade offs the authors made were worth it (in this case), and why I did consider using a this system.
Sadly I don't have enough time to read an entire repo of code to try and figure these things out.
this repository appears to be just a hair over a week old, so i am skeptical even of "I use the service in production and it works very nice". fun project i'm sure, but if i felt like breaking the rules and engaging in a little NIH of this sort - i'm not sure i'd choose HTTP (or any other network protocol) as the hub to build it around
Just use Kafka. Seriously, it's rock solid and is practically lingua franca in backend architecture these days. Everyone understands it and every data processing framework or service supports it.
Kafka is much, much more than just distributed pub/sub. It's disk cache optimizations alone make rolling your own a terrible idea.