Hacker News new | ask | show | jobs
Show HN: FastWSGI, an ultra-fast WSGI server for Python (github.com)
89 points by james_roberts 1654 days ago
9 comments

I've been developing a Python extension, written in C, that provides users with an ultra-fast WSGI server that they can run their WSGI applications on (Flask, Django, uWSGI etc).

I have also recently managed to get it working on multiple platforms (Linux, MacOS and Windows).

If you want to significantly speed up your WSGI based applications, check it out!

It is still in early development at the moment. Any feedback would be greatly appreciated!

=== [Links] ===

Github: https://github.com/jamesroberts/fastwsgi

Pypi: https://pypi.org/project/fastwsgi/

Performance comparisons against other popular WSGI servers: https://github.com/jamesroberts/fastwsgi/blob/main/performan...

With Python, benchmarks that matter are what's the maximum performance you can get out of a particular hardware in the most optimal configuration for the tool in use. (This was most obvious with the async framework benchmarks)

Perhaps other WSGI frameworks are not achieving 100% CPU load and FastWCGI is, and you could easily run them with multiple worker processes to get the same CPU load and comparable performance? (this is just wild speculation)

I don't know, but that's the kind of benchmark I'd like to see: eg. what's the maximum performance you can get out of a, say, 4-core CPU with each of them with whatever configuration stresses the CPU completely, and what are the other metrics you might be seeing (eg. asyncio will basically net you lower memory use, but not any better performance in RPS)?

It'd also be good if the benchmark tool is not running on the same CPU though, as long as you've good a sufficiently fast interconnect.

Not to be taken badly, results you post are amazing and point at a greatly optimised request handling! But with these benchmarks you do not demonstrate the practical value over any competing tool (just that your worker threads can do more in the same time, which is more of the "good engineering" pat-on-the-back type of thing :).
Thanks for the feedback! Yeah, the benchmarks just highlight higher numbers in very simple tests. They should definitely be taken with a grain of salt. I can try add some more practical highlights.
Yeah I don't have an optimal benchmarking setup currently. I have a 16-core CPU that I run everything off (server + benchmarks = not ideal).

I've been working on adding multiprocessing to FastWSGI (only works on Linux at the moment). The RPS almost scales linearly with the number of workers. At 4 processes it hits +260k RPS with decent CPU utilization.

I can definitely add more details to the benchmarks. Thanks for the feedback!

It's more of a question how other tools need to be configured to get a similar CPU utilization, and what RPS would they hit in that case?

I'd create a VM that you can load sufficiently well with FastWSGI, but other tools might need more worker threads or worker processes added to put the same load on that VM.

Basically, what matters practically is what's the maximum you can pull out of this hardware (whatever the configuration is)? Those are then really comparable.

Thanks for the advice! I'll try this out
Ideally, it turns out that all those other tools load the same hardware just the same, but they simply suck compared to FastWSGI.

Then we can all get collectively excited :)

It would be good to eventually see some comparisons running some average/unoptimized code. IME benchmarks seem to focus on either the very basics, or select areas where they are faster than other apps. This is important to cover, but having something that's closer to a real app is more convincing, even if the performance margins drop somewhat.

To come back to the average code. I may start an app and try to optimize things as much as reasonable. Eventually, I'm only able to focus on the functionality and a slide in performance can happen. As more developers are added, this can happen more quickly in the average organization that focuses primarily on functionality. Of course, we should analyze the perf problems and improve the code, but projects like yours may offer a huge perf boost for teams that are struggling here.

I love Python, but I've always enjoyed that bit more scope that the .NET platform provided around performance when code was less optimal. If you can really speed up WSGI this much it'll be a huge boon.

Thanks for the feedback! Yeah, some more "real world" benchmarks should be added.

And yes, if you have some application that you've tried to optimize somewhat and haven't managed to get the desired performance you would like, or if you simply haven't had the time to re-write components to be faster, ideally you could use FastWSGI as a drop in replacement for your current WSGI server and get the extra perf boost for "free". It's still in early development, but this is ultimately one of the main goals of the project.

> provides users with an ultra-fast WSGI server that they can run their WSGI applications on (Flask, Django, uWSGI etc)

I thought this was an equivalent to uWSGI, what's the benefit of running uWSGI on top of FastWSGI?

Oops. This is a typo... should be WSGI apps. Not the uWSGI server
In real world performance it doesn't appear to be that much faster than Bjoern, which you also surmised that Flask is the bottleneck.

Are there other Python WSGI frameworks that can take advantage of FastWSGI?

Why is Flask as slow as it is?

Since this is C, what security mitigations have you put into place?

In theory, any framework that follows the WSGI guidelines should be able to run on top of FastWSGI and take advantage of its speed.

There are many frameworks out there. I do intend to test out some more. For now I've only tested the popular Flask framework and a simple bare bones WSGI app.

Flask was never developed to be lightning fast. Even still, I am quite surprised at how slow it is now that I've seen what kind of numbers can be achieved. I haven't looked deep into it to see where the issues might be.

As for security, that is a work in progress... I definitely wouldn't use FastWSGI in production in the projects current state. It's still early days in terms of development.

Is it to do with the router?. I was reading this article.

https://www.slideshare.net/kwatch/how-to-make-the-fastest-ro...

which lead me to this code.

https://github.com/kwatch/router-sample/blob/master/minikeig...

and i was going to consider trying to benchmark different routers. I started a repo here which drops flask and uses some random router off pypi... https://github.com/byteface/fastwsgitest/blob/master/app.py

and i think i just need to merge that state machine router into it for a test. I believe it's part of the architecture for templating engine called tenjin that predates even jinja?

found your benchmark repo but not installed wrk yet to test.

Looks great, thanks. I was going through the Performance Benchmarks and noticed that the mod_wsgi Apache module - https://github.com/GrahamDumpleton/mod_wsgi - is missing? Please consider including it in the benchmark too - would love to see how your module matches against it.
Will do! Thanks for the feedback
FYI, you have a typo in the graph `Requests served in 60 seconds`. server'd.
Thanks, nice catch! I will update that.
Cool! Looks pretty lightweight. Skimming the code, I see some uses of strtok that look unsafe...I think a URL consisting of entirely `?` will crash.

Yep:

  Server listening at http://0.0.0.0:5000
  Parse error: HPE_INVALID_URL Unexpected start char in url
  free(): double free detected in tcache 2
  [1]    70550 IOT instruction (core dumped)  
(Opened an issue on github)
Thanks so much for reporting a bug! You created the first issue for the project! I'll definitely have a look into this and push a fix
Bjoern author here.

The code looks very nice, looks like it’s inspired by bjoern but much cleaner. I’ve wanted to clean up the mess that the bjoern code is for years but never came around to actually doing it.

That being said I don’t think it provides a lot of value in practice. Same with bjoern. With a reasonably fast server implementation around 99% of time is spent in your Python application, not the server. So it doesn’t actually provide much value in practice to optimize the server. But it’s a nice project to learn about how to write a HTTP and WSGI server :)

Hey, thanks! Great job on Bjoern by the way!

Yeah, I did get some inspiration from Bjoern for this project. Was also a learning project for me. I used this project to pick up C and learn a bit more about HTTP!

Hoping to add a few more features to the server side of things, still not finished, but I agree, there is only so much you can do to optimize things when you're making calls to Python anyway.

While you are completely right, I applaud both James and you for working to potentially remove one bottleneck. If wins are significant enough and people are not blocked by request processing anymore, we'll see them work on frameworks which are better optimized too.
This was my first thought - Python itself is always the bottleneck.
I'm sick of everyone hating on python speed. IO is the bottleneck. By my estimate, 95% of applications are spending <5% of their response time executing python. Maybe your application is compute heavy and is better written in rust, but python gets shit done with minimal effort and the extra cycles are rarely an issue in most workloads.
I write a LOT of Python. I’m not hating on it.

You can definitely outrun I/O but you can never outrun the GIL.

Sure, IO is often the bottleneck, but the python interpreter can most definitely add a lot of overhead, which can add a lot of operational costs. Personally, even in IO bound applications, I prefer something like Go or Elixir, and with those languages, it's not clear what python's productivity advantage is or is not (and I've known python since v1.5.2 so it's not a matter of familiarity).
We had a vendor who implemented their stack with Flask and Postgres on Debian. Their API is consistently slow (seconds to tens of seconds) to the point that we wrote our own app in Dotnet Core (running atop Postgres and Debian) that queries the available content once a day (500k rows of data) with minor refreshes hourly.

We take tens of milliseconds to query Postgres and generate a rendered HTML page for our clients. Showing this to the vendor's devs we got a very surprised response.

Admittedly, we do not operate at their scale, but I am certain this $5 a month droplet will keep running this app for a long time yet even with many users :)

Edit: I did write an MVP in Python atop Sqlalchemy wrapping Postgres, but the performance was still not ideal when rendering hundreds or thousands of rows of data, and the primary developer was already using Dotnet Core.

Bad code is bad code in in any language. Our ecommerce website jacobsparts.com is written in python on debian. It hits the db and rerenders with every page load. I dare you to call it slow.
Awesome stuff! What are some optimizations you have used, if you don't mind me asking? What's the underlying framework?
How big are those responses? I’ve seen terrible performance come down to serialization, which can be addressed by swapping in a fast serialization library like orjson (https://github.com/ijl/orjson). Though even then you’d probably have a hard time getting to tens of milliseconds. Other common culprits: poor indexing, n+1 queries.
At my last company, there was an existing product (an auto ml product for business users) built on Flask with a ton of serialization occurring to populate various charts in the GUI.

After I had left that specific team, I came back and swapped out the json serializer with orjson. It was like 5 lines of code if I recall. The performance skyrocketed. The GUI was noticeably far more responsive in populating the various charts and plots. By "noticeable" I mean it was loading in less than a 1/3 the previous time. Definitely recommend it. It's written in Rust, and it inspired me to start learning the language.

I’m not familiar with dotnet, but I’m not sure if blaming Python is the problem.

A more even comparable rewrite would have been FastApi with an asynchronous library for Postgres (such as SQL Alchemy or TortoiseORM).

There are probably ways to achieve similar results with Django or Flask, but it’s pretty easy with FastApi.

To any experienced Python dev, it's obvious from their description of what the problem is. And it's understandable that anyone inexperienced with Python would blame Python.

They were returning a large number of rows from Postgres (which, if the DB is properly set up, should take at most tens of ms: of course, depending on the width of the rows too), and most (well, I know of none that don't) Python ORM libraries (SQLAlchemy included) have a huge "serialization" cost (turning raw data from Postgres into objects). I've done a benchmark once, and things like Django-ORM or SQLAlchemy were like 10-50x slower than fetching tuples with psycopg directly. SQLAlchemy-core was fastest when fetching tuples if you wanted to not do raw SQL (IIRC, a performance penalty of at most 100%, translated to a factor, up to 2x slower), but Django's fetch-me-tuples functionality was also a single digit multiple of psycopg.

So, the solution to that problem is to fetch tuples, and then pass them in for rendering the page.

Of course, this also points at the problem with all the ORM implementations in Python: they are being too "smart" and dynamic for their own good (if all are bad at it, it also means that Python is not doing something good either, so criticism is warranted).

That's not what was said. Python application code is the bottleneck. As evident from the benchmarks for FastWSGI, even Flask the framework is a bottleneck: pure Python vs Flask went from 70k RPS to 9k RPS.

Python does have a huge performance penalty for basic computation, which is why it has a bunch of C-based libraries that provide bulk-operations that avoid it. If properly used (you rarely need to roll your own with a number of compiled libraries present), Python itself is not a bottleneck. One can argue if that's still Python, but at the very least, it's idiomatic Python development: I hope you don't use Python to prove that pure dynamic languages can outperform compiled languages, but to develop and deliver applications faster using Python's expressiveness.

People bring up GIL as well: it will affect your application startup time and memory usage since you can trivially avoid it by running multiple Python processes. But performance of executing code itself will only be minimally affected if you switch to multi-processing (of course, if memory pressure is so high that all those Python libraries loaded multiple times in memory is affecting your app, that can hinder performance, but that's going to be pretty rare).

Would love to see benchmarks comparing this to FastAPI?

Thanks for this project - will definitely be keeping my eye on it!

FastAPI is an entire web framework using ASGI, whereas this is just a WSGI server.

You could compare FastAPI vs FastWSGI+Flask or any other WSGI framework.

I doubt that, the project name is FastWSGI, not FastASGI.
FastAPI is a framework, the server underneath FastAPI would be uvicorn, which is also based on libuv
Actually you can use any ASGI-compatible web server[1] with FastAPI, uvicorn is generally preferred though because it is apparently the fastest.

[1] https://asgi.readthedocs.io/en/latest/implementations.html

Tried fastwsgi + falcon cythonized.

  Running 1m test @ http://localhost:5000
    8 threads and 100 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency     1.41ms  109.36us   3.85ms   76.27%
      Req/Sec     8.53k   625.45    25.69k    79.77%
    Latency Distribution
      50%    1.43ms
      75%    1.48ms
      90%    1.53ms
      99%    1.62ms
    4072168 requests in 1.00m, 454.37MB read
  Requests/sec:  67816.66
  Transfer/sec:      7.57MB
Pure fastwsgi on same machine

  Running 1m test @ http://localhost:5000
    8 threads and 100 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency   645.52us   30.50us   1.33ms   82.06%
      Req/Sec    18.67k   795.08    36.56k    86.86%
    Latency Distribution
      50%  648.00us
      75%  658.00us
      90%  668.00us
      99%  736.00us
    8919901 requests in 1.00m, 867.68MB read
  Requests/sec: 148419.82
  Transfer/sec:     14.44MB
Damn nice.
The pure fastwsgi results there are insane! Are you running a single process? That's double the RPS that I've been getting on my machine.
AMD 5900x, yup running a single process
Awesome! That performs really well!
Would love to see uwsgi added to the benchmarks. It used to be much faster than gunicorn.
Thanks for the feedback! Will add it shortly
Thank you for adding it so quick!

Not as big a difference as I'd expected.

Good choice for using llhttp, I was always wondering why everyone in the python world keeps ignoring llhttp since it's so much faster than any others. I hope to see some http client lib using llhttp.
Thanks! Yeah, I stumbled upon it while looking for ways to parse http requests. Turns out its awesome. I believe NodeJS uses it under the hood as well. Could be mistaken...
Those are some amazing performance metrics! What's the catch? Is there a tradeoff or did you really find a 10x efficiency?
Thanks! I was quite surprised myself when I saw the numbers for the basic WSGI app. I'm still trying to figure out exactly why my server is so much faster than other servers out there...

My guess is that I avoid calling Python as much as possible. Most of the parsing and the `start_response` code is all written in C instead of Python.

Flask applications seem to have some bottleneck. I don't see the same performance numbers as I do with the basic WSGI app.

*A note: Those tests are just "Hello World" tests on a single worker. They should be taken with a grain of salt...

Should explain what WSGI is.
Web Server Gateway Interface. Simply put, it is just a convention for forwarding requests to Python web applications.

More info here:

python.org/dev/peps/pep-0333/