Per-core store bandwidth is at least 14GB/s on Zen3, 35GB/s for non-temporal stores. Parsing JSON can be done at +2GB/s.
It's very healthy to take maximum bandwidth limits into consideration when reasoning about performance. For instance, for temporal stores, the bottlenecks you see are due to RAM latency and memory parallelism, because of the write-allocate. The load/store uarch can actually retire way more data from SIMD registers.
So there's already some headroom for CPU-bound tasks. For instance 11MB/s is very slow for JIT baseline compiler. But if your particular problem demands arbitrary random access that exceed L3 regularly, maybe that speed is justified.
What we do is CPU bound and we are not just parsing JSON here.
The largest work we do is building an inverted index.
Oversimplified, it is equivalent to this:
inverted_index = defaultdict(list)
for (doc_id, doc_json) in enumerate(doc_jsons):
c = json.loads(payload)
for (field, field_text) in c.items():
for (position, token) in enumerate():
inverted_index[token].push((doc, position))
I'm curious, what is your frame of reference with regards to maximum speed of building inverted indices? Like, what is the maximum throughput you'd expect for this type of task, and what is your reasoning for it?
It's very healthy to take maximum bandwidth limits into consideration when reasoning about performance. For instance, for temporal stores, the bottlenecks you see are due to RAM latency and memory parallelism, because of the write-allocate. The load/store uarch can actually retire way more data from SIMD registers.
So there's already some headroom for CPU-bound tasks. For instance 11MB/s is very slow for JIT baseline compiler. But if your particular problem demands arbitrary random access that exceed L3 regularly, maybe that speed is justified.