Thanks for the feedback, since I have relied in other thread related to O_DSYNC which a lot of folks have already suggested, and I will not repeat it here.
For the benchmark results, and they were mainly due to metadata management. We have implemented our own KV store, see internal here [1], which is more efficient than ext4 namespace management, even after doing very aggressive fs tuning for that [2] (plus 65536 sharding for each leveled dir).