| Regarding: `cmpb $0, %fs:__tls_guard@tpoff`, the per-function-call overhead is due to dynamic initialization on first use requirement: > Block variables with static or thread(since C++11) storage duration are initialized the first time control passes through their declaration (unless their initialization is zero- or constant-initialization, which can be performed before the block is first entered). On all further calls, the declaration is skipped. --- https://en.cppreference.com/w/cpp/language/storage_duration From https://maskray.me/blog/2021-02-14-all-about-thread-local-st... > If you know x does not need dynamic initialization, C++20 constinit can make it as efficient as the plain old `__thread`. [[clang::require_constant_initialization]] can be used with older language standards. Regarding `data16 lea tls_obj(%rip),%rdi` in the general-dynamic TLS model, yeah it's for linker optimization. The local-dynamic TLS model doesn't have data16 or rex prefixes. Regarding "Why don’t we just use the same code as before — the movl instruction — with the dynamic linker substituting the right value for tls_obj@tpoff?" Because -fpic/-fPIC was designed to support dlopen.
The desired efficient GOTTPOFF code sequence is only feasible when the shared object is available at program start, in which case you can guarantee that
"you would need the TLS areas of all the shared libraries to be allocated contiguously:" # x86-64
movq ref@GOTTPOFF(%rip), %rax
movl %fs:(%rax), %eax
With dlopen, the dynamic loader needs a different place for the TLS blocks of newly loaded shared libraries, which unfortunately requires one more indirection.Regarding "... and I don’t say a word about GL_TLS_GENERATION_OFFSET, for example, and I could." `GL_TLS_GENERATION_OFFSET` in glibc is for the lazy TLS allocation scheme. I don't want to spend my valuable time on its implementation...
It is almost infeasible to fix on the glibc side. |
Thanks - I didn’t realize this was mandated by the standard as opposed to “permitted” as one possibility (similarly to how eg a constructor of a global variable can be called before main or upon first use or anywhere in-between according to the standard). Updated the post with this point
> The desired efficient GOTTPOFF code sequence is only feasible when the shared object is available at program start, in which case you can guarantee that “you would need the TLS areas of all the shared libraries to be allocated contiguously”
Indeed I didn’t mention -ftls-model=initial-exec originally (I now added it based on reader feedback; it can work when it will work, which for my use case is a toss-up I guess…), but my point is that you could allocate the TLSes contiguously even if dlopen was used, and I describe how you could do it in the post, albeit in a somewhat hand-wavy way. This is totally not how things were done and I presume one reason is that you don’t carve out chunks of the address space for a use case like this as described in my approach - I just think it would be nice if things worked that way.