Hacker News new | ask | show | jobs
by porges 4002 days ago
> It seems the terrible performance of the STL can be explained by std::string: this thing hits the general purpose allocator every time a new string is constructed. In this benchmark, that means every time we insert a string, possibly more. Not good for such an inner loop. There are ways to speed things up, but that would complicate the code, and defeat the purpose of leaning on the standard library.

It's actually reasonably easy to avoid the unnecessary copying.

Something like this would do (use a string as the buffer, pass it by reference, then use `try_emplace`). Also, it should probably be using the same hash function as your C code:

    #include <cstdint>
    #include <fstream>
    #include <string>
    #include <unordered_map>

    class Intern_pool
    {
        struct fvn_hash
        {
            // FVN-1a hash -- http://isthe.com/chongo/tech/comp/fnv/
            std::size_t operator()(const std::string& s) const
            {
                std::size_t hash = 2166136261; // offset basis (32 bits)
                for (auto c : s)
                {
                    hash ^= c;       // xor
                    hash *= 16777619;   // prime (32 bits)
                }
                
                return hash;
            }
        };

        std::unordered_map<std::string, std::uint32_t, fvn_hash> map;
        std::uint32_t next = 0;

    public:
        std::uint32_t add(const std::string& s)
        {
            auto r = map.try_emplace(s, next);
            if (r.second)
            {
                ++next;
            }

            return r.first->second;
        }
    };

    int main(int argc, const char* argv[])
    {
        for (int i = 1; i < argc; ++i)
        {
            std::ifstream file(argv[i]);

            Intern_pool intern_pool;

            std::string line;
            while (std::getline(file, line))
            {
                intern_pool.add(line);
            }
        }

        return 0;
    }