Hashmap implementation techniques for fun & profit

Published on March 17, 2025

(this post contains over half a decade of increasing frustration about ongoing mediocrity in hashmap implementations. Be advised, there will be cursing)

People keep making elementary bloody mistakes implementing hashmaps. In an effort to get them to cut it out, I herein detail a few basic, simple, reasonably generally applicable techniques to produce performant, functioning hashmaps. Chiefly, these are concerned with reducing memory footprint of the passed-around hashmap struct, reducing cache misses, and producing guaranteed constant-time performance wherever feasible.

Introduction

Like every computer science major, I spent quite a while slacking off and looking at my own special interests at a few points during my studies. Okay, maybe that isn't everyone. Anyway, I uncovered a range of information about the hardware our programs usually run on (AMD64, ARM; the documentation is public and available from AMD, ARM, Intel), as well as about hashmap implementations. At some point in early 2018, I started on writing a cacheline-efficient hashmap, which kicked into overdrive once I found a blog post about fibonacci hashing (probably https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/ ), which caused me to write and publish a hashmap implementation (available at https://github.com/sio-k/LibSio/blob/main/HashMap.hpp ) that incorporates a bunch of the techniques I'm detailing in this document. I've since audited an enormous amount of hashmap implementations, and have found them to almost universally fail at even the most basic things, so I figured I'd finally write this extremely loose guide on hashmap implementation techniques.

Techniques

Focus on the common case

Altogether too often, standard library programmers appear to forget that atrocious performance under specific circumstances is perfectly acceptable, or that hashmap queries/lookups occur an order of magnitude more often than any other operation on the hashmap. If you want to implement a useful hashmap, find a use case, understand how that use case affects how often the various basic operations occur in relation to each other, what a reasonable amount of elements in the hashmap is in reality, and whether you can benefit from implementing some very specific fused operations on your hashmap. Then implement that specific hashmap. Only then generalize if you find a concrete case where this is necessary.

Keep the passed around struct as small as possible

Very often, when standard library implementers implement containers in a vacuum, they implement them to perform exceedingly well in microbenchmarks or fairly specific implementation scenarios. A good example of this is std::string, which is commonly 32 bytes in size so as to accomodate a cache inside the struct to optimize for smaller strings. On the other hand, this comes with significant downsides if you use std::string as a member inside a struct, as it occupies half a cacheline by itself (cachelines are typically 64 bytes, there's some specific exceptions to this though, such as the Apple M1 Level 2 cache line size being 128 bytes).

Thus, in order to not hog huge amounts of cacheline space for no good reason, especially when you need not do so, you'll want to reduce the size of your struct. You can do so by (for example): - reducing the amount of bits necessary to store information about the hashmap, e.g. by limiting yourself to only power-of-two sizes, you can encode the hashmap size as a power of two, and recover it using e.g. 1 << (MINIMUM_SIZE + size_encoded)
- fixing what value designates an empty key or hash cell, so you don't need to store that data separately and can store it inside the cells directly
- not storing extraneous data in the struct instance
- storing rarely-used data in the same allocation as the table itself
- using only one allocation for the table rather than two
- generating necessary information at point-of-use
- storing things such as e.g. the amount of elements in the hashmap in otherwise unused bits in the pointer to the table, e.g.
struct hashmap { void* data : 58 /* 26 on 32-bit machines */; u8 size : 6 };
This is applicable e.g. if the pointer to the data can be guaranteed to be 64-byte aligned (and you store log2 of your allocation size, in elements, with 0 being a reasonable minimum value such as 8 elements), something that is generally supported by allocators. If not supported by your allocator, you can always ask the OS for some pages, which are usually going to be 4096-byte aligned, giving you even more 0 bits to use as you see fit:
struct hashmap { void* data : 52 /* 20 on 32-bit machines */; u16 size : 12 };
The end result of all these is functionally the same: your hashmap struct is now 8 bytes or 4 bytes in size (depending on platform), ensuring you pollute the cache as little as you can.

Choose an actual hashing function

Looking at you, libc++. THE IDENTITY FUNCTION IS NOT A HASHING FUNCTION SUITABLE FOR USE IN A HASHMAP. It produces profoundly terrible behavior, as it will yield an atrocious distribution in the common case of "the key is an integer that we just count up as time goes on".

There's a few useful visualizations of this available at https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/ .

If you must guard against the likes of C++ standard library implementers, use a secondary hash, e.g. fibonacci hashing, to ensure that the distribution is at least adequate in the end.

Disallow certain keys

You can save a lot of space by simply not allowing 0 as a key, thus allowing you to use a value of 0 in the key array as a tombstone.

Okay, you can't disallow certain keys: store the hashes

If you can't or don't want to disallow certain keys, you can disallow certain hash values, e.g. by hashing into the space (1, 2^64 - 1) rather than (0, 2^64 - 1), thus allowing you to reserve 0 as a tombstone value.

For users: use integers as keys as much as possible

As traversing strings takes more time, hashing strings sensibly does as well, and you're going to thrash your cache more on average if you do that, I highly recommend using integers as keys wherever you can. A simple technique for this is to map from strings to integer IDs at an "entry" point (e.g. reading in a file), using a single hashmap, and then henceforth using only the integer ID. If you don't use the identity function as a hash function, you can just count up a counter to assign these IDs, or you could use UUIDs, or whatever you feel like, as long as you have enough of a guarantee it winds up sufficiently unique.

Keep the hardware your program will be running on in mind

Understand what your program is running on. Your memory is never going to be infinite, and you'll never be running on a truly arbitrary architecture. In fact, the chance your program will be running on anything but x86, AMD64, ARM 32-bit, ARM 64-bit, or RISC-V (32- or 64-bit) is very small - and if it does indeed have to, trust me... you'll know all too well.

Become familiar with the inherent limitations, caveats, and features of the architectures your programs will run on. You'll notice there's a lot of commonality between the architectures currently in widespread use, quite a lot more than the design of C, C++, Rust etc. would suggest there is. This commonality you can - and absolutely should - make use of to improve your programs.

AMD, Intel, and ARM publish programmer's manuals on their respective websites. Find them, reference them, understand the machine you're programming for. If you find yourself able to actually read these books cover to cover, do so. If like me you can't, keep them handy - this is as close to word of god as you're gonna get in this space.

There's multiple sites on the internet collating useful and reliable data about these processors, e.g. WikiChip (reasonably detailed general information) and uops.info (timing information of CPU instructions).

Constant realtime querying: cacheline-aligned buckets

In short: we can in fact guarantee that queries of a hashmap not only occur in bounded time, but occur in constant realtime, or very close to it, by exploiting the fact that high-performance processors generally fetch a whole cacheline at once, i.e. 64 bytes, ensuring that we do not pollute this cacheline with things we don't need (i.e. values), and that this cacheline fetch is in fact what takes the most time. We can then simply scan the entire 64 bytes to see if we get a key match, and if not, then the map doesn't contain anything of that key. When inserting, we try to insert in the bucket, and if there are no free slots, we double the size of the hashmap and try again.

In theory, this should be capable of resulting in situations where the load factor is exceedingly low, but that's apparently not how it shakes out in practice. Once more, theory is beaten by practice and I do not know why. In fact, with a decent choice of hash function, this type of hashmap appears to perform better than any other implementation.

I refuse to believe I'm the first to have come up with this, but I've thus far found no major implementation doing this (except Odin's implementation, which is rather obviously based on mine¹).

For fuck's sake, stop this `struct \{ key, value \}` shit

That's the most elementary way you can make your performance suck on all but the most constrained embedded platforms (where, let's face it, you probably want to conserve memory, for which this is still crap, because alignment restrictions get in the way of that).

What to do instead: either separate parallel arrays for keys and values, or one array that first has all the keys, then all the values. Sure that's "harder to grow", but not actually that much. Turns out traversing the entire map takes a bit, but at least you're likely to avoid cache misses on the way.

Multi-mode containers

If you want to get the best of all worlds, you can make it possible to convert your container between multiple different data representations that are optimized for various different operations, e.g.: optimized for constant-time insertion (e.g. an unsorted array with tombstones, that you just append to), cacheline-aligned buckets for constant-time querying and deletion, etc. There are many criminally underutilized techniques that are optimal for specific uses. Make use of them.

Make your code comprehensible

Too often standard library implementations of hashmaps are extremely indirect and almost completely incomprehensible. This makes auditing the code or understanding the behavior almost impossible.

For some examples, refer to the Rust stdlib, or pretty much any C++ STL implementation.

Selected implementation examples

I'm aware of a few suitable examples:
- the production-suitable Odin map implementation, available at https://github.com/odin-lang/Odin/blob/master/base/runtime/dynamic_map_internal.odin
- my own rough and by now quite old C++ implementation, available at https://github.com/sio-k/LibSio/blob/main/HashMap.hpp
- a Haskell implementation that uses separate key/value arrays, available at https://hackage.haskell.org/package/hashtables-1.4.2/docs/src/Data.HashTable.ST.Basic.html#HashTable

Further Research, or "the part where the author admits they themselves didn't thoroughly test any of this"

I'd greatly appreciate further validation of these techniques. They've already been in use for several years in the standard hashmap implementation in the Odin programming language, and have produced no significant issues there, so I figured I'd detail them further here.

I'm sure other people have good ways of ending these things, but I don't. The ADHD means I'm considerably better at starting things than finishing them.

Yeah, I'm not credited, but I do recognize my own coding style in all that. It's fortunately/unfortunately blindingly obvious, and I do approve of the refinements and testing of the Odin implementation. ↩