Bannalia: trivial notes on themes diverse

Thursday, April 4, 2024

A case in API ergonomics for ordered containers

Suppose we have a std::set<int> and would like to retrieve the elements between values a and b, both inclusive. This task is served by operations std::set::lower_bound and std::set::upper_bound:

std::set<int> x=...;

// elements in [a,b]
auto first = x.lower_bound(a);
auto last  = x.upper_bound(b);
 
while(first != last) std::cout<< *first++ <<" ";

Why do we use lower_bound for the first iterator and upper_bound for the second? The well-known STL convention is that a range of elements is determined by two iterators first and last, where first points to the first element of the range and last points to the position right after the last element. This is done so that empty ranges can be handled without special provisions (first == last).

Now, with this convention in mind and considering that

lower_bound(a) returns an iterator to the first element not less than a,
upper_bound(b) returns an iterator to the first element greater than b,

we can convince ourselves that the code above is indeed correct. The situations where one or both of the interval endpoints are not inclusive can also be handled:

// elements in [a,b)
auto first = x.lower_bound(a);
auto last  = x.lower_bound(b);

// elements in (a,b]
auto first = x.upper_bound(a);
auto last  = x.upper_bound(b);

// elements in (a,b)
auto first = x.upper_bound(a);
auto last  = x.lower_bound(b);

but getting them right requires some thinking.

Boost.MultiIndex introduces the operation range to handle this type of queries:

template<typename LowerBounder,typename UpperBounder>
std::pair<iterator,iterator>
range(LowerBounder lower, UpperBounder upper);

lower and upper are user-provided predicates that determine whether an element is not to the left and not to the right of the considered interval, respectively. The formal specification of LowerBounder and UpperBounder is quite impenetrable, but using this facility, in particular in combination with Boost.Lambda2, is actually straightforward:

// equivalent to std::set<int>
boost::multi_index_container<int> x=...;

using namespace boost::lambda2;

// [a,b]
auto [first, last] = x.range(_1 >= a, _1 <= b);

// [a,b)
auto [first, last] = x.range(_1 >= a, _1 < b);

// (a,b]
auto [first, last] = x.range(_1 > a,  _1 <= b);

// (a,b)
auto [first, last] = x.range(_1 > a,  _1 < b);

The resulting code is much easier to read and to get right in the first place, and is also more efficient than two separate calls to [lower|upper]_bound (because the two internal rb-tree top-to-bottom traversals can be partially joined in the implementation of range). Just as importantly, range handles situations such as this:

int a = 5;
int b = 2; // note a > b

// elements in [a,b]
auto first = x.lower_bound(a);
auto last  = x.upper_bound(b);
 
// undefined behavior
while(first != last) std::cout<< *first++ <<" ";

When a > b, first may be strictly to the right of last, and consequently the while loop will crash or never terminate. range, on the other hand, handles the situation gracefully and returns an empty range.

We have seen an example of how API design can help reduce programming errors and increase efficiency by providing higher-level facilities that model and encapsulate scenarios otherwise served by a combination of lower-level operations. It may be interesting to have range-like operations introduced for standard associative containers.

Friday, October 20, 2023

Bulk visitation in `boost::concurrent_flat_map`

Introduction
Prior art
Bulk visitation design
Performance analysis
Conclusions and next steps

Introduction

boost::concurrent_flat_map and its boost::concurrent_flat_set counterpart are Boost.Unordered's associative containers for high-performance concurrent scenarios. These containers dispense with iterators in favor of a visitation-based interface:

boost::concurrent_flat_map<int, int> m;
...
// find the element with key k and increment its associated value
m.visit(k, [](auto& x) {
  ++x.second;
});

This design choice was made because visitation is not affected by some inherent problems afflicting iterators in multithreaded environments.

Starting in Boost 1.84, code like the following:

std::array<int, N> keys;
...
for(const auto& key: keys) {
  m.visit(key, [](auto& x) { ++x.second; });
}

can be written more succintly via the so-called bulk visitation API:

m.visit(keys.begin(), keys.end(), [](auto& x) { ++x.second; });

As it happens, bulk visitation is not provided merely for syntactic convenience: this operation is internally optimized so that it performs significantly faster than the original for-loop. We discuss here the key ideas behind bulk visitation internal design and analyze its performance.

Prior art

In their paper "DRAMHiT: A Hash Table Architected for the Speed of DRAM", Narayanan et al. explore some optimization techniques from the domain of distributed system as translated to concurrent hash tables running on modern multi-core architectures with hierarchical caches. In particular, they note that cache misses can be avoided by batching requests to the hash table, prefetching the memory positions required by those requests and then completing the operations asynchronously when enough time has passed for the data to be effectively retrieved. Our bulk visitation implementation draws inspiration from this technique, although in our case visitation is fully synchronous and in-order, and it is the responsibility of the user to batch keys before calling the bulk overload of boost::concurrent_flat_map::visit.

Bulk visitation design

As discussed in a previous article, boost::concurrent_flat_map uses an open-addressing data structure comprising:

A bucket array split into 2ⁿ groups of N = 15 slots.
A metadata array associating a 16-byte metadata word to each slot group, used for SIMD-based reduced-hash matching.
An access array with a spinlock (and some additional information) for locked access to each group.

The happy path for successful visitation looks like this:

The hash value for the looked-up key and its mapped group position are calculated.
The metadata for the group is retrieved and matched against the hash value.
If the match is positive (which is the case for the happy path), the group is locked for access and the element indicated by the matching mask is retrieved and compared with the key. Again, in the happy path this comparison is positive (the element is found); in the unhappy path, more elements (within this group or beyond) need be checked.

(Note that happy unsuccessful visitation simply terminates at step 2, so we focus our analysis on successful visitation.) As the diagram shows, the CPU has to wait for memory retrieval between steps 1 and 2 and between steps 2 and 3 (in the latter case, retrievals of mutex and element are parallelized through manual prefetching). A key insight is that, under normal circumstances, these memory accesses will almost always be cache misses: successive visitation operations, unless for the very same key, won't have any cache locality. In bulk visitation, the stages of the algorithm are pipelined as follows (the diagram shows the case of three operations in the bulk batch):

The data required at step N+1 is prefetched at the end of step N. Now, if a sufficiently large number of operations are pipelined, we can effectively eliminate cache miss stalls: all memory addresses will be already cached by the time they are used.

The operation visit(first, last, f) internally splits [first, last) into chunks of bulk_visit_size elements that are then processed as described above. This chunk size has to be sufficiently large to give time for memory to be actually cached at the point of usage. On the upper side, the chunk size is limited by the number of outstanding memory requests that the CPU can handle at a time: in Intel architectures, this is limited by the size of the line fill buffer, typically 10-12. We have empirically confirmed that bulk visitation maxes at around bulk_visit_size = 16, and stabilizes beyond that.

Performance analysis

For our study of bulk visitation performance, we have used a computer with a Skylake-based Intel Core i5-8265U CPU:

	Size/core	Latency [ns]
L1 data cache	32 KB	3.13
L2 cache	256 KB	6.88
L3 cache	6 MB	25.00
DDR4 RAM		77.25

We measure the throughput in Mops/sec of single-threaded lookup (50/50 successful/unsuccessful) for both regular and bulk visitation on a boost::concurrent_flat_map<int, int> with sizes N = 3k, 25k, 600k, and 10M: for the three first values, the container fits entirely into L1, L2 and L3, respectively. The test program has been compiled with clang-cl for Visual Studio 2022 in release mode.

As expected, the relative performance of bulk vs. regular visitation grows as data is fetched from a slower cache (or RAM in the latter case). The theoretical throughput achievable by bulk visitation has been estimated from regular visitation by subtracting memory retrieval times as calculated with the following model:

If the container fits in Ln (L4 = RAM), Ln−1 is entirely occupied by metadata and access objects (and some of this data spills over to Ln).
Mutex and element retrieval times (which only apply to successful visitation) are dominated by the latter.

Actual and theoretical figures match quite well, which sugggests that the algorithmic overhead imposed by bulk visitation is negligible.

We have also run benchmarks under conditions more similar to real-life for boost::concurrent_flat_map, with and without bulk visitation, and other concurrent containers, using different compilers and architectures. As an example, these are the results for a workload of 50M insert/lookup mixed operations distributed across several concurrent threads for different data distributions with Clang 12 on an ARM64 computer:


5M updates, 45M lookups skew=0.01	5M updates, 45M lookups skew=0.5	5M updates, 45M lookups skew=0.99

Again, bulk visitation increases performance noticeably. Please refer to the benchmark site for further information and results.

Conclusions and next steps

Bulk visitation is an addition to the interface of boost::concurrent_flat_map and boost::concurrent_flat_set that improves lookup performance by pipelining the internal visitation operations for chunked groups of keys. The tradeoff for this increased throughput is higher latency, as keys need to be batched by the user code before issuing the visit operation.

The insights we have gained with bulk visitation for concurrent containers can be leveraged for future Boost.Unordered features:

In principle, insertion can also be made to operate in bulk mode, although the resulting pipelined algorithm is likely more complex than in the visitation case, and thus performance increases are expected to be lower.
Bulk visitation (and insertion) is directly applicable to non-concurrent containers such as boost::unordered_flat_map: the main problem for this is one of interface design because we are not using visitation here as the default lookup API (classical iterator-based lookup is provided instead). Some possible options are:
1. Use visitation as in the concurrent case.
2. Use an iterator-based lookup API that outputs the resulting iterators to some user-provided buffer (probably modelled as an output "meta" iterator taking container iterators).

Bulk visitation will be officially shipping in Boost 1.84 (December 2023) but is already available by checking out the Boost.Unordered repo. If you are interested in this feature, please try it and report your local results and suggestions for improvement. Your feedback on our current and future work is much welcome.

Friday, August 18, 2023

User-defined class qualifiers in C++23

It is generally known that type qualifiers (such as const and volatile in C++) can be regarded as a form of subtyping: for instance, const T is a supertype of T because the interface (available operations) of T are strictly wider than those of const T. Foster et al. call a qualifier q positive if q T is a supertype of T, and negative it if is the other way around. Without real loss of generality, in what follows we only consider negative qualifiers, where q T is a subtype of (extends the interface of) T.

C++23 explicit object parameters (coloquially known as "deducing this") allow for a particularly concise and effective realization of user-defined qualifiers for class types beyond what the language provides natively. For instance, this is a syntactically complete implementation of qualifier mut, the dual/inverse of const (not to be confused with mutable):

template<typename T>
struct mut: T
{
  using T::T;
};

template<typename T>
T& as_const(T& x) { return x;}

template<typename T>
T& as_const(mut<T>& x) { return x;}

struct X
{
  void foo() {}
  void bar(this mut<X>&) {}
};

int main()
{
  mut<X> x;
  x.foo();
  x.bar();

  auto& y = as_const(x);
  y.foo();
  y.bar(); // error: cannot convert argument 1 from 'X' to 'mut<X> &'

  X& z = x;
  z.foo();
  z.bar(); // error: cannot convert argument 1 from 'X' to 'mut<X> &'
}

The class X has a regular (generally accessible) member function foo and then bar, which is only accessible to instances of the form mut<X>. Access checking and implicit and explicit conversion between subtype mut<X> and mut<X> work as expected.

With some help fom Boost.Mp11, the idiom can be generalized to the case of several qualifiers:

#include <boost/mp11/algorithm.hpp>
#include <boost/mp11/list.hpp>
#include <type_traits>

template<typename T,typename... Qualifiers>
struct access: T
{
  using qualifier_list=boost::mp11::mp_list<Qualifiers...>;

  using T::T;
};

template<typename T, typename... Qualifiers>
concept qualified =
  (boost::mp11::mp_contains<
    typename std::remove_cvref_t<T>::qualifier_list,
    Qualifiers>::value && ...);

// some qualifiers
struct mut;
struct synchronized;

template<typename T>
concept is_mut =  qualified<T, mut>;

template<typename T>
concept is_synchronized = qualified<T, synchronized>;

struct X
{
  void foo() {}

  template<is_mut Self>
  void bar(this Self&&) {} 

  template<is_synchronized Self>
  void baz(this Self&&) {}

  template<typename Self>
  void qux(this Self&&)
  requires qualified<Self, mut, synchronized>
  {}
};

int main()
{
  access<X, mut> x;

  x.foo();
  x.bar();
  x.baz(); // error: associated constraints are not satisfied
  x.qux(); // error: associated constraints are not satisfied

  X y;
  x.foo();
  y.bar(); // error: associated constraints are not satisfied

  access<X, mut, synchronized> z;
  z.bar();
  z.baz();
  z.qux();
}

One difficulty remains, though:

int main()
{
  access<X, mut, synchronized> z;
  //...
  access<X, mut>& w=z; // error: cannot convert from
                       // 'access<X,mut,synchronized>'
                       // to 'access<X,mut> &'
}

access<T,Qualifiers...>& converts to T&, but not to access<T,Qualifiers2...>& where Qualifiers2 is a subset of Qualifiers (for the mathematically inclined, qualifiers q₁, ... , q_N over a type T induce a lattice of subtypes Q T, Q ⊆ {q₁, ... , q_N}, ordered by qualifier inclusion). Incurring undefined behavior, we could do the following:

template<typename T,typename... Qualifiers>
struct access: T
{
  using qualifier_list=boost::mp11::mp_list<Qualifiers...>;

  using T::T;

  template<typename... Qualifiers2>
  operator access<T, Qualifiers2...>&()
  requires qualified<access, Qualifiers2...>
  {
    return reinterpret_cast<access<T, Qualifiers2...>&>(*this);
  }
};

A more interesting challenge is the following: As laid out, this technique implements syntactic qualifier subtyping, but does not do anything towards enforcing the semantics associated to each qualifier: for instance, synchronized should lock a mutex automatically, and a qualifier associated to some particular invariant should assert it after each invocation to a qualifier-constraied member function. I don't know if this functionality can be more or less easily integrated into the presented framework: feedback on the matter is much welcome.

Friday, July 7, 2023

Lookup
Insertion

Visitation API
Benchmarks
Conclusions and next steps

Introduction

Starting in Boost 1.83, Boost.Unordered provides boost::concurrent_flat_map, an associative container suitable for high-load parallel scenarios. boost::concurrent_flat_map leverages much of the work done for boost::unordered_flat_map, but also introduces innovations, particularly in the areas of low-contention operation and API design, that we find worth discussing.

State of the art

The space of C++ concurrent hashmaps spans a diversity of competing techniques, from traditional ones such as lock-based structures or sharding, to very specialized approaches relying on CAS instructions, hazard pointers, Read-Copy-Update (RCU), etc. We list some prominent examples:

tbb::concurrent_hash_map uses closed addressing combined with bucket-level read-write locking. The bucket array is split in a number of segments to allow for incremental rehashing without locking the entire table. Concurrent insertion, lookup and erasure are supported, but iterators are not thread safe. Locked access to elements is done via so-called accessors.
tbb::concurrent_unordered_map also uses closed addressing, but buckets are organized into lock-free split-ordered lists. Concurrent insertion, lookup, and traversal are supported, whereas erasure is not thread safe. Element access via iterators is not protected against data races.
Sharding consists in dividing the hashmap into a fixed number N of submaps indexed by hash (typically, the element x goes into the submap with index hash(x) mod N). Sharding is extremely easy to implement starting from a non-concurrent hashmap and provides incremental rehashing, but the degree of concurrency is limited by N. As an example, gtl::parallel_flat_hash_map uses sharding with submaps essentially derived from absl::flat_hash_map, and inherits the excellent performance of this base container.
libcuckoo::cuckoohash_map adds efficient thread safety to classical cuckoo hashing by means of a number of carefully engineered techniques including fine-grained locking of slot groups or "strips" (of size 4 by default), optimistic insertion and data prefetching.
Meta's folly::ConcurrentHashMap combines closed addressing, sharding and hazard pointers to elements to achieve lock-free lookup (modifying operations such as insertion and erasure lock the affected shard). Iterators, which internally hold a hazard pointer to the element, can be validly dereferenced even after the element has been erased from the map; access, on the other hand, is constant and elements are basically treated as immutable.
folly::AtomicHashMap is a very specialized hashmap that imposes severe usage restrictions in exchange for very high time and space performance. Keys must be trivially copyable and 32 or 64 bits in size so that they can be handled internally by means of atomic instructions; also, some key values must be reserved to mark empty slots, tombstones and locked elements, so that no extra memory is required for bookkeeping information and locks. The internal data structure is based on open addressing with linear probing. Non-modifying operations are lock-free. Rehashing is not provided: instead, extra bucket arrays are appended when the map becomes full, the expectation being that the user provide the estimated final size at construction time to avoid this rather inefficient growth mechanism. Element access is not protected against data races.
On a more experimental/academic note, we can mention initiatives such as Junction, Folklore and DRAMHiT. In general, these do not provide industry-grade container implementations but explore interesting ideas that could eventually be adopted by mainstream libraries, such as RCU-based data structures, lock-free algorithms relying on CAS and/or transactional memory, parallel rehashing and operation batching.

Design principles

Unlike non-concurrent C++ containers, where the STL acts as a sort of reference interface, concurrent hashmaps in the market differ wildly in terms of requirements, API and provided functionality. When designing boost::concurrent_flat_map, we have aimed for a general-purpose container

with no special restrictions on key and mapped types,
providing full thread safety without external synchronization mechanisms,
and disrupting as little as possible the conceptual and operational model of "traditional" containers.

These principles rule out some scenarios such as requiring that keys be of an integral type or putting an extra burden on the user in terms of access synchronization or active garbage collection. They also inform concrete design decisions:

boost::concurrent_flat_map<Key, T, Hash, Pred, Allocator> must be a valid instantiation in all practical cases where boost::unordered_flat_map<Key, T, Hash, Pred, Allocator> is.
Thread-safe value semantics are provided (including copy construction, assignment, swap, etc.)
All member functions in boost::unordered_flat_map are provided by boost::concurrent_flat_map except if there's a fundamental reason why they can't work safely or efficiently in a concurrent setting.

The last guideline has the most impact on API design. In particular, we have decided not to provide iterators, either blocking or not: if not-blocking, they're unsafe, and if blocking they increase contention when not properly used, and can very easily lead to deadlocks:

// Thread 1
map_type::iterator it1=map.find(x1), it2=map.find(x2);

// Thread 2
map_type::iterator it2=map.find(x2), it1=map.find(x1);

In place of iterators, boost::concurrent_flat_map offers an access API based on internal visitation, as described in a later section.

Data structure

boost::concurrent_flat_map uses the same open-addressing layout as boost::unordered_flat_map, where the bucket array is split into 2ⁿ groups of N = 15 slots and each group has an associated 16-byte metadata word for SIMD-based reduced-hash matching and insertion overflow control.

On top of this layout, two synchronization levels are added:

Container level: A read-write mutex is used to control access from any operation to the container. This access is always requested in read mode (i.e. shared) except for operations that require that the whole bucket array be replaced, like rehashing, swapping, assignment, etc. This means that, in practice, this level of synchronization does not cause any contention at all, even for modifying operations like insertion and erasure. To reduce cache coherence traffic, the mutex is implemented as an array of read-write spinlocks occupying separate cache lines, and each thread is assigned one spinlock in a round-robin fashion at thread_local construction time: read/shared access does only involve the assigned spinlock, whereas write/exclusive access, which is comparatively much rarer, requires that all spinlocks be locked.
Group level: Each group has a dedicated read-write spinlock to control access to its slots, plus an atomic insertion counter used for transactional optimistic insertion as described below.

Algorithms

The core algorithms of boost::concurrent_flat_map are variations of those of boost::unordered_flat_map with minimal changes to prevent data races while keeping group-level contention to a minimum.

In the following diagrams, white boxes represent lock-free steps, while gray boxes are executed within the scope of a group lock. Metadata is handled atomically both in locked and lock-free scenarios.

Lookup

Most steps of the lookup algorithm (hash calculation, probing, element pre-checking via SIMD matching with the value's reduced hash) are lock-free and do not synchronize with any operation on the metadata. When SIMD matching detects a potential candidate, double-checking for slot occupancy and the actual comparison with the element are done within the group lock; note that the occupancy double check is necessary precisely because SIMD matching is lock-free and the status of the identified slot may have changed before group locking.

Insertion

The main challenge of any concurrent insertion algorithm is to prevent an element x from being inserted twice by different threads running at the same time. As open-addressing probing starts at a position p₀ univocally determined by the hash value of x, a naïve (and flawed) approach is to lock p₀ for the entire duration of the insertion procedure: this leads to deadlocking if the probing sequences of two different elements intersect.

We have implemented the following transactional optimistic insertion algorithm: At the beginning of insertion, the value of the insertion counter for the group at position p₀ is saved locally and insertion proceeds normally, first checking that an element equivalent to x does not exist and then looking for available slots starting at p₀ and locking only one group of the probing sequence at a time; when an available slot is found, the associated metadata is updated, the insertion counter at p₀ is incremented, and:

If no other thread got in the way (i.e. if the pre-increment value of the counter coincides with the local value stored at the beginning), then the transaction is successful and insertion can be finished by storing the element into the slot before releasing the group lock.
Otherwise, metadata changes are rolled back and the entire insertion process is started over.

Our measurements indicate that, even under adversarial situations, the ratio of start-overs to successful insertions ranges in the parts per million.

Visitation API

From an operational point of view, container iterators serve two main purposes: combining lookup/insertion with further access to the relevant element:

auto it = m.find(k);
if (it != m.end()) {
  it->second = 0;
}

and container traversal:

// iterators used internally by range-for
for(auto& x: m) {
  x.second = 0;
}

Having decided that boost::concurrent_flat_map not rely on iterators due to their inherent concurrency problems, a design alternative is to move element access into the container operations themselves, where it can be done in a thread-safe manner. This is just a form of the familiar visitation pattern:

m.visit(k, [](auto& x) {
  x.second = 0;
});

m.visit_all([](auto& x) {
  x.second = 0;
});

boost::concurrent_flat_map provides visitation-enabled variations of classical map operations wherever it makes sense:

visit, cvisit (in place of find)
visit_all, cvisit_all (as a substitute of container traversal)
emplace_or_visit, emplace_or_cvisit
insert_or_visit, insert_or_cvisit
try_emplace_or_visit, try_emplace_or_cvisit

cvisit stands for constant visitation, that is, the visitation function is granted read-only access to the element, which has less contention than write access.

Traversal functions [c]visit_all and erase_if have also parallel versions:

m.visit_all(std::execution::par, [](auto& x) {
  x.second = 0;
});

Benchmarks

We've tested boost::concurrent_flat_map against tbb::concurrent_hash_map and gtl::parallel_flat_hash_map for the following synthetic scenario: T threads concurrently perform N operations update, successful lookup and unsuccessful lookup, randomly chosen with probabilities 10%, 45% and 45%, respectively, on a concurrent map of (int, int) pairs. The keys used by all operations are also random, where update and successful lookup follow a Zipf distribution over [1, N/10] with skew exponent s, and unsuccessful lookup follows a Zip distribution with the same skew s over an interval not overlapping with the former.

We provide the full benchmark code and results for different 64- and 32-bit architectures in a dedicated repository; here, we just show as an example the plots for Visual Studio 2022 in x64 mode on an AMD Ryzen 5 3600 6-Core @ 3.60 GHz without hyperthreading and 64 GB of RAM.


500k updates, 4.5M lookups skew=0.01	500k updates, 4.5M lookups skew=0.5	500k updates, 4.5M lookups skew=0.99


5M updates, 45M lookups skew=0.01	5M updates, 45M lookups skew=0.5	5M updates, 45M lookups skew=0.99

Note that, for the scenario with 500k updates, boost::concurrent_flat_map continues to improve after the number of threads exceed the number of cores (6), a phenomenon for which we don't have a readily explanation —we could hypothesize that execution is limited by memory latency, but the behavior does not reproduce in the scenario with 5M updates, where the cache miss ratio is necessarily higher. Note also that gtl::parallel_flat_hash_map performs comparatively worse for high-skew scenarios where the load is concentrated on a very small number of keys: this may be due to gtl::parallel_flat_hash_map having a much coarser lock granularity (256 shards in the configuration used) than the other two containers.

In general, results are very dependent on the particular CPU and memory system used; you are welcome to try out the benchmark in your architecture of interest and report back.

Conclusions and next steps

boost::concurrent_flat_map is a new, general-purpose concurrent hashmap that leverages the very performant open-addressing techniques of boost::unordered_flat_map and provides a fully thread-safe, iterator-free API we hope future users will find flexible and convenient.

We are considering a number of new functionalities for upcoming releases:

As boost::concurrent_flat_map and boost::unordered_flat_map basically share the same data layout, it's possible to efficiently implement move construction from one to another by simply transferring the internal structure. There are scenarios where this feature can lead to more performant execution, like, for instance, multithreaded population of a boost::concurrent_flat_map followed by single- or multithreaded read-only lookup on a boost::unordered_flat_map move-constructed from the former.
DRAMHiT shows that pipelining/batching several map operations on the same thread in combination with heavy memory prefetching can reduce or eliminate waiting CPU cycles. We have conducted some preliminary experiments using this idea for a feature we dubbed bulk lookup (providing an array of keys to look for at once), with promising results.

We're launching this new container with trepidation: we cannot possibly try the vast array of different CPU architectures and scenarios where concurrent hashmaps are used, and we don't have yet field data on the suitability of the novel API we're proposing for boost::concurrent_flat_map. For these reasons, your feedback and proposals for improvement are most welcome.

Friday, November 18, 2022

Inside `boost::unordered_flat_map`

Introduction
The case for open addressing
- SIMD-accelerated lookup
boost::unordered_flat_map data structure
Rehashing
Hash post-mixing
Statistical properties of boost::unordered_flat_map
Benchmarks
- Running-n plots
- Aggregate performance
Deviations from the standard
Conclusions and next steps

Introduction

Starting in Boost 1.81 (December 2022), Boost.Unordered provides, in addition to its previous implementations of C++ unordered associative containers, the new containers boost::unordered_flat_map and boost::unordered_flat_set (for the sake of brevity, we will only refer to the former in the remaining of this article). If boost::unordered_map strictly adheres to the C++ specification for std::unordered_map, boost::unordered_flat_map deviates in a number of ways from the standard to offer dramatic performance improvements in exchange; in fact, boost::unordered_flat_map ranks amongst the fastest hash containers currently available to C++ users.

We describe the internal structure of boost::unordered_flat_map and provide theoretical analyses and benchmarking data to help readers gain insights into the key design elements behind this container's excellent performance. Interface and behavioral differences with the standard are also discussed.

The case for open addressing

We have previously discussed why closed addressing was chosen back in 2003 as the implicit layout for std::unordered_map. 20 years after, open addressing techniques have taken the lead in terms of performance, and the fastest hash containers in the market all rely on some variation of open addressing, even if that means that some deviations have to be introduced from the baseline interface of std::unordered_map.

The defining aspect of open addressing is that elements are stored directly within the bucket array (as opposed to closed addressing, where multiple elements can be held into the same bucket, usually by means of a linked list of nodes). In modern CPU architectures, this layout is extremely cache friendly:

There's no indirection needed to go from the bucket position to the element contained.
Buckets are stored contiguously in memory, which improves cache locality.

The main technical challenge introduced by open addressing is what to do when elements are mapped into the same bucket, i.e. when a collision happens: in fact, all open-addressing variations are basically characterized by their collision management techniques. We can divide these techniques into two broad classes:

Non-relocating: if an element is mapped to an occupied bucket, a probing sequence is started from that position until a vacant bucket is located, and the element is inserted there permanently (except, of course, if the element is deleted or if the bucket array is grown and elements rehashed). Popular probing mechanisms are linear probing (buckets inspected at regular intervals), quadratic probing and double hashing. There is a tradeoff between cache locality, which is better when the buckets probed are close to each other, and average probe length (the expected number of buckets probed until a vacant one is located), which grows larger (worse) precisely when probed buckets are close —elements tend to form clusters instead of spreading uniformly throughout the bucket array.
Relocating: as part of the search process for a vacant bucket, elements can be moved from their position to make room for the new element. This is done in order to improve cache locality by keeping elements close to their "natural" location (that indicated by the hash → bucket mapping). Well known relocating algorithms are cuckoo hashing, hopscotch hashing and Robin Hood hashing.

If we take it as an important consideration to stay reasonably close to the original behavior of std::unordered_map, relocating techniques pose the problem that insert may invalidate iterators to other elements (so, they work more like std::vector::insert).

On the other hand, non-relocating open addressing faces issues on deletion: lookup starts at the original hash → bucket position and then keeps probing till the element is found or probing terminates, which is signalled by the presence of a vacant bucket:

So, erasing an element can't just restore its holding bucket as vacant, since that would preclude lookup from reaching elements further down the probe sequence:

A common techique to deal with this problem is to label buckets previously containing an element with a tombstone marker: tombstones are good for inserting new elements but do not stop probing on lookup:

Note that the introduction of tombstones implies that the average lookup probe length of the container won't decrease on deletion —again, special measures can be taken to counter this.

SIMD-accelerated lookup

SIMD technologies, such as SSE2 and Neon, provide advanced CPU instructions for parallel arithmetic and logical operations on groups of contiguous data values: for instance, SSE2 _mm_cmpeq_epi8 takes two packs of 16 bytes and compares them for equality pointwise, returning the result as another pack of bytes. Although SIMD was originally meant for acceleration of multimedia processing applications, the implementors of some unordered containers, notably Google's Abseil's Swiss tables and Meta's F14, realized they could leverage this technology to improve lookup times in hash tables.

The key idea is to maintain, in addition to the bucket array itself, a separate metadata array holding reduced hash values (usually one byte in size) obtained from the hash values of the elements stored in the corresponding buckets. When looking up for an element, SIMD can be used on a pack of contiguous reduced hash values to quickly discard non-matching buckets and move on to full comparison for matching positions. This technique effectively checks a moderate number of buckets (16 for Abseil, 14 for F14) in constant time. Another beneficial effect of this approach is that special bucket markers (vacant, tombstone, etc.) can be moved to the metadata array —otherwise, these markers would take up extra space in the bucket itself, or else some representation values of the elements would have to be restricted from user code and reserved for marking purposes.

boost::unordered_flat_map data structure

boost::unordered_flat_map's bucket array is logically split into 2ⁿ groups of N = 15 buckets, and has a companion metadata array consisting of 2ⁿ 16-byte words. Hash mapping is done at the group level rather than on individual buckets: so, to insert an element with hash value h, the group at position h / 2^{W − n} is selected and its first available bucket used (W is 64 or 32 depending on whether the CPU architecture is 64- or 32-bit, respectively); if the group is full, further groups are checked using a quadratic probing sequence.

The associated metadata is organized as follows (least significant byte depicted rightmost):

h_i holds information about the i-th bucket of the group:

0 if the bucket is empty,
1 to signal a sentinel (a special value at the end of the bucket array used to finish container iteration).
otherwise, a reduced hash value in the range [2, 255] obtained from the least significant byte of the element's hash value.

When looking up within a group for an element with hash value h, SIMD operations, if available, are used to match the reduced value of h against the pack of values {h₀, h₁, ... , h₁₄}. Locating an empty bucket for insertion is equivalent to matching for 0.

ofw is the so-called overflow byte: when inserting an element with hash value h, if the group is full then the (h mod 8)-th bit of ofw is set to 1 before moving to the next group in the probing sequence. Lookup probing can then terminate when the corresponding overflow bit is 0. Note that this procedure removes the need to use tombstones.

If neither SSE2 nor Neon is available on the target architecture, the logical organization of metadata stays the same, but information is mapped to two physical 64-bit words using bit interleaving as shown in the figure:

Bit interleaving allows for a reasonably fast implementation of matching operations in the absence of SIMD.

Rehashing

The maximum load factor of boost::unordered_flat_map is 0.875 and can't be changed by the user. As discussed previously, non-relocating open addressing has the problem that average probe length doesn't decrease on deletion when the erased elements are in mid-sequence: so, continously inserting and erasing elements without triggering a rehash will slowly degrade the container's performance; we call this phenomenon drifting. boost::unordered_flat_map introduces the following anti-drift mechanism: rehashing is controled by the container's maximum load, initially 0.875 times the size of the bucket array; when erasing an element whose associated overflow bit is not zero, the maximum load is decreased by one. Anti-drift guarantees that rehashing will be eventually triggered in a scenario of repeated insertions and deletions.

Hash post-mixing

It is well known that open-addressing containers require that the hash function be of good quality, in the sense that close input values (for some natural notion of closeness) are mapped to distant hash values. In particular, a hash function is said to have the avalanching property if flipping a bit in the physical representation of the input changes all bits of the output value with probability 50%. Note that avalanching hash functions are extremely well behaved, and less stringent behaviors are generally good enough in most open-addressing scenarios.

Being a general-purpose container, boost::unordered_flat_map does not impose any condition on the user-provided hash function beyond what is required by the C++ standard for unordered associative containers. In order to cope with poor-quality hash functions (such as the identity for integral types), an automatic bit-mixing stage is added to hash values:

64-bit architectures: we use the xmx function defined in Jon Maiga's "The construct of a bit mixer".
32-bit architectures: the chosen mixer has been automatically generated by Hash Function Prospector and selected as the best overall performer in internal benchmarks. Score assigned by Hash Prospector: 333.7934929677524.

There's an opt-out mechanism available to end users so that avalanching hash functions can be marked as such and thus be used without post-mixing. In particular, the specializations of boost::hash for string types are marked as avalanching.

Statistical properties of boost::unordered_flat_map

We have written a simulation program to calculate some statistical properties of boost::unordered_flat_map as compared with Abseil's absl::flat_hash_map, which is generally regarded as one of the fastest hash containers available. For the purposes of this analysis, the main design characteristics of absl::flat_hash_map are:

Bucket array sizes are of the form 2ⁿ, n ≥ 4.
Hash mapping is done at the bucket level (rather than at the group level as in boost::unordered_flat_map).
Metadata consists of one byte per bucket, where the most significant bit is set to 1 if the bucket is empty, deleted (tombstone) or a sentinel. The remaining 7 bits hold the reduced hash value for occupied buckets.
Lookup/insertion uses SIMD to inspect the 16 contiguous buckets beginning at the hash-mapped position, and then continues with further 16-bucket groups using quadratic probing. Probing ends when a non-full group is found. Note that the start positions of these groups are not aligned modulo 16.

The figure shows:

the probability that a randomly selected group is full,
the average number of hops (i.e. the average probe length minus one) for successful and unsuccessful lookup

as functions of the load factor, with perfectly random input and without intervening deletions. Solid line is boost::unordered_flat_map, dashed line is absl::flat_hash_map.

Some observations:

Pr(group is full) is higher for boost::unordered_flat_map. This follows from the fact that free buckets cluster at the end of 15-aligned groups, whereas for absl::flat_hash_map free buckets are uniformly distributed across the array, which increases the probability that a contiguous 16-bucket chunk contains at least one free position. Consequently, E(num hops) for successful lookup is also higher in boost::unordered_flat_map.
By contrast, E(num hops) for unsuccessful lookup is considerably lower in boost::unordered_flat_map: absl::flat_hash_map uses an all-or-nothing condition for probe termination (group is non-full/full), whereas boost::unordered_flat_map uses the 8 bits of information in the overflow byte to allow for more finely-grained termination —effectively, making probe termination ~1.75 times more likely. The overflow byte acts as a sort of Bloom filter to check for probe termination based on reduced hash value.

The next figure shows the average number of actual comparisons (i.e. when the reduced hash value matched) for successful and unsuccessful lookup. Again, solid line is boost::unordered_flat_map and dashed line is absl::flat_hash_map.

E(num cmps) is a function of:

E(num hops) (lower better),
the size of the group (lower better),
the number of bits of the reduced hash value (higher better).

We see then that boost::unordered_flat_map approaches absl::flat_hash_map on E(num cmps) for successful lookup (1% higher or less), despite its poorer E(num hops) figures: this is so because boost::unordered_flat_map uses smaller groups (15 vs. 16) and, most importantly, because its reduced hash values contain log₂(254) = 7.99 bits vs. 7 bits in absl::flat_hash_map, and each additional bit in the hash reduced value decreases the number of negative comparisons roughly by half. In the case of E(num cmps) for unsuccessful lookup, boost::unordered_flat_map figures are up to 3.2 times lower under high-load conditions.

Benchmarks

Running-n plots

We have measured the execution times of boost::unordered_flat_map against absl::flat_hash_map and boost::unordered_map for basic operations (insertion, erasure during iteration, successful lookup, unsuccessful lookup) with container size n ranging from 10,000 to 10M. We provide the full benchmark code and results for different 64- and 32-bit architectures in a dedicated repository; here, we just show the plots for GCC 11 in x64 mode on an AMD EPYC Rome 7302P @ 3.0GHz. Please note that each container uses its own default hash function, so a direct comparison of execution times may be slightly biased.


Running insertion	Running erasure


Successful lookup	Unsuccessful lookup

As predicted by our statistical analysis, boost::unordered_flat_map is considerably faster than absl::flat_hash_map for unsuccessful lookup because the average probe length and number of (negative) comparisons are much lower; this effect translates also to insertion, since insert needs to first check that the element is not present, so it internally performs an unsuccessful lookup. Note how performance is less impacted (stays flatter) when the load factor increases.

As for successful lookup, boost::unordered_flat_map is still faster, which may be due to its better cache locality, particularly for low load factors: in this situation, elements are clustered at the beginning portion of each group, while for absl::flat_hash_map they are uniformly distributed with more empty space in between.

boost::unordered_flat_map is slower than absl::flat_hash_map for runnning erasure (erasure of some elements during container traversal). The actual culprit here is iteration, which is particularly slow; this is a collateral effect of having SIMD operations work only on 16-aligned metadata words, while absl::flat_hash_map iteration looks ahead 16 metadata bytes beyond the current iterator position.

Aggregate performance

Boost.Unordered provides a series of benchmarks emulating real-life scenarios combining several operations for a number of hash containers and key types (std::string, std::string_view, std::uint32_t, std::uint64_t and a UUID class of size 16). The interested reader can build and run the benchmarks on her environment of choice; as an example, these are the results for GCC 11 in x64 mode on an Intel Xeon E5-2683 @ 2.10GHz:

std::string
               std::unordered_map: 38021 ms, 175723032 bytes in 3999509 allocations
             boost::unordered_map: 30785 ms, 149465712 bytes in 3999510 allocations
        boost::unordered_flat_map: 14486 ms, 134217728 bytes in 1 allocations
                  multi_index_map: 30162 ms, 178316048 bytes in 3999510 allocations
              absl::node_hash_map: 15403 ms, 139489608 bytes in 3999509 allocations
              absl::flat_hash_map: 13018 ms, 142606336 bytes in 1 allocations
       std::unordered_map, FNV-1a: 43893 ms, 175723032 bytes in 3999509 allocations
     boost::unordered_map, FNV-1a: 33730 ms, 149465712 bytes in 3999510 allocations
boost::unordered_flat_map, FNV-1a: 15541 ms, 134217728 bytes in 1 allocations
          multi_index_map, FNV-1a: 33915 ms, 178316048 bytes in 3999510 allocations
      absl::node_hash_map, FNV-1a: 20701 ms, 139489608 bytes in 3999509 allocations
      absl::flat_hash_map, FNV-1a: 18234 ms, 142606336 bytes in 1 allocations
std::string_view
               std::unordered_map: 38481 ms, 207719096 bytes in 3999509 allocations
             boost::unordered_map: 26066 ms, 181461776 bytes in 3999510 allocations
        boost::unordered_flat_map: 14923 ms, 197132280 bytes in 1 allocations
                  multi_index_map: 27582 ms, 210312120 bytes in 3999510 allocations
              absl::node_hash_map: 14670 ms, 171485672 bytes in 3999509 allocations
              absl::flat_hash_map: 12966 ms, 209715192 bytes in 1 allocations
       std::unordered_map, FNV-1a: 45070 ms, 207719096 bytes in 3999509 allocations
     boost::unordered_map, FNV-1a: 29148 ms, 181461776 bytes in 3999510 allocations
boost::unordered_flat_map, FNV-1a: 15397 ms, 197132280 bytes in 1 allocations
          multi_index_map, FNV-1a: 30371 ms, 210312120 bytes in 3999510 allocations
      absl::node_hash_map, FNV-1a: 19251 ms, 171485672 bytes in 3999509 allocations
      absl::flat_hash_map, FNV-1a: 17622 ms, 209715192 bytes in 1 allocations
std::uint32_t
       std::unordered_map: 21297 ms, 192888392 bytes in 5996681 allocations
     boost::unordered_map:  9423 ms, 149424400 bytes in 5996682 allocations
boost::unordered_flat_map:  4974 ms,  71303176 bytes in 1 allocations
          multi_index_map: 10543 ms, 194252104 bytes in 5996682 allocations
      absl::node_hash_map: 10653 ms, 123470920 bytes in 5996681 allocations
      absl::flat_hash_map:  6400 ms,  75497480 bytes in 1 allocations
std::uint64_t
       std::unordered_map: 21463 ms, 240941512 bytes in 6000001 allocations
     boost::unordered_map: 10320 ms, 197477520 bytes in 6000002 allocations
boost::unordered_flat_map:  5447 ms, 134217728 bytes in 1 allocations
          multi_index_map: 13267 ms, 242331792 bytes in 6000002 allocations
      absl::node_hash_map: 10260 ms, 171497480 bytes in 6000001 allocations
      absl::flat_hash_map:  6530 ms, 142606336 bytes in 1 allocations
uuid
       std::unordered_map: 37338 ms, 288941512 bytes in 6000001 allocations
     boost::unordered_map: 24638 ms, 245477520 bytes in 6000002 allocations
boost::unordered_flat_map:  9223 ms, 197132280 bytes in 1 allocations
          multi_index_map: 25062 ms, 290331800 bytes in 6000002 allocations
      absl::node_hash_map: 14005 ms, 219497480 bytes in 6000001 allocations
      absl::flat_hash_map: 10559 ms, 209715192 bytes in 1 allocations

Each container uses its own default hash function, except the entries labeled FNV-1a in std::string and std::string_view, which use the same implementation of Fowler–Noll–Vo hash, version 1a, and uuid, where all containers use the same user-provided function based on boost::hash_combine.

Deviations from the standard

The adoption of open addressing imposes a number of deviations from the C++ standard for unordered associative containers. Users should keep them in mind when migrating to boost::unordered_flat_map from boost::unordered_map (or from any other implementation of std::unordered_map):

Both Key and T in boost::unordered_flat_map<Key,T> must be MoveConstructible. This is due to the fact that elements are stored directly into the bucket array and have to be transferred to a new block of memory on rehashing; by contrast, boost::unordered_map is a node-based container and elements are never moved once constructed.
For the same reason, pointers and references to elements become invalid after rehashing (boost::unordered_map only invalidates iterators).
begin() is not constant-time (the bucket array is traversed till the first non-empty bucket is found).
erase(iterator) returns void rather than an iterator to the element after the erased one. This is done to maximize performance, as locating the next element requires traversing the bucket array; if that element is absolutely required, the erase(iterator++) idiom can be used. This performance issue is not exclusive to open addressing, and has been discussed in the context of the C++ standard too.
The maximum load factor can't be changed by the user (max_load_factor(z) is provided for backwards compatibility reasons, but does nothing). Rehashing can occur before the load reaches max_load_factor() * bucket_count() due to the anti-drift mechanism described previously.
There is no bucket API (bucket_size, begin(n), etc.) save bucket_count.
There are no node handling facilities (extract, etc.) Such functionality makes no sense here as open-addressing containers are precisely not node-based. merge is provided, but the implementation relies on element movement rather than node transferring.

Conclusions and next steps

boost::unordered_flat_map and boost::unordered_flat_set are the new open-addressing containers in Boost.Unordered providing top speed in exchange for some interface and behavioral deviations from the standards-compliant boost::unordered_map and boost::unordered_set. We have analyzed their internal data structure and provided some theoretical and practical evidence for their excellent performance. As of this writing, we claim boost::unordered_flat_map/boost::unordered_flat_set to rank among the fastest hash containers available to C++ programmers.

With this work, we have reached an important milestone in the ongoing Development Plan for Boost.Unordered. After Boost 1.81, we will continue improving the functionality and performance of existing containers and will possibly augment the available container catalog to offer greater freedom of choice to Boost users. Your feedback on our current and future work is much welcome.

Subscribe to: Posts ( Atom )