Saturday, December 28, 2013

Objects, slots, functors, duality

C++ smart pointers rely on the ability to overload operator-> (and, less importantly, operator->*), whose invocation precedes any access to the pointee. Unfortunately, there is no such thing as overloading operator. (even though the possibility to do so has been considered by Bjarne Stroustrup and others), which would allow for the creation of "smart reference" classes. Lacking this ability, a wrapper class has to go through the painful and non-generic process of replicating the interface of the wrapped object. We present here an approach to class interface design, let's call it slot-based interfaces, that allows for easy wrapping without losing much expressivity with respect to traditional C++ interfaces. The techniques described are exercised by a program available for download (a fairly C++11-compliant compiler such as GCC 4.8 is needed).
Say we have a container class like the following:
template <class T, class Allocator = std::allocator<T>>
class vector {
  ...
  iterator       begin();
  const_iterator begin()const;
  iterator       end();
  const_iterator end()const;
  ...
  template <class... Args>
  void emplace_back(Args&&... args);
  void push_back(const T& x);
  ...
};
The interface of this vector can be changed to one based in slots by replacing each member function with an equivalent overload of operator():
template <class T, class Allocator = std::allocator<T>>
class vector {
  ...
  iterator       operator()(begin_s);
  const_iterator operator()(begin_s)const;
  iterator       operator()(end_s);
  const_iterator operator()(end_s)const;
  ...
  template <class... Args>
  void operator()(emplace_back_s,Args&&... args);
  void operator()(push_back_s,const T& x);
  ...
};
where begin_s, end_s, emplace_back_s, etc. are arbitrary, different types called slots, of each of which there is a corresponding global instance with name begin, end, emplace_back, etc. Traditional code such as
vector<int> c;
for(int i=0;i<10;++i){
  c.push_back(2*i);
  c.emplace_back(3*i);
}
for(auto it=c.begin(),it_end=c.end();it!=it_end;++it){
  std::cout<<*it<<" ";
}
looks like the following when using a slot-based interface:
vector<int> c;
for(int i=0;i<10;++i){
  c(push_back,2*i);
  c(emplace_back,3*i);
}
for(auto it=c(begin),it_end=c(end);it!=it_end;++it){
  std::cout<<*it<<" ";
}
Legibility is still excelent, if a little odd at the beginning. The big advantage of a slot-based interface is that all of it is mediated through invocations of (overloads of) operator(), so wrapping, with the aid of C++11 perfect forwarding, is an extremely simple business:
template <typename T>
class logger
{
  T t;
public:
  template<typename ...Args>
  logger(Args&&... args):t(std::forward<Args>(args)...){}

  template<typename Q,typename ...Args>
  auto operator()(Q&& q,Args&&... args)
    ->decltype(t(std::forward<Q>(q),std::forward<Args>(args)...))
  {
     std::cout<<typeid(q).name()<<std::endl;
     return t(std::forward<Q>(q),std::forward<Args>(args)...);
  }

  template<typename Q,typename ...Args>
  auto operator()(Q&& q,Args&&... args)const
    ->decltype(t(std::forward<Q>(q),std::forward<Args>(args)...))
  {
     std::cout<<typeid(q).name()<<std::endl;
     return t(std::forward<Q>(q),std::forward<Args>(args)...);
  }
};
logger<T> merely adorns the interface of T (which is, remember, entirely comprised of overloads of operator()) by registering the name of the first argument type, i.e. the slot used. Substituting logger<T> by T in any piece of code works immediately without further modifications, regardless of the interface of T (provided, of course, it is slot-based). The following is probably more interesting:
template <typename T>
class shared_ref
{
  std::shared_ptr<T> p;
public:
  template<typename ...Args>
  shared_ref(Args&&... args):p(std::forward<Args>(args)...){}

  template<typename ...Args>
  auto operator()(Args&&... args)
    ->decltype((*p)(std::forward<Args>(args)...))
  {
     return (*p)(std::forward<Args>(args)...);
  }
};
Much as shared_ptr<T> acts as regular T*, shared_ref<T> acts as a regular T& whose (slot-based) interface can be unencumberedly used. This even works with run-time polymorphism based on virtual overloads of operator().
We can continue expanding on this new interface paradigm. So far slots have been regarded as passive types whose only mission is to adorn interfaces with convenient names, but nothing stops us from making these slots active classes in a curiously useful way:
struct begin_s
{
  template<typename T,typename ...Args>
  auto operator()(T&& t,Args&&... args)const
    ->decltype(t(*this,std::forward<Args>(args)...))
  {
    return t(*this,std::forward<Args>(args)...);
  }
};
extern begin_s begin;
We have defined begin_s as a functor performing the following transformation:
begin(t,...) t(begin,...),
which allows us to write (assuming all the slots are defined in an equivalent way) code like this:
vector<int> c;
for(int i=0;i<10;++i){
  push_back(c,2*i);
  emplace_back(c,3*i);
}
for(auto it=begin(c),it_end=end(c);it!=it_end;++it){
  std::cout<<*it<<" ";
}
So, slots give us for free the additional possibility to use them with global function syntax. In fact, we can take this behavior as the very definition of a slot.
Definition. A basic slot is a functor S resolving each invocation of s(t,...) to an invocation of t(s,...), for any s of type S.
Those readers with a mathematical inclination will have detected a nice duality pattern going on here. On one hand, slot-based classes are functors whose overloads of operator() take slots as their first parameter, and on the other hand slots are functors whose first parameter is assumed to be a slot-based class type.
The diagram shows a number of different classes against several slot names, where succesful combinations are marked by a dot: in a sense it is immaterial which axis is considered to hold classes and which holds slots, as their roles can be reversed from a syntactical point of view. One unexpected implication of this duality is that slot-based class wrappers will also work as slot wrappers. For instance, much as we can write
// register access to the interface of c
logger<vector<int>> c;
...
we can also write
// register invocations to emplace_back
logger<emplace_back_s> logged_emplace_back;
logged_emplace_back(c,100);
...
Or, if we make vector open-ended by accepting any slot, we can use logged_emplace_back with member function syntax:
template <typename T>
class open_ended
{
  T t;
public:
  template<typename ...Args>
  open_ended(Args&&... args):t(std::forward<Args>(args)...){}
      
  template<typename Slot,typename ...Args>
  auto operator()(Slot&& s,Args&&... args)
    ->decltype(s(t,std::forward<Args>(args)...))
  {
     return s(t,std::forward<Args>(args)...);
  }

  template<typename Slot,typename ...Args>
  auto operator()(Slot&& s,Args&&... args)const
    ->decltype(s(t,std::forward<Args>(args)...))
  {
     return s(t,std::forward<Args>(args)...);
  }
};
...
logger<emplace_back_s> logged_emplace_back;
open_ended<vector<int>> c;
c(logged_emplace_back,100);
...
By identifying class access with class invocation and reifying member function names into first-class functors, slot-based interfaces allow for a great deal of static composability and provide unified concepts blurring the distinction between classes and functions in ways that are of practical utility. The approach is not perfect: a non-standard syntax has to be resorted to, public inheritance needs special provisions and operators (as opposed to regular member functions, i.e. functions like operator=, operator!, etc.) are not covered. With these limitations in mind, slot-based interfaces can come handy in specialized applications or frameworks requiring interface decoration, smart references or similar constructs.

Wednesday, December 4, 2013

Lookup cost in hash tables with duplicate elements

As a continuation of our previous entry on the statistical properties of hash tables with duplicate elements, let's now analyze how duplicates affect lookup times. Let N be the number of elements in the table, B the number of buckets and F = N/B the associated load factor. We have already learnt that the size S of a bucket follows (for large N) a Poisson distribution PF(n) with mean F. Beginning with the non-duplicate case, what is the average number of elements checked when searching for a given key k? In such general terms, the question is underspecified because lookup times differ depending on whether k is actually present in the table or not. The analysis can be done, though, considering each case in isolation.
Key is not present. This is easy to calculate: the number of elements checked is simply the average length of a bucket, which we know to be F.
Key is present. We assume that the value of k, regarded as a random variable, is evenly distributed among the values stored in the table, i.e. each element has the same probability of being chosen for lookup. Define
B(n) := Pr(k is in a bucket of size n).
It is easy to see that B(n) can be calculated as
B(n) = nPF(n)B/N = nPF(n)/F,
following from the simple fact that there are PF(n)B buckets of size n, each holding precisely n elements. In such a bucket, the number of elements checked when looking for k goes from 1 (k is the first element) to n (k is the last), and is thus in average (1 + ··· + n)/n = (n+1)/2. From this we can calculate in the general case the average number of elements checked as
Σn≥0 ((n+1)/2)B(n) =
= Σn≥0 ((n+1)/2)nPF(n)/F = (1/2F)Σn≥0 (n2+n)PF(n) =
= (1/2F)(E[S] + E[S2]) = (1/2F)(F + F2 + F) = 1+ F/2,
where we have used the fact that the second moment of a Poisson distribution with mean F is F2 + F.
We are now ready to study the case where the hash table holds M groups of duplicate elements, each group with average size G
Key is not present. There are no differences with respect to the non-duplicate case; an entire bucket, whose average size is again F, is traversed when looking up for the non-present key.
Key is present. In terms of distribution of elements, the table is equivalent to another one without duplicates where each group of equal elements is replaced by just one representative. This theoretical table has a load factor F' = F/G and an associated average lookup time 1 + F'/2. Now, searching in our original table incurs the same number of operations, except that all the individual checks before getting to k now turn into checks of an average of G equivalent elements. So, the total number of steps is
1 + GF'/2 = 1 + F/2,
exactly as in the non-duplicate case. How come we see no degradation despite elements being heavily concentrated in fewer buckets? The reason is that the distribution of groups in the table is governed by an equivalent load factor F' = F/G which is much (as much as G) less than F, and the probability that there are two of more groups of elements in the same bucket is in consequence greatly reduced.   
The hash table structures used by some implementations of C++ unordered associative containers such as Boost.Unordered allow for skipping of groups of equivalent elements. For these special hash tables, the average lookup steps are reduced to F/G, if the key is not present, and 1+F/2G, if it is.

Tuesday, November 26, 2013

Measuring lookup times for C++ unordered associative containers

This is the final installment of our series of entries on the performance of C++ unordered associative containers as implemented by Dinkumware, Boost.Unordered and Boost.MultiIndex. We now analyze lookup according to the following two scenarios:
void successful_lookup(const container& c,unsigned int n)
{
  while(n--)s.find(rnd());
}

void unsuccessful_lookup(const container& c,unsigned int n)
{
  while(n--)s.find(rnd2());
}
where rnd generates the same random sequence of values used to populate c with n elements and rnd2 is a different, statistically independent sequence: so, lookups in the fist scenario always succeed, while those in the second one are very likely to fail. As usual we provide profiling programs for the non-duplicate and duplicate versions of the containers, the build and test environment stays the same as before (Microsoft Visual Studio 2012, default release mode settings, Windows box with an Intel Core i5-2520M CPU @2.50GHz), times are in microseconds/element, n runs from 10,000 to 3 million. The results for the non-duplicate case are depicted in the figure (first successful lookup, then unsuccessful):
The algorithms implemented by the three libraries are entirely similar (invoke hash function, map to bucket, traverse till either a matching element or the end of the bucket is met); the differences we see in performance are explained by these two factors:
  • Dinkumware, as we have discussed in a previous entry, has the fastest hash→bucket mapping.
  • Boost.Unordered and Boost.MultiIndex use the same approach to locate the bucket to search through, namely mapping hash values to bucket entries with an expensive modulo calculation and getting to the bucket via the preceding element (thus worsening locality with respect to Dinkumware's procedure): Boost.Unordered calculates this modulo slightly less inefficiently and, in unsuccessful lookups, often has to do the calculation twice, the first time to locate the bucket, and the second one to determine when that bucket ends (if it is not empty).  
The duplicate case with average group size G = 5 and maximum load factor Fmax = 1 looks like this (first successful lookup, then unsuccessful):
For both scenarios, Dinkumware's performance degrades more than the other two libs with respect to the non-duplicate case: when a collision happens, i.e. if a different group than searched for is found in the bucket, Dinkumware has to check every element of the group, whereas Boost.Unordered and Boost.MultiIndex take advantage of their group-skipping capabilities. Boost.MultiIndex has a poorer group-skipping mechanism as it only applies to groups with size > 2, which is more apparent on unsuccessful lookups (for successful lookups the probability that two groups end up in the same bucket is very small.)
Lastly, this is the situation when we set Fmax = 5 (first successful lookup, then unsuccessful):
Now the disadvantage of Dinkumware is much more clear: with a typical load F = 0.75·Fmax =3.75, the average number of elements checked is 1+ F/2 =  2.875 for successful lookups and F = 3.75 for unsuccessful ones, while for Boost.Unordered the corresponding numbers are 1+ F/2G = 1.375 and F/G = 0.75, respectively, and slightly worse than these for Boost.MultiIndex as this library does not perform group skipping for group sizes ≤ 2.

Monday, November 18, 2013

Measuring erasure times for C++ unordered associative containers

After measuring insertion times for Dinkumware, Boost.Unordered and Boost.MultiIndex implementations of C++ unordered associative containers without and with duplicates, we now turn to profiling erasure. As a model for this operation we choose the following scenario:
void erase(container& c)
{
  for(auto it:rnd_it_range(c))c.erase(it);
}
where rnd_it_range(c) is a random shuffling of [c.begin(), ++c.begin(), ... , c.end()). I've written profiling programs for the non-duplicate and duplicate versions of the containers, respectively, which were built and run on the same environment as in previous entries. The graphics show resulting erasure times in microseconds/element for n = 10,000 to 3 million elements.
These are the results for containers without duplicate elements:
Dinkumware and Boost.MultiIndex have unconditional O(1) erasure since buckets need not be traversed to delete an element, and moreover the locality of their corresponding erasure procedures is very good: all of this shows as superior performance with respect to Boost.Unordered, which needs to do bucket traversal in order to locate the node preceding the one being deleted. The CPU memory cache, which favors algorithm locality, makes the differences in performance increase when n is large.
Containers with duplicate elements are populated with a random sequence of elements with groups of equivalent values having an average size G = 5. We first set the maximum load factor Fmax = 1:
Dinkumware performance does not change, as this library's containers treat equivalent elements in exactly the same manner than in the non-duplicate case, and bucket occupancy, greater in this case, does not affect the erasure procedure. Boost.Unordered performance increases somewhat: when deleting a element in the middle of a group of equivalent values, the library does not need to traverse the bucket because the preceding and following elements are linked in a circular list; the net result is that locality improves for these elements and stays the same for elements at the beginning or the end of a group, thus overall performance is better. Boost.MultiIndex behaves the worst here: even though erasure is still unconditionally constant-time, the complexity of the underlying data structure implies that as many as seven memory locations can be visited to erase an element, 75% more than Dinkumware and at about the same level as Boost.Unordered.
When we set Fmax = 5, the situation does not change dramatically:
Boost.Unordered is the only library that could theoretically be impacted by the higher bucket occupancy, but the effect is nonetheless not noticeable.
It should be noted that Dinkumware (in disagreement with the C++ standard) invokes the hash function of the element being erased, which would negatively affect its performance for elements harder to hash than simple unsigned ints. That said, the special provisions Boost.Unordered and Boost.MultiIndex implement to accelerate the processing of groups of equivalent elements have not yet achieved spectacular performance improvements, or even result in decidedly slower times. It is at lookup that these libraries excel, as we will see in a future entry.

Saturday, November 16, 2013

Measuring insertion times for C++ unordered associative containers with duplicate elements

We continue our measuring plan for C++ unordered associative containers and focus now on insertion in containers with duplicate elements as implemented by Dinkumware, Boost.Unordered and Boost.MultiIndex. As before, we consider non-rehashing and rehashing scenarios
// non-rehashing
void insert(container& c,unsigned int n,float Fmax)
{
  c.max_load_factor(Fmax);
  c.reserve(n);
  while(n--)c.insert(rnd());
}

// rehashing
void insert(container& c,unsigned int n,float Fmax)
{
  c.max_load_factor(Fmax);
  while(n--)c.insert(rnd());
}
with two additional complications:
  • The random source produces duplicate elements where each different value is repeated G = 5 times on average within a run of n invocations.
  • We study the performance with maximum load factors Fmax = 1 (the default specified by the standard) and Fmax = G = 5. The latter case results in an element distribution across buckets equivalent to that of a container without duplicate elements being fed the same random source (i.e. where each group of equivalent elements is reduced to just one element). As studied before, this favors implementations that are able to skip groups of equivalent elements in constant time, which is the case for Boost.Unordered and Boost.MultiIndex. 
The test program was compiled and run on the same environment as in our previous entry. The first figure shows the results for the non-rehashing scenario and Fmax = 1, times in microseconds/element.
The sigmoid shapes due to CPU memory cache effects should be familiar by now. Boost.Unordered and Boost.MultiIndex perform very similarly despite their different data structures and insertion algorithms: the former is slower at determining the end of a bucket whereas the latter does worse for groups of size < 3, and these two effects seem to cancel each other out. Why is Dinkumware as fast or even faster than the other two libraries?
  • As we have proved in another entry, the distribution of elements when duplicates are allowed varies significantly with respect to the non-duplicate situation for the same load factor: here many more buckets are empty and the probability that two groups of elements end up in the same bucket is much lower: consequently, most of the time the insertion procedure needs not traverse the entire bucket and the fact that Dinkumware cannot skip groups in constant time is irrelevant.
  • Furthermore, the locality of Dikumware is better than that of Boost.Unordered and Boost.MultiIndex, which need to access the first and the last element of a group upon each new insertion; this shows up as an additional advantage for Dinkumware when n is large and most of the memory active set lies outside the CPU cache. 
When we set Fmax = 5, results look like this:
Although the level of bucket occupancy is five times higher, Boost.Unordered and Boost.MultiIndex insertion times does not grow much, precisely because they are not affected by the size of groups of equivalent elements. On the other hand, Dinkumware performance, as expected, degrades noticeably: when two groups occupy the same bucket, which is now more likely, reaching for the second group implies traversing all the elements of the first.
The following figures show the the inserton times for the rehashing scenario and Fmax = 1 and 5, respectively.
The results are not so easy to interpret as before. There are several competing forces acting here:
  • Dinkumware rehashing procedure is the slowest from an algorithmical point of view (each element is rehashed separately) but has the best locality, which compensates its initial disadvantage when n grows.
  • Boost.Unordered rehashes fastest of all three libraries because the implementation stores precomputed hash values within element nodes. This is more apparent for large values of n, as evidenced by the height of the graph "steps", and would show even more distinctly if the elements were harder to hash than the plain unsigned ints we have used (for instance, if we had dealt with std::strings).
In future entries we will go on to profiling erasure and lookup times.

Saturday, November 9, 2013

Measuring insertion times for C++ unordered associative containers

Equipped with our theoretical analysis on the complexity of insertion for C++ unordered associative containers without duplicate elements, let us now measure the real thing. I have written a program that profiles running insertion times for Dinkumware, Boost.Unordered and Boost.MultiIndex. Tests were compiled with Microsoft Visual Studio 2012, default release mode settings, and run on a Windows machine with an Intel Core i5-2520M processor running at 2.50GHz.
We begin with the non-rehashing scenario:
void insert(container& c,unsigned int n)
{
  c.reserve(n);
  while(n--)c.insert(rnd());
}
The results for n = 10,000 to 3 million elements are depicted in the figure. Times are microseconds/element.
Well, this is anything but average constant time! What we are seeing is in most likelihood the effect of the CPU memory cache, which in our testing environment has a size of 3MB. This means that the active memory set fits into the CPU cache up to n ≈ 150,000 for Dinkumware and n ≈ 200,000 for Boost.Unordered and Boost.MultiIndex: beyond that point average memory read/write time gradually increases and is expected to stabilize when the size of the memory set is much larger than the CPU cache. Leaving these effects aside, what the figure tells us about the performance of the containers is:
  • Boost.MultiIndex does slightly faster than Boost.Unordered because determining the end of a bucket only involves some pointer checks in the former lib whereas the latter has to use more expensive verifications based on stored hash values.
  • Dinkumware bucket arrays sizes are powers of two whereas Boost.Unordered and Boost.MultiIndex resort to an (approximately) geometric progression of prime numbers: the former scenario allows for a much faster mapping of hash values to bucket entries because n%m, which is in general an expensive operation, can be efficiently implemented as n&(m-1) when m is a power of two. On the other hand, Dinkumware's pointer manipulation is the heaviest of all three libs, and the net result is that its observed insertion performance sits more or less between those of the other two.
Now we consider the rehashing scenario:
void insert(container& c,unsigned int n)
{
  while(n--)c.insert(rnd());
}
The differences with respect to the previous case are due, obviously, to the presence of rehashing:
  • Dinkumware excels when n is large because of the advantage we mentioned about using powers of two for bucket array sizes, more prevalent under rehashing conditions.
  • Unlike Boost.Unordered, Boost.MultiIndex needs to calculate and temporarily store hash values each time a rehashing occurs, which eats up its head start when n grows.
Note that the test uses unsigned ints as elements, whose associated hash function is trivial: if the elements were harder to hash (like in the case of std::strings), Boost.Unordered, which is the only lib that stores hash values, should beat the other two for large numbers of elements.

Sunday, October 27, 2013

Insertion complexity of C++ unordered associative containers

C++ unordered associative containers make the beautiful promise of (average) constant time insertion provided hashing is done properly. Let's see the algorithmic theory hash table insertion relies on. We restrict ourselves to discussing containers not allowing duplicate elements (unordered_set and unordered_map, which, for the purpose of this analysis, are entirely equivalent).
So, we want to calculate the average complexity of this model of running insertion into a container:
void insert(container& c,unsigned int n)
{
  while(n--)c.insert(rnd());
}
where rnd is a random source generating non-repeating integers and c is empty upon invoking insert. Let B0 be the initial number of buckets of an empty container and Fmax its maximum load factor (usually 1): when there are  n-1 < FmaxB0 elements in c, the time taken to insert a new one is
 t(n) = a + b(n-1)/B0,
for some constants a and b: the first addend represents the constant insertion time of the element into the structure and the second one accounts for collisions into the destination bucket: if hashing is distributing elements uniformly, there are on average (n-1)/B0 previous elements in each bucket. So, the time complexity of insert(n)(if n FmaxB0) is
f(n) = Σi=1,..,n t(i) = an + bn(n-1)/2B0.
Now, when the maximum load factor is hit (n = FmaxB0), an additional insertion involves rehashing the table to a new, larger number of buckets B. In this scenario, insert(n) complexity is
f(n) = a(n-m) + b(n(n-1) - m(m-1))/2B + r(m,B) + f(m), 
with
 m := FmaxB0,
r(m,B)
:= a'm + b'm(m-1)/2B
.
This is simpler than it seems: f(m) is the time taken before rehashing, r(m,B) represents rehashing m elements from B0 to B (for the sake of generality we allow for a', b' to be different than a, b because no new nodes are allocated, insertion can take some more time than the regular case due to the creation of the new bucket array,  collisions need not be checked, etc.) and the rest of the expression models the insertion of the remaining n-m elements after rehashing. The formula we have derived applies recursively to any n provided that B is the number of buckets used by the container to hold at least n elements and B0 is the number of buckets after the last rehashing operation (0 if no previous rehashing took place), and defining by convention r(0,0) = f(0) = 0. This is a C++ implementation of f(n) (name running_insertion) and f(n)/n (average_running_insertion):
double rehash(unsigned int m,unsigned int B)
{
  static const double aprime=1.0,bprime=1.0;
  return m==0?0:aprime*m+bprime*m*(m-1)/(2*B);
}

double running_insertion(unsigned int n,double Fmax)
{
  static const double a=1.0,b=1.0;
  static const std::array<unsigned int,18> Bs={
    0, 53, 97, 193, 389, 769,
    1543, 3079, 6151, 12289, 24593,
    49157, 98317, 196613, 393241, 786433,
    1572869, 3145739
  };

  if(n==0)return 0.0;

  auto it=std::lower_bound(
    Bs.begin(),Bs.end(),(unsigned int)std::ceil(n/Fmax));
  if(it==Bs.end())throw std::out_of_range("running_insertion");
  unsigned int B=*it,m=(unsigned int)(Fmax*(*(it-1)));
  
  return a*(n-m)+(b*n*(n-1)-b*m*(m-1))/(2*B)+
         rehash(m,B)+running_insertion(m,Fmax);
}

double average_running_insertion(unsigned int n,double Fmax=1.0)
{
  return running_insertion(n,Fmax)/n;
}
53, 97, 193,... is the sequence of bucket array sizes used by Boost.Unordered and other implementations of unordered associative containers. The graphic shows f(n)/n with Fmax = a = b = a' = b' = 1 for n = 1 to 3·106; the horizontal scale is logarithmic.
Before the first rehashing happens (that is, when n ≤ 53), f(n)/n grows as collisions are more frequent, but the slope of the function is negative after every subsequent rehashing because f(n)/n amortizes such steps. From 1,000 elements onwards, the average value of  f(n)/n stabilizes at around 3.6 and does not grow indefinitely. This can be proved formally:
Proposition. f(n)/n is bounded if bucket array sizes follow a geometric progression.
Proof. By hypothesis, bucket array sizes form a sequence Bi = B0Ki for some K > 1. Let's call Ni := FmaxBi. Now, when n > N1 the recursive calculation of f(n) can be unfolded to
f(n) = a(n-Nk-1) + b(n(n-1) - Nk-1(Nk-1-1))/2Bk +
+ Σi=0,..,k-2(a(Ni+1-Ni) + b(Ni+1(Ni+1-1) - Ni(Ni-1))/2Bi+1) +
+ aN0 + bN0(N0-1)/2B0 +
+ Σi=0,..,k-1(a'Ni + b'Ni(Ni-1)/2Bi+1),
where k ceil(logK n/N0), that is, n/N0 Kk < nK/N0. In a [Ni + 1, Ni+1 + 1] interval, Δ(f(n)/n) is monotonically non-decreasing, i.e. f(n)/n is convex (proof omitted), so its maximum happens at Ni + 1 or Ni+1 + 1 and hence we can restrict our study of boundedness of f(n)/n to the points n = Ni + 1. Approximating x-1 to x for x 1, dropping the terms outside of the summatories (as they are constant or grow slower than n) and taking into account the relationship between bucket array sizes, we have
f(Ni + 1) Σi=0,..,k-2(a(Ni+1-Ni) + bFmax(Ni+1/2 - Ni/2K)) +
+ Σi=0,..,k-1(a'Ni + b'FmaxNi/2K) =
= (a(K-1) + bFmax(K2-1)/2K) Σi=0,..,k-2 Ni +
+ (a' + b'Fmax/2K) Σi=0,..,k-1 Ni =
(a(K-1) + bFmax(K2-1)/2K)N0(Kk-1-1)/(K-1) +
+ (a' + b'
Fmax/2K)N0(Kk-1)/(K-1) <
< n(a(K-1) + bFmax(K2-1)/2K + a'K + b'Fmax/2)/(K-1),
thus f(n)/n is bounded.
The (approximate) bounding formula yields 4.25 for the function depicted above (K 2), in excellent agreement with the graph. 
There is an alternative insertion scenario where the total number of elements is known in advance and the user can reserve space to avoid rehashing:
void insert(container& c,unsigned int n)
{
  c.reserve(n);
  while(n--)c.insert(rnd());
}
The associated complexity of this operation is
f(n) = an + bn(n-1)/2B,
where B is the minimum bucket size such that n FmaxB. f(n)/n is, again, bounded, as shown in the figure.
We omit the proof, which is much simpler than the previous one. In this case, the bound on f(n)/n is a + bFmax/2. Note that this does not depend on K precisely because there is no rehashing involved.
So, the promise holds true: if the container uses bucket array sizes following a geometric progression and hashing behaves well, insertion is O(1) on average (either with or without rehashing). In a later entry we will see whether real implementations of unordered associative containers live up to theory.

Friday, October 25, 2013

Implementation of C++ unordered associative containers with duplicate elements

In a prior entry we reviewed the data structures used by some popular implementations of C++ unordered associative containers without duplicate elements (unordered_set and unordered_map) and described a new approach introduced by Boost.MultiIndex hashed indices. Let's continue with the corresponding containers allowing for duplicate elements, unordered_multiset and unordered_multimap.
Hash tables and duplicate elements do not mix well together, the two basic problems being:
  • Load factor control based on the total number of elements rather than the number of different elements leads to unnecessarily large and sparsely populated bucket arrays;
  • Algorithm complexity is not bound by the load factor but instead depends on the average length of groups of equivalent elements.
Implementations of  unordered_multiset/unordered_multimap can do nothing to alleviate the difficulties with load factor management: the standard codifies its behavior very precisely with little room for reinterpretation or improvement. An observant user can remedy the situation on her own by setting FG as the load factor, where F controls the level of occupancy (i.e. it is the equivalent load factor for the case where no duplicate elements are allowed) and G is the average length of element groups. As for the second problem, an implementation may choose to enrich the internal data structure so that element groups can be skipped in constant time (the standard specifies that equivalent elements must be packed together): in this case, group length is no longer an issue and the container provides the same performance guarantees as the non-duplicate version. What do popular implementations do in this respect?

Dinkumware, libc++, libstdc++-v3

None of these libraries makes any special provision to handle groups of equivalent elements in an efficient manner: they rely on the same data structure used for non-duplicate versions of the containers.

Boost.Unordered

This library does augment its data structure to cope with equivalent elements:
The figure shows a group of five equivalent elements located at bucket b2. An additional back pointer is added to the original structure that reverse-links the elements of the group together in the way a circular doubly linked list does. This back pointer allows groups to be treated as whole units: when traversing a bucket for insertion or erasure, only the first element of each group need be inspected, and the rest can be skipped by following the back pointer to the last element. The penalty for this performance improvement is one extra pointer per node.

Boost.MultiIndex

The same data structure used for containers without duplicates can be further tweaked to efficiently handle groups of equivalent elements at no extra memory overhead:
To interpret the tangle of pointers shown in the figure we use the Xn, Xp terminology introduced in the first article. As there, the first element of each bucket is redirectioned to have Xp point to the associated bucket entry. Now consider a group of three or more equivalent elements with F, S, P, L pointers to the first, second, penultimate and last node, as marked in the figure (S = P when the group has length 3); the following redirections are applied:
  • Sp = L,
  • Pn = F.
It is not hard to prove that:
  • X is the first of its bucket iff Xpnn = X,
  • X is the last of its bucket iff Xnpn = X,
  • X is the first of a group (of length ≥3) iff Xnp X and Xnppn = X,
  • X is the second of a group (of length ≥3) iff Xpn X and Xppnn = X,
  • X is the penultimate of a group (of length ≥3) iff Xnp X and Xnnpp = X,
  • X is the last of a group (of length ≥3) iff Xpn X and Xpnnp = X,
that is, each special position can be univocally determined. Moreover, given X we can apply these properties to locate in constant time the nodes following and preceding X:
  • The element following X is Xnnp if X is the penultimate of a group (of length ≥3), Xn otherwise,
  • the element preceding X is Xpn if X is the first of its bucket, Xppn if X is the second of a group (of length ≥3), Xp otherwise.
So, even with all these redirections we can still treat the structure as a circular doubly linked list with O(1) linking/unlinking, and at the same time take benefit of group identification to speed up hash operations. The procedure for insertion follows these steps:
  1. Use the hash function to locate the corresponding bucket entry constant time.
  2. Traverse the first elements of each group in the bucket looking for a group of equivalent elements  time linear on the number of groups in the bucket.
  3. If a group was found, link into it; otherwise, link at the beginning of the bucket constant time.
  4. Adjust bucket entry pointers as needed constant time.
Group skipping is constant time: if X is the first element of a group of length  ≥3, the last element is reached with Xnp; if the length is 1 or 2, fast skipping is not available, but the number of elements to check against is, again, 1 or 2, i.e. bounded. The scheme for erasure looks like:
  1. Unlink from the (doubly linked) list constant time.
  2. Adjust bucket entry pointers if the element is the first or the last element of its bucket, or both constant time.
The actual implementation is more complicated than this overall description suggests, since linking and unlinking must maintain the special redirections used to identify groups, but the net result from the point of view of performance is: insertion and erasure times depend on the number of groups rather than the total number of elements and have the same complexity as the non-duplicate case, i.e. O(1) for insertion if the hash function works properly, unconditionally O(1) for erasure.