Efficient hash tables for network applications
 Thomas Zink^{1}Email author and
 Marcel Waldvogel^{1}
https://doi.org/10.1186/s400640150958y
© Zink and Waldvogel; licensee Springer. 2015
Received: 13 November 2014
Accepted: 1 April 2015
Published: 15 May 2015
Abstract
Hashing has yet to be widely accepted as a component of hard realtime systems and hardware implementations, due to still existing prejudices concerning the unpredictability of space and time requirements resulting from collisions. While in theory perfect hashing can provide optimal mapping, in practice, finding a perfect hash function is too expensive, especially in the context of highspeed applications.
The introduction of hashing with multiple choices, dleft hashing and probabilistic table summaries, has caused a shift towards deterministic DRAM access. However, high amounts of rare and expensive highspeed SRAM need to be traded off for predictability, which is infeasible for many applications.
In this paper we show that previous suggestions suffer from the false precondition of full generality. Our approach exploits four individual degrees of freedom available in many practical applications, especially hardware and highspeed lookups. This reduces the requirement of onchip memory up to an order of magnitude and guarantees constant lookup and update time at the cost of only minute amounts of additional hardware. Our design makes efficient hash table implementations cheaper, more predictable, and more practical.
Keywords
Introduction
Efficient hashing in network applications is still a challenging task, because tremendously increasing line speed, demand for low power consumption and the need for performance predictability pose high constraints on data structures and algorithms. At the same time, memory access speed has almost stayed constant, especially because of the latency and waiting time between sequential repeated accesses. Hashing has yet to be widely accepted as an ingredient in hard realtime systems and hardware implementations, as prejudices concerning the unpredictability of size and time requirements due to collisions still persist.
Modern approaches make use of multiple choices in hashing (Broder and Mitzenmacher 2001; Vöcking 2003) to improve load and the number of memory accesses. Unfortunately, dary hashing requires d independent parallel lookups. To mitigate the need for high parallelism, table summaries (Kirsch and Mitzenmacher 2008; Song et al. 2005), based on (counting) Bloom filters (Bloom 1970; Fan et al. 1998) and derivates, further reduce the number of table accesses to one with high probability (w.h.p.) at the cost of fast but expensive onchip memory (SRAM). The summaries allow set membership queries with a low false positive rate and some approaches also reveal the correct location of an item if present.
Although these improvements address space and time requirements, they come at a high price. SRAM is extremely expensive and, while external DRAM can be shared, it must be replicated for every network processor. In addition, numerous networking applications compete for their slice of this precious memory. For many  like socket lookups, Layer2 switching, packet classification and packet forwarding  tables and their summaries tend to grow extremely large, up to the point where providing enough SRAM is not applicable. Perfect hashing, on the other hand, can lead to a near perfect match (Hagerup and Tholey 2001) but only works on static sets, does not allow updates and requires complex computations.
The options for a network application designer are grim. With millions of lookups per second, even the most improbable worstcase is likely to happen, slowing down the entire application and leading to packet loss and network congestion. Naive hash tables are too unpredictable and yield too many collisions. d−a r y hashing requires high parallelism to minimize sequential lookups. Expensive SRAMbased table summaries optimize the average case performance but still require multiple lookups in the worst case. Perfect hashing can potentially guarantee a perfect match and a constant lookup performance but requires a static set. To be fully accepted in practical network applications hashing needs to guarantee constant lookup performance, require minimal onchip memory, and allow regular updates.
We propose mechanisms to construct an improved data structure which we name Efficient Hash Table (EHT), where efficient relates to both onchip memory (SRAM) usage and lookup performance. The design aggressively reduces the amount of bits per item needed for the onchip summary, guarantees a constant lookup time and still delivers adequate update performance for most applications, except those that require realtime updates. To the best of our knowledge, the EHT is the only data structure offering these characteristics.

The update and lookup engines can be separated. The onchip summary need not to be exact.

The summary’s false positive rate can be ignored, it is irrelevant in respect to lookup performance.

The summary can be de/compressed in real time.

The load of a bucket can potentially be larger than one without increasing memory accesses.
In concert, these concepts reduce SRAM memory size up to an order of magnitude, but they can also be applied and configured individually depending on the target application.
The rest of this paper is organized as follows. Section 2 discusses related work with Section 2.1 reviewing hash table summaries is greater detail. Section 3 introduces the Efficient Hash Table and presents an overview. Section 4 shows how to separate the update and lookup engines. Section 5 discusses the effect of the false positive rate on the EHT. Section 6 presents multiple compression schemes to improve SRAM memory footprint. Section 7 shows how to optimize bucket loads. The results are evaluated and discussed in Section 8. Finally, the paper concludes in Section 9.
Related work
A hash function h maps items of a set S to an array of buckets B. Their natural applications are hash tables, or dictionaries, that map keys to values. In theory, a perfect hash function that is injective on S (Hagerup and Tholey 2001), could map n items to n buckets. While perfect hashing for static sets is relatively easy (Fredman et al. 1984), finding a suitable hash function that requires constant space and time to perform the mapping of a dynamic set is infeasible in practice. As a result, hashing has to deal with collisions, where multiple items are hashed into the same bucket. Naive solutions anchor a linked list or an array of items to the overflown bucket or probe multiple buckets according to a predefined scheme. The need for collision resolution led to the persisting myth that hashing has unpredictable space/time requirements.
Dietzfelbinger et al. 1994 extended the scheme of Fredman et al. 1984 to store dynamic sets. Their dynamic perfect hashing resolves collisions by random selection of universal hash functions (Carter and Wegman 1977) for a secondlevel hash table.
Azar et al. 1994 observed, that by allowing more possible destinations for items and choosing that destination with lowest load, both, the average as well as the upper bound load, can be reduced exponentially. This effect became popular as the “power of two choices”, a term coined by Mitzenmacher in (Mitzenmacher 1996). Vöcking 2003 achieved further improvements by introducing the “alwaysgoleft” algorithm, where the items are distributed asymmetrically among the buckets. Broder and Mitzenmacher 2001 suggest using multiple hash functions to improve the performance of hash tables. The n buckets of the table are split into d equal parts imagined to run from left to right. An item is hashed d times to find the d possible locations. It is then placed in the least loaded bucket. Ties are broken by going left (dleft hashing). A lookup requires examining the d locations. Since the d choices are independent, lookups can be performed in parallel or pipelined. A survey of multiplechoice hashing schemes and their applications can be found in (Mitzenmacher 2001a).
The major drawback of Bloom filters is that they do not allow deletions. Fan et al. 1998 addressed this issue by introducing a counting Bloom filter (CBF). Instead of a bit array, CBF maintains an array of counters C={ς _{0},…,ς _{ m−1}} to represent the number of items that are hashed to its cells. Insertions and deletions can now be handled easily by incrementing and decrementing the corresponding counters. Later, Bonomi et al. presented an improved version of CBF based on dleft hashing (Bonomi et al. 2006).
In (Mitzenmacher 2001b) Mitzenmacher proposes arithmetic coding for Bloom filters used for exchanging messages (web cache information) in distributed systems. Recently, Ficara et al. 2008 introduced a compression scheme for counting Bloom filters based on Huffman coding named MultiLayer Compressed Counting Bloom Filter (MLCCBF). The compressed counters are stored in multiple layers of bitmaps. Indexing requires perfect hash functions since collisions must be avoided. The structure provides near optimal encoding of the counters but retrieval is extremely expensive. The authors propose splitting the bitmaps into equal sized blocks and using an index structure to lower the cost of a counter lookup.
Bloom filters have since gained a lot of attention especially in network applications (Broder and Mitzenmacher 2002). Today, Bloom filters can be used as histograms (Cohen and Matias 2003) and represent arbitrary functions (Chazelle et al. 2004). In 2005 Song et al. 2005 suggested using Bloom filters as a hash table summary. This idea was later refined in (Kirsch and Mitzenmacher 2005). Bloom filterbased summaries are also used for minimal perfect hashing (Lu et al. 2006).
2.1 Review of hash table summaries
Our work is based on the schemes presented by Song et al. 2005 and Kirsch and Mitzenmacher 2005, which we will now review for completeness.
The authors argue that c=6 buckets per item suffice. Later in (Kirsch and Mitzenmacher 2010) the authors refine the MHT by limiting the amount that items are allowed to be moved during insertions. In the most aggressive optimization schemes this can reduce the number of buckets per item to c<2 for n=10^{4} at the cost of additional complexity. Note, that this does not affect the onchip requirements of the MHT summaries, since they are deliberately separated from the actual hash table and their size only depends on the number of items. It has, however, an impact on the size of the occupancy (and deletion) bitmap.
bits.
A predecessor to the MHT is the Segmented Hash Table (Kumar and Crowley 2005) that also divides the hash table into multiple segments. Unlike the MHT, however, segments are equal sized. Each segment uses a Bloom filter to support membership queries for an item. The false positive probability needs to be extremely low to prevent sequential or parallel probing of multiple segments. A novel selective filter insertion algorithm minimizes the number of nonzero counters by selecting that segment for insertion that leads the most empty counters. Thus false positive probability can be reduced. The authors argue that 16 bits per item of onchip memory and 16 or more segments suffice to provide good performance. To also support deletions, an additional counting Bloom filter must be kept offline.
The authors later refine segmented hashing in (Kumar et al. 2008) which they name peacock hash. As with the MHT the idea is to have multiple segments that geometrically decrease in size according to a so called scaling factor. Each table, except the biggest main table, has an onchip Bloom filter for membership queries. When an item is searched the filters of the subtables are queried. If lookup is unsuccessful, the main table is probed. Again, the false positive probability needs to be extremely low to prevent multiple table accesses. With a scaling factor of 10 (each successive table has a size of 10% of the former) and following the observations in (Kumar and Crowley 2005), about 2 bits per item are needed for the onchip Bloom filters.
The problem of nondeterministic lookup performance is addressed in (Ficara et al. 2009). Here each item is associated with a fingerprint that is cut into chucks and stored in a small discriminator table. This table is used to index the main table and is stored onchip. Fingerprints must be unique to prevent collisions. A genetic algorithm is suggested to find the perfect mapping. The authors show that a discriminator table with 4 bits per item can be found in a reasonable amount of time. While it is possible to “build a perfect match […] with fewer [2] bits per item […] the effort […] greatly exceeds the advantages.” ((Ficara et al. 2009), p.141.) Also, being a perfect hashing scheme, it works only on static sets and the discriminator table can only be built if the set of items is known a priori.
Recently, the construction of collisionfree hash tables has been discussed in (Li and Chen 2013). The authors proposed the addition of an onchip summary vector between the Bloom filter summary and the hash table. This summary vector allows deterministic lookup at the cost of additional onchip memory.
Efficient hash tables
We improve upon previously suggested solutions and design an Efficient Hash Table (EHT). The EHT reduces onchip memory requirements, provides a constant lookup performance and thus predictability, and, unlike comparable perfect hashing schemes, it is still updatable and works with dynamic sets.
This is achieved by exploiting degrees of freedom present in many lookup intensive applications. Previous work has shown that flexibility must be bought with onchip memory. By completely separating updates from lookups, the lookup engine can be optimized independently and precious onchip memory saved. The offline update engine precomputes all changes on the online structures and only writes necessary changes (Section 4). Further, we observe that the summary’s false positive rate is irrelevant in respect to lookup performance. By ignoring the false positive rate, the length of the onchip summary can be aggressively reduced (Section 5). However, this leads to an increased rate of collisions and multiple items compete for the same bucket. In order to prevent multiple lookups, clever fingerprinting and verification can reduce the sizes of items and allow multiple entries per bucket (Section 7). To further reduce the onchip summarie’s memory cost, we suggest a Huffman compression scheme suitable for realtime (de)compression (Section 6).
The following sections explain the different components in great detail. We start by separating the update and lookup engines in Section 4. Next, we explore the effect of the false positive rate on expected counter values and number of collisions  bucket load  in Section 5. Then we show how to further reduce onchip memory cost by using Huffman compressed Bloom filter summaries (Section 6). Finally, Section 7 shows how to achieve a guaranteed constant lookup time through clever hashing and multientry buckets.
EHT parameters and equations
Symbol  Description  Effects 

n  number of items in table  m, k 
c  multiplier for number of buckets  m, k 
m=2^{⌈logc n⌉}  number of buckets  k 
\(k = \lceil \frac {m}{n} \ln {2} \rceil \)  number of hash functions/choices  num. of exp. items per bucket 
χ  max allowed counter value  compression rate γ, exp. num of CAM entries 
ω  onchip mem word size [bits]  acompression rate γ 
Separate update and lookup engines
Previous suggestions have shown that support for updates is accompanied by enormous overhead to the tables and their summaries. The PFHT needs an additional offline BFHT to identify entries that have to be relocated. The MHT requires an occupancy bitmap and the summaries require either a deletion bitmap for lazy deletions or counting filters.
In most realworld applications, especially those that require fast lookups, updates are much rarer than lookups. By completely separating update and lookup engines, onchip requirements can be reduced. The idea is to keep two separate summaries. One is kept online in onchip memory and is optimized for lookups. It does not need to be exact and can be different from the update summary which is kept offline. Keeping only an approximate online summary allows individual optimization and more efficient encoding. The update engine precomputes all changes and sends modifications to the online structures. This architecture limits the applicability of the EHT to applications that are not update extensive and do not require realtime updates. That is, we buy optimized lookup performance with decreased update flexibility. That also holds for all previously mentioned summarybased hash tables as well as perfect hashing schemes. We will show that the update complexity of the EHT is comparable to that of its predecessors.
4.1 Maximum counter value
A lookup requires retrieving the leftmost smallest counter in the CBF summary. Successful lookup is guaranteed as long as not all counters corresponding to a key are overflown. If all the counters are overflown, it is not possible to identify the correct bucket. The goal is to identify a maximum allowed counter value χ where the probability that all k ^{′}<k chosen counters for an item equal χ is appropriately small. In essence, choosing an appropriate value for χ is a tradeoff between storage saved, the number of counter overflows, and the number of expected lookup failures.
To be able to retrieve all entries the event that all chosen k ^{′}<k counters equal χ must be dealt with. The easiest solution is to move entries which cannot be retrieved by calculating the counters to CAM. A small CAM must already be maintained for overflown buckets. If χ is chosen appropriately large the overhead is minimal.
Expected number of CAM entries for different c and χ with n =10 ^{ 6 } inserted items
χ  

c  5  4  3 
12.8  0  0  0 
6.4  0  0  17 
3.2  0  47  4183 
1.6  285  5181  61110 
4.2 Encoding
Limiting the counter range allows for better optimized encoding of the summary. We follow a simple and well known approach that is also used in (Kirsch and Mitzenmacher 2008) to pack few counters into one Byte. The difference is that we extend the scheme to an arbitrary word size to achieve higher compression rates. We argue, that SRAM, being implemented onchip, can potentially have an arbitrary word size. Basically, the wider the memory, the more counters can be packed into one word and the more bits can be saved. In reality, one will not find memory widths >128.
We will introduce a more sophisticated Huffman compressed summary in Section 6.
4.3 Updates
In our design we want to completely separate updates from lookups to keep interference with the lookup process as small as possible. When performing updates, the offline table precomputes all changes and applies them to the online CCBF, PFHT and CAM.
There are three types of entries that must be distinguished. Offline entries are kept in the offline BFHT. Due to overflows, each offline entry has a corresponding online entry either in the online PFHT (table entry) or in extra memory (cam entry). The update engine must be able to identify which of the offline entries in affected buckets are table entries, and which are cam entries. Else, it would not be possible to compute relocations without examining all possible locations in the online structure. Since we want to minimize online table access all offline entries are paired with a locator. In case the corresponding entry is a table entry, the locator is simply the index of the hash function used to store the table entry. If it is a cam entry, the locator is set to ∞. An offline entry of item x thus is defined as E _{offline}(x)←(k,v,i), where k denotes the key, v the associated value, and i the locator.
 1.
The entry is moved inside the table. M is updated with an empty entry at the old bucket. If the new bucket has enough space left, M is updated with the new bucket and r, else r must be moved to cam and M is updated with an ∞ bucket (indicating overflow memory) and r.
 2.
The entry is moved from cam to table. If the new bucket has enough space left, M is updated with {new bucket,r} and { ∞, r}. Else r can’t be moved to table and M is not updated.
 3.
The entry is moved from table to cam. M is updated with {new bucket,0} and { ∞, r}.
In any case, the locator of a relocated offline entry must be updated.
The actual update of the online structure is performed by the procedure “UpdateOnline”. The update map M contains bucket addresses and their associated content. The buckets in M are simply replaced with their new value. A special case is if bucket address is ∞, which indicates overflow memory. In this case the overflow memory is probed for the associated entries. If the entry is present, it is removed, else it is inserted. The list L contains a list of counter addresses that must be incremented.
The PFHT needs to be accessed only to write changed buckets. Hence, the complexity is optimal and upper bound by the number of changed buckets. With n items stored in m buckets and \(k = \frac {m}{n} \log 2\) choices, the upper bound is \(O(1 + \frac {m}{n}k) = O(1 + \log 2)\). Similarly, the online CCBF needs only be accessed for counters that actually change, i.e. those that have not yet reached χ.
Deletions work similar to insertions with minor differences. The deleted entry x is removed from the offline BFHT prior to collecting entries. Then all entries in affected buckets buckets are collected and relocation computed. Afterwards, the bucket from which the item is removed is added to M if not already present. Then the online updates are performed. Deletions have the same complexity as insertions.
Ignore the false positive probability
Bloom filters are usually constructed to optimize the false positive probability. In case of the MHT summaries having a negligible small false positive rate is essential to prevent type failure. In general, applications that require exact knowledge about set membership are dependent on minimizing false positives. This inevitably leads to relatively large filters.
We observe that applications using Bloom filterbased summaries as an index into another data structure, like the FHT, do not suffer from false positives, as long as a successful lookup independent of the false positive probability is guaranteed. The structure must provide a predictable worstcase lookup performance. A false positive returned by the summary leads to a table lookup that returns NULL. The worstcase performance is not affected. In conclusion, Bloom filterbased summaries can be potentially much smaller.
By reducing the address space of the summary while keeping the number of entries n constant, counter values and the load of buckets are expected to increase. There exists a tradeoff between reducing onchip memory requirements and the resulting counter values and bucket loads.
5.1 Counter values
5.2 Bucket load
Expected maximum load for different c
c  k  E 

12.8  12  1 
6.4  6  2 
3.2  3  2 
1.6  2  3 
1  1  5 
The problem arising is how to deal with more than one entry per bucket. A naive solution is to use E memory backs, one for each possible entry, and query them in parallel. The additional cost is acceptable compared to the saved SRAM. In Section 7 we will discuss this issue in more detail and present techniques that allow multiple entries per bucket but do not require parallel or sequential memory accesses.
Summary compression
Section 4 introduced a simple word packing scheme for counting Bloom filters where the counters are packed in memory words. Another form of compressed counting Bloom filters has been proposed by Ficara et al. in 2008. Computing counter values in the MLCCBF is expensive due to the fact that all preceding cells must be evaluated and the bitmaps must be accessed using perfect hash functions. These requirements render the MLCCBF inapplicable as a summary for the EHT, since it needs to return counter multiple values on every lookup to determine the correct bucket of an item.
To achieve realtime de/compression the counters must be easily addressable. Storing the compressed counters consecutively is not feasible. Without the help of complex indexing structures one could not retrieve a specific value. When compressing the offline CBF we calculate the maximum number of counters γ _{h} that can be compressed in one memory word, such that each word encodes exactly γ _{h} counters. A first approach to compress the counters is shown in Algorithm 2.
The algorithm runs as long as not all counters have been processed. It iteratively tries to fit as many counters into a word ω as allowed by the compression rate γ _{h} which is initialized to ∞. If the bitlength of ω would exceed the wordsize, everything is reset and restarted with γ _{h} set to the last number of counters in ω. This ensures, that every word (except the last) has exactly γ _{h} counters encoded and allows easy indexing.
This algorithm has an obvious flaw. It depends heavily on the sequence of counters, leading to an unpredictable compression rate γ _{h}. In addition, the compression is wasteful in storage. Since γ _{h} depends on the sequence of counter values, it is upper bound to the longest code sequence it can compress in one word. Assume no compression is used, then every counter will occupy three bits, which equals the length of the Huffman code for c=2. Thus, if during compression a long sequence of counters ≥2 is found, the compression rate γ _{h} will degenerate.
A better approach is to define γ _{h} in advance such that a desired compression rate is achieved. In general, Huffman compression only achieves improvement over word packed compression if γ _{h}>γ _{p}. Thus, γ _{p} can be used as a guideline for choosing γ _{h}. Since we force γ _{h} in advance, it can lead to word overflows, if the compressed γ _{h} counters do not fit into a word (in the following we will refer to this scheme as hard compression).
Overflows can also occur during insertions. If a counter c<χ−1 is incremented and the compressed word already occupies all the available bits, then incrementing the counter will shift one bit out of the word. As a result the last counter value will not be retrievable.
There are different approaches of how to address word overflows. One is to simply ignore the affected counters and assume they have value χ. As long as these counters are not the smallest for any entry, the lookup process is not affected. If, however, the actual counter value is crucial to the lookup, the correct bucket of an entry can not be computed.
Alternatively, the longest code in the word could be replaced with a shorter overflow code, indicating that an overflow occurred. However, this would increase the length of nearly all counter codes and in return the probability of word overflows.
Probably the best solution is to keep a small extra memory, CAM or registers, to store the overflown bits. If counters that are completely or partially overflown must be retrieved, the remaining bits are read from the extra memory. We will show in Section 8, that depending on γ _{h} and χ the cost of additional memory is reasonably small.
bits in total.
Achieving deterministic lookups
A hash table bucket usually holds a single entry or a reference to a collection of entries. If more than one entry is placed in a bucket, lookup might require multiple memory reads by following pointers. This leads to more sophisticated hash table constructions that try to limit the bucket load to one with high probability.
We argue that by using intelligent hashing and wider memory a bucket can hold more than a single entry without the need of sequential or parallel memory accesses. As a preliminary, we define that a bucket will never hold reference to a collection of entries with variable size. A bucket is defined as an array of entries of fixed size, where every entry can be directly accessed.
7.1 Multiple entries per word
One solution is to allow more entries per memory word. Let ω _{D} be the word size in bits and e be the size of an entry in bits. If e≪ω _{D}, a bucket can hold up to \(\left \lfloor \frac {\omega _{\mathrm {D}}}{e} \right \rfloor \) entries which can be read in one cycle. This holds for applications, like QoS/CoS classification, flowbased Server Load balancing or socket lookups, that store only small entries. But many application require larger entries (e.g. IPv6 lookup). While SRAM width is highly flexible, the word size of DRAM is usually fixed, wider memory might not be possible.
The verifier and the index are derived by bitextraction. Let h _{{0,…,k−1}} be the k digests, then V(h _{{0,…,k−1}}) produces the verifiers and A(h _{{0,…,k−1}}) extracts the bucket indexes, or addresses. Instead of the key x only its verifier V(h _{ i }(x)) is stored in bucket A(h _{ i }(x)). To be able to identify which verifier corresponds to a given key, an identifier must be kept along the verifier, that states the hash function i that produced the stored verifier V(h _{ i }(x)). A table entry then consists of the verifier, it’s identifier (which is the index of the hash function), and the associated value v. Hence, E(x)←(V(h _{ i }(x)),v,i). The total number of bits needed is l o g k+(H−A)+v where y denotes the length of y in bits. Note, that the smaller A the larger V. Thus the length of the table competes with the size of the entries.
7.2 Multiple words per bucket
An extension to the former scheme is to allow a bucket to span multiple words. For simplicity, we assume the words are consecutive, although this is not a precondition, as long as there is a fixed offset between the words. A bucket can now be seen as a matrix of r entries per word and w words.
Results and discussion
In this section we present and discuss results of a conceptual implementation of the EHT. The implementation is conceptual in the sense that it does not fully resemble the structure of the EHT but simulates it’s behavior appropriately.
Table 1 shows the parameters and equations that play a crucial role in evaluating the effects of different configurations.
Parameter configurations p of the software simulations
bit  3  2  1  0  

parameter  n  c  χ  ω  
value  0  10^{5}  1.6  4  64 
1  10^{6}  3.2  5  128 
On each simulation we perform ten trials, that is we instantiate the EHT and fill it with n random keys and values. No updates are performed but the EHT is queried for all n and additional 2n random keys to verify that every key can be retrieved and to analyze the falsepositive probability. As summary we use HCCBF. The compression rate γ _{h} is calculated using Algorithm 2. No hard compression is used, since we want to evaluate the quality of the compression algorithm. The cost of using hard compression can be derived by examining the resulting HCCBF and is included in the analysis.
For each try, we calculate the size of the offline CBF, the size of a CCBF and the size of the online HCCBF. We count the frequency of all counter values in the offline summary and derive the number of overflown counters in the online summary. Every compressed word in the HCCBF is analyzed for the number of bits that are actually used to encode counters, resulting in a histogram of codelengths per word. In addition, the load of all online buckets is calculated and the number of CAM entries counted. Finally, we compare the onchip requirements of the EHT with the theoretical requirements of the MHT and FHT.
8.1 Constant lookups
We first evaluate the performance of the EHT with respect to lookups. To achieve deterministic lookup performance, it is crucial that counter value distribution and bucket loads behave as expected. Counter distribution affects the maximum allowed counter value, which in turn affects the effectiveness of summary compression and the number of entries that have to be moved to CAM due to counter overflows.
Entry distribution and expected maximum load
Load  

p  E  0  1  2  3  4 
0−3  3  167662  89728  5327  24  0 
4−7  2  424659  99411  369  0  0 
8−B  3  1184464  837562  80950  684  1 
C−F  2  3204894  980039  10438  1  0 
In the worstcase there was only a single unexpected bucket overflow, for tables with n=10^{6} and c=1.6. In all other cases no bucket overflow occurs. As long as c>1.6 no overflows are to be expected. Again, the experimental results resemble the theoretical assumptions.
Real and expected number of CAM entries
p  min  max  avg  E 

0−1  144  209  177.95  178 
2−3  2  11  6.05  6 
4−5  0  1  0.15  0 
6−7  0  0  0.00  0 
8−9  5017  5446  5194.05  5181 
A−B  236  287  258.20  265 
C−D  40  61  47.00  47 
E−F  0  0  0.00  0 
Once again, the results closely resemble the expectations.
8.2 Onchip memory
We now evaluate the required onchip memory for the EHT summary according to different parameter configurations and compare the results to related work. We consider EHT summaries with no compression (γ _{0}), with wordpacked encoding (Section 4, γ _{p}) and with Huffman compression (Section 6, γ _{h}).
Compression To analyze the achieved compression we take the minimum, maximum and average γ _{h} and compare that to γ _{p} and the number of counters if no compression is used (denoted γ _{0}). We also include the maximum number of bits actually used to compress the counters.
Compression rate
γ _{ h }  bits  

n  c  χ  ω  min  max  avg  γ _{ p }  γ _{ 0 }  max 
10^{6}  1.6  4  64  22  24  22.8  27  21.3  63.3 
5  64  21  22  21.5  24  21.3  63.3  
4  128  50  53  51.0  55  42.6  126.4  
5  128  47  51  49.5  49  42.6  125.1  
3.2  4  64  23  26  24.6  27  21.3  62.7  
5  64  24  25  24.9  24  21.3  63.2  
4  128  56  59  57.7  55  42.6  126.3  
5  128  55  58  56.9  49  42.6  126.3  
10^{5}  1.6  4  64  25  27  26.0  27  21.3  62.6 
5  64  24  26  25.4  24  21.3  62.5  
4  128  57  60  58.8  55  42.6  126.6  
5  128  55  60  57.8  49  42.6  125.7  
3.2  4  64  23  26  25.5  27  21.3  63.0  
5  64  23  26  24.6  24  21.3  62.1  
4  128  57  60  58.3  55  42.6  126.9  
5  128  56  59  57.0  49  42.6  125.8 
Comparison of onchip requirements of different Bloom filterbased summaries for n=10 ^{ 6 }
Peacock hash and discriminator table clearly require the fewest bitsperitem of onchip memory. However, Peacock hashing requires a significant amount of hashing and is not deterministic. It requires multiple sequential or parallel lookups in the worstcase, which might not be acceptable depending on the application. The discriminator table is a perfect hashing type of table that only works with static sets and is not updatable. The Segmented hash table outperforms all configurations of the MHT summaries as well as the Fast Hash Table. Of the MHT summaries, the MBF summaries are favorable over their corresponding SF summaries. The FHT resides between the SF lazy and MBF counting schemes, with lookup and update performance comparable to the MHT counting summaries. In fact, if update (especially deletion) support is important, one should rather use the FHT than the MHT.
Onchip requirements of EHT configurations with n=10 ^{ 6 }
Configuration p  Uncompressed  Packed  Huffman  

KiB  bpi  KiB  bpi  KiB  bpi  
8  787  6.29  622  4.97  737  5.89 
9  611  4.88  659  5.26  
A  700  5.59  781  6.24  
B  685  5.48  679  5.42  
C  1573  12.58  1243  9.94  1367  10.93 
D  1221  9.76  1164  9.31  
E  1399  11.18  1348  10.78  
F  1370  10.96  1180  9.44 
Summary. The results fully meet the expectations and backup our theoretical analysis. We have shown that our initial assumptions allow fundamental improvements over previous suggestions. Experiments have shown that the EHT performs as theoretically expected. This makes the EHT highly predictable and allows easy configuration for target applications. The effects of parameters on counter distribution, bucket load, counter and bucket overflows can easily be predicted. Evaluation shows, which hardware configurations are required for specific parameter sets. The effect of Huffman compression is much harder to predict, since all possible combinations of counter values per word would need to be predicted which is impractical. However, evaluation shows the effect of Huffman compression compared to no compression and simple wordbased encoding.

Reducing the length m is achieved by ignoring the false positive probability. As a result, bucket loads will increase which can be compensated by parallel banks, increasing the offchip memory width or by better hashing. Analysis has shown, that the expected maximum load will not exceed 3 as long as \(\frac {m}{n}>2\). Bucket overflows are extremely rare, even for a large set of items. So only a very small extra overflow memory is needed.

By separating updates from lookups the lookup summary can be optimized for smaller size and performance. The lookup summary is not exact and limited in counter range [χ]. Choosing χ depends on the fraction \(\frac {m}{n}\). Starting with χ=5 for \(2 < \frac {m}{n} < 2.5\), χ can be decremented by one each time \(\frac {m}{n}\) is doubled for a small overhead in terms of CAM. Performance will degrade when \(\frac {m}{n} \rightarrow 2\).

The effect of Huffman compression depends on the wordsize ω and the counter limit χ. Word packing is favorable over Huffman compression both in complexity as well as resulting size, unless ω and χ are big. At the cost of few additional CAM cells, the performance of Huffman compression can be improved by reducing ω while keeping χ constant.
Depending on the costs of the components the parameters for the EHT can be chosen such that the total cost is minimized.
Conclusion
We have proven that through relaxation of requirements and exploitation of degrees of freedom onchip memory requirements can be significantly reduced and lookup performance improved at the cost of minimal additional hardware. Based on four key ideas we have introduced new techniques to design an Efficient Hash Table. Our suggested improvements can be applied individually or in concert and are fully customizable to meet the requirements of the target application and hardware. The costs of each component is analyzed and evaluated and a cost function is provided that allows calculating the overall cost. The simulation results fully meet the expectations, backup our theoretical analysis and allow accurate predictions. Furthermore, we presented a thorough evaluation of the space requirements of not only multiple EHT configuration but also of its predecessors the FHT, MHT as well as segmented and peacock hash, and discriminator table in the presence of a million entries.
High amounts of onchip memory can be traded in for comparatively small amounts of offchip memory, additional CAM, and some computational overhead. Cleverly chosen hash functions allow the reduction of offchip memory size. Offloading update overhead to offline structures leads to a more optimized lookup engine and allows improved encoding. We proposed two compression schemes for the summary that provide realtime performance and are easy to implement. Combined, the presented design achieves an improvement over previous solutions up to an order of magnitude, guarantees constant lookup of O(1), and supports near realtime updates while requiring only a few bits per item of onchip memory.
Declarations
Authors’ Affiliations
References
 Azar, Y, Broder AZ, Karlin AR, Upfal E (1994) Balanced allocations In: SIAM Journal on Computing, 593–602.. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA.Google Scholar
 Bloom, BH (1970) Space/time tradeoffs in hash coding with allowable errors. Commun ACM 13(7): 422–426. doi:10.1145/362686.362692.View ArticleGoogle Scholar
 Bonomi, F, Mitzenmacher M, Panigrahy R, Singh S, Varghese G (2006) An improved construction for counting bloom filters In: Proceedings of the 14th Conference on Annual European Symposium Volume 14.. Springer. doi:10.1007/11841036_61.Google Scholar
 Broder, AZ, Karlin AR (1990) Multilevel adaptive hashing In: SODA ’90: Proceedings of the First Annual ACMSIAM Symposium on Discrete Algorithms, 43–53.. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA.Google Scholar
 Broder, A, Mitzenmacher M (2001) Using multiple hash functions to improve ip lookups In: INFOCOM 2001. Twentieth Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE, 1454–14633. doi:10.1109/INFCOM.2001.916641.Google Scholar
 Broder, A, Mitzenmacher M (2002) Network applications of bloom filters: A survey In: Internet Mathematics, 636–646. http://www.tandfonline.com/doi/abs/10.1080/15427951.2004.10129096
 Carter, LJ, Wegman MN (1977) Universal classes of hash functions In: Proceedings of the Ninth Annual ACM Symposium on Theory of Computing.. ACM, New York, NY, USA.Google Scholar
 Chazelle, B, Kilian J, Rubinfeld R, Tal A (2004) The bloomier filter: an efficient data structure for static support lookup tables In: SODA ’04: Proceedings of the Fifteenth Annual ACMSIAM Symposium on Discrete Algorithms, 30–39.. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA. http://portal.acm.org/citation.cfm?id=982792.982797 Google Scholar
 Cohen, S, Matias Y (2003) Spectral bloom filters In: SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, 241–252.. ACM, New York, NY, USA. doi:10.1145/872757.872787.View ArticleGoogle Scholar
 Dietzfelbinger, M, Mehlhorn K, Rohnert H, Karlin A, Meyer auf der Heide F, Tarjan RE (1994) Dynamic perfect hashing: Upper and lower bounds. SIAM J Comput 23(4): 748–761.View ArticleGoogle Scholar
 Fan, L, Cao P, Almeida J, Broder A (1998) Summary cache: A scalable widearea web cache sharing protocol In: Proceedings of ACM SIGCOMM, 254–265.. ACM, New York, NY, USA.Google Scholar
 Ficara, D, Giordano S, Procissi G, Vitucci F (2008) Multilayer compressed counting bloom filters. INFOCOM 2008. The 27th Conference on Computer Communications. IEEE: 311–315. doi:10.1109/INFOCOM.2008.71.Google Scholar
 Ficara, D, Giordano S, Kumar S, Lynch B (2009) Divide and discriminate: algorithm for deterministic and fast hash lookups In: Proceedings of the 5th ACM/IEEE Symposium on Architectures for Networking and Communications Systems. ANCS ’09, 133–142.. ACM, New York, NY, USA. doi:10.1145/1882486.1882519. http://doi.acm.org/10.1145/1882486.1882519.View ArticleGoogle Scholar
 Fredman, ML, Komlós J, Szemerédi E (1984) Storing a sparse table with O(1) worst case access time. J ACM 31(3): 538–544. doi:10.1145/828.1884.View ArticleGoogle Scholar
 Hagerup, T, Tholey T (2001) Efficient minimal perfect hashing in nearly minimal space. In: Ferreira A Reichel H (eds)STACS 2001. Lecture Notes in Computer Science, vol 2010, 317–326.. Springer. 10.1007/3540446931_28. http://dx.doi.org/10.1007/3540446931_28
 Kirsch, A, Mitzenmacher M (2005) Simple summaries for hashing with multiple choices. In: 43rd Annual Allerton Conference on Communication, Control and Computing. University of Illinois, Urbana, IL, USA.Google Scholar
 Kirsch, A, Mitzenmacher M (2008) Simple summaries for hashing with choices. IEEE/ACM Trans Netw 16(1): 218–231. doi:10.1109/TNET.2007.899058.View ArticleGoogle Scholar
 Kirsch, A, Mitzenmacher M (2010) The power of one move: hashing schemes for hardware. IEEE/ACM Trans Netw 18(6): 1752–1765. doi:10.1109/TNET.2010.2047868.View ArticleGoogle Scholar
 Kumar, S, Turner J, Crowley P (2008) Peacock hashing: Deterministic and updatable hashing for high performance networking In: INFOCOM 2008. The 27th Conference on Computer Communications, 101–105.. IEEE. doi:10.1109/INFOCOM.2008.29.Google Scholar
 Kumar, S, Crowley P (2005) Segmented hash: an efficient hash table implementation for high performance networking subsystems In: Proceedings of the 2005 ACM Symposium on Architecture for Networking and Communications Systems. ANCS ’05, 91–103.. ACM, New York, NY, USA. doi: 10.1145/1095890.1095904. http://doi.acm.org/10.1145/1095890.1095904 View ArticleGoogle Scholar
 Li, D, Chen P (2013) Summaryaided bloom filter for highspeed named data forwarding In: High Performance Switching and Routing (HPSR), 2013 IEEE 14th International Conference On, 225–226. doi:10.1109/HPSR.2013.6602321.Google Scholar
 Lu, Y, Prabhakar B, Bonomi F (2006) Perfect hashing for network applications In: 2006 IEEE International Symposium on Information Theory, 2774–2778.. IEEE press. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4036478
 Mitzenmacher, MD (1996) The power of two choices in randomized load balancing. PhD thesis. Harvard University, Cambridge, MA, USA.Google Scholar
 Mitzenmacher, M (2001a) The power of two choices in randomized load balancing. IEEE Trans Parallel Distributed Sys 12(10): 1094–1104.View ArticleGoogle Scholar
 Mitzenmacher, M (2001b) Compressed bloom filters In: Proc. of the 20th Annual ACM Symposium on Principles of Distributed Computing. IEEE/ACM Trans. on Networking, 144–150. https://dl.acm.org/citation.cfm?id=581878
 Song, H, Dharmapurikar S, Turner J, Lockwood J (2005) Fast hash table lookup using extended Bloom filter: An aid to network processing In: SIGCOMM ’05, 181–192.. ACM Press, New York, NY, USA. doi:10.1145/1080091.1080114.View ArticleGoogle Scholar
 Vöcking, B (2003) How asymmetry helps load balancing. J ACM 50(4): 568–589. doi:10.1145/792538.792546.View ArticleGoogle Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.