Wednesday, May 16, 2012

To hash, to map?

As the question in the title says, in this post I will be comparing the two C++ std library algorithms, unordered_map (formerly known as hash_map) and map. The focal point of the comparison is the memory -- which algorithm is more effective? For this purpose, I created a simple program which inserts N random keys into the data structure. Then I needed to obtain a memory information. At the first glance, it seemed that

mallinfo().uordblks

will be sufficient. However, after going through a painful excercise with the std::vector, it seems that vector (and therefore probably also unordered_map) use some other weird technique -- instead of malloc()-ing the data, the vector mmap()-s some memory blocks! Thus, the real memory usage is more like

int mem = mallinfo().hblkhd + mallinfo().uordblks;

Anyway, here is the program:

#include <stdio.h>
#include <malloc.h>
#include <stdlib.h>
using namespace std;

#define HASH 0

#if HASH
  #include <unordered_map>
  typedef unordered_map<int, int> mymap;
  const char* text = "hash";
#else
  #include <map>
  typedef map<int, int> mymap;
  const char* text = "map";
#endif

const int Ki = 1000;
const int Mi = 1000 * Ki;

const int TESTS = 5;
int test_sizes[TESTS] = {Ki, 10 * Ki, 100 * Ki, Mi, 10 * Mi};

int main() {
  mymap mapa;
  for (int t = 0; t < TESTS; t++) {
    while (mapa.size() < test_sizes[t]) {
      int k = rand();
      mapa[k]++;
    }
    // total memory (malloc + mmap)
    int mem = mallinfo().hblkhd + mallinfo().uordblks;
    printf("%s: Items: %d, Mem: %d, per-entry: %.1f\n", text, 
          test_sizes[t], mem, mem * 1.0 / test_sizes[t]);
  }
}

And the results:
(Note that  2 integers (key&value) consume 8 bytes).

map: Items: 1000, Mem: 48000, per-entry: 48.0
map: Items: 10000, Mem: 480000, per-entry: 48.0
map: Items: 100000, Mem: 4800000, per-entry: 48.0
map: Items: 1000000, Mem: 48000000, per-entry: 48.0
map: Items: 10000000, Mem: 480000000, per-entry: 48.0

hash: Items: 1000, Mem: 46016, per-entry: 46.0
hash: Items: 10000, Mem: 441504, per-entry: 44.2
hash: Items: 100000, Mem: 4211808, per-entry: 42.1
hash: Items: 1000000, Mem: 40454240, per-entry: 40.5
hash: Items: 10000000, Mem: 463691872, per-entry: 46.4

The conclusion? Both hashing and binary trees use roughly the same amount of memory (hashing a bit less but it is fluctuating as the hash-table is resized). And the overhead is quite big -- 5 to 6 times for the integer key-value pair.

No comments:

Post a Comment