Lockless Inc

Benchmarks

To show how fast the Lockless memory allocator is compared to others, we use the t-test1 benchmark. This benchmark is given a total amount of memory to use, and a maximum size of allocation to make. The benchmark chooses random sizes of memory below this maximum to call malloc(), calloc(), free(), realloc() and memalign(). The memory space is split into bins, and a specified number of parallel threads are used to access each bin. When a thread is done with a bin, it releases it, and acquires another. This causes the benchmark to test memory allocated on one thread, and freed in another.

Linux

We test how the memory allocation libraries vary with average allocation size, and total number of threads. We use an Opteron 280 system with a total of 4 processor cores to run this test. The test is run with 1,2,3,4,5,8,12,16, and 20 threads. Note that above 4 threads, the performance will be limited by the lack of processors. We vary the maximum allocation in the benchmark by powers of 2 from 32bytes up to 128KiB.

The resulting data is plotted below by number of threads, and maximum allocation size running from smallest to largest for each thread. We compare the Lockless memory allocator to the glibc 2.9 allocator (a variant of ptmalloc), jemalloc 1.0.0, hoard 371, and tcmalloc 0.8 The total time taken for each benchmark run is shown.

t-test1.c plot for all allocators

As can be seen, some of the allocators have extreme slowdowns in some cases. The hoard allocator becomes very slow if the average allocation size exceeds 64KiB. This appears to be due to thread contention in accessing a process-global pool. Conversely, the glibc allocator is slow for small allocations. This may be due to it not using a slab-based algorithm for small sizes. Finally, note that all allocators take more time in the small-size single threaded case. This is due to the design of the benchmark, where the total number of allocations performed serially is inversely proportional to their maximum size and total number of threads to keep the total amount of memory allocated bounded.

Glibc allocator

The standard allocator in Linux systems is that in the glibc library. This allocator is based on the ptmalloc2 library, with tweaks added over time. The allocator uses mutually exclusive "arenas". A thread will try to lock an arena to do the work it needs. If another thread is using that arena, then the first thread either waits, or creates a new arena. This contention between threads reduces performance in multithreaded applications. In addition, if one thread allocates memory, and a second frees it in a producer-consumer pattern, then large amounts of memory may be wasted. As can be seen below, the Lockless memory allocator performs about 2 times faster, with the performance gap dropping once the number of threads exceeds the number of cpus.

t-test1.c plot for lockless and glibc allocators

Jemalloc allocator

The standard allocator on *BSD systems is the jemalloc allocator. This allocator is also used in the Mozilla browser due to its lower heap usage, lowering cache misses and swapping. Like the glibc allocator, this allocator uses mutually exclusive "arenas". It has similar performance characteristics to the glibc allocator, but at least on this machine, is slower than it in all tested situations. The Lockless memory allocator performs about 2 times faster, with the performance gap dropping once the number of threads exceeds the number of cpus.

t-test1.c plot for lockless and jemalloc allocators

Tcmalloc allocator

The tcmalloc allocator is designed with threading in mind, and as a result gives consistently good performance. Its one weakness is that larger allocations take much longer. As can be seen below, the Lockless memory allocator is everywhere faster than this allocator, averaging about twice as fast for small allocations, and three to four times as fast for larger allocations.

t-test1.c plot for lockless and tcmalloc allocators

Hoard allocator

The hoard allocator performs similarly to tcmalloc, except that the rate of speed reduction for large allocations is higher, and the point where the slowdown occurs is different. Also note the "noise" at larger thread numbers. Once the number of threads exceeds the number of processors it becomes more likely that a thread may be waiting for a shared resource owned by another thread that isn't running on any cpu. The result is that the sleeping thread needs to be migrated to the cpu of the newly waiting thread to gain processor cycles. This causes extra cache-line transfers as the threads are moved randomly about instead of staying on their "home" processor. The numerical noise in the results is due to exactly how much the contention slows things down. Again, it can be seen that the Lockless memory allocator is everywhere faster than this allocator.

t-test1.c plot for lockless and hoard allocators

Server Performance

Another issue is how fast an allocator handles intra-thread and inter-thread allocations and frees. A server process may allocate a work unit in a main thread, and then use a different worker thread to process it. In addition, a well designed server may instead keep allocations and frees thread-local attempting to optimize cache locality. The test_parallel.c measures allocator server performance. It runs three separate tests, and records the total time taken to run all three.

The first test allocates in every thread, and uses an all-to-all communication pattern to transmit the buffers to other threads. The buffers are then freed. Thus in this test, every thread is both a consumer and producer. Fine-grain locking is used to synchronize the communication. If the allocation library uses internal locking for intra-thread frees then it may be a bottleneck in this test, and result in a slow time.

The second test allocates memory in each thread, and then frees that memory in the same thread. This test checks to see the speed impact of the allocation library requiring the use of internal synchronization. An efficient library may be much faster at local frees than remote frees.

The final test constructs a tree of threads. Each thread creates daughter threads, and waits for them to exit. The child threads in turn wait for their children. In addition, each thread allocates some memory. On thread exit, the thread passes a pointer to the array of allocated buffers to its parent. The parent then frees the child's memory after the child has exited. This test stresses the thread creation and exit paths within the allocator. Some allocators may have long thread initialization times. Others may be slow to handle dead threads. Since a server may fork off new threads to handle incoming requests, performance here is important.

The total times for the test_parallel.c benchmark for the glibc, jemalloc, tcmalloc, hoard, and Lockless Inc allocators are summarized in the graph below.

test_parallel.c results for all allocators

The Lockless memory allocator takes 0.60 seconds to complete the parallel server benchmark (test_parallel.c), whereas other allocators take 2.1 seconds, or more. Note that this benchmark is extremely sensitive to the total amount of memory available, so results may vary if this changes. Since it measures server performance, it also is strongly affected by the total number of available processor cores.

Microsoft Windows

We test how the Lockless memory allocator compares to the in-build memory allocator in Microsoft Windows. The problem here is that Microsoft is continually changing how it implements its allocator. To make things simple, we choose the Windows Vista allocator as a baseline. That allocator is much better than the Windows XP one and optimized for multithreaded applications, thus making a fairer comparison.

We test how the total time taken varies with average allocation size, and total number of threads. We use an AMD Phenom system with a total of 4 processor cores to run this test. (Note that this is a different machine to the Linux tests above, so results between Linux and Windows should not be directly compared.) The test is run with 1,2,3,4,5,8,12,16, and 20 threads. Note that above 4 threads, the performance will be limited by the lack of processors. We vary the maximum allocation in the benchmark by powers of 2 from 32bytes up to 128KiB.

t-test1.c plot for lockless and windows allocators

As can be seen, the Windows Vista allocator has a similar performance profile to the Hoard allocator. It, however, is slighly faster than Hoard and comparable in speed to the Lockless memory allocator for mid-range sized allocations. For small allocations, the Lockless memory allocator is up to twice as fast. For large allocations, the Vista allocator suffers from catastrophic performance loss. It appears that it either calls into the kernel far too much, or uses a very heavy-weight synchronization method.

Thus whether your applications benefit from the Lockless memory allocator on the Microsoft Windows platform depends on their allocation size distribution. Most applications have many small allocations, and fewer medium and large ones. However, as memory sizes increase with time, the average application tends to have a larger memory footprint. Thus the effect of large (multi-kilobyte) allocations grows.

The test_parallel.c benchmark has a variety of allocation sizes, stressing many parts of a memory allocator. It can be a reliable indicator of server performance if the server is not IO or compute bound. This benchmark takes 14.9 seconds to run on Windows Vista. Using the Lockless memory allocator on the same machine causes the benchmark to take 1.38 seconds. This is an order of magnitude speedup.

MySQL

An important load for many businesses is that of their database servers. The speed of memory allocators may greatly change your server's performance. The following graph describes read-only performance on a 8 core Xeon machine when different allocators are used.

MySQL Performance with different memory allocators

Write latency is also affected by the allocator choice. By switching to a different allocator, a large decrease in cpu usage and request latency may be obtained. More on this is at the MySQL Performance article.

Summary

The Lockless allocator is on average 2-3 times faster than the best memory allocators developed to date, in both a single and multithreaded context. The Lockless allocator can be an order of magnitude faster in some situations, for example server loads. However, note that the amount of performance gain will be application-specific. Modern applications written in high level languages like C++ and Python tend to spend a significant amount of their time dealing with memory. It is expected that the performance increase for mission-critical applications may be as much as a couple of cpu speed grades. By purchasing the Lockless memory allocator, you may be able to save much more than what would be needed to be spent for extra hardware.

About Us Returns Policy Privacy Policy Send us Feedback
Company Info | Product Index | Category Index | Help | Terms of Use
Copyright © Lockless Inc All Rights Reserved.
My Account View Cart