Lockless Inc

Creating a Fast Hash Function

Hash functions convert a stream of arbitrary data bytes into a single number. By the pigeon-hole principle, many possible inputs will map to the same output. However, if a hash function is chosen well, then it is difficult to find two keys that will hash to the same value. A very good hash function will have an output indistinguishable from a random number generator.

However, usually we do not wish to have output quite that uniform. Small correlations between the input and output bits are typically not a problem provided they are not large. The larger the correlations we allow, the faster a function we can create. For example an extremely poor hash function is:


typedef unsigned long long u64b;
u64b hash(char *key, u64b len)
{
	u64b result = 0;
	while (len)
	{
		result += *key;
		key++;
		len--;
	}
	
	return result;
}

This function simply sums the input bytes. If the input keys are random enough, then this may be a perfect solution. However, in the real world, the keys we wish to hash are decidedly non-random. We thus would rather have something that would produce a less correlated output.

To quantify whether or not a hash function is any good, we will need some tests. The simplest test is to see how a stream of NUL bytes is handled. A good hash will give a different answer for each length.


/* check for problems with nulls and odd-sized hashes */
static void simple_string_test(void)
{
	byte buf[8];
	u64b i;

	memset(buf, 0, 8);

	printf("These should all be different\n");
	for (i=0; i<8; i++)
	{
		printf("%lld  0-byte strings, hash is %llx\n", i, hash(buf, i));
	}
	
	memset(buf, 42, 8);
	printf("These should also all be different\n");
	for (i=1; i<8; i++)
	{
		printf("%lld  42-byte strings, hash is %llx\n", i, hash(buf, i));
	}
	
	printf("These should also all be different\n");
	for (i=1; i<8; i++)
	{
		buf[i] += i;
	}
	
	for (i=1; i<8; i++)
	{
		printf("%lld  42+-byte strings, hash is %llx\n", i, hash(buf, i));
	}
}

The bad add-the-input-bytes hash described above will fail this test. No matter how long the stream of zeros, it will output the same value: zero. This test also makes sure that odd-sized inputs are handled well. More advanced hash functions tend to input their data in large chunks. The handling of the extra bytes in the partially-full last chunk can be poor.

Another useful thing is to test for avalanche, where altering a single bit in the key changes the hash. For a good hash, every key bit should statistically alter every output bit about 50% of the time. Bob Jenkins has some nice code that tests for this. A simple version tests that each key bit affects each hash bit:


/* check that every input bit changes every output bit half the time */
#define MAXPAIR 80
#define MAXLEN 100
static void avalanche(void)
{
	byte qa[MAXLEN+1], qb[MAXLEN+2], *a = &qa[0], *b = &qb[1];
	u64b c, d, i, j = 0, k, l, z;
	u64b e, f, g, h;
	u64b x, y;
	u64b hlen;

	printf("No more than %d trials should ever be needed \n",MAXPAIR/2);
	for (hlen = 0; hlen < MAXLEN; ++hlen)
	{
		z = 0;
		
		/* For each input byte, */
		for (i = 0; i < hlen; ++i)
		{
			/* for each input bit, */
			for (j = 0; j < 8; ++j)
			{
				e=f=g=h=x=y=~(u64b)0;

				/* check that every output bit is affected by that input bit */
				for (k = 0; k < MAXPAIR; k += 2)
				{
					/* keys have one bit different */
					for (l=0; l<hlen+1; ++l)
					{
						a[l] = b[l] = 0;
					}

					/* have a and b be two keys differing in only one bit */
					a[i] ^= (k<<j);
					a[i] ^= (k>>(8-j));
					c = hash(a, hlen);
					b[i] ^= ((k+1)<<j);
					b[i] ^= ((k+1)>>(8-j));
					d = hash(b, hlen);

					e &= c^d;
					f &= ~(c^d);
					g &= c;
					h &= ~c;
					x &= d;
					y &= ~d;
					
					if (!(e|f|g|h|x|y)) break;
				}
				if (k > z) z = k;
				if (k == MAXPAIR)
				{
					printf("Some bit didn't change: ");
					printf("%.8llx %.8llx %.8llx %.8llx %.8llx %.8llx  ", e,f,g,h,x,y);
					printf("i %lld j %lld len %lld\n",i,j,hlen);
				}
				if (z == MAXPAIR) goto done;
			}
		}
		
done:
		if (z < MAXPAIR)
		{
			printf("Mix success  %2lld bytes ",i);
			printf("required  %lld  trials\n",z/2);
		}
	}
	printf("\n");
}

(The above has been cleaned up to support 64bit hashes and the coding style altered to match that in the rest of this website.) If a hash function doesn't pass the above test, then there is something seriously wrong with it.

Next, we should test for correlation between bits. To do this, a helper function for bit-string manipulation is useful:



/* Modify the k-th bit */
static void modify_bit(byte *a, unsigned k)
{
	/* XOR the appropriate byte in the bitstring with the right bit */
	a[k/8] ^= 1 << (k&7);
}

We would like to test how often an output bit is changed by input bits, and would like to see a 50% correlation between every bit-pair. Unfortunately, for any given number of trials, there is statistical noise. Thus we can only say there is a problem when the correlation differs from 50% by more than an amount that varies as the square root of the total number of trials. Then there is the fact that there are multiple bit-pairs we are checking simultaneously. Simply by random chance, this increases the probability that any given bit-pair will signal a false over or under-correlation. In order to get good statistical results, about a million trials are required.


static void corr_1st_order(u64b trials, int size)
{
	int i, j, k;

	byte *state = calloc(size, 1);
	byte *save = calloc(size, 1);
	
	u64b inb, outb;

	int *t = calloc(sizeof(int) * 64 * size * 8, 1);
	
	double x, y, ssq = 0;
	double maxt = 50.0, mint = 50.0;
	double sfactor = 4.0*64/sqrt(trials);
	
	srandom(getpid() * time(NULL));

	for (i = 0; i < trials; i++)
	{
		for (j = 0; j < size; j++)
		{
			state[j] = random();
		}
	
		memcpy(save, state, size);
	
		inb = hash(state, size);
	
		for (k = 0; k < 8 * size; k++)
		{
			memcpy(state, save, size);
			
			modify_bit(state, k);
			
		    outb = hash(state, size) ^ inb;
			
		    for (j = 0; j < 64; j++)
		    {
		        t[k * 64 + j] += outb & 1;
				outb /= 2;
		    }
		}
	}

	for (i = 0; i < 8 * size; i++)
	{
		for (j = 0; j < 64; j++)
		{
			x = t[i * 64 + j] * 100.0 / trials;
			if (x > maxt) maxt = x;
			if (x < mint) mint = x;
			
			y = fabs(x - 50.0);
			if (y > sfactor) printf("Bad value %f (%d %d)\n", x, i, j);
			
			ssq += (x-50)*(x-50);
		}
	}
	
	ssq /= (64 * 8 * size);
	
	printf("Max %f Min %f Variance %f sfactor %f\n", maxt, mint, ssq, sfactor);
	
	free(t);
	free(state);
	free(save);
}

An even more complex test compares pairs of output bits with each input bit. This is more sensitive to subtle problems with a hash function, and can indicate when the output isn't quite mixed enough.


static void corr_2nd_order(u64b trials, int size)
{
	int i, j, k, l;

	byte *state = calloc(size, 1);
	byte *save = calloc(size, 1);
	
	u64b inb, outb, o1, o2;

	int *t = calloc(sizeof(int) * 64 * 63 * size * 8, 1);
	
	double x, y, ssq = 0;
	double maxt = 50.0, mint = 50.0;
	double sfactor = 3.0*64/sqrt(trials);
	
	srandom(getpid() * time(NULL));

	for (i = 0; i < trials; i++)
	{
		for (j = 0; j < size; j++)
		{
			state[j] = random();
		}
	
		memcpy(save, state, size);
	
		inb = hash(state, size);
	
		for (k = 0; k < 8 * size; k++)
		{
			memcpy(state, save, size);
			
			modify_bit(state, k);
			
		    outb = hash(state, size) ^ inb;
			
			o1 = outb;
			
		    for (j = 0; j < 64; j++)
		    {
				o2 = o1;
				for (l = j + 1; l < 64; l++)
				{
					o2 /= 2;
					t[k * 64 * 63 + j * 63 + l - 1] += (o1 ^ o2) & 1;
				}
				
				o1 /= 2;
		    }
		}
	}

	for (i = 0; i < 8 * size; i++)
	{
		for (j = 0; j < 64; j++)
		{
			for (k = j + 1; k < 64; k++)
			{
				x = t[i * 64 * 63 + j * 63 + k - 1] * 100.0 / trials;
				if (x > maxt) maxt = x;
				if (x < mint) mint = x;

				y = fabs(x - 50.0);
				if (y > sfactor) printf("Bad value %f (%d %d %d)\n", x, i, j, k);

				ssq += (x-50)*(x-50);
			}
		}
	}
	
	ssq /= (64 * 63 * 8 * size);
	
	printf("Max %f Min %f Variance %f sfactor %f\n", maxt, mint, ssq, sfactor);
	
	free(t);
	free(state);
	free(save);
}

Using about a million trials, the above test will say if your hash function differs from a random oracle at the 1% level. This is good enough for most (non-cryptographic) purposes.

Other tests for hash-functions exist. However, I have found that the above ones tend to do a good job. The Maurer test passes some quite poor hash functions. The second and third-order ANFS tests are somewhat discerning. However, the simpler bit correlation tests are quicker to run, and give more useful information on failure.

The task for today is to construct the fastest possible hash function that passes all of these tests. (Even faster hash functions are possible, like variants of the add-all-bytes one, but may lead to statistical problems for non-random input data.) For ultimate speed, we will use assembly language for the hash function descriptions. This can give up to a 2× speedup in some cases.

The first hash function uses a similar algorithm to the parallel random number generator described in another article. The hash state is repeatedly multiplied by a large odd number. It then is "folded" with a half-shift + xor operation. These operations are chosen because they are invertible. This means that the available phase-space is conserved, and thus all possible hash values can be obtained.

The multiplication by an odd number (mod a power of two) can be inverted by finding the modular inverse. The xorshift operation can be inverted by repeating itself. The multiplication mixes lower bits into higher bits. The xorshift does the reverse, mixing information downwards to the least significant bits. The combination of the two spreads information throughout the hash state. We need two multiplication + xorshifts per mix-in of key state in this algorithm. If only one is done, then the tests do not pass.

The second part of the algorithm involves dealing with the fraction of an 8-byte chunk remaining. Since we don't want to read beyond the end of the key buffer, we mix in the size-four remaining chunk if it exists, then the size two, and finally the size one. This is faster than using a jump-table for a calculated goto.

Finally two more multiplications, and one xorshift is done to mix in the new bits. The result is an algorithm which passes the above tests, and is quite fast. The inner loop is quite small, and it processes eight bytes per iteration.


# %rsi = buffer pointer, %rdi = buffer length
.globl hash_asm
.type   hash_asm,@function
hash_asm:
	# The large odd multiplication constant
	movabs  $0x6595a395a1ec531b, %r8
	mov     %rsi, %rdx
	xor     %r8, %rdx
	
	# Do we only have a partial chunk to do?
	sub     $0x8, %rsi
	jb      2f
	
	# The main loop
.align 8, 0x90
1:	xor     (%rdi), %rdx
	imul    %r8, %rdx
	mov     %rdx, %rax
	shr     $0x20, %rdx
	xor     %rdx, %rax
	add     $0x8, %rdi
	imul    %r8, %rax
	mov     %rax, %rdx
	shr     $0x20, %rax
	xor     %rax, %rdx
	sub     $0x8, %rsi
	jae     1b

	# Is there a partial chunk remaining?
2:	cmp     $-0x8, %rsi
	jne     4f
	
	# Handle the final mixing
3:	imul    %r8, %rdx
	mov     %rdx, %rax
	shr     $0x20, %rdx
	xor     %rdx, %rax
	imul    %r8, %rax
	retq
		
	# Handle a partial chunk
4:	xor     %rax, %rax
	test    $0x4, %rsi
	je      4f
	mov     (%rdi), %eax
	add     $0x4, %rdi
4:	test    $0x2, %rsi
	je      4f
	movzwl  (%rdi), %ecx
	shl     $0x10, %rax
	add     %rcx, %rax
	add     $0x2, %rdi
4:	test    $0x1, %rsi
	je      4f
	movzbl  (%rdi), %ecx
	shl     $0x8, %rax
	add     %rcx, %rax
	
	# Mix in the partial chunk
4:	xor     %rax, %rdx
	imul    %r8, %rdx
	mov     %rdx, %rax
	shr     $0x20, %rdx
	xor     %rdx, %rax
	imul    %r8, %rax
	mov     %rax, %rdx
	shr     $0x20, %rax
	xor     %rax, %rdx
	jmp     3b
.size	hash_asm, .-hash_asm

So how should we benchmark the above? The problem is that its speed depends on the length of the key buffer. To be fair, we will need to include a few sizes in the benchmark, and weight each equally. One which does this is:


static void benchmark(void)
{
	int size = 1 << 28;
	byte *buf = calloc(size, 1);
	u64b result = 0;
	
	int i;
	
	for (i = 0; i < size / 8; i++)
	{
		result += hash(buf, 8);
	}
	
	for (i = 0; i < size / 32; i++)
	{
		result += hash(buf, 32);
	}
	
	for (i = 0; i < size / 1024; i++)
	{
		result += hash(buf, 1024);
	}
	
	for (i = 0; i < size / (1 << 16); i++)
	{
		result += hash(buf, 1 << 16);
	}
	
	for (i = 0; i < size / (1 << 22); i++)
	{
		result += hash(buf, 1 << 22);
	}
	
	printf("Result : %llu\n", result);
}

This benchmark completes in 1.24s for the above hash function on this machine. Can we do better? In fact we can. Instead of using an xorshift to shift information downwards, we can use other instructions. (Using a multiplication to shift it upwards is fairly optimal.) One that comes to mind is the rotate instruction. By rotating by 32 bits we can swap the upper and lower parts of our 64 bit state. A hash function which does this is:


# rsi = buffer pointer, %rdi = buffer length
.globl hash_rol
.type   hash_rol,@function
hash_rol:
	movabs  $0xac95a395a9ec535b, %r8
	mov     %rsi, %rax
	xor     %r8, %rax
	sub     $0x8, %rsi
	jb      2f
	
	# Use a rotate instruction to mix downwards
.align 8, 0x90
1:	xor     (%rdi), %rax
	imul    %r8, %rax
	rol     $32, %rax
	xor     %r8, %rax
	add     $0x8, %rdi
	imul    %r8, %rax
	rol     $32, %rax
	sub     $0x8, %rsi
	jae     1b
2:	cmp     $-0x8, %rsi
	jne     4f
	
	# Use rotates + multiplications for final mixing
3:	imul    %r8, %rax
	rol     $32, %rax
	xor     %r8, %rax
	imul    %r8, %rax
	rol     $32, %rax
	imul    %r8, %rax
	rol     $32, %rax
	imul    %r8, %rax
	retq
4:	xor     %rdx, %rdx
	test    $0x4, %rsi
	je      4f
	mov     (%rdi), %edx
	add     $0x4, %rdi
4:	test    $0x2, %rsi
	je      4f
	movzwl  (%rdi), %ecx
	shl     $0x10, %rdx
	add     %rcx, %rdx
	add     $0x2, %rdi
4:	test    $0x1, %rsi
	je      4f
	movzbl  (%rdi), %ecx
	shl     $0x8, %rdx
	add     %rcx, %rdx
	
	# Mix in partial chunk using rotates
4:	xor     %rdx, %rax
	imul    %r8, %rax
	rol     $32, %rax
	imul    %r8, %rax
	rol     $32, %rax
	jmp     3b
.size	hash_rol, .-hash_rol

Since a rotate doesn't do as much information mixing as an xorshift does, we need more of them. However, since many less instructions are required per iterations, we have a net win. This hash function successfully passes the tests, and completes the benchmark in 1.13s, which is a fair speedup.

It turns out that there is an even better downward-mixing instruction than rotate. The bswap instruction reverses the bytes in a register. This is perfect for our purposes. Replacing the rotates with this gives:


# rsi = buffer pointer, %rdi = buffer length
.globl hash_bswap
.type   hash_bswap,@function
hash_bswap:
	movabs  $0xac95a395a9ec535b, %r8
	mov     %rsi, %rax
	xor     %r8, %rax
	sub     $0x8, %rsi
	jb      2f
	
	# Use a bswap instruction to mix downwards
.align 8, 0x90
1:	xor     (%rdi), %rax
	imul    %r8, %rax
	bswap   %rax
	add     $0x8, %rdi
	imul    %r8, %rax
	bswap    %rax
	sub     $0x8, %rsi
	jae     1b
2:	cmp     $-0x8, %rsi
	jne     4f
	
	# Use bswaps + multiplications for final mixing
3:	imul    %r8, %rax
	bswap   %rax
	xor     %r8, %rax
	imul    %r8, %rax
	bswap   %rax
	imul    %r8, %rax
	bswap   %rax
	imul    %r8, %rax
	retq
4:	xor     %rdx, %rdx
	test    $0x4, %rsi
	je      4f
	mov     (%rdi), %edx
	add     $0x4, %rdi
4:	test    $0x2, %rsi
	je      4f
	movzwl  (%rdi), %ecx
	shl     $0x10, %rdx
	add     %rcx, %rdx
	add     $0x2, %rdi
4:	test    $0x1, %rsi
	je      4f
	movzbl  (%rdi), %ecx
	shl     $0x8, %rdx
	add     %rcx, %rdx
	
	# Mix in partial chunk using bswaps
4:	xor     %rdx, %rax
	imul    %r8, %rax
	bswap   %rax
	imul    %r8, %rax
	bswap   %rax
	jmp     3b
.size	hash_bswap, .-hash_bswap

This is a little faster, at 1.07s This is because we don't need the extra xor by the constant in the inner loop. (Without it, the rotate version doesn't pass the correlation tests.)

Is this the end? Not quite, we can do a bit better by expanding the amount of hash state in the inner loop. By using 128 bits of state, we can use 64×64 →128 bit multiplications. This is a more efficient use of processor resources. In addition, the fact that the high and low results of the multiplication are in separate registers allows us to perform the xorshift operation in a single instruction. The resulting code looks like:


# rsi = buffer pointer, %rdi = buffer length
.globl hash_128
.type   hash_128,@function
hash_128:
	movabs  $0x1591aefa5e7e5a17, %r8
	mov     %rsi, %rax
	xor     %r8, %rax
	movabs  $0x2bb6863566c4e761, %r9
	mov     %r9, %rcx
	sub     $0x10, %rsi
	jb      2f
	
# Process 128bits = 16 bytes at a time.
.align 8, 0x90
1:	xor     (%rdi), %rax
	xor     8(%rdi), %rcx

# Only one mix operation needed per iteration.
	mul     %r8
	add     $0x10, %rdi
	imul    %r9, %rcx
	add     %rdx, %rcx
	xor     %rcx, %rax
	sub     $0x10, %rsi
	jae     1b
2:	cmp     $-0x10, %rsi
	jne     4f

# Final mix using 128bit mul and xorshift.
3:	mul     %r8
	imul    %r9, %rcx
	add     %rdx, %rcx
	xor     %rcx, %rax
	mul     %r8
	imul    %r9, %rcx
	add     %rdx, %rcx
	xor     %rcx, %rax
	retq

# Need to mix in up to 15 bytes.
4:	xor     %rdx, %rdx
	test    $0x8, %rsi
	je      4f
	xor     (%rdi), %rax
	add     $0x8, %rdi
4:	test    $0x4, %rsi
	je      4f
	mov     (%rdi), %edx
	add     $0x4, %rdi
4:	test    $0x2, %rsi
	je      4f
	movzwl  (%rdi), %r10d
	shl     $0x10, %rdx
	add     %r10, %rdx
	add     $0x2, %rdi
4:	test    $0x1, %rsi
	je      4f
	movzbl  (%rdi), %edi
	shl     $0x8, %rdx
	add     %rdi, %rdx
4:	xor     %rdx, %rcx
	jmp    3b
.size	hash_128, .-hash_128

Using the above algorithm results in superior mixing. This means that only one 128 bit + xor operation is needed per 128 bits of key state. It also means that less mixes are required after the final merge. Even though the double-word multiplication instruction is slow, the fact that we are now processing twice the data per iteration more than makes up for it. The benchmark now completes in 0.84s, which well and truly breaks the one-second barrier.

Of course, this isn't the quite best code. Now that we are processing 128 bits at a time, how about trying the bswap algorithm within the inner loop, and then mixing the final results together with 128 bit multiplies and xorshifts? This removes dependencies in the inner loop, and allows the processor to extract extra parallelism it couldn't previously.


# rsi = buffer pointer, %rdi = buffer length
.globl hash_128_swap
.type   hash_128_swap,@function
hash_128_swap:
	movabs  $0x1591aefa5e7e5a17, %r8
	mov     %rsi, %rax
	xor     %r8, %rax
	movabs  $0x2bb6863566c4e761, %r9
	mov     %r9, %rcx
	sub     $0x10, %rsi
	jb      2f
	
# Use two multiply + bswap algorithms in parallel
.align 8, 0x90
1:	xor     (%rdi), %rax
	xor     8(%rdi), %rcx
	add     $0x10, %rdi
	imul    %r8, %rax
	imul    %r9, %rcx
	bswap   %rax
	bswap   %rcx
	sub     $0x10, %rsi
	jae      1b
2:	cmp     $-0x10, %rsi
	jne     4f

# Finish off with 128bit multiples to mix.
3:	mul     %r8
	imul    %r9, %rcx
	add     %rdx, %rcx
	xor     %rcx, %rax
	mul     %r8
	imul    %r9, %rcx
	add     %rdx, %rcx
	xor     %rcx, %rax
	retq
4:	xor     %rdx, %rdx
	test    $0x8, %rsi
	je      4f
	xor     (%rdi), %rax
	add     $0x8, %rdi
4:	test    $0x4, %rsi
	je      4f
	mov     (%rdi), %edx
	add     $0x4, %rdi
4:	test    $0x2, %rsi
	je      4f
	movzwl  (%rdi), %r10d
	shl     $0x10, %rdx
	add     %r10, %rdx
	add     $0x2, %rdi
4:	test    $0x1, %rsi
	je      4f
	movzbl  (%rdi), %edi
	shl     $0x8, %rdx
	add     %rdi, %rdx
4:	xor     %rdx, %rcx
	mul     %r8
	imul	%r9, %rcx
	add     %rdx, %rcx
	xor     %rcx, %rax
	jmp     3b
.size	hash_128_swap, .-hash_128_swap

The extra parallelism increases speed a little bit. Note how the superior mixing means that we only need one bswap per 64bit half-state in the above. (Instead of the two bswaps for full state in the earlier algorithm.) This code takes 0.81s in the benchmark, and as a result the above algorithm is highly recommended.

Of course we still aren't quite done. There is one other set of instructions we haven't touched yet. The streaming SIMD instructions (SSE) allow computation on a large amount of data in a short amount of time. If we can construct a hash function which uses them efficiently, we may be able to beat the above record.

There are several problems with the SSE instructions. The first is that they require 16-byte alignment of memory. Normally, this isn't a problem, as the user of a SSE function can enforce alignment, However, a hash function cannot be so choosey. Therefor, we will need to check the alignment of the input buffer, and fall back on a slightly slower inner loop in the unaligned case.

Another problem with the SSE instructions is that they are not very orthogonal. There are a wide variety of instructions, but there doesn't seem to have been much of a plan about what was implemented, and what not. The instructions chosen were selected on their use in multimedia, not the integer arithmetic within hash functions. Thus we will need to think laterally in order to find ones useful for our task.

We require two types of instructions for the inner loop. Ones which mix bits "upwards", and ones which mix them "downwards". By using both sets, we can mix all the bits, and create a nice function. The problem is that SSE instructions are designed to work on multiple bits of data in parallel. Having information cross the boundaries between those chunks of data isn't easy.

The method we used to move information upwards previously was via multiplication by large odd constants. The problem here is that we are limited in the SSE2 instruction set about which multiplications are allowed. We would like to use as large a multiplication as possible. However, the largest one which computes two 32×32→64 bit multiplies only uses half of the data within a SSE register. Doing the same on the other half requires other instructions to extract the information, and perhaps yet more instructions to recombine everything.

It seems the best multiply instruction available is pmullw, which does a 16×16→16 multiply. This doesn't do very much mixing, which may be a problem. If only a 32 bit or 64 bit version of the above existed in SSE2. (The higher SSE implementations cannot be guaranteed to exist on a 64bit machine.)

Mixing information downwards presents a new problem. There are no bswap or rotate instructions in SSE2. There are 128 bit shifts, but implementing a rotate out of them is slow. Another possibility is to use a shift + xor. However, since the 128 bit shifts are limited to be in units of whole bytes, there aren't many good possibilities. The amount of mixing available with this technique is limited.

What saves us are the unpack instructions. We can use punpckhbw and punpcklbw to interleave the bytes from two SSE registers. This, when combined with the 16 bit multiply yields the following algorithm:


# rsi = buffer pointer, %rdi = buffer length
.align 16
hash_sse_c:
.quad	0x5a379ab38dc5a46b
.quad	0xd6c573e9c613993d 
.globl hash_sse
.type   hash_sse,@function
hash_sse:
	movabs  $0x1591aefa5e7e5a17, %r8
	movq    %rsi, %xmm0
	movaps  hash_sse_c(%rip), %xmm2
	movabs  $0x2bb6863566c4e761, %r9
	movdqa  %xmm2, %xmm1
	pxor    %xmm2, %xmm0
	sub     $0x20, %rsi
	jb      2f
	test    $0x0f, %rdi
	jne     5f

# Inner loop - use 16 bit mul and unpack to mix
.align 8, 0x90
1:	paddq   (%rdi), %xmm0
	paddq   0x10(%rdi), %xmm1
	pmullw  %xmm2, %xmm0
	pmullw  %xmm2, %xmm1
	add     $0x20, %rdi
	movdqa  %xmm0, %xmm3
	punpckhbw %xmm1, %xmm0
	punpcklbw %xmm3, %xmm1
	sub     $0x20, %rsi
	jae     1b
	pxor    %xmm1, %xmm0
2:	cmp     $-0x20, %rsi
	jne     4f

#	Extract 128 bits of state from %xmm0
	movhlps %xmm0, %xmm1
	movq    %xmm0, %rax
	movq    %xmm1, %rcx

# Use 128bit multiple and xorshift for final mix
3:	mul     %r8
	imul    %r9, %rcx
	add     %rdx, %rcx
	xor     %rcx, %rax
	mul     %r8
	imul    %r9, %rcx
	add     %rdx, %rcx
	xor     %rcx, %rax
	mul     %r8
	imul    %r9, %rcx
	add     %rdx, %rcx
	xor     %rcx, %rax
	retq

# The final chunk can be up to 31 bytes in size
4:	test    $0x10, %rsi
	je      4f
	movups  (%rdi), %xmm3
	pxor    %xmm3, %xmm0
	add     $0x10, %rdi
	pmullw  %xmm2, %xmm0
4:	xor     %rdx, %rdx
	movhlps %xmm0, %xmm1
	movq    %xmm0, %rax
	movq    %xmm1, %rcx
	test    $0x8, %rsi
	je      4f
	xor     (%rdi), %rax
	add     $0x8, %rdi
4:	test    $0x4, %rsi
	je      4f
	mov     (%rdi), %edx
	add     $0x4, %rdi
4:	test    $0x2, %rsi
	je      4f
	movzwl  (%rdi), %r10d
	shl     $0x10, %rdx
	add     %r10, %rdx
	add     $0x2, %rdi
4:	test    $0x1, %rsi
	je      4f
	movzbl  (%rdi), %edi
	shl     $0x8, %rdx
	add     %rdi, %rdx
4:	xor     %rdx, %rcx
	jmp 3b
	
# miss-aligned source inner loop
.align	16
5:	movups  (%rdi), %xmm3
	movups  0x10(%rdi), %xmm4
	paddq   %xmm3, %xmm0
	paddq   %xmm4, %xmm1
	pmullw  %xmm2, %xmm0
	pmullw  %xmm2, %xmm1
	add     $0x20, %rdi
	movdqa  %xmm0, %xmm3
	punpckhbw %xmm1, %xmm0
	punpcklbw %xmm3, %xmm1
	sub     $0x20, %rsi
	jae     5b
	pxor   %xmm1, %xmm0
	jmp     2b
.size	hash_sse, .-hash_sse

The above code is quite complicated. However, it is extremely fast in the benchmark, taking 0.785s. It manages to be so fast because it processes 32 bytes for every iteration of the inner loop. The relatively poor mixing of the SSE instructions is made up for by an extra 128bit multiply and xorshift. If you are hashing keys which are large buffers, the above is a very good function to use. For smaller sizes, the non-sse functions may be faster.

Comments

Dmitry V'jukov said...
Thanks! I really like your blog. You know, it's difficult to find anything deep for developers those days.
sfuerst said...
Thanks! I try to come up with something new and interesting every two weeks or so to write about.
Chris Anderson said...
In most of the inner loops, there's the sequence that tests for branching:

sub $0x10, %rsi ; subtract 16 from rsi (buffer length)
ja 1b ; continue if rsi > 0

or the C equiv:

 do { } while ( (rsi -= 16) > 0 );

Shouldn't this be:

 do { } while ( (rsi -= 16) > 15 );

instead? Since the last 15 bytes are handled by the final chunk parts.
sfuerst said...
Oops, you are right. How embarrassing! Thanks for letting us know so it can be fixed.

Fortunately, the code can be fixed without affecting the inner loop by offsetting the size by the chunk length. We can do this by changing the initial "cmp" into a "sub", and then having a "jae" instead of a "ja" as the loop condition.

I've replaced the code in the article with the new version. The only other change is that hash_128_swap() needs to do a little more mixing in the odd-sized case. It doesn't passing the 31-byte mix-test otherwise.
Chris Anderson said...
I converted one of the versions to C, just to see how close gcc can come to hand coded asm, and the results are pretty good using the benchmark() function above:

$ gcc -O2 h.c -lrt
$ a.out
Asm 551332706 nanosecs (ba560400)
C 414688636 nanosecs (ba560400)
Asm 452776232 nanosecs (ba560400)
C 414525092 nanosecs (ba560400)
NH 1778447535 nanosecs (b0351680)
Asm 452955337 nanosecs (ba560400)
NH 1778373149 nanosecs (b0351680)

code, full source here: http://www.eetbeetee.org/h.c

unsigned int
hash_128_swapc( const void *p, unsigned int len )
{
  register unsigned long long r8 = 0x1591aefa5e7e5a17ULL,
                              r9 = 0x2bb6863566c4e761ULL,
                              rax = len ^ r8,
                              rcx = r9,
                              rdx;
#define bswap( r ) \
  __asm__ __volatile__ ( "bswapq %0" : "+r" (r) : : )
#define mul128( a, d, r ) \
  __asm__ __volatile__ ( "mulq %2" : "+a" (a), "=d" (d) : "r" (r) : )

  while ( len >= 16 ) {
    rax = ( rax ^ ((unsigned long long *) p)[ 0 ] ) * r8;
    rcx = ( rcx ^ ((unsigned long long *) p)[ 1 ] ) * r9;
    bswap( rax );
    bswap( rcx );
    p = &((unsigned long long *) p)[ 2 ];
    len -= 16;
  }
  if ( len != 0 ) {
    if ( ( len & 8 ) != 0 ) {
      rdx = 0;
      rax ^= ((unsigned long long *) p)[ 0 ];
      p = &((unsigned long long *) p)[ 1 ];
    }
    if ( ( len & 4 ) != 0 ) {
      rdx = ((unsigned int *) p)[ 0 ];
      p = &((unsigned int *) p)[ 1 ];
    }
    if ( ( len & 2 ) != 0 ) {
      rdx = ( rdx << 16 ) | ((unsigned short *) p)[ 0 ];
      p = &((unsigned short *) p)[ 1 ];
    }
    if ( ( len & 1 ) != 0 ) {
      rdx = ( rdx << 8 ) | ((unsigned char *) p)[ 0 ];
    }
    rcx ^= rdx;
  }
  mul128( rax, rdx, r8 );
  rcx = ( rcx * r9 ) + rdx;
  rax ^= rcx;
  mul128( rax, rdx, r8 );
  rcx = ( rcx * r9 ) + rdx;
  rax ^= rcx;

  return ( rax >> 32 ) ^ rax;
}


sfuerst said...
Interesting benchmark results. It looks like the bswap-based hash is quite a bit faster than "newhash".

However, the comparison between C and asm isn't quite right. GCC is smart, and will inline the C version of the hash - removing some of the size-checks in the benchmark. To prevent that, you'll need to add __attribute__((noclone, noinline)) to get the speed that a user with a non-compiler-optimizable buffer length will see.
Chris Anderson said...
Yeah, newhash is a derivative of Bob Jenkins' newhash64(): http://www.burtleburtle.net/bob/c/lookup8.c

I counted the instructions in the newhash inner loop vs bswap, bswap: 11 / 16 bytes = .6875 inst per byte, newhash64: 73 / 24 bytes = 3.042. With that, bswap should be 4.42 times as fast, and it is fairly close to the actual multiple: 4.29.
sfuerst said...
Yeah, this article isn't called "Creating a Fast Hash Function" for nothing. ;-)

Although, I have to admit, the final few functions do cheat a bit. There is an increased correlation between bits 23 bytes apart. This is a bit larger than the 8 bytes tested by the correlation checking tests, so the flaw isn't picked up. Fortunately, for most uses, the statistical weakness can be ignored.
Timo Ewalds said...
Very interesting article!

I noticed your very first hash has a bug. It never moves the key pointer forward, so you simply add the first byte to itself len times, which is an even worse hashing function than you intended.

How do your hash functions compare to MurmurHash3? https://code.google.com/p/smhasher/
sfuerst said...
Haha, thanks for noticing that. Yes, it was definitely a really bad hash. :-)

I've just updated the article with the slightly more sensible version that updates the key buffer pointer.

I haven't really tested the newer versions of MurmurHash. You are right, it'll probably be quite interesting to see how they compare with those here.
sfuerst said...
It looks like there was a bug in the non-aligned version of hash_sse. The final mixing was done with a padd instead of a pxor. The version in the article has been fixed. Thanks go out to Samy Bahra for spotting this.

I've also just done some benchmarking with Murmurhash3. It takes 1.5s to run the benchmark on this machine. hash_sse() is about twice as fast, which is pretty amazing...
Georgi 'Sanmayce' said...
Well-said:
"Lockless Inc. seeks to make the highest performance software available. The goal is to optimize for modern machines, and take advantage of the changes in technology that have occurred in the past few years. We seek to extract the power of multiple cores, vectorization, and 64 bit technology to its fullest."

Hi to all,

a few months ago a new superfast hash function was revealed (by chance) based on FNV1A named FNV1A_Jesteress, I should like you to test it along with your implementations and if it is possible with FNV1A_Mantis. It screams on new architectures like Core i3+, for reference: http://www.strchr.com/hash_functions

It would be very useful to use your experience in sake of more clarity on this subject, don't you think?

Regards
sfuerst said...
I've now timed your FNV variants. It takes 0.833s for FNV1A_Jesteress to complete the benchmark, and 0.875s for FNV1A_Mantis. Note that these hashes output 32 bits rather than 64, so it is a little unfair to compare them to the other hashes here which do twice the work...

This machine is a rather old AMD Opteron, so may not have quite the same hash performance characteristics as machines based on Intel's Core architecture.
Georgi 'Sanmayce' said...
Thanks sfuerst,
it is interesting, sadly I have no access to AMDs (even Opteron).
Looking at code (in my view very complicated) presented on your very informative web-page I suggest (and would be very glad) you or someone else to rewrite my FNV variants, to make them 64bit, it is obvious they are with great potential, or maybe you are not so much impressed as I am?

Thanks again and good luck.
sfuerst said...
If you expand your FNV1A variants to 64 bit, then you get something that looks very much like the hash_rol function on this page. Using bswap instead of the rol instruction seems to be better due to the increased amount of downwards diffusion.

Enter the 10 characters above here


Name
About Us Returns Policy Privacy Policy Send us Feedback
Company Info | Product Index | Category Index | Help | Terms of Use
Copyright © Lockless Inc All Rights Reserved.
My Account View Cart