The Compressed Ordinary Object Pointers in Java 64bit JVM

Submitted by Alexis Wilke on Thu, 01/03/2019 - 22:49

More Small Bansai Trees fit in the same amount of space as one giant sequoia tree.

When was Compressed OOP created in Java?

Whenever the Java interpreter was converted to work on 64 bit machines, the pointers, just like with any other software, doubled in size. In other words, the good old 32 bit pointers which used 4 bytes, now went to 64 bit pointers which use 8 bytes. So each reference to any object now uses 2x the size.

The JVM engineers then decided to implement a Compressed version of the pointers which came out with Java SE 6u23 and all versions over Java 7.

The advantage of 64 bit computers is that you can have a very large amount of memory. For a long time (in computing sense), we had a limit of about 8Gb, then 16Gb, then 32Gb, so that optimization made a lot of sense. Now that we see more and more computers with over 32Gb of RAM, it may become less useful, although if your application doesn't need that much memory, it can still be very useful to your specific case.

In terms of speed, it probably doesn't help much, although it might because you can cram 2x the number of pointers within the same CPU data cache. In terms of memory space, you save 4 bytes per reference, so if your application deals with a large number of small objects, that could be millions of references and thus Mb of data space saved by the simple compression.

How does the compression work?

All pointers to objects are aligned. Here are several reasons for the alignment;

The alignment is generally chosen to be the same as the cache lines. That way the buffer is aligned with the CPU cache and accesses are faster.
The alignment is required on certain CPUs. Again this is to be able to make things faster. If you do not need to send a few bits of data for each memory access, then you save a few lines between the CPU and memory. Much less so today since many of those connections are serialized. In the old days, memory access was done in parallel1
The low level memory managers (often referenced as malloc) link free and allocated memory chunks together using one pointer and a size. The pointer needs to be 64 bits on a 64 bit machine, and the size is generally the same: 64 bit. With that in mind, you get a a block of 16 bytes per memory chunk that you allocate (by default malloc returns a block of memory without a header, so you do not lose those 16 bytes once the memory is allocated.) This also means that the smallest unallocated (and thus allocated) memory block is 16 bytes (8 bytes on a 32 bit machine.) Java uses a garbage collection mechanism which means allocated buffers do use a header.

The result of this alignment constrain is that all pointers that point to the start of such an object is aligned. Java only deals with objects and thus all the pointers it deals with pointer to an aligned object. In 64 bits, this means that the last 3 bits of the pointer are always zero:

... XXXXXXXX : XXXXXXXX XXXXXXXX XXXXXXXX XXXXX000

(I removed the first 3 bytes so the pointers is more likely to fit your screen.)

The fact is that with a 32 bit pointers you could already reference 4Gb of data. But with a 64 bit memory model, we could access a lot more. One idea would be to save the next 3 bits in those last 3 bits since we know they are zeroes otherwise:

... 00000ABC : XXXXXXXX XXXXXXXX XXXXXXXX XXXXX000

... 00000000 : XXXXXXXX XXXXXXXX XXXXXXXX XXXXXABC

This gives use 4 bytes + 3 bits, or 35 bits of addressable memory. That means 32Gb of RAM (8 x 4Gb).

Now, in assembly language, copying 3 bits the way I just shown would be slow. You need many instructions to accomplish the feat, something like this:

Compression:
    MOV %EAX, %EBX
    LSR %EBX, 32
    OR %EBX, %EAX

The code assumes that bits 35 to 63 are all zeroes since we would only accept pointers between 0 and 32Gb - 1 byte.

Decompression:
    MOV %EAX, %EBX
    AND #7, %EBX
    LSL %EBX, 32
    OR %EBX, %EAX

The decompression assumes that the variable %EAX has bits 32 to 63 set to zeroes.

This would work, but like I just said, it's many instructions and it requires two registers. There is a better solution which is to simply shift the data by 3 bits:

Compression:
    LSR %EAX, 3

Decompression:
    LSL %EAX, 3

Here we solve both problems: we need only one register and one single instruction.

The resulting bits look like this:

... 00000ABC : XXXXXXXX XXXXXXXX XXXXXXXX XXDEF000

... 00000000 : ABCXXXXX XXXXXXXX XXXXXXXX XXXXXDEF

Again, it supposes that all the bits between bit 35 and 63 are all zeroes.

Wait, but how can you guarantee that bit 35 to 63 are all zeroes?

On computers with an MMU (all since the 80386, you could purchase a separate MMU chip with the 80186 and 80286 processors...), you can actually create a virtual memory model whenever you create a process.

Java uses that to its advantage. All the heap memory will be using addresses that fit between the start point defined by your OS and the maximum amount of memory you allow your JVM to use. In most Unices this is done using sbrk() and the memory addresses will be contiguous. Easy. (Under MS-Windows, however, the memory model is completely different and blocks of memory may be allocated anyway... What they do is allocate a Virtual Memory Buffer with one call, the address starts at a different point but the principals are otherwise the exact same.)

This is why the Compressed OOPS scheme breaks a little before 32Gb of memory. A certain amount of the memory at the start of a process is going to be used by the text (i.e. the compiled assembly language instructions that make Java work, some internal data, etc.)

Now, we can do the following pointer math to get a pointer that fits in 35 bits:

compressed_pointer = (pointer - start_of_data) >> 3

The start_of_data can be calculated or set to a specific value depending on the JVM implementation. When we allocate the virtual memory buffer in MS-Windows, this can be the pointer that function returns.

Other Potential Assembly Optimizations

I have no clue whether they use such, but with a start_of_data pointer and a shift of three, and a displacement, many processors have the ability to access the memory without having to have an explicit shift.

The load and save instructions (With x86, a MOV from or to memory), you can write this:

MOV [%EBP+%EAX*8+24], %EBX

This loads a 64 bit values at offet 24 from object at address EAX. The EBP is the start_of_data pointer which gets added to the result. Here is the pointer math:

      %EBP
+ 8 * %EAX
+       24

This addressing mode was considered slow in older processors because they would do the math each time. I'd bet that newer processors do not have any slow down other than reading a few more bytes from the Instruction Cache.

That being said, if you need to access a certain variable many times, calculating its final address once and reusing it is probably always going to be faster and this can be done with a single instruction in x86:

LEA [%EBP+%EAX*8+24], %EAX

Now we see why such a pointer compression can be very beneficial on a processor that supports such addressing. As far as I know, the PPC and 68k supported such too. RISC like processors, however, are likely to need an extra register and do the math shown earlier.

What kind of savings are we talking about?

Someone ran a test creating many objects to see the difference between using 31Gb and 32Gb of heap. As I mentioned earlier, because of various limitations, the JVM sees 32Gb as too large a heap to use compressed pointers, so just that amount will show you the difference.

The result in his test clearly show that with 31Gb he could create around 580,00,000 objects. With 32Gb of memory, he could only create 380,000,000. That's a difference of 200,000,000 less objects with +1Gb of memory!

Remember, 200 million objects is 800Mb of memory just for those objects.

Note, however, that if you go much larger, you still can create more objects. In his test, he had to go as high as 48Gb to get the same number of objects as with 31Gb. Anything over 48Gb was a bonus in memory. This can certainly point out that the CPU will go a little faster since it will access some 16Gb less than with 48Gb.

Remember that even if they use a shift, it will be very fast because the main bottleneck with CPUs is still memory. If the CPU does. A shift is close to transparent when memory access is involve just before or just after.

1. Serialize transmission between memory and CPU is an improvement because it can go faster. This is because it takes an awful long time to get parallel lined properly synchronized. Remember that in electronics, multiple micro seconds is a long time.

Add new comment

Main Menu