The Linux Page

So many nopw and nopl in amd64 code?!

I have been wondering why gcc adds so many nop instruction in the binary code of my 64 bit programs.

The fact is that code is expected to run faster if properly aligned.

How's that?

A nop instruction does nothing, by definition: No OPeration.

On amd64, the CPU instruction cache (called L1) works by loading 16 bytes at once in the processor decoder. So if you can align your code to a 16 byte boundary, all the better. That way the instructions part of these 16 bytes will be executed at once. When you jump to a non-boundary area, the processor only executes what is left (i.e. say you jump to an address ending with 0x1D, then 0x1D, 0x1E, and 0x1F bytes may be used to run an instruction. Bytes 0x10 to 0x1C are lost.)

Okay, this is good, but when I look at my code it doesn't align the targets. It aligns everything. That is, every few instructions, I get a nop instruction to align what's coming next to the next 16 bytes, whether or not there is a jump to that instruction.

From what I understand, there is really no reason to do that. However, there is one other optimization in amd64 processors: Direct Path Decoding.

This defines the latency of each instruction toward other instructions. In other words, an instruction that supports direct path is decoded in parallel with all the other instructions that also support direct path (depending on the processor, more or less of such instructions will be decoded in parallel, in many cases, up to 3 these days.)

So I thought the conditional jumps (Jcc) would be affected. Not the case. All the conditional jumps are near jumps that are direct path decoding instructions. So I still don't see why gcc would do this type of optimization...

The reason is that the newer family of amd64 processors do two new things:

1) it fetches 32 bytes at once (thus gcc doesn't really do it right... since it aligns everything to 16 bytes)

2) it can execute 2 integers, 1 floating point, and 1 jump instruction all in one go (the jump is really the prediction code, not the jump itself.)

This means you should not have more than one jump every 32 bytes. This explains the additional nopw and nopl all over the place. These nop instructions push the jump instructions out of the previous 32 bytes (the current implementation is only aligning every 16 bytes...)

Note: the 16 & 32 bytes boundary has one other limit: if a branch spans over such a boundary, then the processor cannot do the jump prediction and thus you need to wait for the next 16 or 32 bytes to load before the jump can be fully decoded. That's a lot of waiting, although putting the jump in the next 16 or 32 bytes will not really help (if you ask me) since either way it will be loaded.

In other words, if you did not yet have a jump in the last 16/32 bytes, it serves no purpose to push the jump to the next block. If you already had one (or more...) jump in the current 16/32, pushing that last one to the next 16/32 bytes is a good idea to avoid problems with the existing jump. (From what I understand, only one prediction is stored for every 16 or 32 bytes. More and you lose decoding and execution time...)

Oh! And the newest processors are finally capable of fusing the last compare instruction with the following conditional jump instruction. So I hope gcc will stop interleaving stuff there which it has done before and still does today...

For additional information about assembly optimization, you want to read the attached document: Software Optimization Guide for AMD Family 15h Processors

(Documentation source: Developer Guides & Manuals)

You may want to search for the newest version since Family 15h is from 2011 (i.e. time of writing, but when you read this, it could be quite a bit out of date.)

The branching information in this document is specifically found in Chapter 7. Every single chapter and annex is about optimization, so I strongly suggest you read the entire book if you have time and don't know much about modern assembly language optimizations.

Optimize for size...

Assuming that your application doesn't need to be so fast, you can ignore those optimization using the -Os optimization option of gcc. This option tells gcc to optimize for size (-Os[ize]!) This can be combined with another optimization option as in:

gcc -O3 -Os ...

In my case, it reduces the size by 700 bytes on a 4Kb object file which I think is quite much (about 15% smaller code, although you have to take in account that the object file also includes constant data tables and headers... so it may be as high as 20% here.)

AttachmentSize
amd64-family15.pdf1.79 MB