How to Guide the Compiler to Speed ​​up Your Code | by Minhaz | Apr, 2022

Photo by Daniil Sitantev on Unsplash.

Modern compilers don’t just compile our code, as is from a certain high-level language to assembly language (or machine-readable instructions). They spend a great deal of time and effort optimizing our code to achieve the best performance.

This is of course enabled when the right flags are provided to the compilers. You can always instruct the compilers to optimize for binary size or compilation latency instead (read more).

This article will be focused on optimizing runtime performance.

Disclaimers:

Most of the examples in this article are using C++, but I believe the content would be useful for everyone.

The content of this article is not a reflection of the organisation. I work for but instead my own.

Photo by Francesco Vantini on Unsplash.

Let’s talk a little about modern CPUs. This is often taught at school but we tend to forget it.

SIMD vs SISD

SISD stands for Single Instruction Stream, Single Data Stream. Typically, a program’s code is executed in sequence, ie one after another. Let’s say we have two arrays say a and band we want to write a program that converts each element in a with the following operation:

a[i] = a[i] + b[i];

For each index i in the arrays.

Visualisation of how to swap algorithm would work on a SISD CPU. Source: Open source repository on Github — Wunkolo/qreverse (MIT license).

This is how we often visualize how our code is executed on the CPUs. And we tend to optimize the big Omega — and yes that is the right practice.

But modern CPUs can do better!

Modern CPUs have capabilities and support for SIMD which stands for Single Instruction, Multiple Data. Such machines can exhibit data-level parallelism (which is different from concurrency). They can perform the same instruction on multiple data at once.

For the above example, the SIMD CPUs could group and execute the operations in one batch as:

a[0] = a[0] + b[0];
a[1] = a[1] + b[1];
a[2] = a[2] + b[2];
a[3] = a[3] + b[3];

SIMD instruction for + is called addps in SSE or vaddps in AVX, and they support grouping of 4 elements and 8 elements respectively (integer type).

Visualisation of how to swap algorithm would work on a SIMD CPU. Source: Open source repository on Github — Wunkolo/qreverse (MIT license).

You see how your code would just run K times faster if you could tell the CPU to run the algorithm in this manner?

Worry not! You can — there are intrinsics that allow you to tell the machines to do vector operations (SIMD) instead of a scalar.

Here is a code example

Code example to compute sum of two arrays into `target` array using Neon Intrinsics.

Tell me who doesn’t want to write code this way — looks absolutely lovely!

Ok, I don’t!

This requires acquiring a bit more domain knowledge and is often not easily maintainable. There is a way to write both maintainable as well as high-performance code — but that is a topic for another evening (will update here)!

The good thing is, often you don’t have to write code with vector intrinsics directly to achieve the performance targets. This takes me to the next topic — “vectorization“.

Vectorization

SIMD supports instructions that can operate on vector data types. In the above example a group of array elements like a[0...3] or b[4...7] can be called vectors.

Vectorization is the use of vector instructions to speed up program execution.

Vectorization can be done both by programmers — by explicitly writing vector instructions as well as by the compiler directly. The latter case is called Auto Vectorization.

Auto Vectorization can be done by Ahead of Time (AOT) compilers at compile time or by Just in Time (JIT) compiler at execution time.

Loop unrolling

Loop unrolling or Loop unwrapping is a loop transformation technique that attempts to optimize the program’s execution speed at expense of its binary size.

The goal of the loop unrolling is to increase a program’s speed by reducing or eliminating instructions that control the loop such as pointer arithmetic and end of loop test on each iteration.

A simple example of loop unrolling would be:

Loop unrolling — unrolled loops look bad right? They are often good for better performance!

In the later part of this article, I’ll share some performance numbers around these two.

gcc on Mac — Image by the Author.

Modern C++ compilers have code optimization techniques like loop vectorizer which allows them to generate vector instructions for code written in scalar format.

So often our scalar codes are actually being run as vector instructions in the CPUs already! Phewww!

However, it depends on how the code is written — for the compiler to understand if it can auto-vectorizethe code or not. Sometimes, the compiler cannot determine if it’s safe to vectorize a certain loop or not. Same goes for loop unrolling.

But, worry not! There are ways to guide your compiler that it’s safe to compile with these optimizations.

The techniques below shall work for a certain group of compilers that can generate Neon code (read more below) like GCC, LLVM-clang (Used in Android NDK), and Arm C/C++ Compiler.

For demonstration, I’ll use the problem statement of — converting an image in YUV image format to ARGB image format. For evaluating performance I shall use Pixel 4a (an Android device).

Code example:

Code example to convert YUV image to ARGB. The algorithm by default iterate over every pixel on the image in row-major manner.

On the Pixel 4a device, for an 8 MP image (3264x2448 = ~8 million pixels) — running the code above, I got the following number as average runtime latency.

Average latency of running the code above on Pixel 4A for 8MP image.

It’s useful to note that, the compiler is already trying to optimize the code. It was run with -O3 compilation flag (ie optimized for speed over binary size or compilation time). Auto-vectorization is enabled for this flag.

pragma declaration — loop vectorize

Using following #pragma declaration just before the for loop indicates to the compiler that the following loop contains no data dependencies that should prevent auto-vectorization:

#pragma clang loop vectorize(assume_safety)

Important note: Ensure that you only use this pragma when it is safe to do so it could otherwise lead to race conditions.

So if we fit it into the example above like this

Example of how to add #pragma directive on top of your loop

We get the following average latency numbers:

11.4% speedup by adding this single line directive on top of your loop.

pragma declaration — loop unroll

Similarly, we can instruct the compiler to unroll loops when compiling with the following statement:

#pragma clang loop unroll_count(2)

So if we fit it into the example above like this

Example of how to add #pragma directives at different stages of the for loop.

We get the following average latency numbers:

18.5% speedup by adding both loop vectorise and loop unroll directives (2 additional lines of code).

The integer in unroll_count(N) Basically guides the compiler on how much to unroll — you can benchmark with different numbers to figure out the best one.

Overall 18+% speed up for this example with 2 lines of code and less than 1 hour of effort! The resulting code is easy to read and maintain.

Some other tips:

  1. Prefer using < to construct loops as compared to <= or !=.
  2. Using the -ffast-math Option can significantly improve the performance of generated code, as far as the algorithm is tolerant to potential inaccuracies as it breaks compliance with IEEE and ISO standards for math operations.

TL;DR;

You could get meaningful full speedup in your code by using these directives which guide the compilers to optimize the code better.

In the example used I was able to get roughly 18%+ speedup with 2 additional lines of code and the code is much easier to read and maintain than one with vector syntax directly.

The final speedup with this approach depends on how independent the data is with respect to the algorithm.

The best way is always to try and benchmark. The performance boost (if any) would definitely be worth the engineering time cost.

I plan to write more about ways to achieve improved performance without sacrificing the maintainability of the code. A lot of these are derived from things I learned at Google and my hobby projects.

In this section, I am going to describe a little more on some of the unexplained concepts.

Neon

Not our noble gas buddy! Neon is the implementation of Arm’s Advanced SIMD architecture. The purpose of Neon is to accelerate data manipulation by providing:

  1. Thirty-two 128-bit vector registers, each capable of containing multiple lanes of data.
  2. SIMD instructions to operate simultaneously on those multiple lanes of data.

Source: developer.arm.com

Per documentation there are multiple ways to make use of the technology:

  1. Using Neon-enabled open source libraries. We all love this!
  2. Auto-vectorization feature in the compilers that can take advantage of Neon.
  3. Using Neon intrinsics — the compiler will replace them with appropriate Neon instructions. This gives us direct low-level access to the exact Neon instruction we want from C/C++ code.
  4. Writing assembly code with Neon directly, only works for really well-experienced programmers.
  1. What is Neon?
  2. Coding best practices for auto-vectorization

Leave a Comment