C++17 and False Sharing

Few days ago I was listening to a talk from Michael Wong at ACCU 2019, when he mentioned that C++17 has false sharing support, at around 01:05:00.

Some of the comments pointed out a new nifty feature of the C++17 standard: https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size

The last time I dealt with false sharing I was using C, OpenMP and some hard coded values. Let’s see if C++17 makes our life easier.

False Sharing

A bit of background first, if you know what false sharing is, skip to the next section.

Let’s first discuss the Sharing part of False Sharing: when reading data from the RAM, the CPU loads a block of memory (cache line) in its own cache (actually there are more cache levels, but let’s ignore this fact). This is done to leverage data locality: if you access a piece of data, most likely you will access other pieces of data that are close to it, so it makes sense to load the whole cache line into the cache for faster retrieval. What happens when one CPU modify the content of a cache line currently shared with other CPUs? The caches of the other CPUs must be reloaded before the processes running on those CPUs can continue, stalling them. This mechanism allows the various threads of execution in a process to have a coherent view of the memory. Intel has a nice article about the issue where you can get more details.

Now let’s discuss the False bit of False Sharing: this happens when two or more processes read and modify independent data on the same cache line. In this case the cache coherence mechanism we saw before becomes an issue: a CPU stalls waiting for an update that is completely useless. See the example below.

False sharing can be fixed, usually, in two different ways:

Make sure that unrelated data are stored in different cache lines
Use local data for intermediate calculation and then access shared memory only at the end.

Usually, my favorite solution is the second, but what happens when only the first can be pursued? C++17 to the rescue!

False Sharing and C++17

The header <new> in C++17 comes with a new constant:

#include <new>

inline constexpr std::size_t 
                  hardware_destructive_interference_size

You can use this constant to make sure that independent data belongs to different cache lines because the variable allow us to access the size of a cache line in a portable way. Consider

struct resultType {
	int val;
};

An array of resultType might have false sharing: for example, if int is 4 bytes and a cache line is 64 bytes, up to 16 consecutive cells of the array will fall in the same cache line. We can change the struct in this way:

struct resultType {
        alignas(hardware_destructive_interference_size) 
        int val;
};

Now every struct will add enough padding to make sure that every val element has its own cache line. This, in theory, will avoid false sharing but we have to pay a cost: because of the alignment requirement the struct will increase its size.

Measure, measure, measure

Any optimization must be tested. At the time of writing only Microsoft Visual Studio 2019 supports this feature. I made some evaluations using the following test environment:

Microsoft Visual Studio 2019, Community, version 16.1.1
Optimization level: O2
Build: Release / x64
Standard: C++17
Processor: Intel i7-4800MQ
RAM: 16 GB

Below the program I used in my experiments. Please note that if you find yourself dealing with a similar problem, using raw threads as I did is a very bad idea, please consider using std::future or at least local variables (more on this below).

#include <algorithm>
#include <iostream>
#include <numeric>
#include <chrono>
#include <thread>
#include <array>
#include <new>

#define FALSE_SHARING() false

int main() {

	constexpr std::size_t numProcessors = 2;
	constexpr std::size_t numIter = 40'000'000;

#if FALSE_SHARING()
		std::cout << "With false sharing \n";
		struct resultType {
			int val;
		};
#else
		std::cout << "Without false sharing\n";
		struct resultType {
			alignas(std::hardware_destructive_interference_size) int val;
		};
#endif
	std::array<resultType, numProcessors> results{ 0 };
	std::array<std::thread, numProcessors> threads;

	auto start = std::chrono::high_resolution_clock::now();

	for (std::size_t i = 0; i < numProcessors; ++i) {
		auto& result = results[i];
		threads[i] = std::thread{ [&result, numIter]() {
			for (std::size_t j = 0; j < numIter; ++j)
				result.val = (result.val + std::rand() % 10) % 50;
		} };
	}

	std::for_each(begin(threads), end(threads), [](std::thread & t) { t.join(); });

	auto stop = std::chrono::high_resolution_clock::now();

	std::cout << "Duration "<<std::chrono::duration_cast<std::chrono::milliseconds>(stop -
		start).count() << '\n';
	const auto res = std::accumulate(cbegin(results), cend(results), resultType{ 0 },
		[](const resultType a, const resultType b) {
			auto s = a.val + b.val;
			return resultType{ s };
		});
	std::cout << "result " << res.val << '\n';
}

The results for 2 threads of execution are as follows:

No False Sharing (ms)	False Sharing (ms)
880	1853
768	1648
764	1642
761	1743
754	1576
772	1591
Average: 783	Average: 1675

So, now, thanks to C++17 we can avoid false sharing without having to hardcode values or use platform-specific macros.

In this particular case, I would advise against using the alignment trick, I would suggest using local variables as much as possible. If we substitute the central loop with the following construction, where all the intermediate calculation are done on the local variable localRes:

for (std::size_t i = 0; i < numProcessors; ++i) {
	auto& result = results[i];
	threads[i] = std::thread{ [&result, numIter]() {
		resultType localRes{0};
		for (std::size_t j = 0; j < numIter; ++j)
			localRes.val = (localRes.val + std::rand() % 10) % 50;
		result = localRes;
	}
	};
}

we minimize the false sharing to the final assignment and the two version of the code yield basically the same results.

Closing Notes

False sharing can be an insidious problem that can hinder the performance of multi-threaded code. In the past the only weapons at our disposal were the use of local data or hardcoding the cache line size, making the code less portable. Now thanks to C++17 we can obtain information about our cache in a portable way.

C++17 and False Sharing

False Sharing

False Sharing and C++17

Measure, measure, measure

Closing Notes

Published by Admin

One thought on “C++17 and False Sharing”

Leave a comment Cancel reply

False Sharing

False Sharing and C++17

Measure, measure, measure

Closing Notes

Share this:

Related

Published by Admin

One thought on “C++17 and False Sharing”

Leave a comment Cancel reply