string_view: the good and the bad

C++17 introduced std::string_view to allow manipulating strings without the overhead of heap allocation coming from std::string. In this article we will see a couple of use cases.

First and foremost, what is string_view? It is a non-owning view over a contiguous sequence of chars. Let’s unpack the definition:

View: you have read-only access and you might be restricted to a sub-sequence of the original sequence;
Non-owning: something else owns the memory you can read from.

Return a string_view

When should we return string_view instead of string? Let’s see some examples:

class Employee{
   std::string id;
public:
   const std::string& getID() const{return id;}
};

Should we switch the return type of getID from string to string_view? No. We have already a string, we can simply return a reference to the string. The user of your class will see the return type and they should know that they must ensure that the instance of employee lives longer than the reference.

Let’s now add a fallback value to the ID:

const std::string& getId() const { 
         if(....)
             return id; 
         return "ID NOT AVAILABLE";
}

This is BAD! Because if you return the string literal "ID NOT AVAILABLE" , the returned reference will bind to an instance of string local to the function getId. The caller will therefore obtain, a dangling reference. Welcome to the beautiful world of Undefined Behavior.

Before C++17 we have various solutions to the issue:

Always return string instead of a reference and incur in the memory allocation penalty;
Create a static string instance containing the string literal and return a reference to the string in the second return
Return const char *, but this can be done only if you assume that id does not contain termination characters between other chars. (See the notes at https://en.cppreference.com/w/cpp/string/basic_string/c_str )

In C++17 we can fix the situation in a very elegant way: we can notice that both the possible return values will survive after the end of the call. So we can rewrite the function as

std::string_view getId() const { 
         if(condition)
             return id; 
         return "ID NOT AVAILABLE";
}

We used an if/else, let’s use a conditional expression instead to have more compact code.

std::string_view getId() const { 
         return condition ? id : "ID NOT AVAILABLE";
}

And this is also BAD, as the author of this question on stackoverflow discovered. Why? I will shamelessly summarize the accepted answer . What is the type of the expression? To decide its return type, the expression tries to convert one of the two types (say the string literal for example) to the other (the std::string) and it manages to exactly this conversion! Basically you get something like:

std::string_view getId() const { 
         std::string temp = condition ? id : "ID NOT AVAILABLE";
         return temp;
}

Which is bad because you will be returning a string_view pointing to a temporary which is destroyed when the function returns. The solution? We need the target type to be a string_view, we have a nice conversion available between string and string_view, so let’s nudge the expression in the right direction transforming the string literal in a string_view:

std::string_view getId() const { 
         return condition ? id : std::string_view{"ID NOT AVAILABLE"};
}

Another example of return value is getting a substring, consider the following member function of the class Employee:

std::string getPartOfId(const std::size_t n) const{
     return id.substr(0, n);
}

This is so wasteful: id will survive the end of the call but we are creating a new string for no good reason.

In C++17 we could write this:

std::string_view getPartOfId(std::size_t len) const {
	std::string_view view{ id };
	return view.substr(0, len);
}

If the end user needs a string, they can still convert the string_view to a string, but the function is not doing any potentially useless allocations.

But is all of this worth it? Yes, we did some tests, using relatively long substrings (40 chars) to avoid small string optimizations and we used both versions of getPartOfId . The string_view version was up to 6 times faster on MSVC 2019 using optimization O2 than the version return a string.

So, let’s summarize, when is appropriate to use string_view as a return value?

When we would need to build a string
We can be sure that the string_view will not exceeds the lifetime of the memory it depends on

Use string_view as input parameter

Let’s add yet another member function to our class

void appendToId(const std::string& newPart) {
	id += newPart;
}

If we pass to appendToId a string literal, a temporary string will be created, passed to the function, and then destroyed. This is wasteful, in particular is the string literal is so long that the small string optimization cannot be used. But with a small change we can do the following:

void appendToId(std::string_view newPart) {
	id += newPart;
}

And we can avoid a wasteful heap allocation.

So string_view is a sure winner as input parameter, no? Well, not really, string_view can be a barrier for optimization, consider this last member class of Employee:

void setId(std::string id){
    id = std::move(id);
}

Passing by value has advantages: assume that the user of the function does not need anymore the input parameter, they can simply invoke the function as:

employee.setId(std::move(newID));

giving up the buffer that the string owns. This also works if you pass a string literal because a new string will be created on-the-fly and then, inside the function, the underlying buffer can be moved inside id. On the pro, cons and alternatives to this approach, you can refer to The Nightmare of Move Semantics for Trivial Classes, a CppCon2017 presentation by Nicolai Josuttis.

Null termination and C-style functions

Assume we have a string_view and we have a c-style function accepting a null-terminated character array, something like:

strlen(const char*)

Is it safe to invoke the function in this way?

std::string_view s = .....
....
strlen(s.data())

No! Because string_view might not be null-terminated: see, for example the case where s is the result of the invocation of substr on a string_view.

Conclusion

string_view allow for expressive and performant code but at a risk: the user of the class must be aware of the lifetime implications and we must be mindful of the fact that string_view is not guaranteed to be null-terminated. Secondly, the use as input parameter makes sense when the function could not benefit from passing by value the input parameters.

C++17 and True Sharing

In a previous post we saw how to avoid false sharing using a new feature of C++17: the constant std::hardware_destructive_interference_size . The same page on cppreference describes another constant, again available in the header <new>:

inline constexpr std::size_t
    hardware_constructive_interference_size

This constant can be used to make sure that data that will be accessed together is on the same cache line.

We would expect the two constants to have equal value, and usually they have, but having two different values has advantages:

Code expressiveness: for example you can clearly state why a particular alignment is required;
More flexibility.

True Sharing

The idea of true sharing is pretty simple: assume to have a struct like the following one:

struct bigStruct{
  int element1;
  ... other stuff....
  int element2;
};

If element1and element2 are frequently accessed together they should be in the same cache line. Consider the following example:

bigStruct a;
.....
a.element1 += a.element2;

If element1and element2 belong to two different lines, we will need to load both lines, possibly impacting performance. So, we should re-write our struct in this way:

struct alignas(std::hardware_constructive_interference_size) shouldStayTogether{
   int element1;
   int element2;
};

struct bigStruct{
.....
  shouldStayTogether elements;
......
};

How useful it this alignment and what kind of performance improvement can we get?

The experiment

I will use the setup of the previous post:

Microsoft Visual Studio 2019, Community, version 16.1.1
Optimization level: O2
Build: Release / x64
Standard: C++17
Processor: Intel i7-4800MQ
RAM: 16 GB

Here is the test program:

#include <algorithm>
#include <iostream>
#include <numeric>
#include <chrono>
#include <atomic>
#include <thread>
#include <vector>
#include <array>
#include <new>

//Function that measures the performance
template<typename countersType>
void measure() {

	constexpr std::size_t numIter = 40'000'000;
	constexpr std::size_t numExpr = 20;

	struct alignas(64) resultType {
		std::array<std::uint8_t, 64 - sizeof(int)> padding{ 0 };
		countersType cnt;
	};


	int totRes = 0;
	double totalD = 0;

	for (std::size_t q = 0; q < numExpr; ++q) {

		std::vector<resultType> data(numIter);

		const auto start = std::chrono::high_resolution_clock::now();
                //The important part
		for (std::size_t j = 0; j < data.size(); ++j) {
			data[j].cnt.val1 += data[j].cnt.val2;
		}
		const auto stop = std::chrono::high_resolution_clock::now();
                //Performance measurement
		const double duration =
			static_cast<double>(std::chrono::duration_cast<std::chrono::milliseconds>(stop
				- start).count());
                /*Use the results to avoid the loop being optimized out*/
		auto fromResTypeToInt = [](const resultType a) {return (a.cnt.val1 + a.cnt.val2) % 10; };
		auto sumAbs10 = [](const int a, const int b) {return (a + b) % 10; };
		auto res = std::transform_reduce(cbegin(data), cend(data), 0, sumAbs10, fromResTypeToInt);

		totRes += res;
		totalD += duration;
	}
	std::cout << "Total duration " << totalD << '\n';
	std::cout << "Result to avoid loop being optimized out " << totRes << '\n';
}

int main() {
        //Force val1 and val2 to be on the same line
	std::cout << "Cache Aligned\n";
	struct alignas(std::hardware_constructive_interference_size) countersAligned {
		int val1{ rand() % 10 };
		int val2{ rand() % 10 };
	};
	measure<countersAligned>();

	std::cout << "Cache NOT aligned\n";
	struct countersNotAligned {
		int val1{ rand() % 10 };
		int val2{ rand() % 10 };
	};
	measure<countersNotAligned>();
}

In order to obtain cache misses numbers I compiled the code also under Linux, so that I could use the perf program. hardware_constructive_interference_size was substituted with 64, then I used

GCC 7.4
-O2
–std=c++17

The idea is simple: we first create a type called resultType. resultType is a template containing padding and an instance of countersType . resultType is aligned with the size of a cache line, therefore any instance of the type will start at the beginning of a cache line. In my case the cache line size is 64 bytes. In resultType we have enough padding so that countersType belongs to two different cache lines if we use the default alignment. We create a long vector of instances of resultType and we sum the two integers (val1 and val2) contained in countersType.

template<typename countersType>
struct alignas(64) resultType {
    std::array<std::uint8_t, 64 - sizeof(int)> padding{ 0 };
    countersType cnt;
};

The type countersType can be aligned with the next cache line, thanks to hardware_constructive_interference_size or not. If countersType is not aligned with the next cache line, half of it belongs to a cache line, the second half to the next one. Here are the results, the number of samples and iterations are the same seen in the code I provide.

	Windows	Linux	Linux
countersType	Time[ms]	Time[ms]	Cache Misses
Aligned	7848	6552	775.408.558
Not aligned	8715	8460	882.116.005

So keeping on the same cache line elements that are used together improves performance, and now C++17 provides a portable way to do so.

Obviously, the example I provided is contrived, I had to align every struct with a cache line to have meaningful results, without this precaution, the performance is the same. Regardless, should you find yourself in the need to ensure that two or more variable reside on the same cache line, you can consider the above-mentioned technique.

C++11/14 and alignment

When I played around with the Linux version I tried to compile it also with C++14 and in that case I did not see any performance difference between the two version.

I examined the assembly code on godbolt to see what was going on and I found something interesting. In C++17 the vector is allocated with:

operator new(unsigned long, std::align_val_t)

while in C++14 we use

operator new(unsigned long)

This is another new feature of C++17, where operator new (in this case used by std::vector) takes also the alignment requirement, while in C++14 and earlier versions the alignment was, for this particular case, on my machine, 16 bytes, making the memory layout of the array unsuitable to demonstrate the effects of a struct allocated in between two cache lines.

Conclusion

As we have seen C++17 supports a portable way to ensure that related elements reside on the same cache line and this can, in particular cases, yield performance gains. As always we should be careful, check our assumptions and measure our results.

C++17 and False Sharing

Few days ago I was listening to a talk from Michael Wong at ACCU 2019, when he mentioned that C++17 has false sharing support, at around 01:05:00.

Some of the comments pointed out a new nifty feature of the C++17 standard: https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size

The last time I dealt with false sharing I was using C, OpenMP and some hard coded values. Let’s see if C++17 makes our life easier.

False Sharing

A bit of background first, if you know what false sharing is, skip to the next section.

Let’s first discuss the Sharing part of False Sharing: when reading data from the RAM, the CPU loads a block of memory (cache line) in its own cache (actually there are more cache levels, but let’s ignore this fact). This is done to leverage data locality: if you access a piece of data, most likely you will access other pieces of data that are close to it, so it makes sense to load the whole cache line into the cache for faster retrieval. What happens when one CPU modify the content of a cache line currently shared with other CPUs? The caches of the other CPUs must be reloaded before the processes running on those CPUs can continue, stalling them. This mechanism allows the various threads of execution in a process to have a coherent view of the memory. Intel has a nice article about the issue where you can get more details.

Now let’s discuss the False bit of False Sharing: this happens when two or more processes read and modify independent data on the same cache line. In this case the cache coherence mechanism we saw before becomes an issue: a CPU stalls waiting for an update that is completely useless. See the example below.

False sharing can be fixed, usually, in two different ways:

Make sure that unrelated data are stored in different cache lines
Use local data for intermediate calculation and then access shared memory only at the end.

Usually, my favorite solution is the second, but what happens when only the first can be pursued? C++17 to the rescue!

False Sharing and C++17

The header <new> in C++17 comes with a new constant:

#include <new>

inline constexpr std::size_t 
                  hardware_destructive_interference_size

You can use this constant to make sure that independent data belongs to different cache lines because the variable allow us to access the size of a cache line in a portable way. Consider

struct resultType {
	int val;
};

An array of resultType might have false sharing: for example, if int is 4 bytes and a cache line is 64 bytes, up to 16 consecutive cells of the array will fall in the same cache line. We can change the struct in this way:

struct resultType {
        alignas(hardware_destructive_interference_size) 
        int val;
};

Now every struct will add enough padding to make sure that every val element has its own cache line. This, in theory, will avoid false sharing but we have to pay a cost: because of the alignment requirement the struct will increase its size.

Measure, measure, measure

Any optimization must be tested. At the time of writing only Microsoft Visual Studio 2019 supports this feature. I made some evaluations using the following test environment:

Microsoft Visual Studio 2019, Community, version 16.1.1
Optimization level: O2
Build: Release / x64
Standard: C++17
Processor: Intel i7-4800MQ
RAM: 16 GB

Below the program I used in my experiments. Please note that if you find yourself dealing with a similar problem, using raw threads as I did is a very bad idea, please consider using std::future or at least local variables (more on this below).

#include <algorithm>
#include <iostream>
#include <numeric>
#include <chrono>
#include <thread>
#include <array>
#include <new>

#define FALSE_SHARING() false

int main() {

	constexpr std::size_t numProcessors = 2;
	constexpr std::size_t numIter = 40'000'000;

#if FALSE_SHARING()
		std::cout << "With false sharing \n";
		struct resultType {
			int val;
		};
#else
		std::cout << "Without false sharing\n";
		struct resultType {
			alignas(std::hardware_destructive_interference_size) int val;
		};
#endif
	std::array<resultType, numProcessors> results{ 0 };
	std::array<std::thread, numProcessors> threads;

	auto start = std::chrono::high_resolution_clock::now();

	for (std::size_t i = 0; i < numProcessors; ++i) {
		auto& result = results[i];
		threads[i] = std::thread{ [&result, numIter]() {
			for (std::size_t j = 0; j < numIter; ++j)
				result.val = (result.val + std::rand() % 10) % 50;
		} };
	}

	std::for_each(begin(threads), end(threads), [](std::thread & t) { t.join(); });

	auto stop = std::chrono::high_resolution_clock::now();

	std::cout << "Duration "<<std::chrono::duration_cast<std::chrono::milliseconds>(stop -
		start).count() << '\n';
	const auto res = std::accumulate(cbegin(results), cend(results), resultType{ 0 },
		[](const resultType a, const resultType b) {
			auto s = a.val + b.val;
			return resultType{ s };
		});
	std::cout << "result " << res.val << '\n';
}

The results for 2 threads of execution are as follows:

No False Sharing (ms)	False Sharing (ms)
880	1853
768	1648
764	1642
761	1743
754	1576
772	1591
Average: 783	Average: 1675

So, now, thanks to C++17 we can avoid false sharing without having to hardcode values or use platform-specific macros.

In this particular case, I would advise against using the alignment trick, I would suggest using local variables as much as possible. If we substitute the central loop with the following construction, where all the intermediate calculation are done on the local variable localRes:

for (std::size_t i = 0; i < numProcessors; ++i) {
	auto& result = results[i];
	threads[i] = std::thread{ [&result, numIter]() {
		resultType localRes{0};
		for (std::size_t j = 0; j < numIter; ++j)
			localRes.val = (localRes.val + std::rand() % 10) % 50;
		result = localRes;
	}
	};
}

we minimize the false sharing to the final assignment and the two version of the code yield basically the same results.

Closing Notes

False sharing can be an insidious problem that can hinder the performance of multi-threaded code. In the past the only weapons at our disposal were the use of local data or hardcoding the cache line size, making the code less portable. Now thanks to C++17 we can obtain information about our cache in a portable way.