Introduction To valarray

valarray is std:: vector’s  unpopular nerdy little brother.  It was designed for multi-dimensional numeric calculations.  While it may sound like a niche tool for scientific applications, It actually can be useful in other kinds of software.

Consider a vector representing a shopping list items prices,  and another vector representing the discount for each item:

#include <iostream>
#include <vector>
#include <string>
#include <sstream>

template <typename C>
std::string  seqTostring(C c)
{
    std::stringstream ss;
    ss << '(';
    for (auto it = std::begin(c); it != std::end(c); it++)
    {
        ss << *it;
        if (std::next(it) != end(c)) ss << ",";
    }
    ss << ')';
    return ss.str();
}

void calculatePrice()
{
    std::vector<double> itemPrice{ 11 , 44, 5.2, 1.99 };
    std::vector<double> itemDiscount{0 , 0.2, 0.4, 0 };
    std::vector<double> finalPrice(itemPrice.size());

    for (size_t i = 0; i < itemPrice.size(); i++)
    {
        finalPrice[i] = itemPrice[i] * (1.0 - itemDiscount[i]);
    }
    std::cout << "List of final prices: " << seqTostring(finalPrice) << '\n';
}

int main()
{
    calculatePrice();
    return 0;
}

List of final prices: (11,35.2,3.12,1.99)

Doing the same calculation with  valarray  is as straightforward as can be:


#include <valarray>
void calculatePrice()
{
   std::valarray<double> itemPrice{ 11 , 44, 5.2, 1.99 };
   std::valarray<double> itemDiscount{0 , 0.2, 0.4, 0 };
   std::valarray<double> finalPrice = itemPrice * (1.0 - itemDiscount);
   std::cout << "List of final prices: " << seqTostring(finalPrice) << '\n';
}

List of final prices: (11,35.2,3.12,1.99}

Notice how each of the elements in the itemDiscount array is individually subtracted from 1.0 before being multiplied by the corresponding element in the itemPrice array. In addition to producing a loop-free code, valarray numeric specialization allows compilers to emit highly optimized machine code under the assumption the elements in valarray are usually processed as numerical values.

For primitive numerical values such as int and double,  valarray supports, out of the box, any  type of arithmetic and logical operation:

int main()
{
 	std::valarray<unsigned int> values{ 1 /*1000*/, 13 /*1101*/ , 8 /*0001 */ };
	std::valarray<unsigned int> andResult = 8U & values;
	std::cout << "Logical AND of " << seqTostring(values) << " with 0001b : "
		<< seqTostring(andResult) << '\n';
	return 0;
}

Logical AND of (1,13,8) with 0001b : (0,8,8)

In addition, some mathematical functions, such as tan and sqrt,  are overloaded to accept a valarry:


int main()
{
	std::valarray<int> vals{ 144 , 16, 4, 81 };
	std::cout << "sqrt of  " << seqTostring(vals) << " = " << seqTostring(sqrt(vals)) << '\n';
	return 0;
}

sqrt of  (144,16,4,81) = (12,4,2,9)

However, if the function you need is not included in the standard library   you can define your own operation with the apply() method:

int main()
{
	std::valarray<int> vals{ 97,54,84,76 };
	std::valarray<int> applyResult = vals.apply([](int val)	{
		auto s = std::to_string(val);
		return std::stoi(s + s);});
	std::cout << seqTostring(vals) << " converted to " << seqTostring(applyResult) << '\n';
}

Contrary to what you would expect, apply accepts a lambda but will not accept a function object! this is the first hint of the somewhat skewed way valarray integrates with the rest of the C++ standard library.  It appears as if valarry started on a different evolutionary branch, apart from the rest of the library.  Methods such as begin(), end(),  clear()    we have gotten used to finding in a container class are missing.   If you wish to add more elements than the array capacity you must call valarray‘s  resize()  member manually with the new size  (invalidating all contained elements as a side effect )  than re-insert all old elements and add new elements  (a sort of manual reallocation).  It also means calling resize(0) effectively clear the array.

An empty valarry cause undefined behavior in simple workflows such as trying to call sum().  So you should be extra careful when working with empty arrays.
Another peculiarity of valarray implementation is an occasional compiler’s  failure to cast correctly returned types when valarrys are involved.  rewriting the previous apply() example as below compile’s fine in Visual Studio 2015, but not in GCC 4.9.2 :

int main()
{
	std::valarray<int> vals{ 97,54,84,76 };
	std::cout << seqTostring(vals) << " converted to " <<
		seqTostring(vals.apply(doubleTheNumbers)) << '\n';
}

Getting this and other instances of code involving valarrays require liberal use of static_cast<varlarry>  in order to point the compiler in the right direction.

C++ 11  took a few steps to bridge the gap,  among several updates: valarray got its own overload for the begin()/end() functions allowing it play better with the STL containers and algorithms:

std::valarray<int> vals = { 54,94,82,52,3,91,57,37,77,38 };
	std::sort(begin(vals), end(vals));
	std::cout << " sorted array: " << seqTostring(vals) << '\n';

(97,54,84,76) converted to (9797,5454,8484,7676)

Instantiating valarray with a user types  T  works in a similar way to other containers. The standart requires that constructing T with a copy constructor must be the same as instantiating T with a default constructor and using the assignment operator to assign it.  For example,  given:


struct valType
{
	int i;
	int j;
};

than

struct valType
{
	int i;
	int j;
};

valType val1{ I1,I2 }; 
valType  val2;
val2 = val1;

valType val3(val1);
assert (val2 == val3); // never fails for any I1, I2

For the full list of requirements of T see  C++ 11 standard, ISO/IEC 14882:2011(E)  section 26.2

Once you do come up with the proper type,  you will have to overload any functions you might need:

std::ostream& operator<<(std::ostream& os, const valType& vt)
{
	os << '{' << vt.i << ',' << vt.j << '}';
	return os;
}

valType sqrt(valType vt)
{
	valType res;
	res.i = sqrt(vt.i);
	res.j = sqrt(vt.j);
	return res;
}

int main()
{
	std::valarray<valType> vals{ { 4,4 },{ 16,9 },{ 256,1 },{ 1,64 } };
	std::valarray<valType> vals_sqrt = sqrt(vals);
	std::cout << seqTostring(vals) << " sqrt: " << seqTostring(vals_sqrt) << '\n';
	std::cout << seqTostring(vals) << " sum:  " << vals.sum() << '\n';
}

({4,4},{16,9},{256,1},{1,64}) sqrt: ({2,2},{4,3},{16,1},{1,8})
({4,4},{16,9},{256,1},{1,64}) sum: {277,78}

In spite of valarray’s quirks,  it can be a powerful tool when  applied with care.

In the next post, I’ll discuss the valarray [] operator, it is overloaded with som much functionality it warrants its own post.

for further reading see cppreference.com   and chapter 26.6 of the C++ 11 standard = ISO/IEC 14882:2011(E)

 

 

C++’s Simple Singletons

The singleton is one of the most controversial design patterns. Yet,  it is omnipresent in many real-world code bases. With C++ 11,   creating a thread safe, lazy initialized singleton  is  cheap, painless and straight forward process, by either using local statics or std::call_once:

Local statics

Creating a singleton with a local statics boils down to:

Singleton& Singleton::Instance()
{
	static Singleton instance;
	return instance;
}

The standard guarantees the Singleton instance  is created once, even if more than one thread attempts to call Instance at the same time (ISO/IEC 14882:2011 6.7 footnote #4)

call_once

std::call_once wraps a callable object and ensure it is called only once. Even if multiple threads try to call it at the same time.

The wrapper is using an object of type std::once_flag to keep track whether the code was already called:

#include <iostream>
#include <mutex>

class Singleton
{
private:
	Singleton(const Singleton&) = delete;
	Singleton & operator=(const Singleton&) = delete;
	

	static std::unique_ptr<Singleton> instance;
	static std::once_flag onceFlag;
public:
	Singleton() = default;

	static void NofityInit()
	{
		std::cout << "Initializing Singleton" << '\n';
	}
	static Singleton& Singleton::Instance()
	{
		std::call_once(Singleton::onceFlag,[] (){
			NofityInit();
			instance.reset(new Singleton); 
		});

		std::cout << "Getting  Singleton instance" << '\n';
		return *(instance.get());
	}
};

std::unique_ptr<Singleton> Singleton::instance;
std::once_flag Singleton::onceFlag;

int main()
{
	Singleton& s1 = Singleton::Instance();
	Singleton& s2 = Singleton::Instance();

	return 0;
}

The output for this program:


Initializing Singleton
Getting Singleton instance
Getting Singleton instance

For more details, including performance study, check out Rainer Grimm’s in-depth post.

The Initialization List’s Illusionist Show

Consider the following  source:

 #include "b.h"
    class A
    {
    public:
        A();
        int get_i();

    private:
        B b;
        int i;
    };
a.h

 

#include "a.h" 
#include < iostream >

 A::A() : i(100), b(this){} 
 int A::get_i(){ return i; }
a.cpp
class A;
class B
{
public:
    B(A* _a);

private:
    A* a;
};
b.h
#include "a.h"
#include "b.h"
#include <iostream>

B::B( A* _a) : a(_a) 
{
  std::cout << "running B constructor "  << '\n';
  std::cout << "A::i value : " << a->get_i() << '\n';
} 
b.cpp
#include "a.h"
int main(void)
{
    A a;
    return 0;
}
main.cpp

Looking at the A class constructor’s  code you might expect the output to be:


Running B constructor
A::i value : 100

However,  after running the code on my  machine I got:

Running B constructor
A::i value : 1866555446

b  is the last member in the initialization list.  More to the point, it is located after the i member variables.  Yet, It appears as if i contains junk value when the B::B() constructor retrieves its value through the a this pointer.

So, what’s going on?

As a general principal, Having this in an initialization list should be done with some care.   However,  in this specific example,  the root cause is C++’s take on an illusionist act:  While the developer is busy carefully positioning variables in the initialization list, he is unaware the initialization order was already set without him noticing!

Variables initialization order is set by the order the variables are declared in the class,  not by the order they appear in the initialization list.  b is declared in  A  before i.  So,  when an A  object is constructed, The b member (declared first A class deceleration) is initialized first with the value supplied by the initialization list. The supplied value happen’s to be a pointer to a  half-baked A object with an uninitialised i member producing the junk value.

Conclusion

Declaring class variables doesn’t seem significate beyond the simple act of specifying data members needed to define the object’s state.  In effect, the class member declaration is important in subtle ways,  from object initialization to the way it is layout in memory.

Ordered Vs. Unordered Sets

In mathematics, a set has no order. The sets  {1,2,3}  and {3,2,1}   are equivalent.  STL’s std::set diverges from the mathematical definition and imposes an order on its elements. Typically the set is implemented internally as a red-black tree.  a compare predicate (by default the std::less <T>  operator) determines the position of a new element in the tree.   It allows for efficient iteration over the elements for a price:   insertion, removal and search of an element cost an average of  O(log n).  This is not too bad for small sets, however, it can become an issue for larger sets or sets accessed from performance critical code. 

If all you need is a container that will keep track weather and object is contained in a set or not.   You are better off using  unordered_set  (introduced in C++ 11) its functionality is similar to the pre-C++ 11  STL set,  Only it is implemented as a hash table with an average, insert, delete and search  times of O(1)    with large sets the difference is noticeable:

#include <iostream>
#include <chrono>
#include <unordered_set>
#include <set>
#include <string>

template <typename TimeT = std::chrono::milliseconds> struct measure {
    template <typename F, typename... Args> static typename TimeT::rep execution(F func, Args&&... args)
    {
        auto start = std::chrono::system_clock::now();
        // Now call the function with all the parameters you need.
        func(std::forward<Args>(args)...);
        auto duration = std::chrono::duration_cast<TimeT>(std::chrono::system_clock::now() - start);
        return duration.count();
    }
};

std::string baseString = "abcdefghi";

class SetContainer
{
public:
    void populate_unordered_set()
    {
        std::string s = baseString;
        while(std::next_permutation(begin(s), end(s))) {
            unorderedSet.insert(s);
        }
    }

    void populate_ordered_set()
    {
        std::string s = baseString;
        while(std::next_permutation(begin(s), end(s))) {
            orderedSet.insert(s);
        }
    }

    bool ordered_set_has(std::string val)
    {
        return (orderedSet.find(val) != orderedSet.end());
    }

    bool unordered_set_has(std::string val)
    {
        return (unorderedSet.find(val) != unorderedSet.end());
    }

    long iterate_ordered_set()
    {
        long count = 0;

        for(auto& s : orderedSet) {
            count++;
        }
        return count;
    }

    bool iterate_unordered_set()
    {
        long count = 0;
        for(auto& s : unorderedSet) {
            count++;
        }
        return count;
    }

private:
    std::unordered_set<std::string> unorderedSet;
    std::set<std::string> orderedSet;
};

using nsMeasure = measure<std::chrono::nanoseconds>;

int main(int argc, char** argv)
{
    SetContainer setContainer;
    std::cout << "Unordered set population time (ns) :"
              << nsMeasure::execution(std::bind(&SetContainer::populate_unordered_set, &setContainer)) << '\n';
    std::cout << "Ordered set population time (ns) :"
              << nsMeasure::execution(std::bind(&SetContainer::populate_ordered_set, &setContainer)) << "\n\n";

    std::string searchVal = "abcdefgih";
    std::cout << "searching for " << searchVal << " :\n";

    std::cout << "Unordered Set (ns) :"
              << nsMeasure::execution(std::bind(&SetContainer::unordered_set_has, &setContainer, searchVal)) << '\n';
    std::cout << "Ordered Set (ns) :"
              << nsMeasure::execution(std::bind(&SetContainer::ordered_set_has, &setContainer, searchVal)) << "\n\n";

    std::cout << "Unordered set's elements iteration (ns) :"
              << nsMeasure::execution(std::bind(&SetContainer::iterate_unordered_set, &setContainer)) << '\n';
    std::cout << "Ordered set's elements iteration (ns) :"
              << nsMeasure::execution(std::bind(&SetContainer::iterate_ordered_set, &setContainer)) << "\n\n"   ;

    return 0;
}


Unordered set population time (ns) :225967000
Ordered set population time (ns) :277887000

searching for abcdefgih :

Unordered Set (ns) :2000
Ordered Set (ns) :6000

Unordered set’s elements iteration (ns) :29006000
Ordered set’s elements iteration (ns) :17211000

(The handy measure template is from Nikos Athanasiou’s answer on Stackoverflow.com)

The program above was compiled with clang in release mode and -O2 flag.

As you can see from the output,  inserting element in the unordered  set, took nearly half of the time it took to insert an element in the ordered  set.  The unordered container search time  is third of the ordered set’s  search time.  On the other hand iterating over the unordered set  takes twice as much as the unordered set .

Conclusion

Like a cable subscription: there is really no reason to pay for the all inclusive plan when you are only going to watch the sci-fi channel.  If all you want is to keep track of a bunch of objects and you rarely need to iterate over the elements in the set, consider using the unordered set for better performance. 

Atomic Basics

Last week I’ve  encountered  some confusion around the way atomic work in C++.  Some  developers believe  wrapping any data structure with atomic will  make it,  by some magic,  thread safe.  They are then perplexed by the source not compiling or even worse compiling but not behaving as expected.

Reading the fine print in the C++ standard  [ISO/IEC 14882:2014(E)  Ch. 25.5 ]   reveals that the  T argument in the atomic template is required to be trivially copyable .    What does it mean?   a  type is trivially copyable  if its creation  or destruction doesn’t involve more than allocation or releasing memory –  similar to C’s  old style malloc.  The standard of course is more specific  you can find the full list of requirements in  [ISO/IEC 14882:2014(E)  Ch. 9 ]

So all primitives such as intchar  array  types are trivially copyable.  Since their creation requires no  more than allocation space for them in memory.  These types can be copied  safely using memcpy  . The standard library provides a handy method  called  std::is_trivially_copyable  to determine if a type is  trivially copyable:

int main()
{

    std::cout << "int  trivially copyable ? " << std::is_trivially_copyable<int>::value << " \n";
    std::cout << "std::array  trivially copyable ? " << std::is_trivially_copyable<std::array<char, 16> >::value << " \n";
    std::cout << "std::string  trivially copyable ? " << std::is_trivially_copyable<std::string>::value << " \n";

    return 0;
}

int trivially copyable ? 1
std::array trivially copyable ? 1
std::string trivially copyable ? 0

Coping memory in a thread safe way is what lays at the core of atomic thread safety.  This explains the  requirement for  trivially copyable types.  In most implementations,  atomic containing  primitive types maintain thread safety  without using a synchronization object (they are lock free).  However, with none-primitive types,  the compiler  will employ  a  lock to ensure the  modifying thread has exclusive access the memory used by the atomic variable. The method std::atomic::is_lock_free can determine which strategy the compiler selected for a type.

Atomic also control the type of order  threads can “observe”  changes to memory.  The different  options are enumerated by std::memory_order.  This an advanced  topic. In most cases, the default used the by the standard   library  std::memory_order_seq_cst  is the safest way to go.

Let’s talk about  what std::memory_order_seq_cst   means:

In modern PCs, there are often more than one CPU (cores)  each one of the cores had its own  memory cache.  Values taken from memory can be  stored temporarily  in the cache memory for quick retrieval.   So different  threads running on different cores  might not observe changes to memory at the same time. It could be that a memory location had changed but  a thread still  has an older cached value. It is, therefore, not guaranteed that all threads  have the same view of memory.  This is where  std::memory_order_seq_cst  steps in.  Atomic load/store  operation executing with this option ensures that all threads will have the same view of memory, always and for all atomic using this option. So if a store/load operation happened first for one thread, this operation will also happen first from the point of view of other threads.
As always nothing comes for  free, in this case, the cost are lost optimization opportunities for the compiler and CPU.

Consider this example (taken from here)

	#include <thread>
	#include <atomic>
	#include <cassert>

	std::atomic<bool> x = { false };
	std::atomic<bool> y = { false };
	std::atomic<int> z = { 0 };

	void write_x()
	{
		x.store(true, std::memory_order_seq_cst);
	}

	void write_y()
	{
		y.store(true, std::memory_order_seq_cst);
	}

	void read_x_then_y()
	{
		while (!x.load(std::memory_order_seq_cst))
			;
		if (y.load(std::memory_order_seq_cst)) {
			++z;
		}
	}

	void read_y_then_x()
	{
		while (!y.load(std::memory_order_seq_cst))
			;
		if (x.load(std::memory_order_seq_cst)) {
			++z;
		}
	}

	int main()
	{
		std::thread a(write_x);
		std::thread b(write_y);
		std::thread c(read_x_then_y);
		std::thread d(read_y_then_x);
		a.join(); b.join(); c.join(); d.join();
		assert(z.load() != 0);  // will never happen
	}

The claim is that the assert in line 45 will always  pass. Let’s follow this through :
Either of the 4 threads can execute first.  The while loops at lines 22,31   will cause the threads running the  read_x_then_y()  and  read_y_then_x()  methods to wait until  either  x or y  are assigned a value.   Let’s assume x is updated first:

the read_x_then_y  thread  exits the while loop and  moves to check the value of y.
the read_y_then_x thread is still looping,  waiting for y to get updated.   By the order imposed by std::memory_order_seq_cst, this thread has the same view of memory as the other threads. So if the read_x_then_y thread does not detect a change in y   neither will read_y_then_x .  So,  y value in line 24  evaluate to false , the z variable is not increased and the thread is done.

Eventually,  y  get updated by the write_y  thread.
read_y_then_x thread exits the loop at line 31 and moves to the next statement where it examines the  value of x.  As before  the std::memory_order_seq_cst total (across threads) order guarantees that the read_y_then_x  thread  will have the same view of memory as read_x_then_y,   Since that thread already seen an updated x.   read_y_then_x  will also see an x  with a true value. So the if statement on line 33 evaluate to true and z  value increases  satisfying the assert on line 45.

The same chain of arguments works the same if you swap x and y.  So in all cases, the assert will not trigger an error.

 

Note: We could drop the  std::memory_order_seq_cst argument in the load and store methods as  they are defined as:

T load( std::memory_order order = std::memory_order_seq_cst )
void store( T desired, std::memory_order order = std::memory_order_seq_cst );
Conclusion

The fundamental principals of atomics  are not that different from using synchronization objects to manage shared memory access from multiple threads.    Having said that, using atomics let you take advantage of the highly optimized  expertly crafted code written for the standard library.  In addition, it makes the code more readable by hiding most of the thread safe code inside the atomic object.

If you  like to know about the other memory order models check these pages:

GCC Wiki – Memory model synchronization modes:  a well written friendly explanation with simple examples.
cppreference.com – std::memory_order : an in-depth explanation of the different memory access options with full examples.