Multi-threading with OpenMP

The following is an outline of how to make C++ standalone templates compatible with OpenMP, and therefore make them work in a multi-threaded environment. This should be considered as an extension to Code generation, that has to be read first. The C++ standalone mode of Brian is compatible with OpenMP, and therefore simulations can be launched by users with one or with multiple threads. Therefore, when adding new templates, the developers need to make sure that those templates are properly handling the situation if launched with OpenMP.

Key concepts

All the simulations performed with the C++ standalone mode can be launched with multi-threading, and make use of multiple cores on the same machine. Basically, all the Brian operations that can easily be performed in parallel, such as computing the equations for NeuronGroup, Synapses, and so on can and should be split among several threads. The network construction, so far, is still performed only by one single thread, and all created objects are shared by all the threads.

Use of #pragma flags

In OpenMP, all the parallelism is handled thanks to extra comments, added in the main C++ code, under the form:

#pragma omp ...

But to avoid any dependencies in the code that is generated by Brian when OpenMP is not activated, we are using functions that will only add those comments, during code generation, when such a multi-threading mode is turned on. By default, nothing will be inserted.

Translations of the #pragma commands

All the translations from openmp_pragma() calls in the C++ templates are handled in the file devices/cpp_standalone/codeobject.py In this function, you can see that all calls with various string inputs will generate #pragma statements inserted into the C++ templates during code generation. For example:

{{ openmp_pragma('static') }}

will be transformed, during code generation, into:

#pragma omp for schedule(static)

You can find the list of all the translations in the core of the openmp_pragma() function, and if some extra translations are needed, they should be added here.

Execution of the OpenMP code

In this section, we are explaining the main ideas behind the OpenMP mode of Brian, and how the simulation is executed in such a parallel context. As can be seen in devices/cpp_standalone/templates/main.cpp, the appropriate number of threads, defined by the user, is fixed at the beginning of the main function in the C++ code with:

{{ openmp_pragma('set_num_threads') }}

equivalent to (thanks to the openmp_pragam() function defined above): nothing if OpenMP is turned off (default), and to:

omp_set_dynamic(0);
omp_set_num_threads(nb_threads);

otherwise. When OpenMP creates a parallel context, this is the number of threads that will be used. As said, network creation is performed without any calls to OpenMP, on one single thread. Each template that wants to use parallelism has to add {{ openmp_pragma{('parallel')}} to create a general block that will be executed in parallel or {{ openmp_pragma{('parallel-static')}} to execute a single loop in parallel.

How to make your template use OpenMP parallelism

To design a parallel template, such as for example devices/cpp_standalone/templates/common_group.cpp, you can see that as soon as you have loops that can safely be split across nodes, you just need to add an openmp command in front of those loops:

{{openmp_pragma('parallel-static')}}
for(int _idx=0; _idx<N; _idx++)
{
    ...
}

By doing so, OpenMP will take care of splitting the indices and each thread will loop only on a subset of indices, sharing the load. By default, the scheduling use for splitting the indices is static, meaning that each node will get the same number of indices: this is the faster scheduling in OpenMP, and it makes sense for NeuronGroup or Synapses because operations are the same for all indices. By having a look at examples of templates such as devices/cpp_standalone/templates/statemonitor.cpp, you can see that you can merge portions of code executed by only one node and portions executed in parallel. In this template, for example, only one node is recording the time and extending the size of the arrays to store the recorded values:

{{_dynamic_t}}.push_back(_clock_t);

// Resize the dynamic arrays
{{_recorded}}.resize(_new_size, _num_indices);

But then, values are written in the arrays by all the nodes:

{{ openmp_pragma('parallel-static') }}
for (int _i = 0; _i < _num_indices; _i++)
{
    ....
}

In general, operations that manipulate global data structures, e.g. that use push_back for a std::vector, should only be executed by a single thread.

Synaptic propagation in parallel

General ideas

With OpenMP, synaptic propagation is also multi-threaded. Therefore, we have to modify the SynapticPathway objects, handling spike propagation. As can be seen in devices/cpp_standalone/templates/synapses_classes.cpp, such an object, created during run time, will be able to get the number of threads decided by the user:

_nb_threads = {{ openmp_pragma('get_num_threads') }};

By doing so, a SynapticPathway, instead of handling only one SpikeQueue, will be divided into _nb_threads SpikeQueues, each of them handling a subset of the total number of connections. All the calls to SynapticPathway object are performed from within parallel blocks in the synapses and synapses_push_spikes template, we have to take this parallel context into account. This is why all the function of the SynapticPathway object are taking care of the node number:

void push(int *spikes, unsigned int nspikes)
{
    queue[{{ openmp_pragma('get_thread_num') }}]->push(spikes, nspikes);
}

Such a method for the SynapticPathway will make sure that when spikes are propagated, all the threads will propagate them to their connections. By default, again, if OpenMP is turned off, the queue vector has size 1.

Preparation of the SynapticPathway

Here we are explaining the implementation of the prepare() method for SynapticPathway:

{{ openmp_pragma('parallel') }}
{
    unsigned int length;
    if ({{ openmp_pragma('get_thread_num') }} == _nb_threads - 1)
        length = n_synapses - (unsigned int) {{ openmp_pragma('get_thread_num') }}*n_synapses/_nb_threads;
    else
        length = (unsigned int) n_synapses/_nb_threads;

    unsigned int padding  = {{ openmp_pragma('get_thread_num') }}*(n_synapses/_nb_threads);

    queue[{{ openmp_pragma('get_thread_num') }}]->openmp_padding = padding;
    queue[{{ openmp_pragma('get_thread_num') }}]->prepare(&real_delays[padding], &sources[padding], length, _dt);
}

Basically, each threads is getting an equal number of synapses (except the last one, that will get the remaining ones, if the number is not a multiple of n_threads), and the queues are receiving a padding integer telling them what part of the synapses belongs to each queue. After that, the parallel context is destroyed, and network creation can continue. Note that this could have been done without a parallel context, in a sequential manner, but this is just speeding up everything.

Selection of the spikes

Here we are explaining the implementation of the peek() method for SynapticPathway. This is an example of concurrent access to data structures that are not well handled in parallel, such as std::vector. When peek() is called, we need to return a vector of all the neuron spiking at that particular time. Therefore, we need to ask every queue of the SynapticPathway what are the id of the spiking neurons, and concatenate them. Because those ids are stored in vectors with various shapes, we need to loop over nodes to perform this concatenate, in a sequential manner:

{{ openmp_pragma('static-ordered') }}
for(int _thread=0; _thread < {{ openmp_pragma('get_num_threads') }}; _thread++)
{
    {{ openmp_pragma('ordered') }}
    {
        if (_thread == 0)
            all_peek.clear();
        all_peek.insert(all_peek.end(), queue[_thread]->peek()->begin(), queue[_thread]->peek()->end());
    }
}

The loop, with the keyword ‘static-ordered’, is therefore performed such that node 0 enters it first, then node 1, and so on. Only one node at a time is executing the block statement. This is needed because vector manipulations can not be performed in a multi-threaded manner. At the end of the loop, all_peek is now a vector where all sub queues have written the id of spiking cells, and therefore this is the list of all spiking cells within the SynapticPathway.

Compilation of the code

One extra file needs to be modified, in order for OpenMP implementation to work. This is the makefile devices/cpp_standalone/templates/makefile. As one can simply see, the CFLAGS are dynamically modified during code generation thanks to:

{{ openmp_pragma('compilation') }}

If OpenMP is activated, this will add the following dependencies:

-fopenmp

such that if OpenMP is turned off, nothing, in the generated code, does depend on it.