Multi-threading with OpenMP

The following is an outline of how to make C++ standalone templates compatible with OpenMP, and therefore make them work in a multi-threaded environment. This should be considered as an extension to Code generation, that has to be read first. The C++ standalone mode of Brian is compatible with OpenMP, and therefore simulations can be launched by users with one or with multiple threads. Therefore, when adding new templates, the developers need to make sure that those templates are properly handling the situation if launched with OpenMP.

Key concepts

All the simulations performed with the C++ standalone mode can be launched with multi-threading, and make use of multiple cores on the same machine. Basically, all the Brian operations that can easily be performed in parallel, such as computing the equations for NeuronGroup, Synapses, and so on can and should be split among several threads. The network construction, so far, is still performed only by one single thread, and all created objects are shared by all the threads. This is only during the run() function that all the objects are executed within a global parallel environment, meaning that all operations will be performed on all the threads, if multi-threading is activated.

Use of #pragma flags

In OpenMP, all the parallelism is handled thanks to extra comments, added in the main C++ code, under the form:

#pragma omp ...

But to avoid any dependencies in the code that is generated by Brian when OpenMP is not activated, we are using functions that will only add those comments, during code generation, when such a multi-threading mode is turned on. By default, nothing will be inserted.

Translations of the #pragma commands

All the translations from openmp_pragma() calls in the C++ templates are handled in the file devices/cpp_standalone/codeobject.py In this function, you can see that all calls with various string inputs will generate #pragma statements inserted into the C++ templates during code generation. For example:

{{ openmp_pragma('static') }}

will be transformed, during code generation, into:

#pragma omp for schedule(static)

You can find the list of all the translations in the core of the openmp_pragma() function, and if some extra translations are needed, they should be added here.

Execution of the OpenMP code

In this section, we are explaining the main ideas behind the OpenMP mode of Brian, and how the simulation is executed in such a parallel context. As can be seen in devices/cpp_standalone/templates/main.cpp, the appropriate number of threads, defined by the user, is fixed at the beginning of the main function in the C++ code with:

{{ openmp_pragma('set_num_threads') }}

equivalent to (thanks to the openmp_pragam() function defined above): nothing if OpenMP is turned off (default), and to:

omp_set_dynamic(0);
omp_set_num_threads(nb_threads);

otherwise. When OpenMP creates a parallel context, this is the number of threads that will be used. As said, network creation is performed without any calls to OpenMP, on one single thread. The parallelism is only starting in the file devices/cpp_standalone/templates/network.cpp with the creation of a parallel context in the run() function, as can be seen with:

{{ openmp_pragma('parallel') }}
{
    while(clock->running())
    {
      .....
    }
}

just before the loop over each time steps of the simulation. Therefore, from now on and until the end of the run() call, all operations are handled in parallel, and will, by default, be executed on all the nodes. This is more efficient, creating the parallel context only once, reducing the overhead. Interestingly, as you can see in this run() function, most of the clock operation are only handled by one node, thanks to:

{{ openmp_pragma('single') }}
{
    ....
}

This command is adding a #pragma flag such that only one node will update the clocks. Remembering that objects are shared by all threads, so we should not have multiple access. This flag is adding an implicit synchronization barrier such that all threads will be sync after such a block of operation. If such a synchronization is not needed, then you can use to single-nowait flag.

How to make your template compatible with OpenMP

Use of IS_OPENMP_COMPATIBLE flag

If an existing template is not compatible with OpenMP....

Design of a non parallel template

If the template added can not be parallelized, such as for example a SpikeMonitor object, then most of its operations should be encapsulated within a single flag, such as:

{{ openmp_pragma('single-nowait') }}
{
    ....
}

By doing so, this will make sure that even when OpenMP is turned on, operations will be safely performed only by one node. This is the safest option, and this should be present each time you have a node that will manipulate data structures such as vector, performing operations such as push_back(), affecting the data structure. Those operations should not be performed in parallel, leading to inconsistencies or segmentation faults.

Design of a parallel template

To design a parallel template, such as for example devices/cpp_standalone/templates/common_group.cpp, you can see that as soon as you have loops that can safely be split across nodes, you just need to add an openmp command in front of those loops:

{{openmp_pragma('static')}}
for(int _idx=0; _idx<N; _idx++)
{
    ...
}

By doing so, OpenMP will take care of splitting the indices and each thread will loop only on a subset of indices, sharing the load. By default, the scheduling use for splitting the indices is static, meaning that each node will get the same number of indices: this is the faster scheduling in OpenMP, and it makes sense for NeuronGroup or Synapses because operations are the same for all indices. By having a look at examples of templates such as devices/cpp_standalone/templates/statemonitor.cpp, you can see that you can merge portions of code executed by only one node and portions executed in parallel. In this template, for example, only one node is recording the time and extending the size of the arrays to store the recorded values:

{{ openmp_pragma('single') }}
{{_dynamic_t}}.push_back(_clock_t);

// Resize the dynamic arrays
{{ openmp_pragma('single') }}
{{_recorded}}.resize(_new_size, _num_indices);

But then, values are written in the arrays by all the nodes:

{{ openmp_pragma('static') }}
for (int _i = 0; _i < _num_indices; _i++)
{
    ....
}

Initialization of the arrays

Even if we said that all network creation was performed outside of the main parallel context, created in the run() loop of network.cpp, there are still some other parallel contexts that are created and destroyed while initializing the arrays. This can be seen in devices/cpp_standalone/templates/objects.cpp, especially in the function _init_arrays(). Because those calls are made outside a parallel context, we need to create one, and that’s why there is a call to:

{{ openmp_pragma('parallel-static') }}

that is transformed into:

#pragma omp parallel for schedule(static)

This comment will create on the fly an OpenMP parallel context and destroy it just after the loop. This adds a little overhead, but those init calls are not numerous compared to the simulation.

A similar idea can be found in devices/cpp_standalone/templates/group_variable_set.cpp: because this template is called outside the main run() loop, during network creation, we need to create a parallel context to perform OpenMP operations. This is why we are using there:

{{ openmp_pragma('parallel-static') }}

instead of simply:

{{ openmp_pragma('static') }}

Synaptic propagation in parallel

General ideas

With OpenMP, synaptic propagation is also multi-threaded. Therefore, we have to modify the SynapticPathway objects, handling spike propagation. As can be seen in devices/cpp_standalone/templates/synapses_classes.cpp, such an object, created during run time, will be able to get the number of threads decided by the user:

_nb_threads = {{ openmp_pragma('get_num_threads') }};

By doing so, a SynapticPathway, instead of handling only one SpikeQueue, will be divided into _nb_threads SpikeQueues, each of them handling a subset of the total number of connections. Because all the calls to SynapticPathway object are performed within the run() loop of devices/cpp_standalone/templates/network.cpp, we have to assume that they are performed in a parallel context. This is why all the function of the SynapticPathway object are taking care of the node number:

void push(int *spikes, unsigned int nspikes)
{
    queue[{{ openmp_pragma('get_thread_num') }}]->push(spikes, nspikes);
}

Such a method for the SynapticPathway will make sure that when spikes are propagated, all the threads will propagate them to their connections. By default, again, if OpenMP is turned off, the queue vector has size 1.

Preparation of the SynapticPathway

Here we are explaining the implementation of the prepare() method for SynapticPathway. The preparation of the SynapticPathway is performed outside the run() loop, therefore outside of a parallel context. If we want each thread to prepare its own subset of connections, we need to create a temporary parallel context:

{{ openmp_pragma('parallel') }}
{
    unsigned int length;
    if ({{ openmp_pragma('get_thread_num') }} == _nb_threads - 1)
        length = n_synapses - (unsigned int) {{ openmp_pragma('get_thread_num') }}*n_synapses/_nb_threads;
    else
        length = (unsigned int) n_synapses/_nb_threads;

    unsigned int padding  = {{ openmp_pragma('get_thread_num') }}*(n_synapses/_nb_threads);

    queue[{{ openmp_pragma('get_thread_num') }}]->openmp_padding = padding;
    queue[{{ openmp_pragma('get_thread_num') }}]->prepare(&real_delays[padding], &sources[padding], length, _dt);
}

Then, basically, each threads is getting an equal number of synapses (except the last one, that will get the remaining ones, if the number is not a multiple of n_threads), and the queues are receiving a padding integer telling them what part of the synapses belongs to each queue. After that, the parallel context is destroyed, and network creation can continue. Note that this could have been done without a parallel context, in a sequential manner, but this is just speeding up everything.

Selection of the spikes

Here we are explaining the implementation of the peek() method for SynapticPathway. This is an example of concurrent access to data structures that are not well handled in parallel, such as std::vector. When peek() is called, we need to return a vector of all the neuron spiking at that particular time. Therefore, we need to ask every queue of the SynapticPathway what are the id of the spiking neurons, and concatenate them. Because those ids are stored in vectors with various shapes, we need to loop over nodes to perform this concatenate, in a sequential manner:

{{ openmp_pragma('static-ordered') }}
for(int _thread=0; _thread < {{ openmp_pragma('get_num_threads') }}; _thread++)
{
    {{ openmp_pragma('ordered') }}
    {
        if (_thread == 0)
            all_peek.clear();
        all_peek.insert(all_peek.end(), queue[_thread]->peek()->begin(), queue[_thread]->peek()->end());
    }
}

The loop, with the keyword ‘static-ordered’, is therefore performed such that node 0 enters it first, then node 1, and so on. Only one node at a time is executing the block statement. This is needed because vector manipulations can not be performed in a multi-threaded manner. At the end of the loop, all_peek is now a vector where all sub queues have written the id of spiking cells, and therefore this is the list of all spiking cells within the SynapticPathway.

Compilation of the code

One extra file needs to be modified, in order for OpenMP implementation to work. This is the makefile devices/cpp_standalone/templates/makefile. As one can simply see, the CFLAGS are dynamically modified during code generation thanks to:

{{ openmp_pragma('compilation') }}

If OpenMP is activated, this will add the following dependencies:

-fopenmp

such that if OpenMP is turned off, nothing, in the generated code, does depend on it.