Multi-threading with OpenMP
The following is an outline of how to make C++ standalone templates compatible with OpenMP, and therefore make them work in a multi-threaded environment. This should be considered as an extension to Code generation, that has to be read first. The C++ standalone mode of Brian is compatible with OpenMP, and therefore simulations can be launched by users with one or with multiple threads. Therefore, when adding new templates, the developers need to make sure that those templates are properly handling the situation if launched with OpenMP.
Key concepts
All the simulations performed with the C++ standalone mode can be launched with
multi-threading, and make use of multiple cores on the same machine. Basically,
all the Brian operations that can easily be performed in parallel, such as
computing the equations for NeuronGroup
, Synapses
, and so on can and should
be split among several threads. The network construction, so far, is still
performed only by one single thread, and all created objects are shared by all
the threads.
Use of #pragma
flags
In OpenMP, all the parallelism is handled thanks to extra comments, added in the main C++ code, under the form:
#pragma omp ...
But to avoid any dependencies in the code that is generated by Brian when OpenMP is not activated, we are using functions that will only add those comments, during code generation, when such a multi-threading mode is turned on. By default, nothing will be inserted.
Translations of the #pragma
commands
All the translations from openmp_pragma()
calls in the C++ templates are
handled
in the file devices/cpp_standalone/codeobject.py
In this function, you can
see that all calls with various string inputs will generate #pragma statements
inserted into the C++ templates during code generation. For example:
{{ openmp_pragma('static') }}
will be transformed, during code generation, into:
#pragma omp for schedule(static)
You can find the list of all the translations in the core of the
openmp_pragma()
function, and if some extra translations are needed, they
should be added here.
Execution of the OpenMP code
In this section, we are explaining the main ideas behind the OpenMP mode of
Brian, and how the simulation is executed in such a parallel context.
As can be seen in devices/cpp_standalone/templates/main.cpp
, the appropriate
number of threads, defined by the user, is fixed at the beginning
of the main function in the C++ code with:
{{ openmp_pragma('set_num_threads') }}
equivalent to (thanks to the openmp_pragam()
function defined above):
nothing if OpenMP is turned off (default), and to:
omp_set_dynamic(0);
omp_set_num_threads(nb_threads);
otherwise. When OpenMP creates a parallel context, this is the number of
threads that will be used. As said, network creation is performed without
any calls to OpenMP, on one single thread. Each template that wants to use
parallelism has to add {{ openmp_pragma{('parallel')}}
to create a general
block that will be executed in parallel or
{{ openmp_pragma{('parallel-static')}}
to execute a single loop in parallel.
How to make your template use OpenMP parallelism
To design a parallel template, such as for example
devices/cpp_standalone/templates/common_group.cpp
, you can see that as soon
as you have loops that can safely be split across nodes, you just need to add
an openmp command in front of those loops:
{{openmp_pragma('parallel-static')}}
for(int _idx=0; _idx<N; _idx++)
{
...
}
By doing so, OpenMP will take care of splitting the indices and each thread
will loop only on a subset of indices, sharing the load. By default, the
scheduling use for splitting the indices is static, meaning that each node will
get the same number of indices: this is the faster scheduling in OpenMP, and it
makes sense for NeuronGroup
or Synapses
because operations are the same for
all indices. By having a look at examples of templates such as
devices/cpp_standalone/templates/statemonitor.cpp
, you can see that you can
merge portions of code executed by only one node and portions executed in
parallel. In this template, for example, only one node is recording the time and
extending the size of the arrays to store the recorded values:
{{_dynamic_t}}.push_back(_clock_t);
// Resize the dynamic arrays
{{_recorded}}.resize(_new_size, _num_indices);
But then, values are written in the arrays by all the nodes:
{{ openmp_pragma('parallel-static') }}
for (int _i = 0; _i < _num_indices; _i++)
{
....
}
In general, operations that manipulate global data structures, e.g. that use
push_back
for a std::vector
, should only be executed by a single thread.
Synaptic propagation in parallel
General ideas
With OpenMP, synaptic propagation is also multi-threaded. Therefore, we have to
modify the SynapticPathway
objects, handling spike propagation. As can be seen
in devices/cpp_standalone/templates/synapses_classes.cpp
, such an object,
created during run time, will be able to get the number of threads decided by
the user:
_nb_threads = {{ openmp_pragma('get_num_threads') }};
By doing so, a SynapticPathway
, instead of handling only one SpikeQueue
,
will be divided into _nb_threads
SpikeQueue
s, each of them handling a
subset of the total number of connections. All the calls to
SynapticPathway
object are performed from within parallel
blocks in the
synapses
and synapses_push_spikes
template, we have to take this
parallel context into account. This is why all the function of the
SynapticPathway
object are taking care of the node number:
void push(int *spikes, unsigned int nspikes)
{
queue[{{ openmp_pragma('get_thread_num') }}]->push(spikes, nspikes);
}
Such a method for the SynapticPathway
will make sure that when spikes are
propagated, all the threads will propagate them to their connections. By
default, again, if OpenMP is turned off, the queue vector has size 1.
Preparation of the SynapticPathway
Here we are explaining the implementation of the prepare()
method for
SynapticPathway
:
{{ openmp_pragma('parallel') }}
{
unsigned int length;
if ({{ openmp_pragma('get_thread_num') }} == _nb_threads - 1)
length = n_synapses - (unsigned int) {{ openmp_pragma('get_thread_num') }}*n_synapses/_nb_threads;
else
length = (unsigned int) n_synapses/_nb_threads;
unsigned int padding = {{ openmp_pragma('get_thread_num') }}*(n_synapses/_nb_threads);
queue[{{ openmp_pragma('get_thread_num') }}]->openmp_padding = padding;
queue[{{ openmp_pragma('get_thread_num') }}]->prepare(&real_delays[padding], &sources[padding], length, _dt);
}
Basically, each threads is getting an equal number of synapses (except the
last one, that will get the remaining ones, if the number is not a multiple of
n_threads
), and the queues are receiving a padding integer telling them what
part of the synapses belongs to each queue. After that, the parallel context is
destroyed, and network creation can continue. Note that this could have been
done without a parallel context, in a sequential manner, but this is just
speeding up everything.
Selection of the spikes
Here we are explaining the implementation of the peek()
method for
SynapticPathway
. This is an example of concurrent access to data structures
that are not well handled in parallel, such as std::vector
. When peek()
is
called, we need to return a vector of all the neuron spiking at that particular
time. Therefore, we need to ask every queue of the SynapticPathway
what are the
id of the spiking neurons, and concatenate them. Because those ids are stored
in vectors with various shapes, we need to loop over nodes to perform this
concatenate, in a sequential manner:
{{ openmp_pragma('static-ordered') }}
for(int _thread=0; _thread < {{ openmp_pragma('get_num_threads') }}; _thread++)
{
{{ openmp_pragma('ordered') }}
{
if (_thread == 0)
all_peek.clear();
all_peek.insert(all_peek.end(), queue[_thread]->peek()->begin(), queue[_thread]->peek()->end());
}
}
The loop, with the keyword ‘static-ordered’, is therefore performed such that
node 0 enters it first, then node 1, and so on. Only one node at a time is
executing the block statement. This is needed because vector manipulations can
not be performed in a multi-threaded manner. At the end of the loop, all_peek
is now a vector where all sub queues have written the id of spiking cells, and
therefore this is the list of all spiking cells within the SynapticPathway
.
Compilation of the code
One extra file needs to be modified, in order for OpenMP implementation to work.
This is the makefile devices/cpp_standalone/templates/makefile
. As one can
simply see, the CFLAGS are dynamically modified during code generation thanks
to:
{{ openmp_pragma('compilation') }}
If OpenMP is activated, this will add the following dependencies:
-fopenmp
such that if OpenMP is turned off, nothing, in the generated code, does depend on it.