Multi-threading with OpenMP
~~~~~~~~~~~~~~~~~~~~~~~~~~~

The following is an outline of how to make C++ standalone templates compatible
with OpenMP, and therefore make them work in a multi-threaded environment. This
should be considered as an extension to :doc:`codegen`, that has to
be read first. The C++ standalone mode of Brian is compatible with OpenMP, and
therefore simulations can be launched by users with one or with multiple
threads. Therefore, when adding new templates, the developers need to make sure
that those templates are properly handling the situation if launched with
OpenMP. 

Key concepts
============

All the simulations performed with the C++ standalone mode can be launched with
multi-threading, and make use of multiple cores on the same machine. Basically,
all the Brian operations that can easily be performed in parallel, such as
computing the equations for `NeuronGroup`, `Synapses`, and so on can and should
be split among several threads. The network construction, so far, is still
performed only by one single thread, and all created objects are shared by all
the threads.

Use of ``#pragma`` flags
========================

In OpenMP, all the parallelism is handled thanks to extra comments, added in the
main C++ code, under the form::

    #pragma omp ...
        
But to avoid any dependencies in the code that is generated by Brian when
OpenMP is not activated, we are using functions that will only add those
comments, during code generation, when such a multi-threading mode is turned on.
By default, nothing will be inserted.

Translations of the ``#pragma`` commands
----------------------------------------

All the translations from ``openmp_pragma()`` calls in the C++ templates are
handled
in the file ``devices/cpp_standalone/codeobject.py`` In this function, you can
see that all calls with various string inputs will generate #pragma statements
inserted into the C++ templates during code generation. For example::

    {{ openmp_pragma('static') }}

will be transformed, during code generation, into::

    #pragma omp for schedule(static)

You can find the list of all the translations in the core of the
``openmp_pragma()`` function, and if some extra translations are needed, they
should be added here.

Execution of the OpenMP code
----------------------------

In this section, we are explaining the main ideas behind the OpenMP mode of
Brian, and how the simulation is executed in such a parallel context.
As can be seen in ``devices/cpp_standalone/templates/main.cpp``, the appropriate
number of threads, defined by the user, is fixed at the beginning
of the main function in the C++ code with::

    {{ openmp_pragma('set_num_threads') }}

equivalent to (thanks to the ``openmp_pragam()`` function defined above):
nothing if OpenMP is turned off (default), and to::

    omp_set_dynamic(0);
    omp_set_num_threads(nb_threads);

otherwise. When OpenMP creates a parallel context, this is the number of
threads that will be used. As said, network creation is performed without
any calls to OpenMP, on one single thread. Each template that wants to use
parallelism has to add ``{{ openmp_pragma{('parallel')}}`` to create a general
block that will be executed in parallel or
``{{ openmp_pragma{('parallel-static')}}`` to execute a single loop in parallel.

How to make your template use OpenMP parallelism
================================================

To design a parallel template, such as for example
``devices/cpp_standalone/templates/common_group.cpp``, you can see that as soon
as you have loops that can safely be split across nodes, you just need to add
an openmp command in front of those loops::

    {{openmp_pragma('parallel-static')}}
    for(int _idx=0; _idx<N; _idx++)
    {
        ...
    }

By doing so, OpenMP will take care of splitting the indices and each thread
will loop only on a subset of indices, sharing the load. By default, the
scheduling use for splitting the indices is static, meaning that each node will
get the same number of indices: this is the faster scheduling in OpenMP, and it
makes sense for `NeuronGroup` or `Synapses` because operations are the same for
all indices. By having a look at examples of templates such as
``devices/cpp_standalone/templates/statemonitor.cpp``, you can see that you can
merge portions of code executed by only one node and portions executed in
parallel. In this template, for example, only one node is recording the time and
extending the size of the arrays to store the recorded values::

    {{_dynamic_t}}.push_back(_clock_t);

    // Resize the dynamic arrays
    {{_recorded}}.resize(_new_size, _num_indices);

But then, values are written in the arrays by all the nodes::

    {{ openmp_pragma('parallel-static') }}
    for (int _i = 0; _i < _num_indices; _i++)
    {
        ....
    }

In general, operations that manipulate global data structures, e.g. that use
``push_back`` for a ``std::vector``, should only be executed by a single thread.

Synaptic propagation in parallel
================================

General ideas
-------------

With OpenMP, synaptic propagation is also multi-threaded. Therefore, we have to
modify the `SynapticPathway` objects, handling spike propagation. As can be seen
in ``devices/cpp_standalone/templates/synapses_classes.cpp``, such an object,
created during run time, will be able to get the number of threads decided by
the user::

    _nb_threads = {{ openmp_pragma('get_num_threads') }};

By doing so, a `SynapticPathway`, instead of handling only one `SpikeQueue`,
will be divided into ``_nb_threads`` `SpikeQueue`\ s, each of them handling a
subset of the total number of connections. All the calls to
`SynapticPathway` object are performed from within ``parallel`` blocks in the
``synapses`` and ``synapses_push_spikes`` template, we have to take this
parallel context into account. This is why all the function of the
`SynapticPathway` object are taking care of the node number::

    void push(int *spikes, unsigned int nspikes)
    {
        queue[{{ openmp_pragma('get_thread_num') }}]->push(spikes, nspikes);
    }

Such a method for the `SynapticPathway` will make sure that when spikes are
propagated, all the threads will propagate them to their connections. By
default, again, if OpenMP is turned off, the queue vector has size 1.

Preparation of the `SynapticPathway`
------------------------------------

Here we are explaining the implementation of the ``prepare()`` method for
`SynapticPathway`::

        {{ openmp_pragma('parallel') }}
        {
            unsigned int length;
            if ({{ openmp_pragma('get_thread_num') }} == _nb_threads - 1) 
                length = n_synapses - (unsigned int) {{ openmp_pragma('get_thread_num') }}*n_synapses/_nb_threads;
            else
                length = (unsigned int) n_synapses/_nb_threads;
            
            unsigned int padding  = {{ openmp_pragma('get_thread_num') }}*(n_synapses/_nb_threads);

            queue[{{ openmp_pragma('get_thread_num') }}]->openmp_padding = padding;
            queue[{{ openmp_pragma('get_thread_num') }}]->prepare(&real_delays[padding], &sources[padding], length, _dt);
        }

Basically, each threads is getting an equal number of synapses (except the
last one, that will get the remaining ones, if the number is not a multiple of
``n_threads``), and the queues are receiving a padding integer telling them what
part of the synapses belongs to each queue. After that, the parallel context is
destroyed, and network creation can continue. Note that this could have been
done without a parallel context, in a sequential manner, but this is just
speeding up everything.

Selection of the spikes
-----------------------

Here we are explaining the implementation of the ``peek()`` method for
`SynapticPathway`. This is an example of concurrent access to data structures
that are not well handled in parallel, such as ``std::vector``. When ``peek()`` is
called, we need to return a vector of all the neuron spiking at that particular
time. Therefore, we need to ask every queue of the `SynapticPathway` what are the
id of the spiking neurons, and concatenate them. Because those ids are stored
in vectors with various shapes, we need to loop over nodes to perform this
concatenate, in a sequential manner::

    {{ openmp_pragma('static-ordered') }}
    for(int _thread=0; _thread < {{ openmp_pragma('get_num_threads') }}; _thread++)
    {
        {{ openmp_pragma('ordered') }}
        {
            if (_thread == 0)
                all_peek.clear();
            all_peek.insert(all_peek.end(), queue[_thread]->peek()->begin(), queue[_thread]->peek()->end());
        }
    }
   
The loop, with the keyword 'static-ordered', is therefore performed such that
node 0 enters it first, then node 1, and so on. Only one node at a time is
executing the block statement. This is needed because vector manipulations can
not be performed in a multi-threaded manner. At the end of the loop, ``all_peek``
is now a vector where all sub queues have written the id of spiking cells, and
therefore this is the list of all spiking cells within the `SynapticPathway`.

Compilation of the code
=======================

One extra file needs to be modified, in order for OpenMP implementation to work.
This is the makefile ``devices/cpp_standalone/templates/makefile``. As one can
simply see, the CFLAGS are dynamically modified during code generation thanks
to::
    
    {{ openmp_pragma('compilation') }}

If OpenMP is activated, this will add the following dependencies::

    -fopenmp

such that if OpenMP is turned off, nothing, in the generated code, does depend
on it.