Multi-threading with OpenMP ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The following is an outline of how to make C++ standalone templates compatible with OpenMP, and therefore make them work in a multi-threaded environment. This should be considered as an extension to :doc:`codegen`, that has to be read first. The C++ standalone mode of Brian is compatible with OpenMP, and therefore simulations can be launched by users with one or with multiple threads. Therefore, when adding new templates, the developers need to make sure that those templates are properly handling the situation if launched with OpenMP. Key concepts ============ All the simulations performed with the C++ standalone mode can be launched with multi-threading, and make use of multiple cores on the same machine. Basically, all the Brian operations that can easily be performed in parallel, such as computing the equations for `NeuronGroup`, `Synapses`, and so on can and should be split among several threads. The network construction, so far, is still performed only by one single thread, and all created objects are shared by all the threads. Use of ``#pragma`` flags ======================== In OpenMP, all the parallelism is handled thanks to extra comments, added in the main C++ code, under the form:: #pragma omp ... But to avoid any dependencies in the code that is generated by Brian when OpenMP is not activated, we are using functions that will only add those comments, during code generation, when such a multi-threading mode is turned on. By default, nothing will be inserted. Translations of the ``#pragma`` commands ---------------------------------------- All the translations from ``openmp_pragma()`` calls in the C++ templates are handled in the file ``devices/cpp_standalone/codeobject.py`` In this function, you can see that all calls with various string inputs will generate #pragma statements inserted into the C++ templates during code generation. For example:: {{ openmp_pragma('static') }} will be transformed, during code generation, into:: #pragma omp for schedule(static) You can find the list of all the translations in the core of the ``openmp_pragma()`` function, and if some extra translations are needed, they should be added here. Execution of the OpenMP code ---------------------------- In this section, we are explaining the main ideas behind the OpenMP mode of Brian, and how the simulation is executed in such a parallel context. As can be seen in ``devices/cpp_standalone/templates/main.cpp``, the appropriate number of threads, defined by the user, is fixed at the beginning of the main function in the C++ code with:: {{ openmp_pragma('set_num_threads') }} equivalent to (thanks to the ``openmp_pragam()`` function defined above): nothing if OpenMP is turned off (default), and to:: omp_set_dynamic(0); omp_set_num_threads(nb_threads); otherwise. When OpenMP creates a parallel context, this is the number of threads that will be used. As said, network creation is performed without any calls to OpenMP, on one single thread. Each template that wants to use parallelism has to add ``{{ openmp_pragma{('parallel')}}`` to create a general block that will be executed in parallel or ``{{ openmp_pragma{('parallel-static')}}`` to execute a single loop in parallel. How to make your template use OpenMP parallelism ================================================ To design a parallel template, such as for example ``devices/cpp_standalone/templates/common_group.cpp``, you can see that as soon as you have loops that can safely be split across nodes, you just need to add an openmp command in front of those loops:: {{openmp_pragma('parallel-static')}} for(int _idx=0; _idxpush(spikes, nspikes); } Such a method for the `SynapticPathway` will make sure that when spikes are propagated, all the threads will propagate them to their connections. By default, again, if OpenMP is turned off, the queue vector has size 1. Preparation of the `SynapticPathway` ------------------------------------ Here we are explaining the implementation of the ``prepare()`` method for `SynapticPathway`:: {{ openmp_pragma('parallel') }} { unsigned int length; if ({{ openmp_pragma('get_thread_num') }} == _nb_threads - 1) length = n_synapses - (unsigned int) {{ openmp_pragma('get_thread_num') }}*n_synapses/_nb_threads; else length = (unsigned int) n_synapses/_nb_threads; unsigned int padding = {{ openmp_pragma('get_thread_num') }}*(n_synapses/_nb_threads); queue[{{ openmp_pragma('get_thread_num') }}]->openmp_padding = padding; queue[{{ openmp_pragma('get_thread_num') }}]->prepare(&real_delays[padding], &sources[padding], length, _dt); } Basically, each threads is getting an equal number of synapses (except the last one, that will get the remaining ones, if the number is not a multiple of ``n_threads``), and the queues are receiving a padding integer telling them what part of the synapses belongs to each queue. After that, the parallel context is destroyed, and network creation can continue. Note that this could have been done without a parallel context, in a sequential manner, but this is just speeding up everything. Selection of the spikes ----------------------- Here we are explaining the implementation of the ``peek()`` method for `SynapticPathway`. This is an example of concurrent access to data structures that are not well handled in parallel, such as ``std::vector``. When ``peek()`` is called, we need to return a vector of all the neuron spiking at that particular time. Therefore, we need to ask every queue of the `SynapticPathway` what are the id of the spiking neurons, and concatenate them. Because those ids are stored in vectors with various shapes, we need to loop over nodes to perform this concatenate, in a sequential manner:: {{ openmp_pragma('static-ordered') }} for(int _thread=0; _thread < {{ openmp_pragma('get_num_threads') }}; _thread++) { {{ openmp_pragma('ordered') }} { if (_thread == 0) all_peek.clear(); all_peek.insert(all_peek.end(), queue[_thread]->peek()->begin(), queue[_thread]->peek()->end()); } } The loop, with the keyword 'static-ordered', is therefore performed such that node 0 enters it first, then node 1, and so on. Only one node at a time is executing the block statement. This is needed because vector manipulations can not be performed in a multi-threaded manner. At the end of the loop, ``all_peek`` is now a vector where all sub queues have written the id of spiking cells, and therefore this is the list of all spiking cells within the `SynapticPathway`. Compilation of the code ======================= One extra file needs to be modified, in order for OpenMP implementation to work. This is the makefile ``devices/cpp_standalone/templates/makefile``. As one can simply see, the CFLAGS are dynamically modified during code generation thanks to:: {{ openmp_pragma('compilation') }} If OpenMP is activated, this will add the following dependencies:: -fopenmp such that if OpenMP is turned off, nothing, in the generated code, does depend on it.