A C++17 Thread Pool for High-Performance Scientific Computing

We present a modern C++17-compatible thread pool implementation, built from scratch with high-performance scientific computing in mind. The thread pool is implemented as a single lightweight and self-contained class, and does not have any dependencies other than the C++17 standard library, thus allowing a great degree of portability. In particular, our implementation does not utilize OpenMP or any other high-level multithreading APIs, and thus gives the programmer precise low-level control over the details of the parallelization, which permits more robust optimizations. The thread pool was extensively tested on both AMD and Intel CPUs with up to 40 cores and 80 threads. This paper provides motivation, detailed usage instructions, and performance tests.


Introduction 1.Motivation
Multithreading [1] is essential for modern high-performance computing.Since C++11, the C++ [2][3] [4] standard library has included built-in low-level multithreading support using constructs such as std::thread.However, std::thread creates a new thread each time it is called, which can have a significant performance overhead.Furthermore, it is possible to create more threads than the hardware can handle simultaneously, potentially resulting in a substantial slowdown.
The library presented here contains a C++ thread pool class, BS::thread_pool, which avoids these issues by creating a fixed pool of threads once and for all, and then continuously reusing the same threads to perform different tasks throughout the lifetime of the program.By default, the number of threads in the pool is equal to the maximum number of threads that the hardware can run in parallel.
The user submits tasks to be executed into a queue.Whenever a thread becomes available, it retrieves the next task from the queue and executes it.The pool automatically produces an std::future for each task, which allows the user to wait for the task to finish executing and/or obtain its eventual return value, if applicable.Threads and tasks are autonomously managed by the pool in the background, without requiring any input from the user aside from submitting the desired tasks.
The design of this library is guided by four important principles.First, compactness: the entire library consists of just one self-contained header file, with no other components or dependencies, aside from a small self-contained header file with optional utilities.Second, portability: the library only utilizes the C++17 standard library [5], without relying on any compiler extensions or 3rd-party libraries, and is therefore compatible with any modern standards-conforming C++17 compiler on any platform.Third, ease of use: the library is extensively documented, and programmers of any level should be able to use it right out of the box.
The fourth and final guiding principle is performance: each and every line of code in this library was carefully designed with maximum performance in mind, and performance was tested and verified on a variety of compilers and platforms.Indeed, the library was originally designed for use in the author's own computationally-intensive scientific computing projects, running both on high-end desktop/laptop computers and high-performance computing nodes.
Other, more advanced multithreading libraries may offer more features and/or higher performance.However, they typically consist of a vast codebase with multiple components and dependencies, and involve complex APIs that require a substantial time investment to learn.This library is not intended to replace these more advanced libraries; instead, it was designed for users who don't require very advanced features, and prefer a simple and lightweight library that is easy to learn and use and can be readily incorporated into existing or new projects.

Overview of features
• Fast: -Built from scratch with maximum performance in mind.
-Suitable for use in high-performance computing nodes with a very large number of CPU cores.
-Compact code, to reduce both compilation time and binary size.
-Reusing threads avoids the overhead of creating and destroying them for individual tasks.
-A task queue ensures that there are never more threads running in parallel than allowed by the hardware.• Lightweight: -Single header file: simply #include "BS_thread_pool.hpp"and you're all set! -Header-only: no need to install or build the library.
-Self-contained: no external requirements or dependencies.
-Portable: uses only the C++ standard library, and works with any C++17-compliant compiler.
-Only 304 lines of code, excluding comments, blank lines, and lines containing only a single brace, with all optional features disabled.-Only 396 lines of code across both header files with all optional features enabled and including all optional utilities.• Easy to use: -Very simple operation, using only a handful of member functions, with additional member functions for more advanced use.-Every task submitted to the queue using the submit_task() member function automatically generates an std::future, which can be used to wait for the task to finish executing and/or obtain its eventual return value.-Loops can be automatically parallelized into any number of tasks using the submit_loop() member function, which returns a BS::multi_future that can be used to track the execution of all parallel tasks at once.-If futures are not needed, tasks may be submitted using detach_task(), and loops can be parallelized using detach_loop() -sacrificing convenience for even greater performance.In that case, wait(), wait_for(), and wait_until() can be used to wait for all the tasks in the queue to complete.-The code is thoroughly documented using Doxygen comments -not only the interface, but also the implementation, in case the user would like to make modifications.-The included test program BS_thread_pool_test.cpp can be used to perform exhaustive automated tests and benchmarks, and also serves as a comprehensive example of how to properly use the library.The included PowerShell script BS_thread_pool_test.ps1 provides a portable way to run the tests with multiple compilers.• Utility classes: -The optional header file BS_thread_pool_utils.hpp contains several useful utility classes.
-Send simple signals between threads using the BS::signaller utility class.
-Synchronize output to a stream from multiple threads in parallel using the BS::synced_stream utility class.-Easily measure execution time for benchmarking purposes using the BS::timer utility class.
• Additional features: -Assign a priority to each task using the optional task priority feature.Tasks with higher priorities will be executed first.-Submit a sequence of tasks enumerated by indices to the queue using detach_sequence() and submit_sequence().-Change the number of threads in the pool safely and on-the-fly as needed using the reset() member function.
-Monitor the number of queued and/or running tasks using the get_tasks_queued(), get_tasks_running(), and get_tasks_total() member functions.-Get the current thread count of the pool using get_thread_count().
-Freely pause and resume the pool using the pause(), unpause(), and is_paused() member functions; when paused, threads do not retrieve new tasks out of the queue.-Purge all tasks currently waiting in the queue with the purge() member function.
-Catch exceptions thrown by tasks submitted using submit_task() or submit_loop() from the main thread through their futures.
-Run an initialization function in each thread before it starts to execute any submitted tasks.
-Get the pool index of the current thread using BS::this_thread::get_index() and a pointer to the pool that owns the thread using BS::this_thread::get_pool().-Get the unique thread IDs for all threads in the pool using get_thread_ids() or the implementation-defined thread handles using the optional get_native_handles() member function.
-Submit class member functions to the pool, either applied to a specific object or from within the object itself.-Pass arguments to tasks by value, reference, or constant reference.
-Under continuous and active development.Bug reports and feature requests are welcome, and should be made via GitHub issues.

Compiling and compatibility
This library should successfully compile on any C++17 standard-compliant compiler, on all operating systems and architectures for which such a compiler is available.Compatibility was verified with a 24-core (8P+16E) / 32-thread Intel i9-13900K CPU using the following compilers and platforms: As this library requires C++17 features, the code must be compiled with C++17 support: • For Clang or GCC, use the -std=c++17 flag.On Linux, you will also need to use the -pthread flag to enable the POSIX threads library.• For MSVC, use /std:c++17, and also /permissive-to ensure standards conformance.
For maximum performance, it is recommended to compile with all available compiler optimizations: • For Clang or GCC, use the -O3 flag.
As an example, to compile the test program BS_thread_pool_test.cpp with warnings and optimizations, it is recommended to use the following commands: • On Linux with GCC: g++ BS_thread_pool_test.cpp -std=c++17 -O3 -Wall -Wextra -Wconversion -Wsign-conversion -Wpedantic -Weffc++ -Wshadow -pthread -o BS_thread_pool_test #include "BS_thread_pool.hpp" The thread pool will now be accessible via the BS::thread_pool class.For an even quicker installation, you can download the header file itself directly at this URL.
This library also comes with an independent utilities header file BS_thread_pool_utils.hpp, which is not required to use the thread pool, but provides some utility classes that may be helpful for multithreading.This header file also resides in the include folder.It can be downloaded directly at this URL.
This library is also available on various package managers and build system, including vcpkg, Conan, Meson, and CMake with CPM.Please see below for more details.

Constructors
The default constructor creates a thread pool with as many threads as the hardware can handle concurrently, as reported by the implementation via std::thread::hardware_concurrency().This is usually determined by the number of cores in the CPU.If a core is hyperthreaded, it will count as two threads.For example: // Constructs a thread pool with as many threads as available in the hardware.BS::thread_pool pool; Optionally, a number of threads different from the hardware concurrency can be specified as an argument to the constructor.However, note that adding more threads than the hardware can handle will not improve performance, and in fact will most likely hinder it.This option exists in order to allow using less threads than the hardware concurrency, in cases where you wish to leave some threads available for other processes.For example: // Constructs a thread pool with only 12 threads.BS::thread_pool pool(12); Usually, when the thread pool is used, a program's main thread should only submit tasks to the thread pool and wait for them to finish, and should not perform any computationally intensive tasks on its own.In that case, it is recommended to use the default value for the number of threads.This ensures that all of the threads available in the hardware will be put to work while the main thread waits.

Getting and resetting the number of threads in the pool
The member function get_thread_count() returns the number of threads in the pool.This will be equal to std::thread::hardware_concurrency() if the default constructor was used.
It is generally unnecessary to change the number of threads in the pool after it has been created, since the whole point of a thread pool is that you only create the threads once.However, if needed, this can be done, safely and on-the-fly, using the reset() member function.
reset() will wait for all currently running tasks to be completed, but will leave the rest of the tasks in the queue.Then it will destroy the thread pool and create a new one with the desired new number of threads, as specified in the function's argument (or the hardware concurrency if no argument is given).The new thread pool will then resume executing the tasks that remained in the queue and any new submitted tasks.

Finding the version of the library
If desired, the version of this library may be read during compilation time from the macro BS_THREAD_POOL_VERSION.The value will be a string containing the version number and release date.For example: std::cout << "Thread pool library version is " << BS_THREAD_POOL_VERSION << ".\n";
3 Submitting tasks to the queue 3.1 Submitting tasks with no arguments and receiving a future In this section we will learn how to submit a task with no arguments, but potentially with a return value, to the queue.Once a task has been submitted, it will be executed as soon as a thread becomes available.Tasks are executed in the order that they were submitted (first-in, first-out), unless task priority is enabled (see below).
For example, if the pool has 8 threads and an empty queue, and we submitted 16 tasks, then we should expect the first 8 tasks to be executed in parallel, with the remaining tasks being picked up by the threads one by one as each thread finishes executing its first task, until no tasks are left in the queue.
The member function submit_task() is used to submit tasks to the queue.It takes exactly one input, the task to submit.This task must be a function with no arguments, but it can have a return value.The return value is an std::future associated to the task.
If the submitted function has a return value of type T, then the future will be of type std::future<T>, and will be set to the return value when the function finishes its execution.If the submitted function does not have a return value, then the future will be an std::future<void>, which will not return any value but may still be used to wait for the function to finish.
Using auto for the return value of submit_task() means the compiler will automatically detect which instance of the template std::future to use.However, specifying the particular type std::future<T>, as in the examples below, is recommended for increased readability.
To wait until the task finishes, use the member function wait() of the future.To obtain the return value, use the member function get(), which will also automatically wait for the task to finish if it hasn't yet.Here is a simple example: # In this example we submitted the function the_answer(), which returns an int.The member function submit_task() of the pool therefore returned an std::future<int>.We then used used the get() member function of the future to get the return value, and printed it out.
In addition to submitted a pre-defined function, we can also use a lambda expression to quickly define the task on-the-fly.Rewriting the previous example in terms of a lambda expression, we get: It is generally simpler and faster to submit lambda expressions rather than pre-defined functions, especially due to the ability to capture local variables, which we will discuss in the next section.
Of course, tasks do not have to return values.In the following example, we submit a function with no return value and then using the future to wait for it to finish executing: Here we split the lambda into multiple lines to make it more readable.The command std::this_thread::sleep_for(std::chrono::milliseconds(500)) instructs the task to simply sleep for 500 milliseconds, simulating a computationally-intensive task.

Submitting tasks with arguments and receiving a future
As stated in the previous section, tasks submitted using submit_task() cannot have any arguments.However, it is easy to submit tasks with argument either by wrapping the function in a lambda or using lambda captures directly.

Detaching and waiting for tasks
Usually, it is best to submit a task to the queue using submit_task().This allows you to wait for the task to finish and/or get its return value later.However, sometimes a future is not needed, for example when you just want to "set and forget" a certain task, or if the task already communicates with the main thread or with other tasks without using futures, such as via condition variables.
In such cases, you may wish to avoid the overhead involved in assigning a future to the task in order to increase performance.This is called "detaching" the task, as the task detaches from the main thread and runs independently.
Detaching tasks is done using the detach_task() member function, which allows you to detach a task to the queue without generating a future for it.The task can have any number of arguments, but it cannot have a return value, as there would be no way for the main thread to retrieve that value.
Since detach_task() does not return a future, there is no built-in way for the user to know when the task finishes executing.You must manually ensure that the task finishes executing before trying to use anything that depends on its output.Otherwise, bad things will happen!BS::thread_pool provides the member function wait() to facilitate waiting for all of the tasks in the queue to complete, whether they were detached or submitted with a future.The wait() member function works similarly to the wait() member function of std::future.Consider, for example, the following code: This program first defines a local variable named result and initializes it to 0. It then detaches a task in the form of a lambda expression.Note that the lambda captures result by reference, as indicated by the & in front of it.This means that the task can modify result, and any such modification will be reflected in the main thread.The task changes result to 42, but it first sleeps for 100 milliseconds.When the main thread prints out the value of result, the task has not yet had time to modify its value, since it is still sleeping.Therefore, the program will print out the initial value 0.
To wait for the task to complete, we must use the wait() member function after detaching it: #include "BS_thread_pool.Now the program will print out the value 42, as expected.Note, however, that wait() will wait for all the tasks in the queue, including any other tasks that were potentially submitted before or after the one we care about.If we want to wait for just one task, submit_task() would be a better choice.

Waiting for submitted or detached tasks with a timeout
Sometimes you may wish to wait for the tasks to complete, but only for a certain amount of time, or until a specific point in time.For example, if the tasks have not yet completed after some time, you may wish to let the user know that there is a delay.
For task submitted with futures using submit_task(), this can be achieved using two member functions of std::future: • wait_for() waits for the task to be completed, but stops waiting after the specified duration, given as an argument of type std::chrono::duration, has passed.• wait_until() waits for the task to be completed, but stops waiting after the specified time point, given as an argument of type std::chrono::time_point, has been reached.
In both cases, the functions will return future_status::ready if the future is ready, meaning the task is finished and its return value, if any, has been obtained.However, it will return std::future_status::timeout if the future is not yet ready by the time the timeout has expired.
Here The output should look similar to this: Sorry, the task is not done yet.Sorry, the task is not done yet.Sorry, the task is not done yet.Sorry, the task is not done yet.Task done!For detached tasks, since we do not have a future for them, we cannot use this method.However, BS::thread_pool has two member functions, also named wait_for() and wait_until(), which similarly wait for a specified duration or until a specified time point, but do so for all tasks (whether submitted or detached).Instead of an std::future_status, the thread pool's wait functions returns true if all tasks finished running, or false if the duration expired or the time point was reached but some tasks are still running.
Here is the same example as above, using detach_task() and pool.wait_for(): #include "BS_thread_pool.hpp"// BS: Of course, this will also work with detach_task(), if we call wait() on the pool itself instead of on the returned future.Note that in this example, instead of getting a future from submit_task() and then waiting for that future, we simply called wait() on that future straight away.This is a common way of waiting for a task to complete if we have nothing else to do in the meantime.Note also that we passed flag_object by reference to the lambda, since we want to set the flag on that same object, not a copy of it (passing by value wouldn't have worked anyway, since variables captured by value are implicitly const).
Another thing you might want to do is call a member function from within the object itself, that is, from another member function.This follows a similar syntax, except that you must also capture this (i.e. a pointer to the current object) in the lambda.One of the most common and effective methods of parallelization is splitting a loop into smaller loops and running them in parallel.It is most effective in "embarrassingly parallel" computations, such as vector or matrix operations, where each iteration of the loop is completely independent of every other iteration.
For example, if we are summing up two vectors of 1000 elements each, and we have 10 threads, we could split the summation into 10 blocks of 100 elements each, and run all the blocks in parallel, potentially increasing performance by up to a factor of 10.BS::thread_pool can automatically parallelize loops.To see how this works, consider the following generic loop: where: • T is any signed or unsigned integer type.
• The loop is over the range [start, end), i.e. inclusive of start but exclusive of end.
• loop() is an operation performed for each loop index i, such as modifying an array with end -start elements.
This loop may be automatically parallelized and submitted to the thread pool's queue using the member function submit_loop(), which has the follows syntax: pool.submit_loop(start, end, loop, num_blocks); where: • start is the first index in the range.
• end is the index after the last index in the range, such that the full range is [start, end).In other words, the loop will be equivalent to the one above if start and end are the same.
start and end must both be of the same integer type T. See below for examples of what to do when they are not of the same type.-Note that if end <= start, nothing will happen.
• loop() is the function that should run in every iteration of the loop, and takes one argument, the loop index.• num_blocks is the number of blocks of the form [a, b) to split the loop into.For example, if the range is [0, 9) and there are 3 blocks, then the blocks will be the ranges [0, 3), [3,6), and [6, 9).
-The internal algorithm ensures that each of the blocks has one of two sizes, differing by 1, with the larger blocks always first, so that the tasks are as evenly distributed as possible.For example, if the range [0, 100) is split into 15 blocks, the result will be 10 blocks of size 7, which will be executed first, and 5 blocks of size 6.-This argument can be omitted, in which case the number of blocks will be the number of threads in the pool.
Each block will be submitted to the thread pool's queue as a separate task.Therefore, a loop that is split into 3 blocks will be split into 3 individual tasks, which may run in parallel.If there is only one block, then the entire loop will run as one task, and no parallelization will take place.
To parallelize the generic loop above, we use the following commands: BS::multi_future<void> loop_future = pool.submit_loop(start,end, loop, num_blocks); loop_future.wait();submit_loop() returns an object of the helper class template BS::multi_future.This is essentially a specialization of std::vector<std::future<T>> with additional member functions.Each of the num_blocks blocks will have an std::future assigned to it, and all these futures will be stored inside the returned BS::multi_future.When loop_future.wait() is called, the main thread will wait until all tasks generated by submit_loop() finish executing, and only those tasks -not any other tasks that also happen to be in the queue.This is essentially the role of the BS::multi_future class: to wait for a specific group of tasks, in this case the tasks running the loop blocks.
What value should you use for num_blocks?Omitting this argument, so that the number of blocks will be equal to the number of threads in the pool, is typically a good choice.For best performance, it is recommended to do your own benchmarks to find the optimal number of blocks for each loop (you can use the BS::timer utility class).Using fewer tasks than there are threads may be preferred if you are also running other tasks in parallel.Using more tasks than there are threads may improve performance in some cases, but parallelization with too many tasks will suffer from diminishing returns.
Note that submit_loop() was executed with the explicit template parameter <unsigned int>.The reason is that the two loop indices must be of the same type.However, here max is a unsigned int, while 0 is a (signed) int, so the types do not match, and the code will not compile unless we force the 0 to be of the right type.This can be done most elegantly by specifying the type of the indices explicitly using the template parameter.
The reason this is not done automatically (e.g. using std::common_type is that it may result in accidentally casting negative indices to an unsigned type, or integer indices to a too narrow integer type, which may lead to an incorrect loop range.
We could also cast the 0 explicitly to unsigned int, but that doesn't look as nice: As a side note, notice that here we parallelized the calculation of the squares, but we did not parallelize printing the results.This is for two reasons: 1. We want to print out the squares in ascending order, and we have no guarantee that the blocks will be executed in the correct order.This is very important; you must never expect that the parallelized loop will execute at the same order as the non-parallelized loop.2. If we did print out the squares from within the parallel tasks, we would get a huge mess, since all 10 blocks would print to the standard output at once.Later we will see how to synchronize printing to a stream from multiple tasks at the same time.

Parallelizing loops without futures
Just as in the case of detach_task() vs. submit_task(), sometimes you may want to parallelize a loop, but you don't need it to return a BS::multi_future.In this case, you can save the overhead of generating the futures (which can be significant, depending on the number of blocks) by using detach_loop() instead of submit_loop(), with the same arguments.

Parallelizing individual indices vs. blocks
We have seen that detach_loop() and submit_loop() execute the function loop(i) for each index i in the loop.However, behind the scenes, the loop is split into blocks, and each block executes the loop() function multiple times.Each block has an internal loop of the form (where T is the type of the indices): The start and end indices of each block are determined automatically by the pool.For example, in the previous section, the loop from 0 to 100 was split into 10 blocks of 10 indices each: start = 0 to end = 10, start = 10 to end = 20, and so on; the blocks are not inclusive of the last index, since the for loop has the condition i < end and not i <= end.
However, this also means that the loop() function is executed multiple times per block.This generates additional overhead due to the multiple function calls.For short loops, this should not affect performance.However, for very long loops, with millions of indices, the performance cost may be significate.
For this reason, the thread pool library provides two additional member functions for parallelizing loops: detach_blocks() and submit_blocks().While detach_loop() and submit_loop() execute a function loop(i) once per index but multiple times per block, detach_blocks() and submit_blocks() execute a function block(start, end) once per block.
The main advantage of this method is increased performance, but the main disadvantage is slightly more complicated code.In particular, the user must define the loop from start to end manually within each block.Here is the previous example using detach_blocks(): # << ((i % 5 != 4) ?" | " : "\n"); } Note how the block function takes two arguments, and includes the internal loop.
Generally, compiler optimizations should be able to make detach_loop() and submit_loop() perform roughly the same as detach_blocks() and submit_blocks().However, you should perform your own benchmarks to see which option works best for your particular use case.

Loops with return values
Unlike submit_task(), the member function submit_loop() only takes loop functions with no return values.The reason is that it wouldn't make sense to return a future for every single index of the loop.However, submit_blocks() does allow the block function to have a return value, as the number of blocks will generally not be too large, unlike the number of indices.
The block function will be executed once for each block, but the blocks are managed by the thread pool, with the user only able to select the number of blocks, but not the range of each block.Therefore, there is limited usability in returning one value per block.However, for cases where this is desired, such as for summation or some sorting algorithms, submit_blocks() does accept functions with return values, in which case it returns a BS::multi_future<T> object where T is the type of the return values.Here's an example of a function template summing all elements of type T in a given range: Here we used the fact that BS::multi_future<T> is a specialization of std::vector<std::future<T>>, so we can use a range-based for loop to iterate over the futures, and use the get() member function of each future to get its value.The values of the futures will be the partial sums from each block, so when we add them up, we will get the total sum.Note that we divided the loop into 100 blocks, so there will be 100 futures in total, each with the partial sum of 10,000 numbers.
The range-based for loop will likely start before the loop finished executing, and each time it calls a future, it will get the value of that future if it is ready, or it will wait until the future is ready and then get the value.This increases performance, since we can start summing the results without waiting for the entire loop to finish executing first -we only need to wait for individual blocks.
If we did want to wait until the entire loop finishes before summing the results, we could have used the get() member function of the BS::multi_future<T> object itself, which returns an std::vector<T> with the values obtained from each future.In that case, the sum could be obtained after calling sub-mit_blocks() as follows: std::vector<T> partial_sums = loop_future.get();T result = std::reduce(partial_sums.begin(),partial_sums.end());return result;

Parallelizing sequences
The member functions detach_loop(), submit_loop(), detach_blocks(), and submit_blocks() parallelize a loop by splitting it into blocks, and submitting each block as an individual task to the queue, with each such task iterating over all the indices in the corresponding block's range, which can be numerous.However, sometimes we have loops with few indices, or more generally, a sequence of tasks enumerated by some index.In such cases, we can avoid the overhead of splitting into blocks and simply submit each individual index as its own independent task to the pool's queue.
This can be done with detach_sequence() and submit_sequence().The syntax of these functions is similar to detach_loop() and submit_loop(), except that they don't have the num_blocks argument at the end.The sequence function must take only one argument, the index.As usual, detach_sequence() detaches the tasks and does not return a future, while submit_sequence() returns a BS::multi_future.If the tasks in the sequence return values, then the futures will contain those values, otherwise they will be void futures.

More about BS::multi_future<T>
The helper class template BS::multi_future<T>, which we have been using throughout this section, provides a convenient way to collect and access groups of futures.This class is a specialization of std::vector<T>, so it should be used in a similar way: • When you create a new object, either use the default constructor to create an empty object and add futures to it later, or pass the desired number of futures to the constructor in advance.• Use the [] operator to access the future at a specific index, or the push_back() member function to append a new future to the list.• The size() member function tells you how many futures are currently stored in the object.
However, BS::multi_future<T> also has additional member functions that are aimed specifically at handling futures: • Once all the futures are stored, you can use wait() to wait for all of them at once or get() to get an std::vector<T> with the results from all of them.• You can check how many futures are ready using ready_count().
• You can check if all the stored futures are valid using valid().
• You can wait for all the stored futures for a specific duration with wait_for() or wait until a specific time with wait_until().These functions return true if all futures have been waited for before the duration expired or the time point was reached, and false otherwise.
Aside from using BS::multi_future<T> to track the execution of parallelized loops, it can also be used, for example, whenever you have several different groups of tasks and you want to track the execution of each group individually.

Utility classes
The optional header file BS_thread_pool_utils.hpp contains several useful utility classes.These are not necessary for using the thread pool itself; BS_thread_pool.hpp is the only header file required.However, the utility classes can make writing multithreading code more convenient.The version of the utilities header file can be found by checking the macro BS_THREAD_POOL_UTILS_VERSION.

Synchronizing printing to a stream with BS::synced_stream
When printing to an output stream from multiple threads in parallel, the output may become garbled.For example, consider this code: # Assuming you have at least 4 hardware threads (so that 4 tasks can run concurrently), the output should be similar to: 12 tasks total, 0 tasks running, 12 tasks queued.Task 0 done.Task 1 done.Task 2 done.Task 3 done.8 tasks total, 4 tasks running, 4 tasks queued.Task 4 done.Task 5 done.Task 6 done.Task 7 done.4 tasks total, 4 tasks running, 0 tasks queued.Task 8 done.Task 9 done.Task 10 done.Task 11 done.0 tasks total, 0 tasks running, 0 tasks queued.
The reason we called pool.wait() in the beginning is that when the thread pool is created, an initialization task runs in each thread, so if we don't wait, the first line will say there are 16 tasks in total, including the 4 initialization tasks.See below for more details.

Purging tasks
Consider a situation where the user cancels a multithreaded operation while it is still ongoing.Perhaps the operation was split into multiple tasks, and half of the tasks are currently being executed by the pool's threads, but the other half are still waiting in the queue.
The thread pool cannot terminate the tasks that are already running, as the C++17 standard does not provide that functionality (and in any case, abruptly terminating a task while it's running could have extremely bad consequences, such as memory leaks and data corruption).However, the tasks that are still waiting in the queue can be purged using the purge() member function.
Once purge() is called, any tasks still waiting in the queue will be discarded, and will never be executed by the threads.Please note that there is no way to restore the purged tasks; they are gone forever.The program submit 8 tasks to the queue.Each task waits 100 milliseconds and then prints a message.The thread pool has 4 threads, so it will execute the first 4 tasks in parallel, and then the remaining 4. We wait 50 milliseconds, to ensure that the first 4 tasks have all started running.Then we call purge() to purge the remaining 4 tasks.As a result, these tasks never get executed.However, since the first 4 tasks are still running when purge() is called, they will finish uninterrupted; purge() only discards tasks that have not yet started running.The output of the program therefore only contains the messages from the first 4 tasks: Task 0 done.Task 1 done.Task 2 done.Task 3 done.

Exception handling
submit_task() catches any exceptions thrown by the submitted task and forwards them to the corresponding future.They can then be caught when invoking the get() member function of the future.For example: #include "BS_thread_pool.hpp"BS::synced_stream sync_out; BS::thread_pool pool; double inverse(const double x) { if (x == 0) throw std::runtime_error("Division by zero!"); else return 1 / x; } int main()

Getting information about the threads
BS::thread_pool comes with a variety of methods to obtain information about the threads in the pool: 1.The namespace BS::this_thread provides functionality similar to std::this_thread.If the current thread belongs to a BS::thread_pool object, then BS::this_thread::get_index() can be used to get the index of the current thread, and BS::this_thread::get_pool() can be used to get the pointer to the thread pool that owns the current thread.Please see the reference below for more details.2. The member function get_thread_ids() returns a vector containing the unique identifiers for each of the pool's threads, as obtained by std::thread::get_id().These values are not so useful on their own, but can be used for whatever the user wants to use them for.3. The optional member function get_native_handles(), if enabled, returns a vector containing the underlying implementation-defined thread handles for each of the pool's threads, as obtained by std::thread::native_handle().For more information, see the relevant section below.

Thread pool initialization functions
Sometimes, it is necessary to initialize the threads before they run any tasks.This can be done by submitting a proper initialization function to the constructor or to reset(), either as the only argument or as the second argument after the desired number of threads.The thread initialization must take no arguments and have no return value.However, if needed, the function can use BS::this_thread::get_index() and BS::this_thread::get_pool() to figure out which thread and pool it belongs to.
The thread initialization function is submitted as a set of special tasks, one per thread, which bypass the queue, but still count towards the number of running tasks, which means get_tasks_total() and get_tasks_running() will report that these tasks are running if they are checked immediately after the pool is initialized.This is done so that the user has the option to either wait for the initialization tasks to finish, by calling wait() on the pool, or just keep going.In either case, the initialization tasks will always finish executing before any tasks are picked out of the queue, so there is no reason to wait for them to finish unless they have some side-effects that affect the main thread.
Warning: If the thread pool is destroyed while paused, any tasks still in the queue will never be executed!

Avoiding wait deadlocks
Consider the following program: #include "BS_thread_pool.hpp"// BS::thread_pool #include <iostream> // std::cout int main() { BS::thread_pool pool; pool.detach_task([&pool] { pool.wait();std::cout << "Done waiting.\n";}); } This program creates a thread pool, and then detaches a task that waits for tasks in the same thread pool to complete.If you run this program, it will never print the message "Done waiting", because the task will wait for itself to complete.This causes a deadlock, and the program will wait forever.
Usually, in simple programs, this will never happen.However, in more complicated programs, perhaps ones running multiple thread pools in parallel, wait deadlocks could potentially occur.In such cases, the macro BS_THREAD_POOL_ENABLE_WAIT_DEADLOCK_CHECK can be defined before including BS_thread_pool.hpp.wait() will then check whether the user tried to call it from within a thread of the same pool, and if so, it will throw the exception BS::thread_pool::wait_deadlock instead of waiting.This check is disabled by default because wait deadlocks are not something that happens often, and the check adds a small but non-zero overhead every time wait() is called.

Setting task priority
Defining the macro BS_THREAD_POOL_ENABLE_PRIORITY before including BS_thread_pool.hppenables task priority.The priority of a task or group of tasks may then be specified as an additional argument (at the end of the argument list) to detach_task(), submit_task(), detach_blocks(), submit_blocks(), detach_loop(), submit_loop(), detach_sequence(), and submit_sequence().If the priority is not specified, the default value will be 0.
The priority is a number of type BS::priority_t, which is a signed 16-bit integer, so it can have any value between -32,768 and 32,767.The tasks will be executed in priority order from highest to lowest.If priority is assigned to the block/loop/sequence parallelization functions, which submit multiple tasks, then all of these tasks will have the same priority.
The namespace BS::pr contains some pre-defined priorities for users who wish to avoid magic numbers and enjoy better future-proofing.In order of decreasing priority, the pre-defined priorities are: BS::pr::highest, BS::pr::high, BS::pr::normal, BS::pr::low, and BS::pr::lowest.

Here is a simple example:
#define BS_THREAD_POOL_ENABLE_PRIORITY #include "BS_thread_pool.hpp"// BS::thread_pool #include "BS_thread_pool_utils.hpp"// BS::synced_stream This program will print out the tasks in the correct priority order.Note that for simplicity, we used a pool with just one thread, so the tasks will run one at a time.In a pool with 5 or more threads, all 5 tasks will actually run more or less at the same time, because, for example, the task with the second-highest priority will be picked up by another thread while the task with the highest priority is still running.
Of course, this is just a pedagogical example.In a realistic use case we may want, for example, to submit tasks that must be completed immediately with high priority so they skip over other tasks already in the queue, or background non-urgent tasks with low priority so they evaluate only after higher-priority tasks are done.
Here are some subtleties to note when using task priority: • Task priority is facilitated using std::priority_queue, which has O(log n) complexity for storing new tasks, but only O(1) complexity for retrieving the next (i.e.highest-priority) task.This is in contrast with std::queue, used if priority is disabled, which both stores and retrieves with O(1) complexity.• Due to this, enabling the priority queue can incur a very slight decrease in performance, depending on the specific use case, which is why this feature is disabled by default.As usual, there is a trade-off here, where you get functionality in exchange for performance.However, the difference in performance is never substantial, and compiler optimizations can often reduce it to a negligible amount.• When using the priority queue, tasks will not necessarily be executed in the same order they were submitted, even if they all have the same priority.This is due to the implementation of std::priority_queue as a binary heap, which means tasks are stored as a binary tree instead of sequentially.To execute tasks in submission order, give them monotonically decreasing priorities.• Technically, BS::priority_t is defined to be (std::int_least16_t), since this type is guaranteed to be present on all systems, rather than std::int16_t, which is optional in the C++ standard.This means that on some exotic systems BS::priority_t may actually have more than 16 bits.However, the pre-defined priorities are 100% portable, and will always have the same values (e.g.: BS::pr::highest = 32767) regardless of the actual bit width.
cmake -S .-B build cmake --build build 10 Complete library reference This section provides a complete reference to classes, member functions, objects, and macros available in this library, along with other important information.Member functions are given here with simplified prototypes (e.g.removing const) for ease of reading.
More information can be found in the provided Doxygen comments.Any modern IDE, such as Visual Studio Code, can use the Doxygen comments to provide automatic documentation for any class and member function in this library when hovering over code with the mouse or using auto-complete.The class BS::thread_pool is the main thread pool class.It can be used to create a pool of threads and submit tasks to a queue.When a thread becomes available, it takes a task from the queue and executes it.
The member functions that are available by default, when no macros are defined, are: • Constructors: -thread_pool(): Construct a new thread pool.The number of threads will be the total number of hardware threads available, as reported by the implementation.This is usually determined by the number of cores in the CPU.If a core is hyperthreaded, it will count as two threads.Construct a new thread pool with the specified number of threads and initialization function.

• Resetters:
void reset(): Reset the pool with the total number of hardware threads available, as reported by the implementation.Waits for all currently running tasks to be completed, then destroys all threads in the pool and creates a new thread pool with the new number of threads.Any tasks that were waiting in the queue before the pool was reset will then be executed by the new threads.If the pool was paused before resetting it, the new pool will be paused as well.-void reset(BS::concurrency_t num_threads): Reset the pool with a new number of threads.
void reset(std::function<void()>& init_task) Reset the pool with the total number of hardware threads available, as reported by the implementation, and a new initialization function.-void reset(BS::concurrency_t num_threads, std::function<void()>& init_task): Reset the pool with a new number of threads and a new initialization function.• Getters: -size_t get_tasks_queued(): Get the number of tasks currently waiting in the queue to be executed by the threads.-size_t get_tasks_running(): Get the number of tasks currently being executed by the threads.-size_t get_tasks_total(): Get the total number of unfinished tasks: either still waiting in the queue, or running in a thread.Note that get_tasks_total() == get_tasks_queued() + get_tasks_running(). -BS::concurrency_t get_thread_count(): Get the number of threads in the pool.
• Task submission without futures (T and F are template parameters): void detach_task(F&& task): Submit a function with no arguments and no return value into the task queue.To push a function with arguments, enclose it in a lambda expression.Does not return a future, so the user must use wait() or some other method to ensure that the task finishes executing, otherwise bad things will happen.-void detach_blocks(T first_index, T index_after_last, F&& block, size_t num_blocks = 0): Parallelize a loop by automatically splitting it into blocks and submitting each block separately to the queue.The block function takes two arguments, the start and end of the block, so that it is only called only once per block, but it is up to the user make sure the block function correctly deals with all the indices in each block.Does not return a BS::multi_future, so the user must use wait() or some other method to ensure that the loop finishes executing, otherwise bad things will happen.-void detach_loop(T first_index, T index_after_last, F&& loop, size_t num_blocks = 0): Parallelize a loop by automatically splitting it into blocks and submitting each block separately to the queue.The loop function takes one argument, the loop index, so that it is called many times per block.Does not return a BS::multi_future, so the user must use wait() or some other method to ensure that the loop finishes executing, otherwise bad things will happen.-void detach_sequence(T first_index, T index_after_last, F&& sequence): Submit a sequence of tasks enumerated by indices to the queue.Does not return a BS::multi_future, so the user must use wait() or some other method to ensure that the sequence finishes executing, otherwise bad things will happen.• Task submission with futures (T, F, and R are template parameters): std::future<R> submit_task(F&& task): Submit a function with no arguments into the task queue.To submit a function with arguments, enclose it in a lambda expression.If the function has a return value, get a future for the eventual returned value.If the function has no return value, get an std::future<void> which can be used to wait until the task finishes.-BS::multi_future<R> submit_blocks(T first_index, T index_after_last, F&& block, size_t num_blocks = 0): Parallelize a loop by automatically splitting it into blocks and submitting each block separately to the queue.The block function takes two arguments, the start and end of the block, so that it is only called only once per block, but it is up to the user make sure the block function correctly deals with all the indices in each block.Returns a BS::multi_future that contains the futures for all of the blocks.-BS::multi_future<void> submit_loop(T first_index, T index_after_last, F&& loop, size_t num_blocks = 0): Parallelize a loop by automatically splitting it into blocks and submitting each block separately to the queue.The loop function takes one argument, the loop index, so that it is called many times per block.It must have no return value.Returns a BS::multi_future that contains the futures for all of the blocks.-BS::multi_future<R> submit_sequence(T first_index, T index_after_last, F&& sequence): Submit a sequence of tasks enumerated by indices to the queue.Returns a BS::multi_future that contains the futures for all of the tasks.• Task management: void purge(): Purge all the tasks waiting in the queue.Tasks that are currently running will not be affected, but any tasks still waiting in the queue will be discarded, and will never be executed by the threads.Please note that there is no way to restore the purged tasks.• Waiting for tasks (R and P, C, and D are template parameters): void wait(): Wait for tasks to be completed.Normally, this function waits for all tasks, both those that are currently running in the threads and those that are still waiting in the queue.However, if the pool is paused, this function only waits for the currently running tasks (otherwise it would wait forever).Note: To wait for just one specific task, use submit_task() instead, and call the wait() member function of the generated future.-bool wait_for(std::chrono::duration<R, P>& duration): Wait for tasks to be completed, but stop waiting after the specified duration has passed.Returns true if all tasks finished running, false if the duration expired but some tasks are still running.
bool wait_until(std::chrono::time_point<C, D>& timeout_time): Wait for tasks to be completed, but stop waiting after the specified time point has been reached.Returns true if all tasks finished running, false if the time point was reached but some tasks are still running.

Optional features for the BS::thread_pool class
The thread pool has several optional features that must be explicitly enabled using macros.
• Task priority: Enabled by defining the macro BS_THREAD_POOL_ENABLE_PRIORITY.
If the priority is not specified, the default value will be 0. -The priority is a number of type BS::priority_t, which is a signed 16-bit integer, so it can have any value between -32,768 and 32,767.The tasks will be executed in priority order from highest to lowest.-The namespace BS::pr contains some pre-defined priorities: BS::pr::highest, BS::pr::high, BS::pr::normal, BS::pr::low, and BS::pr::lowest.• Pausing: Enabled by defining the macro BS_THREAD_POOL_ENABLE_PAUSE.Adds the following member functions: void pause(): Pause the pool.The workers will temporarily stop retrieving new tasks out of the queue, although any tasks already executed will keep running until they are finished.-void unpause(): Unpause the pool.The workers will resume retrieving new tasks out of the queue.-bool is_paused(): Check whether the pool is currently paused.
• Getting the native handles of the threads: Enabled by defining the macro BS_THREAD_POOL_ENABLE_NATIVE_HANDLES.Adds the following member function: std::vector<std::thread::native_handle_type> get_native_handles(): Get a vector containing the underlying implementation-defined thread handles for each of the pool's threads.• Wait deadlock checks: Enabled by defining the macro BS_THREAD_POOL_ENABLE_WAIT_DEADLOCK_CHECK.
-When enabled, wait(), wait_for(), and wait_until() will check whether the user tried to call them from within a thread of the same pool, which would result in a deadlock.If so, they will throw the exception BS::thread_pool::wait_deadlock instead of waiting.
10.1.3The BS::this_thread namespace The namespace BS::this_thread provides functionality similar to std::this_thread.It contains the following function objects: • BS::this_thread::get_index() can be used to get the index of the current thread.If this thread belongs to a BS::thread_pool object, it will have an index from 0 to BS::thread_pool::get_thread_count() -1.Otherwise, for example if this thread is the main thread or an independent std::thread, std::nullopt will be returned.• BS::this_thread::get_pool() can be used to get the pointer to the thread pool that owns the current thread.If this thread belongs to a BS::thread_pool object, a pointer to that object will be returned.Otherwise, std::nullopt will be returned.• In both cases, an std::optional object will be returned, of type BS::this_thread::optional_index or BS::this_thread::optional_pool respectively.Unless you are 100% sure this thread is in a pool, first use std::optional::has_value() to check if it contains a value, and if so, use std::optional::value() to obtain that value.
10. 1.4 The BS::multi_future<T> class BS::multi_future<T> is a helper class used to facilitate waiting for and/or getting the results of multiple futures at once.It is defined as a specialization of std::vector<std::future<T>>.This means that all of the member functions that can be used on an std::vector can also be used on a BS::multi_future.For example, you may use a range-based for loop with a BS::multi_future, since it has iterators.
In addition to inherited member functions, BS::multi_future has the following specialized member functions (R and P, C, and D are template parameters): • [void or std::vector<T>] get(): Get the results from all the futures stored in this BS::multi_future, rethrowing any stored exceptions.If the futures return void, this function returns void as well.If the futures return a type T, this function returns a vector containing the results.• size_t ready_count(): Check how many of the futures stored in this BS::multi_future are ready.
• bool valid(): Check if all the futures stored in this BS::multi_future are valid.
• void wait(): Wait for all the futures stored in this BS::multi_future.
• bool wait_for(std::chrono::duration<R, P>& duration): Wait for all the futures stored in this BS::multi_future, but stop waiting after the specified duration has passed.Returns true if all futures have been waited for before the duration expired, false otherwise.• bool wait_until(std::chrono::time_point<C, D>& timeout_time): Wait for all the futures stored in this multi_future object, but stop waiting after the specified time point has been reached.Returns true if all futures have been waited for before the time point was reached, false otherwise.
10.2 Utility header file (BS_thread_pool_utils.hpp)10.2.1 The BS::signaller class BS::signaller is a utility class which can be used to allow simple signalling between threads.This class is really just a convenient wrapper around std::promise, which contains both the promise and its future.It has the following member functions: • signaller(): Construct a new signaller.
• void wait(): Wait until the signaller is ready.
• void ready(): Inform any waiting threads that the signaller is ready.

10.2.2
The BS::synced_stream class BS::synced_stream is a utility class which can be used to synchronize printing to an output stream by different threads.It has the following member functions (T is a template parameter pack): • synced_stream(std::ostream& stream = std::cout): Construct a new synced stream which prints to the given output stream.• void print(T&&... items): Print any number of items into the output stream.Ensures that no other threads print to this stream simultaneously, as long as they all exclusively use the same synced_stream object to print.• void println(T&&... items): Print any number of items into the output stream, followed by a newline character.
In addition, the class comes with two stream manipulators, which are meant to help the compiler figure out which template specializations to use with the class: • BS::synced_stream::endl: An explicit cast of std::endl.Prints a newline character to the stream, and then flushes it.Should only be used if flushing is desired, otherwise a newline character should be used instead.• BS::synced_stream::flush: An explicit cast of std::flush.Used to flush the stream.

10. 2 . 3
The BS::timer class BS::timer is a utility class which can be used to measure execution time for benchmarking purposes.It has the following member functions:• timer(): Construct a new timer and immediately start measuring time.
• On Linux with Clang: replace g++ with clang++.•On Windows with GCC or Clang: replace -o BS_thread_pool_test with -o BS_thread_pool_test.exe and remove -pthread.To install BS::thread_pool, simply download the latest release from the GitHub repository, place the header file BS_thread_pool.hppfrom the include folder in the desired folder, and include it in your program: Here are two examples.We could even get rid of the multiply function entirely and put everything inside a lambda, if desired: std::future<double> my_future = pool.submit_task([first, second] { return first * second; }); std::cout << my_future.get()<< '\n'; } This program creates a new object flag_object of the class flag_class, sets the flag to true using the setter member function set_flag(), and then prints out the flag's value using the getter member function get_flag().if we want to submit the member function set_flag() as a task to the thread pool?We simply wrap the entire statement flag_object.set_flag(true);from line in a lambda, and pass flag_object to the lambda by reference, as in this example: << std::boolalpha << flag_object.get_flag()<< '\n'; } std::cout << std::boolalpha << flag_object.get_flag()<< '\n'; } Here is an example: Note that in this example we defined the thread pool as a global object, so that it is accessible outside the main() function.