Introduction
There are many ways to create code that runs in parallel. In Swift in the Apple world it was common to use functions from Grand Central Dispatch (GCD), but in Swift 5.5 async
and await
became first-class citizens of the language. That means there’s now a more Swift-like way to handle concurrent code.
I personally get confused with the meaning of some of the keywords across languages. For instance, the async
keyword in Swift is not the same as the async
function in C++ STD library. That’s why I wanted to compare at least Swift to C++.
This article shows different ways of running a loop in parallel in both Swift and C++, and I also compare the running times using a problem from Advent of Code as benchmark.
Advent of Code “Jengatris” as benchmark
I’ve taken the problem from day 22 of Advent of Code 2023 as my benchmark. It’s a problem I like because it’s easy to visualize. In fact, I first solved it using the Godot Engine, just so I could see it in 3D from the start.
Please read the full description of the problem in the AoC website for details. But the blunt summary is that you have to solve a game that it’s like a mixture of Tetris and Jenga, so I refer to it as “Jengatris”. 3-dimensional bricks or “pieces” fall from above until they stack. Then, in the first part of the problem, you have to find out which pieces are “essential”, that is, if such a piece gets removed other above it will fall. If more than one piece sustains another from above, then they are not “essential” (according to my definition of “essential”; of course if you remove both of them, the one above will fall).
In the second part of the problem, which it’s what we are interested in for the benchmark, you have to count how many pieces will fall if you were to remove one of the essential pieces. The answer to the problem is the sum of all the pieces that will fall for all the essential pieces.
Here’s a video made in Godot showing the result:
In the first part of the video I let the pieces fall to find the solution of the first part. Then the screen flashes a bit because there are 974 essential pieces for my input, so I have to quickly simulate 974 scenarios to get the sum. Finally you can see I remove all the essentials, just to see more pieces fall (this is not part of the problem, but just for fun).
Copyable game state
Because we have to simulate each possibility for all the essential pieces, it is very convenient if we can easily create copies of the game state. This will also be very helpful when computing several solutions in parallel, because we want to avoid sharing any data to avoid race conditions.
In Swift we can simply use a struct
, because structs in Swift are passed by value. In contrast, classes are passed by reference, so no data gets duplicated. So to be able to copy the whole game state, I simply declared it like this:
struct GameState { var pieces: [AABB<Int>] var volume: VoxelVolume<Int> }
A game state contains a list of pieces described as integer Axis-Aligned Bounding Boxes (AABBs), and what I called a Voxel Volume, which it’s a 3-dimensional integer matrix that stores the ID of each piece, to know whether that integer coordinate is occupied or not.
Note that in C++ a struct
behaves very differently. In C++ a struct is basically the same as a class, except that all its members are public by default. So to be really explicit that I want to duplicate the game state, I added a ‘copy’
function, and explicitly deleted the copy constructor and the copy-assignment operator:
struct GameState { std::vector< AABB<int> > pieces; std::shared_ptr<VoxelVolume<int> > volume; GameState(IntAABBList pieces, std::shared_ptr<VoxelVolume<int> > volume); std::shared_ptr<GameState> copy() const; GameState(const GameState&) = delete; GameState& operator=(const GameState&) = delete; };
The copy function looks like this:
std::shared_ptr<Jengatris::GameState> Jengatris::GameState::copy() const { std::vector<AABB<int> > p(pieces.begin(), pieces.end()); return std::make_shared<Jengatris::GameState>(p, volume->copy()); }
Note that the Voxel Volume has a copy function as well. You can see the whole code in Github: algoDeSwift/AoC2023.
We have done the most important thing for concurrency. Now let’s see how to run the simulations in parallel.
Parallel loop in Swift with GCD
The sequential solution in Swift can be written functionally with the ‘reduce’
function (remember that the answer to part 2 of the problem is the sum of all possible simulated scenarios):
static func countFalls(state: GameState, ids: Set<Int>) -> Int { return ids.reduce(0) { sum, i in let (_, n) = Jengatris.simulate(start: state, without: i) return sum + n } }
It is quite straightforward to rewrite a loop as a parallel loop with GCD:
static func concurrentCountFalls(state: GameState, ids: Set<Int>) -> Int { let indexArray: [Int] = Array(ids) var counts = [Int].init(repeating: 0, count: indexArray.count) DispatchQueue.concurrentPerform(iterations: indexArray.count) { iteration in let id = indexArray[iteration] let (_, n) = Jengatris.simulate(start: state, without: id) counts[iteration] = n } return counts.reduce(0, +) }
Notice that we created a shared resource, ‘counts’
, where we save all the intermediate results. But we don’t need any mutex for this because we aren’t resizing the array and each thread will only write to its unique position, given by ‘iteration’
. The number of threads created will be automatically decided by GCD depending on the number of cores of the CPU and the capabilities of the hardware. In my Mac mini M1 it creates 8 threads.
Parallel loop in Swift with async/await
As I mentioned in the introduction, Swift 5.5 introduced some keywords for asynchronous code: ‘async’
and ‘await’
. Note that because the code is asynchronous it doesn’t necessarily mean it runs in parallel. For instance, Javascript is very asynchronous, but it’s mostly single-threaded (unless you are using Workers). During each “tick” of the event loop the different callbacks get updated in Javascript.
But concurrent code is asynchronous by nature, so it is useful to have first-class citizens asynchronous keywords in the language to write concurrent code. You just need to flag a function with ‘async’
to mark it asynchronous. And then you can use the ‘await’
to wait for something to finish before continuing with the rest of the code, without actually blocking the execution of the program.
There’s a special function called ‘withTaskGroup’
that it’s very helpful for our example. It creates a Task Group where you can keep adding Tasks. In our case each task will be one of the simulations. Then, we simply wait for all the results to come back. Here’s the code:
static func countFallsAsync(state: GameState, ids: Set<Int>) async -> Int { var sum = 0 await withTaskGroup(of: Int.self) { group in for id in ids { group.addTask { let (_, n) = Jengatris.simulate(start: state, without: id) return n } } for await n in group { sum += n } } return sum }
Here we don’t have to worry about the number of threads created either. The system will choose for us. The performance should be the same as with GCD.
Parallel loop in C++ with threads
Let’s see how the C++ code compares to Swift. Let’s start for writing down the sequential version of it. Instead of using the ‘reduce’
function, I used a for-loop
for this one because I think it’s easier to read:
size_t Jengatris::countFalls(const GameState& state, const std::unordered_set<int>& ids) { size_t sum = 0; for (const auto& id : ids) { std::unordered_set<int> moved; auto s = state.copy(); auto _ = Jengatris::simulate(*s, id, &moved); sum += moved.size(); } return sum; }
A simple solution that works is to create a thread for each element in the input. That creates lots of threads, though. My input has 974 entries, so that’s 974 threads. The code below creates a lambda function with the work each thread needs to do, creates all the threads, and then waits for all of them to finish with ‘join’
:
size_t Jengatris::countFallsThreaded(const GameState &state, const std::unordered_set<int> &ids) { std::vector<int> idArray(ids.begin(), ids.end()); std::vector<size_t> counts(ids.size()); std::vector<std::thread> threads; auto parallelWork = [&state, &idArray, &counts](int iteration) { int id = idArray[iteration]; std::unordered_set<int> moved; auto s = state.copy(); auto _ = Jengatris::simulate(*s, id, &moved); counts[iteration] = moved.size(); }; // this will start MANY threads!! (974 threads for my input) for (size_t i = 0; i < idArray.size(); i++) { threads.emplace_back(parallelWork, i); } // Wait for threads to finish for (auto& thread : threads) { thread.join(); } return std::accumulate(counts.begin(), counts.end(), 0); }
Parallel loop in C++ with async
In C++ ‘std::async’ is a function template used to run a function asynchronously, potentially in a separate thread which might be part of a thread pool. This is different from flagging a function asynchronous with ‘async’ in Swift.
This can be used in combination with ‘futures’
to wait for results. Javascript programmers may be used to Futures when using Promises. These are the same. In fact, an std::future can also be paired with an std::promise. But here we are interested in the asynchronous function. My code looks like this now:
size_t Jengatris::countFallsAsync(const GameState &state, const std::unordered_set<int> &ids) { std::vector<int> idArray(ids.begin(), ids.end()); std::vector<std::future<size_t>> futures; auto parallelWork = [&state, &idArray](int iteration) { int id = idArray[iteration]; std::unordered_set<int> moved; auto s = state.copy(); auto _ = Jengatris::simulate(*s, id, &moved); return moved.size(); }; // Start asynchronous tasks for (size_t i = 0; i < idArray.size(); ++i) { futures.push_back( std::async(std::launch::async, parallelWork, i)); } // Wait for tasks to finish and accumulate the results // When I put a breakpoint here, Xcode says there are 372 threads. // Still lots, but less than 974... size_t total = 0; for (auto& future : futures) { total += future.get(); } return total; }
The code is a bit ugly, though. We can do better with Threading Building Blocks (TBB).
Parallel loop in C++ with C++17 and TBB
C++17 does include parallel algorithms, such as transform_reduce. You can pass an execution policy, and if you specify ‘parallel’
, it should run in parallel. So my parallel loop can be cleanly written like this:
size_t Jengatris::countFallsParallel(const GameState &state, const std::unordered_set<int> &ids) { return std::transform_reduce( std::execution::par, ids.begin(), ids.end(), size_t(0), std::plus<>(), [&state](int id) { std::unordered_set<int> moved; auto s = state.copy(); auto _ = Jengatris::simulate(*s, id, &moved); return moved.size(); } ); }
However, that code does not compile with Apple Clang. If you check the compiler support page, parallel algorithms are not supported in Apple Clang. To compile that in macOS, I installed GCC with Homebrew, and TBB (see below). The code compiles and runs. However, it doesn’t seem to run in parallel. The performance is the same as the sequential version. So I rewrote it with Intel TBB.
TBB is a C++ template library developed by Intel for parallel programming. The code got slightly uglier, but here’s the same parallel loop:
size_t Jengatris::countFallsTBB(const GameState &state, const std::unordered_set<int> &ids) { std::vector<int> idArray(ids.begin(), ids.end()); size_t sum = tbb::parallel_reduce( tbb::blocked_range<size_t>(0, idArray.size()), size_t(0), [&](const tbb::blocked_range<size_t>& range, size_t localSum) { for (size_t i = range.begin(); i != range.end(); ++i) { int id = idArray[i]; std::unordered_set<int> moved; auto s = state.copy(); auto _ = Jengatris::simulate(*s, id, &moved); localSum += moved.size(); } return localSum; }, std::plus<>() ); return sum; }
For some reason, Homebrew doesn’t automatically find the headers and libraries, so I had to point at them manually. For reference, my compiler command is:
g++-13 -std=c++17 -O3 -Wall -Wextra -pedantic -o advent advent2023-cpp/*.cpp -ltbb -I/opt/homebrew/Cellar/tbb/2021.11.0/include/ -L/opt/homebrew/Cellar/tbb/2021.11.0/lib
Benchmark
I’ve collected some numbers, mostly to verify that indeed the parallel code runs faster. I’ve compiled both Swift and C++ versions in Release mode optimizing for speed, not size, and averaged the values of a few runs. See the graph below.
A few observations:
- Any version runs much faster than GDScript, which takes 24 seconds.
- The C++ code runs a bit faster than Swift.
- Clang performs slightly faster than GCC.
- All the parallel implementations perform similarly to each other, in their respective languages/compilers, except for the “GCC parallel” which doesn’t seem to be working as expected.
- The C++ solution with many threads seems slightly slower than the other C++ solutions in GCC. Presumably because of the overhead of creating that many threads, although it doesn’t seem to make a difference in Clang.
Summary
There are multiple ways to do computations in parallel in both Swift and C++, some approaches more modern than others. However, always prepare some benchmark to actually check that the solution works. You may get a surprise like the one I got with C++17 (if someone knows why the parallel execution is being ignored, please let me know).
In the C++ world, go to the Compiler Support to find out which features of the standard are actually implemented in the compiler you are using. If you are targetting multiple platforms, you may not want to use the parallel algorithms from C++17.
C++ sounds a bit scarier than Swift, but I hope with these comparisons you see C++ doesn’t need to be that much verbose.
Thanks to Eric Wastl for Advent of Code.
All the code can be found in Github: algoDeSwift/AoC2023.
Tweet