Improving Parallelism with Multithreading#
The Cost of Synchronization#
We are studying the sum of a sequence of integers $0, ..., n - 1$. We divide the sequence into $t$ disjoint regions and assign each region to a thread. We store the sum of each thread in a variable and use a mutex to protect this variable.
use std::{
sync::{Arc, Mutex},
thread,
time::Instant,
};
fn main() {
let args = std::env::args().collect::<Vec<String>>();
if args.len() != 3 {
panic!("Usage: {} <nthreads> <log_nelems>", args[0]);
}
let nthreads: usize = args[1].parse().unwrap();
let log_nelems: usize = args[2].parse().unwrap();
let nelems = 1_usize << log_nelems;
let nelems_per_thread = nelems / nthreads;
let gsum = Arc::new(Mutex::new(0));
let now = Instant::now();
let mut handlers = vec![];
for i in 0..nthreads {
let gsum = Arc::clone(&gsum);
let handler = thread::spawn(move || {
let start = i * nelems_per_thread;
let end = start + nelems_per_thread;
for j in start..end {
let mut sum = gsum.lock().unwrap();
*sum += j;
}
});
handlers.push(handler);
}
for handle in handlers {
handle.join().unwrap();
}
assert_eq!(nelems * (nelems - 1) / 2, *gsum.lock().unwrap());
println!("Running took {} s.", now.elapsed().as_secs())
}
We tested on a quad-core system with a sequence of size $n=2^{20}$, and the running time is measured in milliseconds. The results are as follows:
| | | | | |
-- | -- | -- | -- | -- | -- | --
Number of Threads | 1 | 2 | 4 | 8 | 16 | 32
Time | 14932 | 72823 | 76478 | 118155 | 108878 | 101532
The results are difficult to understand. The performance gets worse as the number of CPUs used increases. The reason for the poor performance is the cost of synchronization operations (P
and V
) relative to memory update operations. This highlights an important lesson in parallel programming: synchronization is costly and should be avoided as much as possible. If unavoidable, useful computations should be maximized to offset this cost.
Optimizing with Local Variables to Eliminate Synchronization#
Therefore, we use local variables to eliminate unnecessary memory references and accumulate each thread's partial sum in a local variable instead of a global variable.
use std::{thread, time::Instant};
fn main() {
let args = std::env::args().collect::<Vec<String>>();
if args.len() != 3 {
panic!("Usage: {} <nthreads> <log_nelems>", args[0]);
}
let nthreads: usize = args[1].parse().unwrap();
let log_nelems: usize = args[2].parse().unwrap();
let nelems = 1_usize << log_nelems;
let nelems_per_thread = nelems / nthreads;
let now = Instant::now();
let mut handlers = vec![];
for i in 0..nthreads {
let handler = thread::spawn(move || {
let mut sum = 0;
let start = i * nelems_per_thread;
let end = start + nelems_per_thread;
for j in start..end {
sum += j;
}
sum
});
handlers.push(handler);
}
let mut gsum = 0;
for handle in handlers {
gsum += handle.join().unwrap();
}
assert_eq!(nelems * (nelems - 1) / 2, gsum);
println!("Running took {} s.", now.elapsed().as_micros() as f64 / 1e6)
}
We tested on a quad-core system with a sequence of size $n=2^{31}$, and the running time is measured in seconds. The results are as follows:
| | | | | |
-- | -- | -- | -- | -- | -- | --
Number of Threads | 1 | 2 | 4 | 8 | 16 | 32
Time | 22.413 | 11.430 | 6.087 | 7.414 | 7.428 | 7.411
The results are consistent with expectations, showing a set of decreasing running times.
Characterizing the Performance of Parallel Programs#
As shown in the graph, we can see that as the number of threads increases, the running time decreases until it reaches four threads, at which point the running time levels off and even starts to increase slightly. In an ideal scenario, we would expect the running time to decrease linearly as the number of cores increases. That is, we would expect the running time to decrease by half for each doubling of the number of threads. This is indeed the case until $t > 4$, at which point each of the four cores is busy running at least one thread. With an increasing number of threads, the running time actually increases slightly due to the overhead of context switching between multiple threads on a single core. For this reason, parallel programs are often written to run only one thread per core.
While absolute running time is the ultimate measure of program performance, there are some useful relative measures that can indicate how well a parallel program utilizes potential parallelism. The speedup of a parallel program is usually defined as: where $p$ is the number of processor cores and $T_k$ is the running time on $k$ cores. This formula is sometimes referred to as strong scaling. When $T_1$ is the running time of the sequential version of the program, $S_p$ is called the absolute speedup. Absolute speedup provides a more realistic measure of the benefits of parallelism compared to relative speedup. Even when a parallel program is run on a single processor, it is often affected by the overhead of synchronization, which artificially increases the value of relative speedup because it increases the size of the numerator. On the other hand, measuring absolute speedup is more difficult because it requires two different versions of the program. For complex parallel code, creating a separate sequential version may not be practical, either because the code is too complex or because the source code is not available.
A related measure is efficiency, defined as: often expressed as a percentage ranging from $(0, 100]$. Efficiency measures the overhead caused by parallelization. A program with high efficiency spends more time on useful work and less time on synchronization and communication.
The table below shows the various speedup and efficiency measurements for our parallel sum example program. Achieving over $90%$ efficiency is very good, but don't be deceived. This high efficiency is due to the fact that our problem is very easily parallelizable. In practice, this is rarely the case. Parallel programming has been an active research area for decades. With the advent of commercial multicore machines, which double the number of cores every few years, parallel programming will continue to be a deep, challenging, and active research field.
| | | | | |
-- | -- | -- | -- | -- | -- | --
Threads ($t$) | 1 | 2 | 4 | 8 | 16 | 32
Cores ($p$) | 1 | 2 | 4 | 8 | 16 | 32
Running Time ($T_p$) | 22.413 | 11.430 | 6.087 | 7.414 | 7.428 | 7.411
Speedup ($S_p$) | 1 | 1.96 | 3.68 | 3.023 | 3.017 | 3.024
Efficiency ($E_p$) | $100\%$ | $98\%$ | $92\%$ | $76\%$ | $75\%$ | $76\%$