Modern scientific research, from simulating black holes to modeling climate systems, requires computational resources that far exceed what a single processor can provide. Problems involving massive datasets or computationally expensive algorithms (e.g., Monte Carlo simulations, numerical PDE solvers, machine learning training) demand performance beyond sequential execution.
Parallel computing addresses this by breaking a problem into smaller tasks that can be solved simultaneously on multiple processing elements. With the rise of multicore CPUs, distributed systems, GPUs, and specialized accelerators, parallel computing has become central to high-performance computing (HPC). This lab will introduce you to the theory, programming models, and practical execution of parallel codes, with examples in Python, C, and MPI. You will also gain experience running jobs on a modern HPC cluster with a workload manager like Slurm.
Theoretical Foundations¶
Before we dive into implementation, we review key concepts that define the limits and opportunities of parallelism.
Amdahl’s Law (Strong Scaling)¶
If a fraction of a program is inherently sequential, the maximum speedup with processors is:
Note that, as , .
Implication: Even a small sequential portion limits total speedup.
Example: If 5% of your code is sequential, the maximum speedup is 20x, no matter how many processors you add.
This highlights why HPC algorithms for systems like Frontier (the DOE exascale machine) must minimize sequential bottlenecks.
This corresponds to strong scaling tests, where the problem size is fixed and we ask how performance improves as resources increase.
Gustafson’s Law (Weak Scaling)¶
A more optimistic view: as we increase , we also increase the problem size to fully utilize resources:
Implication: In scientific computing, we often want higher resolution or larger domains, so performance scales with problem size.
This corresponds to weak scaling tests, which measure how performance changes when the workload grows proportionally with resources.
Flynn’s Taxonomy¶
To better understand computing architectures, Flynn (1972) classified them into four categories:
SISD: Single Instruction, Single Data (traditional CPU execution)
SIMD: Single Instruction, Multiple Data (vector units, GPUs, NumPy and JAX vectorization if hardware supported)
MISD: Multiple Instruction, Single Data (rare, mostly theoretical)
MIMD: Multiple Instruction, Multiple Data (clusters, multicore CPUs, distributed MPI systems)
Programming models map naturally onto these:
Additional Resources¶
HPC Carpentry lessons: https://
hpc -carpentry .github .io MPI Tutorial: https://
mpitutorial .com/ Slurm quick-start guide: https://
slurm .schedmd .com /quickstart .html
Monte Carlo Computation of ¶
We will parallelize a simple algorithm using different techniques. The algorithm is monte carlo computation of . This is an embarrassingly parallel problem. so not much actual algorithm consideration is needed. We mainly use it to get ourselve familiar with different tools.
Python Series Code¶
Here is the algorithm in native python:
import random
def mcpi_loop(n_total):
n_inside = 0
for _ in range(n_total):
x, y = random.random(), random.random()
if x*x + y*y < 1.0:
n_inside += 1
return 4 * n_inside / n_totalpi = mcpi_loop(1000_000)
print(pi)%timeit mcpi_loop(1000_000)On my laptop it takes about 80ms to perform 1M samples. The number of significant digits is .
Embarrassingly Parallel Computing¶
Since this algorithm is embarrassingly parallelizable, we can simply run it multiple times and compute the mean. Let’s do this as a class exercise using this Google Sheet.
Effectively, we just did a weak scaling test!
Numpy Parallel Code¶
When compiled with BLAS backend, Numpy automatically distribute
compute across multiple cores.
import numpy as np
np.__config__.show()import os
print(os.environ.get('OPENBLAS_NUM_THREADS', 0))
print(os.environ.get('MKL_NUM_THREADS', 0))def mcpi_numpy(n_total):
x = np.random.rand(n_total)
y = np.random.rand(n_total)
return 4 * np.mean(x*x + y*y < 1.0)pi = 4 * mcpi_numpy(1000_000) / 1000_000
print(pi)%timeit mcpi_numpy(1000_000)os.environ['MKL_NUM_THREADS'] = '1'
os.environ['OPENBLAS_NUM_THREADS'] = '1'
%timeit mcpi_numpy(1000_000)os.environ['MKL_NUM_THREADS'] = '4'
os.environ['OPENBLAS_NUM_THREADS'] = '4'
%timeit mcpi_numpy(1000_000)JAX Parallel Code¶
Many operations in JAX, especially linear algebra related,
automatically use multiple cores.
from jax import numpy as jnp
from jax import random as jrd
from jax import jit
def mcpi_jax(n_total):
key = jrd.key(0)
key1, key2 = jrd.split(key)
x = jrd.uniform(key1, n_total)
y = jrd.uniform(key2, n_total)
return 4 * jnp.mean(x*x + y*y < 1.0)pi = mcpi_jax(1000_000)
print(pi)%timeit mcpi_jax(1000_000)C Series Code¶
Our original python code can be easily translate to C.
Please put the following code into a new file mcpi_loop.c:
#include <stdlib.h>
int mcpi_loop(int n_total)
{
int n_inside = 0;
for(int i = 0; i < n_total; ++i) {
double x = (double)rand() / RAND_MAX;
double y = (double)rand() / RAND_MAX;
if(x * x + y * y < 1.0)
++n_inside;
}
return (4.0 * n_inside) / n_total;
}Although the actual C code is pretty simple, unfortunately, there is
not really a portable, high-resolution time functionon in bash.
Hence, we need to do timing ourselves in the C code:
#include <time.h>
static struct timespec t0;
void tik()
{
clock_gettime(CLOCK_REALTIME, &t0);
}
double tok()
{
struct timespec t1, dt;
clock_gettime(CLOCK_REALTIME, &t1);
dt.tv_nsec = t1.tv_nsec - t0.tv_nsec;
dt.tv_sec = t1.tv_sec - t0.tv_sec;
if(dt.tv_nsec < 0) {
dt.tv_nsec += 1000000000;
dt.tv_sec--;
}
int ms = dt.tv_nsec / 1000000 + dt.tv_sec * 1000;
int ns = dt.tv_nsec % 1000000;
return ms + 1e-6 * ns;
}Then we can put it all together and create the main function:
#include <stdio.h>
int main()
{
tik();
double pi = mcpi_loop(1000000);
double ms = tok();
printf("pi = %f\n", pi);
printf("dt = %f ms\n", ms);
return 0;
}We can put all these functions into a single “mcpi_loop.c” file and then compile with:
gcc mcpi_loop.c -O3 -o mcpi_loop
./mcpi_loopOn my laptop, the run takes ~ 36 ms.
Shared Memory: OpenMP C Code¶
OpenMP is a widely used API for writing multithreaded applications in C, C++, and Fortran. It allows you to add simple compiler directives (pragmas) to enable parallel execution on shared-memory systems.
Its key features include:
Easy to parallelize loops using
#pragma omp parallel forThreads share memory, so synchronization and data races must be handled carefully
OpenMP includes mechanisms for reductions, barriers, critical sections, and more
To use OpenMP:
Include the header #include <omp.h>
Compile with
gcc -fopenmp
Here is our updated mcpi_omp() function:
#include <omp.h>
#include <stdlib.h>
int mcpi_omp(int n_total)
{
int n_inside = 0;
#pragma omp parallel for reduction(+:n_inside)
for(int i = 0; i < n_total; ++i) {
double x = (double)rand() / RAND_MAX;
double y = (double)rand() / RAND_MAX;
if(x * x + y * y < 1.0)
++n_inside;
}
return n_inside;
}Copying it with tik(), tok(), and main() to “mcpi_omp.c”, we can
now compile our OpenMP version by
gcc mcpi_omp.c -fopenmp -O3 -o mcpi_omp
./mcpi_ompOn my laptop, the run takes ~ 8 ms, matching the JAX
implementation.
Distributed Memory: MPI¶
MPI (Message Passing Interface) is the de facto standard for distributed-memory parallel programming. Unlike OpenMP, where threads share memory, MPI processes have separate memory spaces and must explicitly communicate using messages. It is well-suited for large-scale clusters and supercomputers, where each node has its own memory and processors.
Common MPI functions include:
MPI_Init()andMPI_Finalize()for starting and ending MPI programsMPI_Comm_rank()andMPI_Comm_size()to identify each process and the total number of processesMPI_Send(),MPI_Recv(), andMPI_Reduce()for data exchange and aggregation
Note: One of the most commonly used implementations of MPI is OpenMPI. Despite the similar name, OpenMPI is unrelated to OpenMP! The name similarity is unfortunately confusing.
OpenMP is for multithreading on shared-memory systems
OpenMPI is a software implementation of the MPI standard for distributed-memory systems
To compile and run an MPI program, use mpicc to compile the program
and use mpirun -np N ./program to run it with N processes.
Here is an example implementation:
#include <mpi.h>
int main(int argc, char** argv)
{
tik();
MPI_Init(&argc, &argv);
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int n_total = 1000000;
int l_total = n_total / size;
srand(time(NULL)+rank); // ensure different seed per process
int l_inside = mcpi_loop(l_total);
int n_inside = 0;
MPI_Reduce(&l_inside, &n_inside, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
if(rank == 0) {
double pi = 4.0 * n_inside / n_total;
printf("Estimated value of pi: %f\n", pi);
}
MPI_Finalize();
double ms = tok();
if(rank == 0)
printf("dt = %f ms\n", ms);
return 0;
}On my laptop, when I run mpirun -np 4 ./mcpi_mpi, it took ~ 25 ms.
This is significantly higher than the OpenMP version because of the
all the startup overhead.
Job Submission to HPC with Slurm¶
On many HPC clusters, job execution is managed by a workload manager. One of the most widely used systems is Slurm (Simple Linux Utility for Resource Management).
Slurm schedules jobs across compute nodes and handles resource allocation, job queues, and execution environments. Instead of running MPI jobs interactively with mpirun, users typically submit jobs using a Slurm script.
The UA HPC website has some documentation of using SLURM.
And here is a sample submission script “submit.sh” that we may use
#!/bin/bash
#SBATCH --job-name=mcpi_mpi
#SBATCH --ntasks=4
#SBATCH --nodes=1
#SBATCH --mem=1gb
#SBATCH --time=00:05:00
#SBATCH --partition=standard
#SBATCH --account=astr501-513
# SLURM Inherits your environment. cd $SLURM_SUBMIT_DIR not needed
pwd; hostname; date
module load openmpi3
/usr/bin/time -o mcpi_mpi.time mpirun -np 4 mcpi_mpiTo submit the job, simply login to ocelote and run
sbatch submit.shTo check your job, run
squeue -u USRENAME