Parallel Computing, HPC, and Slurm

Modern scientific research, from simulating black holes to modeling climate systems, requires computational resources that far exceed what a single processor can provide. Problems involving massive datasets or computationally expensive algorithms (e.g., Monte Carlo simulations, numerical PDE solvers, machine learning training) demand performance beyond sequential execution.

Parallel computing addresses this by breaking a problem into smaller tasks that can be solved simultaneously on multiple processing elements. With the rise of multicore CPUs, distributed systems, GPUs, and specialized accelerators, parallel computing has become central to high-performance computing (HPC). This lab will introduce you to the theory, programming models, and practical execution of parallel codes, with examples in Python, C, and MPI. You will also gain experience running jobs on a modern HPC cluster with a workload manager like Slurm.

Theoretical Foundations¶

Before we dive into implementation, we review key concepts that define the limits and opportunities of parallelism.

Amdahl’s Law (Strong Scaling)¶

If a fraction $f$ of a program is inherently sequential, the maximum speedup $P$ with $P$ processors is:

\begin{align} S(P) = \frac{1}{f + (1-f)/P}. \end{align}

(1)

Note that, as $P \to \infty$ , $S \to 1/f$ .

Implication: Even a small sequential portion limits total speedup.

Example: If 5% of your code is sequential, the maximum speedup is 20x, no matter how many processors you add.
This highlights why HPC algorithms for systems like Frontier (the DOE exascale machine) must minimize sequential bottlenecks.

This corresponds to strong scaling tests, where the problem size is fixed and we ask how performance improves as resources increase.

Gustafson’s Law (Weak Scaling)¶

A more optimistic view: as we increase $P$ , we also increase the problem size to fully utilize resources:

\begin{align} S(P) = f + (1-f)P. \end{align}

(2)

Implication: In scientific computing, we often want higher resolution or larger domains, so performance scales with problem size.

This corresponds to weak scaling tests, which measure how performance changes when the workload grows proportionally with resources.

Flynn’s Taxonomy¶

To better understand computing architectures, Flynn (1972) classified them into four categories:

SISD: Single Instruction, Single Data (traditional CPU execution)
SIMD: Single Instruction, Multiple Data (vector units, GPUs, NumPy and JAX vectorization if hardware supported)
MISD: Multiple Instruction, Single Data (rare, mostly theoretical)
MIMD: Multiple Instruction, Multiple Data (clusters, multicore CPUs, distributed MPI systems)

Programming models map naturally onto these:

OpenMP: shared-memory SIMD/MIMD
CUDA/OpenCL: SIMD execution on GPUs
MPI: distributed-memory MIMD

Additional Resources¶

HPC Carpentry lessons: https://hpc-carpentry.github.io
MPI Tutorial: https://mpitutorial.com/
Slurm quick-start guide: https://slurm.schedmd.com/quickstart.html

Monte Carlo Computation of $\pi$ ¶

We will parallelize a simple algorithm using different techniques. The algorithm is monte carlo computation of $\pi$ . This is an embarrassingly parallel problem. so not much actual algorithm consideration is needed. We mainly use it to get ourselve familiar with different tools.

Python Series Code¶

Here is the algorithm in native python:

import random

def mcpi_loop(n_total):
    n_inside = 0
    for _ in range(n_total):
        x, y = random.random(), random.random()
        if x*x + y*y < 1.0:
            n_inside += 1
    return 4 * n_inside / n_total

pi = mcpi_loop(1000_000)
print(pi)

%timeit mcpi_loop(1000_000)

On my laptop it takes about 80ms to perform 1M samples. The number of significant digits is $\sim \log_{10}\sqrt{N} = 3$ .

Embarrassingly Parallel Computing¶

Since this algorithm is embarrassingly parallelizable, we can simply run it multiple times and compute the mean. Let’s do this as a class exercise using this Google Sheet.

Effectively, we just did a weak scaling test!

Numpy Parallel Code¶

When compiled with BLAS backend, Numpy automatically distribute compute across multiple cores.

import numpy as np

np.__config__.show()

import os
print(os.environ.get('OPENBLAS_NUM_THREADS', 0))
print(os.environ.get('MKL_NUM_THREADS',      0))

def mcpi_numpy(n_total):
    x = np.random.rand(n_total)
    y = np.random.rand(n_total)
    return 4 * np.mean(x*x + y*y < 1.0)

pi = 4 * mcpi_numpy(1000_000) / 1000_000
print(pi)

%timeit mcpi_numpy(1000_000)

os.environ['MKL_NUM_THREADS']      = '1'
os.environ['OPENBLAS_NUM_THREADS'] = '1'
%timeit mcpi_numpy(1000_000)

os.environ['MKL_NUM_THREADS']      = '4'
os.environ['OPENBLAS_NUM_THREADS'] = '4'
%timeit mcpi_numpy(1000_000)

JAX Parallel Code¶

Many operations in JAX, especially linear algebra related, automatically use multiple cores.

from jax import numpy  as jnp
from jax import random as jrd
from jax import jit

def mcpi_jax(n_total):
    key = jrd.key(0)
    key1, key2 = jrd.split(key)
    
    x = jrd.uniform(key1, n_total)
    y = jrd.uniform(key2, n_total)
    return 4 * jnp.mean(x*x + y*y < 1.0)

pi = mcpi_jax(1000_000)
print(pi)

%timeit mcpi_jax(1000_000)

C Series Code¶

Our original python code can be easily translate to C. Please put the following code into a new file mcpi_loop.c:

#include <stdlib.h>

int mcpi_loop(int n_total)
{
	int n_inside = 0;
	for(int i = 0; i < n_total; ++i) {
		double x = (double)rand() / RAND_MAX;
		double y = (double)rand() / RAND_MAX;

		if(x * x + y * y < 1.0)
			++n_inside;
	}

	return (4.0 * n_inside) / n_total;
}

Although the actual C code is pretty simple, unfortunately, there is not really a portable, high-resolution time functionon in bash. Hence, we need to do timing ourselves in the C code:

#include <time.h>

static struct timespec t0;

void tik()
{
	clock_gettime(CLOCK_REALTIME, &t0);
}

double tok()
{
	struct timespec t1, dt;
	clock_gettime(CLOCK_REALTIME, &t1);

	dt.tv_nsec = t1.tv_nsec - t0.tv_nsec;
	dt.tv_sec  = t1.tv_sec  - t0.tv_sec;
	if(dt.tv_nsec < 0) {
		dt.tv_nsec += 1000000000;
		dt.tv_sec--;
	}

	int ms = dt.tv_nsec / 1000000 + dt.tv_sec * 1000;
	int ns = dt.tv_nsec % 1000000;

	return ms + 1e-6 * ns;
}

Then we can put it all together and create the main function:

#include <stdio.h>

int main()
{
	tik();
	double pi = mcpi_loop(1000000);
	double ms = tok();

	printf("pi = %f\n",    pi);
	printf("dt = %f ms\n", ms);

	return 0;
}

We can put all these functions into a single “mcpi_loop.c” file and then compile with:

gcc mcpi_loop.c -O3 -o mcpi_loop
./mcpi_loop

On my laptop, the run takes ~ 36 ms.

Shared Memory: OpenMP C Code¶

OpenMP is a widely used API for writing multithreaded applications in C, C++, and Fortran. It allows you to add simple compiler directives (pragmas) to enable parallel execution on shared-memory systems.

Its key features include:

Easy to parallelize loops using #pragma omp parallel for
Threads share memory, so synchronization and data races must be handled carefully
OpenMP includes mechanisms for reductions, barriers, critical sections, and more

To use OpenMP:

Include the header #include <omp.h>
Compile with gcc -fopenmp

Here is our updated mcpi_omp() function:

#include <omp.h>
#include <stdlib.h>

int mcpi_omp(int n_total)
{
	int n_inside = 0;

	#pragma omp parallel for reduction(+:n_inside)
	for(int i = 0; i < n_total; ++i) {
		double x = (double)rand() / RAND_MAX;
		double y = (double)rand() / RAND_MAX;

		if(x * x + y * y < 1.0)
			++n_inside;
	}

	return n_inside;
}

Copying it with tik(), tok(), and main() to “mcpi_omp.c”, we can now compile our OpenMP version by

gcc mcpi_omp.c -fopenmp -O3 -o mcpi_omp
./mcpi_omp

On my laptop, the run takes ~ 8 ms, matching the JAX implementation.

Distributed Memory: MPI¶

MPI (Message Passing Interface) is the de facto standard for distributed-memory parallel programming. Unlike OpenMP, where threads share memory, MPI processes have separate memory spaces and must explicitly communicate using messages. It is well-suited for large-scale clusters and supercomputers, where each node has its own memory and processors.

Common MPI functions include:

MPI_Init() and MPI_Finalize() for starting and ending MPI programs
MPI_Comm_rank() and MPI_Comm_size() to identify each process and the total number of processes
MPI_Send(), MPI_Recv(), and MPI_Reduce() for data exchange and aggregation

Note: One of the most commonly used implementations of MPI is OpenMPI. Despite the similar name, OpenMPI is unrelated to OpenMP! The name similarity is unfortunately confusing.

OpenMP is for multithreading on shared-memory systems
OpenMPI is a software implementation of the MPI standard for distributed-memory systems

To compile and run an MPI program, use mpicc to compile the program and use mpirun -np N ./program to run it with N processes.

Here is an example implementation:

#include <mpi.h>

int main(int argc, char** argv)
{
	tik();

	MPI_Init(&argc, &argv);

	int rank, size;
	MPI_Comm_rank(MPI_COMM_WORLD, &rank);
	MPI_Comm_size(MPI_COMM_WORLD, &size);

	int n_total = 1000000;
	int l_total = n_total / size;

	srand(time(NULL)+rank); // ensure different seed per process
	int l_inside = mcpi_loop(l_total);

	int n_inside = 0;
	MPI_Reduce(&l_inside, &n_inside, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);

	if(rank == 0) {
		double pi = 4.0 * n_inside / n_total;
		printf("Estimated value of pi: %f\n", pi);
	}

	MPI_Finalize();

	double ms = tok();
	if(rank == 0)
		printf("dt = %f ms\n", ms);

	return 0;
}

On my laptop, when I run mpirun -np 4 ./mcpi_mpi, it took ~ 25 ms. This is significantly higher than the OpenMP version because of the all the startup overhead.

Job Submission to HPC with Slurm¶

On many HPC clusters, job execution is managed by a workload manager. One of the most widely used systems is Slurm (Simple Linux Utility for Resource Management).

Slurm schedules jobs across compute nodes and handles resource allocation, job queues, and execution environments. Instead of running MPI jobs interactively with mpirun, users typically submit jobs using a Slurm script.

The UA HPC website has some documentation of using SLURM.

And here is a sample submission script “submit.sh” that we may use

#!/bin/bash

#SBATCH --job-name=mcpi_mpi
#SBATCH --ntasks=4
#SBATCH --nodes=1
#SBATCH --mem=1gb
#SBATCH --time=00:05:00
#SBATCH --partition=standard
#SBATCH --account=astr501-513
 
# SLURM Inherits your environment. cd $SLURM_SUBMIT_DIR not needed
pwd; hostname; date
 
module load openmpi3
/usr/bin/time -o mcpi_mpi.time mpirun -np 4 mcpi_mpi

To submit the job, simply login to ocelote and run