Shared Memory and OpenMP Basics

First things first

Log into your account:
ssh userid@bwbay.ncsa.illinois.edu, or
ssh userid@bw.ncsa.illinois.edu

Then request an interactive job:
qsub -I -l nodes=1:ppn=32 -l walltime=3:00:00 -l advres=workshopxe
Now we'll move on while this request is processed. Opening a new tab (Command + T on a Mac) will allow us to log in again and start examining and compiling code.

Thread v. Process

A process is created by the OS to execute a program with given resources (e.g., memory, registers); generally speaking, different processes do not share their memory with another. A thread is a subset of a process, and it shares the resources of its parent process, but has its own stack to keep track of function calls. Multiple threads of a process will have access to the same memory (but can have local variables).

Shared Memory

Shared memory parallel computers vary, but they have the ability for the processors to access all memory as global address space. So it's not simply threads that share memory. These processors operate independently, but share the same memory.

By analogy, recall the painters painting houses example that Dr. Panoff mentioned Monday night, but this will have a twist. Imagine you were one of four artistic painters, all painting on one shared canvas. You each have your own set of brushes, but you share one palette on which to put paints and mix the colors. This shared palette from which which you make new colors and the shared canvas on which you paint are analogous to the shared memory from which you will read and to which you will write. If someone mixes colors on the palette and makes a nice new shade of green, you can access that green and use it to paint on the canvas, but if you don't communicate with the other painters, that green may change into something different before you're done using it.

Also recall from Dr. Panoff's example that differences in the location of a resource effect the painting of houses. Shared memory machines can be classified as Uniformed Memory Access (UMA) and Non-Uniformed Memory Access (NUMA), based upon memory access times.

Uniform Memory Access

Equal access and access times to memory

Non-Uniform Memory Access

Often made by physically linking two or more multiprocessors
One processor can still directly access memory of another processor
But not all processors have equal time access to all memories
Because memory access across a link is slower

Getting to know OpenMP

OpenMP is an API built for shared-memory parallelism. This is usually realized by multi-threading. The OpenMP API is comprised of three distinct components: compiler directives, runtime library routines, and environment variables.

Compiler directives: typed in your source code, these are instructions for the compiler regarding how to parallelize your code. If you set the right flags at compilation, these directives are read and understood, else, they are ignored.
Runtime library routines: typed in your source code, these are function calls to functions of the OpenMP library (omp.h in C/C++).
Environment variable: typed in the terminal, this is a variable used by the system that can be modified or retrieved.

Code to copy

Copy code into your workspace:
cp -r ~bplist/2015/openmp .

Code examples style (in this page)

example-shell-command
Example line of code (in the file, not the shell)
Comments are also interweaved

parallel compiler directive
USAGE: #pragma omp parallel [options] { Code inside here runs in parallel Among options available: declaring private/shared vars }
EXAMPLE: #pragma omp parallel private(var1, var2) shared(var3) {

The parallel pragma starts a parallel block. It creates a team of N threads (where N is determined at runtime), all of which execute the next statement or block (a block of code requires a {…} enclosure). After the statement, the threads join back into one. This is called a fork/join model.
OMP_NUM_THREADS environment variable

USAGE: OMP_NUM_THREADS=number

EXAMPLE: export OMP_NUM_THREADS=16 Sets the value in the shell

This environment variable tells the library how many threads that can be used in running the program. If dynamic adjustment of the number of threads is enabled, this number is the maximum number of threads that can be used, else, it is the exact number of threads that will be used.

The default value is the number of online processors on the machine; on Blue Waters, it seems to be 1, no matter your allocation.
omp_get_thread_num function

EXAMPLE: uid = omp_get_thread_num()

This function asks the thread that is executing it to identify itself by returning it's unique number. [answers "Who am I?"]
What would this return if called outside of a parallel block?
omp_get_num_threads function

EXAMPLE: threadCount = omp_get_num_threads()

This function returns the number of threads in the team currently executing the parallel block from which it is called. [answers "How many of us?"]. If this "get" function exists, might there be a corresponding function?
for compiler directive
USAGE: #pragma omp for [clause]{ for loop }
EXAMPLE: #pragma omp parallel shared(a,b,c,chunk) private(i) { #pragma omp for for (i=0; i < N; i++){
This directive has the iterations of the upcoming loop be executed in parallel by the team of threads. The iterations (or chunks of them) can be assigned in a 'round-robin' style before any iteration is processed, or they can be assigned first-come, first-serve.

Does the code in one iteration of the loop need to be independent of all other iterations? Why/not?

This assumes a parallel region has already been initiated, otherwise it executes in serial on a single processor. The loop iteration variable is private in scope throughout the loop execution.

N.B. Be aware of the use of the word parallel in this pragma and the general parallel pragma. Multiple uses of the term parallel can lead to a multiplication of your parallelism.
The reduction command

USAGE: reduction ( operator : list )

A reduction is a repeated operation that operates over multiple values and yields one value. So, the reduction command performs an operation (using a specified operator) on all of the variables that are in its list. The reduction variable(s) is a shared variable, not a private variable. Operators can include: +, *, -, /, &, ^, |, &&, ||.

EXAMPLE: #pragma omp parallel for private(privateVar) reduction(+:runningTotal)

Exercise

Go to code into your workspace:
openmp/pi-serial.c

Team up with someone you haven't worked with before. Modify this code or your own serial pi code to be parallel, using the reduction clause.

Shared Memory and OpenMP Basics

First things first

Thread v. Process

Shared Memory

Uniform Memory Access

Non-Uniform Memory Access

Getting to know OpenMP

Code to copy

Code examples style (in this page)

`parallel` compiler directive

`OMP_NUM_THREADS` environment variable

`omp_get_thread_num` function

`omp_get_num_threads` function

`for` compiler directive

USAGE: `#pragma omp for [clause]{ for loop }`

EXAMPLE: `#pragma omp parallel shared(a,b,c,chunk) private(i) { #pragma omp for for (i=0; i < N; i++){`

The `reduction` command

USAGE: `reduction ( operator : list )`

EXAMPLE: `#pragma omp parallel for private(privateVar) reduction(+:runningTotal)`

Exercise

Shared Memory and OpenMP Basics

First things first

Thread v. Process

Shared Memory

Uniform Memory Access

Non-Uniform Memory Access

Getting to know OpenMP

Code to copy

Code examples style (in this page)

parallel compiler directive

OMP_NUM_THREADS environment variable

omp_get_thread_num function

omp_get_num_threads function

for compiler directive

USAGE: #pragma omp for [clause]{ for loop }

EXAMPLE: #pragma omp parallel shared(a,b,c,chunk) private(i) { #pragma omp for for (i=0; i < N; i++){

The reduction command USAGE: reduction ( operator : list )

EXAMPLE: #pragma omp parallel for private(privateVar) reduction(+:runningTotal)

Exercise

`parallel` compiler directive

`OMP_NUM_THREADS` environment variable

`omp_get_thread_num` function

`omp_get_num_threads` function

`for` compiler directive

USAGE: `#pragma omp for [clause]{ for loop }`

EXAMPLE: `#pragma omp parallel shared(a,b,c,chunk) private(i) { #pragma omp for for (i=0; i < N; i++){`

The `reduction` command

USAGE: `reduction ( operator : list )`

EXAMPLE: `#pragma omp parallel for private(privateVar) reduction(+:runningTotal)`