Shared Memory and OpenMP Basics

First things first

Log into your account:
ssh userid@bwbay.ncsa.illinois.edu, or
ssh userid@bw.ncsa.illinois.edu

Then request an interactive job:
qsub -I -l nodes=1:ppn=32 -l walltime=3:00:00 -l advres=workshopxe
Now we'll move on while this request is processed. Opening a new tab (Command + T on a Mac) will allow us to log in again and start examining and compiling code.

Thread v. Process

A process is created by the OS to execute a program with given resources (e.g., memory, registers); generally speaking, different processes do not share their memory with another. A thread is a subset of a process, and it shares the resources of its parent process, but has its own stack to keep track of function calls. Multiple threads of a process will have access to the same memory (but can have local variables).

Shared Memory

Shared memory parallel computers vary, but they have the ability for the processors to access all memory as global address space. So it's not simply threads that share memory. These processors operate independently, but share the same memory.

By analogy, recall the painters painting houses example that Dr. Panoff mentioned Monday night, but this will have a twist. Imagine you were one of four artistic painters, all painting on one shared canvas. You each have your own set of brushes, but you share one palette on which to put paints and mix the colors. This shared palette from which which you make new colors and the shared canvas on which you paint are analogous to the shared memory from which you will read and to which you will write. If someone mixes colors on the palette and makes a nice new shade of green, you can access that green and use it to paint on the canvas, but if you don't communicate with the other painters, that green may change into something different before you're done using it.

Also recall from Dr. Panoff's example that differences in the location of a resource effect the painting of houses. Shared memory machines can be classified as Uniformed Memory Access (UMA) and Non-Uniformed Memory Access (NUMA), based upon memory access times.

Uniform Memory Access

Image of Processors with Uniform Memory Access from https://computing.llnl.gov/tutorials/openMP/

Non-Uniform Memory Access

Image of Processors with Non-Uniform Memory Access from https://computing.llnl.gov/tutorials/openMP/

Getting to know OpenMP

OpenMP is an API built for shared-memory parallelism. This is usually realized by multi-threading. The OpenMP API is comprised of three distinct components: compiler directives, runtime library routines, and environment variables.

Code to copy

Copy code into your workspace:
cp -r ~bplist/2015/openmp .

Code examples style (in this page)

example-shell-command
Example line of code (in the file, not the shell)
Comments are also interweaved
  1. parallel compiler directive

    USAGE:
    #pragma omp parallel [options] {
        Code inside here runs in parallel
        Among options available: declaring private/shared vars
    }
    EXAMPLE: #pragma omp parallel private(var1, var2) shared(var3) {

    The parallel pragma starts a parallel block. It creates a team of N threads (where N is determined at runtime), all of which execute the next statement or block (a block of code requires a {…} enclosure). After the statement, the threads join back into one. This is called a fork/join model.

    Source: http://en.wikipedia.org/wiki/Fork%E2%80%93join_model#/media/File:Fork_join.svg
  2. OMP_NUM_THREADS environment variable

    USAGE: OMP_NUM_THREADS=number
    EXAMPLE: export OMP_NUM_THREADS=16 Sets the value in the shell

    This environment variable tells the library how many threads that can be used in running the program. If dynamic adjustment of the number of threads is enabled, this number is the maximum number of threads that can be used, else, it is the exact number of threads that will be used.

    The default value is the number of online processors on the machine; on Blue Waters, it seems to be 1, no matter your allocation.

  3. omp_get_thread_num function

    EXAMPLE: uid = omp_get_thread_num()

    This function asks the thread that is executing it to identify itself by returning it's unique number. [answers "Who am I?"]

    What would this return if called outside of a parallel block?

  4. omp_get_num_threads function

    EXAMPLE: threadCount = omp_get_num_threads()

    This function returns the number of threads in the team currently executing the parallel block from which it is called. [answers "How many of us?"]. If this "get" function exists, might there be a corresponding function?

  5. for compiler directive

    USAGE:
    #pragma omp for [clause]{
    	for loop
    }

    EXAMPLE:
    #pragma omp parallel shared(a,b,c,chunk) private(i)
    {
    	#pragma omp for
    		for (i=0; i < N; i++){

    This directive has the iterations of the upcoming loop be executed in parallel by the team of threads. The iterations (or chunks of them) can be assigned in a 'round-robin' style before any iteration is processed, or they can be assigned first-come, first-serve.

    Does the code in one iteration of the loop need to be independent of all other iterations? Why/not?

    This assumes a parallel region has already been initiated, otherwise it executes in serial on a single processor. The loop iteration variable is private in scope throughout the loop execution.

    N.B. Be aware of the use of the word parallel in this pragma and the general parallel pragma. Multiple uses of the term parallel can lead to a multiplication of your parallelism.

  6. The reduction command

    USAGE: reduction ( operator : list )

    A reduction is a repeated operation that operates over multiple values and yields one value. So, the reduction command performs an operation (using a specified operator) on all of the variables that are in its list. The reduction variable(s) is a shared variable, not a private variable. Operators can include: +, *, -, /, &, ^, |, &&, ||.

    EXAMPLE: #pragma omp parallel for private(privateVar) reduction(+:runningTotal)

Exercise

Go to code into your workspace:
openmp/pi-serial.c

Team up with someone you haven't worked with before. Modify this code or your own serial pi code to be parallel, using the reduction clause.