Log into your account:
ssh userid@bwbay.ncsa.illinois.edu
, or
ssh userid@bw.ncsa.illinois.edu
Then request an interactive job:
qsub -I -l nodes=1:ppn=32 -l walltime=3:00:00 -l advres=workshopxe
Now we'll move on while this request is processed. Opening a new tab (Command + T on a Mac) will allow us to log in again and start examining and compiling code.
A process is created by the OS to execute a program with given resources (e.g., memory, registers); generally speaking, different processes do not share their memory with another. A thread is a subset of a process, and it shares the resources of its parent process, but has its own stack to keep track of function calls. Multiple threads of a process will have access to the same memory (but can have local variables).
Shared memory parallel computers vary, but they have the ability for the processors to access all memory as global address space. So it's not simply threads that share memory. These processors operate independently, but share the same memory.
By analogy, recall the painters painting houses example that Dr. Panoff mentioned Monday night, but this will have a twist. Imagine you were one of four artistic painters, all painting on one shared canvas. You each have your own set of brushes, but you share one palette on which to put paints and mix the colors. This shared palette from which which you make new colors and the shared canvas on which you paint are analogous to the shared memory from which you will read and to which you will write. If someone mixes colors on the palette and makes a nice new shade of green, you can access that green and use it to paint on the canvas, but if you don't communicate with the other painters, that green may change into something different before you're done using it.
Also recall from Dr. Panoff's example that differences in the location of a resource effect the painting of houses. Shared memory machines can be classified as Uniformed Memory Access (UMA) and Non-Uniformed Memory Access (NUMA), based upon memory access times.
OpenMP is an API built for shared-memory parallelism. This is usually realized by multi-threading. The OpenMP API is comprised of three distinct components: compiler directives, runtime library routines, and environment variables.
omp.h
in C/C++).Copy code into your workspace:
cp -r ~bplist/2015/openmp .
example-shell-command
Example line of code (in the file, not the shell)
Comments are also interweaved
parallel
compiler directive
#pragma omp parallel [options] {
Code inside here runs in parallel
Among options available: declaring private/shared vars
}
#pragma omp parallel private(var1, var2) shared(var3) {
The parallel pragma starts a parallel block. It creates a team of N threads (where N is determined at runtime), all of which execute the next statement or block (a block of code requires a {…} enclosure). After the statement, the threads join back into one. This is called a fork/join model.
OMP_NUM_THREADS
environment variableOMP_NUM_THREADS=number
export OMP_NUM_THREADS=16 Sets the value in the shell
This environment variable tells the library how many threads that can be used in running the program. If dynamic adjustment of the number of threads is enabled, this number is the maximum number of threads that can be used, else, it is the exact number of threads that will be used.
The default value is the number of online processors on the machine; on Blue Waters, it seems to be 1, no matter your allocation.
omp_get_thread_num
functionuid = omp_get_thread_num()
This function asks the thread that is executing it to identify itself by returning it's unique number. [answers "Who am I?"]
omp_get_num_threads
functionthreadCount = omp_get_num_threads()
This function returns the number of threads in the team currently executing the parallel block from which it is called. [answers "How many of us?"]. If this "get" function exists, might there be a corresponding function?
for
compiler directive#pragma omp for [clause]{
for loop
}
#pragma omp parallel shared(a,b,c,chunk) private(i)
{
#pragma omp for
for (i=0; i < N; i++){
This directive has the iterations of the upcoming loop be executed in parallel by the team of threads. The iterations (or chunks of them) can be assigned in a 'round-robin' style before any iteration is processed, or they can be assigned first-come, first-serve.
This assumes a parallel region has already been initiated, otherwise it executes in serial on a single processor. The loop iteration variable is private in scope throughout the loop execution.
N.B. Be aware of the use of the word parallel
in this pragma and the general parallel pragma. Multiple uses of the term parallel
can lead to a multiplication of your parallelism.
reduction
commandreduction ( operator : list )
A reduction is a repeated operation that operates over multiple values and yields one value. So, the reduction command performs an operation (using a specified operator) on all of the variables that are in its list. The reduction variable(s) is a shared variable, not a private variable. Operators can include: +, *, -, /, &, ^, |, &&, ||.
#pragma omp parallel for private(privateVar) reduction(+:runningTotal)
Go to code into your workspace:
openmp/pi-serial.c
Team up with someone you haven't worked with before. Modify this code or your own serial pi code to be parallel, using the reduction clause.