BYU

Office of Research Computing

Block-cyclic programming example

This page describes an example of an MPI program that assigns data points out in a cyclic fashion, known as a ''block-cyclic'' distribution. This will not be a good match for all applications. Please read the description below for more information.

Assumptions made by this algorithm

This algorithm, and the example code, makes several assumptions:

  • All the processes involved in the program will be run on nodes with approximately equal processing capacity - there is no dynamic load sharing here
  • All the processes involved in the calculation are running on the same processor architecture and operating system - this implementation will not tolerate differences in data size, orientation, or endianness
  • The data points are arranged in 2-dimensions, and therefore have ''x'' and ''y'' coordinates
  • The ''x'' and ''y'' coordinates are the only input that is needed for the calculations for that data point

Types of problems this will/won't work for

This algorithm and example implementation will work for problems where the calculations are completely independent of each other. For example, calculating the Mandelbrot Set would work well here, since the calculation of each data point depends entirely on the input coordinates, and on nothing else. However, a Five-point Stencil problem, as is often used for thermodynamics calculations, would not be a good fit, since the calculations depend on the previous iteration for nearby data points.

Compiling and Running

To compile the example listed below, simply use your favorite variation of the ''mpicc'' compiler, using syntax similar to the following:

mpicc -o block_cyclic_boilerplate block_cyclic_boilerplate.c

Similarly, to run the program, simply use an MPI job launcher (either ''mpirun'' and ''mpiexec''). Note that this implementation requires that you supply maximum dimension values for the ''x'' and ''y'' coordinates. For example, inside of a submitted job, you could use syntax like this:

mpirun ./block_cyclic_boilerplate --xmax=300 --ymax=150

Note that in this example, the calculations would be done on a 300x150 matrix, and that the ''x'' variable would range from 0 to 299, and the ''y'' variable from 0 to 149.

General Algorithm Description

In this algorithm, the data points are assigned to processes in a cyclic fashion. For ''n'' processes, the first process will get data point ''0'', data point ''n'', data point ''2n'', ''3n'', etc, while the second process will get data point ''1'', ''n+1'', ''2n+1'', and so on.

Each process is given the same number of data points, with the following exception: The first process (process ''0'') handles any extra data points at the end of the calculation, beyond the equal division.

To illustrate this, see the following diagram:

Alt

In this example, we have 18 total data points, and 5 processes. Each process is given 3 data points to calculate, and process 0 also handles the 3 extra at the end.

Example Code

If you're modifying the code for a new application, please note the do_work() and the output_data() functions, each of which is called, once per data point.

#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>
#include <getopt.h>
#include <errno.h>

//options populated by '--xmax' and '--ymax' variables
static int xmax;
static int ymax;
//flags specifying that xmax and ymax have been set
int xmax_set = 0;
int ymax_set = 0;

//flag specifying that output shouldn't/should be quiet (only errors)
int quiet_output = 0;

//datatype definition for output of "do_work" function
typedef int datatype;

//function definitions
//General utility functions
void handle_options(int argc, char * argv[]);
int handle_conversion(const char * strtoconv);

//encapsulations of the specifics for doing work, and outputting data
datatype do_work(int x,int y);
void output_data(int x, int y, datatype result);


int main(int argc, char * argv[]) {

    int my_rank,nproc,points_per_process,lx,ly,seq,count,i;
    int *full_result_x, *full_result_y, *result_x, *result_y;
    datatype *full_result;
    int exitcode = 0;

    //deal with command line parameters (eg --xmax, etc.)
    handle_options(argc, argv);

    //initialize MPI
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
    MPI_Comm_size(MPI_COMM_WORLD, &nproc);

    //check if parameters are set
    if(my_rank == 0) {
        if(!xmax_set || !ymax_set) {
            fprintf(stderr, "Error: An error has occurred setting one of the limts (xmax or ymax)."
                    "  Please be sure you've set them both via the \"--xmax=\" and"
                    " \"--ymax=\" parameters\n");
            exit(-1);
        } else {
            if(!quiet_output) {
                printf("Beginning run with xmax=%u, and ymax=%u\n", xmax, ymax);
            }
        }
    }

    //every process will calculate the same number of data points; rank 0 will pick up the slack at the end
    points_per_process = xmax*ymax/nproc;

    //allocate local buffers to gather results
    result_x = (int *)malloc(points_per_process * sizeof(int));
    result_y = (int *)malloc(points_per_process * sizeof(int));
    datatype * result = (datatype *)malloc(points_per_process * sizeof(datatype));

    if(my_rank == 0) {
        //prepare space for gathering data back to rank 0
        full_result_x = (int *)malloc(xmax * ymax * sizeof(int));
        full_result_y = (int *)malloc(xmax * ymax * sizeof(int));
        full_result = (datatype *)malloc(xmax * ymax * sizeof(datatype));

        if(full_result_x == NULL || full_result_y == NULL || full_result == NULL) {
            fprintf(stderr, "Error in rank 0: unable to allocate one or more full_result buffers\n");
            exit(-1);
        }
    }


    if(result_x != NULL && result_y != NULL && result != NULL) {

        //all processes do this

        //calculate for just my assigned values
        //initial_conditions
        lx = my_rank%xmax;
        ly = my_rank/xmax;

        for(count = 0; count < points_per_process; count++) {
            result[count] = do_work(lx,ly);
            result_x[count] = lx;
            result_y[count] = ly;

            //increment lx and ly
            lx += nproc;
            if(lx >= xmax) {
                lx %= xmax;
                ly ++;
            }
        }


        //gather the results back
        MPI_Gather( result_x, 
                points_per_process, 
                MPI_INT, 
                full_result_x, 
                points_per_process, 
                MPI_INT, 
                0, 
                MPI_COMM_WORLD);
        MPI_Gather( result_y, 
                points_per_process, 
                MPI_INT, 
                full_result_y, 
                points_per_process, 
                MPI_INT, 
                0, 
                MPI_COMM_WORLD);

        /*Note that this assumes that the datatype is something that can be 
         * transferred in binary between hosts.  A much better way would to 
         * create a derived datatype in MPI, in case the program is used 
         * with processes on different architectures, ith different 
         * endianness, etc.  But it works for now.  Just make sure you're 
         * running on homogeneous hardware and kernels*/
        MPI_Gather( result, 
                points_per_process*sizeof(datatype)/sizeof(int), 
                MPI_INT, 
                full_result, 
                points_per_process, 
                MPI_INT, 
                0, 
                MPI_COMM_WORLD);

        //clean up local result memory
        free(result_x);
        free(result_y);
        free(result);

        if(my_rank == 0) {

            //do the work for all the points between the end of the block cycles, 
            // at the end of the buffer
            //basically pick up the slack in rank 0; should be no more than nproc 
            // extra data points
            for(i = points_per_process*nproc; i < xmax*ymax; i++) {
                lx = i%xmax;
                ly = i/xmax;
                full_result_x[i]=lx;
                full_result_y[i]=ly;
                full_result[i]=do_work(lx,ly);
            }

            //output the results
            //output resequenced results (in sorted order) from the distributed version
            //for example, if nproc=4, then we get the first data point from process 0, then 
            // the first one from process 1, then the first one from process 2, then the first 
            // from process 3, then the second from process 0, etc.
            for(count = 0; count < points_per_process; count++) {
                for(i=0; i<nproc; i++) {
                    output_data(    full_result_x[count+i*points_per_process], 
                            full_result_y[count+i*points_per_process], 
                            full_result[count+i*points_per_process]);
                }
            }

            //output extra results from the end, that rank 0 calculated, to pick up the slack
            for(i = points_per_process*nproc; i < xmax*ymax; i++) {
                output_data(full_result_x[i], full_result_y[i], full_result[i]);
            }

            //free up main arrays where data was gathered
            free(full_result_x);
            free(full_result_y);
            free(full_result);
        }


    } else {
        fprintf(stderr, 
            "Error in rank %u: unable to allocate one or more result buffers\n", 
            my_rank);
        exitcode = -1;
    }
    MPI_Finalize();

    exit(exitcode);

}


void handle_options(int argc, char * argv[]) {

    while(1) {

        static struct option long_options[] = {
            {"xmax", required_argument, 0, 'x'},
            {"ymax", required_argument, 0, 'y'},
            {"help", no_argument, 0, 'h'}
        };

        int option_index = 0;
        int c;

        c = getopt_long(argc, argv, "x:y:q", long_options, &option_index);

        if( c == -1 )
            break;

        switch (c) {

            case 'x':
                //printf("option -x with value %s\n", optarg);
                if(!xmax_set) {
                    xmax = handle_conversion(optarg);
                    xmax_set = 1;
                }
                break;
            case 'y':
                //printf("option -y with value %s\n", optarg);
                if(!ymax_set) {
                    ymax = handle_conversion(optarg);
                    ymax_set = 1;
                }
                break;
            case 'q':
                quiet_output = 1;
            case '?':
                break;
            case 'h':
            default:
                printf("To run this program, you must supply the following parameters\n"
                    "\t-x num or --xmax=num, to have the x variable range from 0 to"
                    " num-1\n\t-y num or --ymax=num, to have the y variable range from"
                    " 0 to num-1\n\nBOTH VARIABLES ARE REQUIRED\n"
                    "You can also suppress all non-error output by appending the \"-q\""
                    " option.");
                exit(-1);
        }

    }
}

int handle_conversion(const char * strtoconv) {
    errno = 0;
    char * endptr;
    long val = strtol(strtoconv, &endptr, 10);

    if(errno == ERANGE) {
        fprintf(stderr, 
            "Error: unable to convert string \"%s\" to range; data out of range\n", 
            strtoconv);
        exit(-1);
    } else if (endptr == strtoconv) {
        //no convertable strings
        fprintf(stderr, 
            "Error: unable to convert string \"%s\" to integer\n", strtoconv);
        exit(-1);
    } else {
        return (int)val;
    }
}

datatype do_work(int x, int y) {
    return x*y;
}

void output_data(int x, int y, datatype result) {
    if(!quiet_output) {
        printf("data point %u - (x, y, result) = (%u, %u, %u)\n", y*xmax+x, x, y, result);
    }
}