P573: Practical Use of the BLAS

Practical Use of the BLAS

The basic rules so far are:

Whenever possible, use a BLAS operation and call the optimized built-in library.
Whenever possible, convert the algorithm to use a higher level BLAS function.
Within a single BLAS level, prefer a kernel that has the lowest memory reference to flop ratio.

Four versions of matrix-matrix have been shown, five including directly calling the BLAS version dgemm(). Now consider how to find and use the BLAS library.

BLAS Nomenclature

How did I know to look for something as weirdly named as "dgemm()"? The BLAS have a (mostly) systematic naming convention.

For almost all BLAS, the first letter of the routine is one from the set {S,D,C,Z}. Those indicate single precision, double precision, complex, double complex, resp. Exception: i*amax(n,x,incx) where * = one of S,D,C,Z. Since P573 uses double precision everywhere unless otherwise stated, all of the functions called will have the first letter of "D", except for idamax(). The numerical computing community often will use the term "saxpy" for a vector update, even when the data is double precision. So beware of that minor inconsistency; it's the numerical linear algebra equivalent of ending a sentence with a preposition: not exactly correct, but everyone knows what is meant.
BLAS2/BLAS3 have second and third letters giving the type of matrix involved:
- GE = general matrix
- GB = general banded matrix
- GP = general "packed" matrix
- HE = general Hermitian matrix
- SY = general symmetric matrix
- TR = triangular matrix
- ...
BLAS2/BLAS3 have fourth and fifth letters giving the type of operation.
- MV = matrix-vector product.
- SV = solve (for triangular matrices only).
- MM = matrix-matrix product.
- R* = low rank update (R for rank-1, R2 for rank-2).
Vector operations have an "increment" associated with each vector. For example, the daxpy call is daxpy(n, alpha, x, incx, y, incy). The increment is the stride in memory between consecutive elements in the vector. For example, if the vector is a column in a 2D array stored in row-major order, the stride will be the declared trailing dimension of the array.

How do you find out what BLAS are exist? Use the man pages, or just plain old Google. You can also use the BLAS Quick Reference Card (a Postscript file), although learning to read the card requires some practice.

Part of the naming obscurity comes from a limitation of Fortran compilers when the BLAS were first proposed. An identifier in Fortran could have 6-8 characters, so the only safely portable way to proceed was to limit names to 6 characters. The latest standards have increased that limit up to 64 characters, which makes a major difference in readability. A function named matrix_matrix_multiply() is readable and easy to figure out; the name dgemm() is not.

Slightly related: Fortran has a built-in function for matrix-matrix multiply, called matmul(), which handles both matrix*vector and matrix*matrix multiplication. The Fortran dotproduct function is called dot_product(). These are "intrinsic" functions, meaning they are part of the language standard, and are available for use without having to link in any libraries. A good BLAS library, however, will always be at least as fast and in some cases much faster than the language intrinsic version.

BLAS Calling Conventions

When using the BLAS, beware that

The BLAS have to be callable from both Fortran C/C++. At the time they were developed (the 1970's), Fortran passed all arguments by address¹. That means all arguments passed in are addresses. So if the length of the vector is n, from C you need to pass the argument &n. Of course, from a Fortran program it is called with argument n; the default in Fortran is to pass the address, not a copy of the actual datum. C-interface versions of the BLAS can make life easier if when writing from C/C++. The problem is that the interface is nonstandard, so code using it may not port to another platform. Furthermore, your library routine may be used by a code that itself calls libraries that use the standard BLAS, which can cause trouble for linkers. tl;dr version: Always use the standard interface, not a wrapper. A similar warning applies to using modern Fortran: inquiry functions allow you to not pass in the declared dimension and number of rows and columns in a 2D array, and there are F95 wrappers for the BLAS. Do not use those; they are not standard.
2D arrays are presumed to be in column-major order; that is, in the computer's memory, consecutive locations correspond to going down one column after another. Again, see the note about the C-interface versions - some assume a row-by-row layout others the usual column major order, which makes using a C/C++ interface non-portable.
The leading dimension of each 2D array needs to be passed in, usually in an integer variable lda. Mapping matrix coordinates (i,j) into singly-indexed computer memory coordinates requires the array's declared leading dimension, as well as the sizes of the matrix stored in the 2D array.

When calling Fortran-based BLAS from a C routine, the invoked BLAS function must be declared as "extern". For example,

   extern void dgemv_(char *trans, int *m, int *n,
                     double *alpha, double *a, int *lda, double *x, int *incx,
                     double *beta,  double *y, int *incy );

When calling from a C++ function, you will need to subvert the name-mangling that C++ does. This can be done using a header file that says to use the C-calling convention for the loader:
```
   extern "C" {
   extern void dgemv_(char *trans, int *m, int *n,
                     double *alpha, double *a, int *lda, double *x, int *incx,
                     double *beta,  double *y, int *incy ); };
```
Inside the braces, put all the BLAS (and C and Fortran) functions you will be calling. Usually a header file is provided along with the BLAS that does it for you, and you just have to #INCLUDE it.
Linking in the BLAS library requires using something like -lblas on the link line of compilation. For Solaris systems, use -xlic_lib=sunperf. A recent version of the Intel BLAS library installed in default locations can be linked in using -L/opt/intel/composer_xe_2013.3.163/mkl/lib/intel64 -lmkl_rt -lpthread -lm so beware: the actual location and library name need to be tracked down for each system. Furthermore, systems used for scientific computing may have multiple implementations of the BLAS installed. Avoid the ones created for the original compatability libraries on netlib.org; those were only intended as temporaries for vendors to base more efficient ones upon.
Since the BLAS are built using -dalign, you need to do the same for your functions/routines in any language. Most compilers will align data on double word boundaries nowadays (2015 CE) in any case².
Using C/C++ as the main program and calling the standard BLAS (i.e., the Fortran convention ones) some Fortran libraries need to also be linked in. Look for a library like -lftn or something similar.
When calling the standard BLAS from C/C++, most machines require appending an underscore to the name of the BLAS routine. So a call is to daxpy_() instead of daxpy(). IBM machines do not follow this "underscore convention", which is irksome for mixed-language codes. When the underscore convention is followed, calling the BLAS matrix-vector product function looks like
```
   double alpha = 1.0, beta = 0.0;
   int incx = 1, incy = 1;
   int tda = 100;
   double A[10000]; // for a 100x100 matrix.  or a 10x1000 matrix. or 
                    // any matrix with fewer than 10001 entries
   char trans = 'N';
    .
    .
    .
   dgemv_(&trans,&n,&n,&alpha,A,&tda,x,&incx,&beta,y,&incy);
```
Again, note that the values are passed in as addresses when calling the standard BLAS from C.

The matrix-matrix multiply results obtained by calling dgemm came from a C++ code calling the standard interfaces to the BLAS, so the exact same code works on all platforms (Sun, IBM, SGI, Intel) without changing any of the source. Not a big deal ... until you have to modify all 143 of the BLAS calls in a 128k line fluid dynamics code.

¹ Fortran also could (and still can) pass data using "common blocks", a software engineering nightmare because they do not appear in any argument list. Since 1989 Fortran can also pass arguments by value, but that was long after the BLAS were cast in the software equivalent of concrete.
² Common blocks in Fortran allow a user to align doubles on single word boundaries, and the compiler and run time system must honor that foolishness. Another reason common blocks are regarded as evil by computer scientists.

Next page: A brief linear algebra review, preparatory to deriving a fast linear system solver using the BLAS.

Started: 17 Aug 2011
Modified: Wed 24 Sep 2014, 10:11 AM for BLAS linkage and use notes.
Last Modified: Mon 12 Sep 2016, 07:22 AM