Practical Use of the BLAS
The basic rules so far are:
- Whenever possible, use a BLAS operation and call the optimized built-in
library.
- Whenever possible, convert the algorithm to use a higher level BLAS
function.
- Within a single BLAS level, prefer a kernel that
has the lowest memory reference to flop ratio.
Four versions of matrix-matrix have been shown,
five including directly calling the BLAS version dgemm().
Now consider how to find and use the BLAS library.
BLAS Nomenclature
How did I know to look for something as weirdly named
as "dgemm()"? The BLAS have a (mostly) systematic naming convention.
- For almost all BLAS, the first letter of the routine is one from the set
{S,D,C,Z}. Those indicate single precision, double precision, complex, double
complex, resp. Exception: i*amax(n,x,incx) where * = one of S,D,C,Z.
Since P573 uses double precision everywhere unless otherwise stated, all of the
functions called will have the first letter of "D", except for idamax(). The
numerical computing community often will use the term "saxpy" for a vector
update, even when the data is double precision. So beware of that minor
inconsistency; it's the numerical linear algebra equivalent of ending a sentence
with a preposition: not exactly correct, but everyone knows what is meant.
- BLAS2/BLAS3 have second and third letters giving the type of matrix
involved:
- GE = general matrix
- GB = general banded matrix
- GP = general "packed" matrix
- HE = general Hermitian matrix
- SY = general symmetric matrix
- TR = triangular matrix
- ...
- BLAS2/BLAS3 have fourth and fifth letters giving the type of operation.
- MV = matrix-vector product.
- SV = solve (for triangular matrices only).
- MM = matrix-matrix product.
- R* = low rank update (R for rank-1, R2 for rank-2).
- Vector operations have an "increment" associated
with each vector. For example, the
daxpy call is daxpy(n, alpha, x, incx, y, incy).
The increment is the stride in memory between consecutive elements in the
vector. For example, if the vector is a column in a 2D array stored in
row-major order, the stride will be the declared trailing dimension of
the array.
How do you find out what BLAS are exist? Use the man pages, or just
plain old Google. You can also use the
BLAS Quick Reference Card (a Postscript file),
although learning to read the card requires some practice.
Part of the naming obscurity comes from a limitation of Fortran
compilers when the BLAS were first proposed. An identifier in Fortran could
have 6-8 characters, so the only safely portable way to proceed was to limit
names to 6 characters. The latest standards have increased that limit up to
64 characters, which makes a major difference in readability. A function
named matrix_matrix_multiply() is readable and easy to figure out;
the name dgemm() is not.
Slightly related: Fortran has a built-in function for matrix-matrix multiply,
called matmul(), which handles both matrix*vector and matrix*matrix
multiplication. The Fortran dotproduct function is called dot_product().
These are "intrinsic" functions, meaning they are part of the language
standard, and are available for use without having to link in any libraries.
A good BLAS library, however, will always be at least as fast and in some
cases much faster than the language intrinsic version.
BLAS Calling Conventions
When using the BLAS, beware that
- The BLAS have to be callable from both Fortran C/C++. At the time they were
developed (the 1970's), Fortran passed all arguments by address1.
That means all arguments
passed in are addresses. So if the length of the vector is n, from C you
need to pass the argument &n. Of course, from a Fortran program it is called
with argument n; the default in Fortran is to pass the address, not a copy of the
actual datum. C-interface versions of the BLAS can make life easier if when
writing from C/C++. The problem is that the interface is nonstandard, so code
using it may not port to another platform. Furthermore, your library routine may be
used by a code that itself calls libraries that use the standard BLAS, which can
cause trouble for linkers.
tl;dr version: Always use the standard interface, not a wrapper.
A similar warning applies to using modern Fortran: inquiry functions allow you
to not pass in the declared dimension and number of rows and columns in a 2D
array, and there are F95 wrappers for the BLAS. Do not use those; they are not
standard.
- 2D arrays are presumed to be in column-major order; that is, in the
computer's memory, consecutive locations correspond to going down one column after
another. Again, see the note about the C-interface versions - some assume a
row-by-row layout others the usual column major order, which makes using a C/C++
interface non-portable.
- The leading dimension of each 2D array needs to be passed in, usually
in an integer variable lda. Mapping matrix coordinates (i,j) into
singly-indexed computer memory coordinates requires the array's declared
leading dimension, as well as the sizes of the matrix stored in the 2D array.
- When calling Fortran-based BLAS from a C routine, the invoked BLAS function
must be declared as "extern". For example,
extern void dgemv_(char *trans, int *m, int *n,
double *alpha, double *a, int *lda, double *x, int *incx,
double *beta, double *y, int *incy );
- When calling from a C++ function, you will need to
subvert the name-mangling that C++ does. This can be done
using a header file that says to use the C-calling
convention for the loader:
extern "C" {
extern void dgemv_(char *trans, int *m, int *n,
double *alpha, double *a, int *lda, double *x, int *incx,
double *beta, double *y, int *incy ); };
Inside the braces, put all the BLAS (and C and Fortran) functions you will be
calling. Usually a header file is provided along with the BLAS that does it for
you, and you just have to #INCLUDE it.
- Linking in the BLAS library requires
using something like -lblas on the link line of compilation. For Solaris
systems, use -xlic_lib=sunperf. A recent version of the Intel BLAS library
installed in default locations can be linked in using
-L/opt/intel/composer_xe_2013.3.163/mkl/lib/intel64 -lmkl_rt -lpthread -lm
so beware: the actual location and library name need to be tracked down for
each system. Furthermore, systems used for scientific computing may have multiple
implementations of the BLAS installed. Avoid the ones created for the original
compatability libraries on netlib.org; those were only intended as temporaries
for vendors to base more efficient ones upon.
-
Since the BLAS are built using -dalign, you need to
do the same for your functions/routines in any language. Most compilers will
align data on double word boundaries nowadays (2015 CE) in any case2.
- Using C/C++ as the main program and calling the standard BLAS (i.e., the
Fortran convention ones) some Fortran libraries need to also be linked in. Look for
a library like -lftn or something similar.
- When calling the standard BLAS from C/C++, most machines require appending an
underscore to the name of the BLAS routine. So a call is to daxpy_() instead of
daxpy(). IBM machines do not follow this "underscore convention", which is irksome
for mixed-language codes. When the underscore convention is followed, calling the
BLAS matrix-vector product function looks like
double alpha = 1.0, beta = 0.0;
int incx = 1, incy = 1;
int tda = 100;
double A[10000]; // for a 100x100 matrix. or a 10x1000 matrix. or
// any matrix with fewer than 10001 entries
char trans = 'N';
.
.
.
dgemv_(&trans,&n,&n,&alpha,A,&tda,x,&incx,&beta,y,&incy);
Again, note that the values are passed in as addresses when calling
the standard BLAS from C.
The matrix-matrix multiply results
obtained by calling dgemm came from a C++ code
calling the standard
interfaces to the BLAS, so the exact same code works on all platforms
(Sun, IBM, SGI, Intel) without changing any of the source.
Not a big deal ... until you have to modify all 143 of the BLAS calls in a 128k line
fluid dynamics code.
1 Fortran also could (and still can) pass data using "common blocks", a software
engineering nightmare because they do not appear in any argument list. Since 1989
Fortran can also pass arguments by value, but that was long after the BLAS were
cast in the software equivalent of concrete.
2 Common blocks in Fortran allow a user to align doubles on single word
boundaries, and the compiler and run time system must honor that foolishness.
Another reason common blocks are regarded as evil by computer scientists.
Next page: A brief linear algebra review, preparatory
to deriving a fast linear system solver using the BLAS.
- Started: 17 Aug 2011
- Modified: Wed 24 Sep 2014, 10:11 AM for BLAS linkage and use notes.
- Last Modified: Mon 12 Sep 2016, 07:22 AM