Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent NaNs appearing in results from calling shared-library blis from Julia #12

Closed
tkelman opened this issue Jul 3, 2014 · 32 comments

Comments

@tkelman
Copy link
Contributor

tkelman commented Jul 3, 2014

CPU: Core2 Duo E8400 (old machine)
OS: Ubuntu 14.04, x86-64

Compiled BLIS reference configuration, setting BLIS_ENABLE_DYNAMIC_BUILD := yes. By itself, BLIS passes its own make test.

I'm calling into BLIS from Julia by the following steps:

git clone https://github.com/julialang/julia
cd julia
mkdir -p $PWD/usr/lib
cp /path/to/libblis.so $PWD/usr/lib
echo 'override USE_SYSTEM_BLAS = 1' > Make.user
echo 'override USE_BLAS64 = 0' >> Make.user
echo 'override LIBBLAS = -L$(build_libdir) -lblis -lm' >> Make.user
echo 'override LIBBLASNAME = libblis' >> Make.user
make -j8   # this will take quite a while - Julia has lots of big dependencies
make testall

This gives me a different failure each time I repeat make testall. Here are some examples, from the first couple of files in Julia's test suite (linalg1 and linalg2):
https://gist.github.com/47dbd5517c4a6f56fb2e
https://gist.github.com/51835795c8ada7c0f2a1
https://gist.github.com/552ec09e5e78d1cd3da7
https://gist.github.com/7cce0057bf9a3009a92a
https://gist.github.com/0f01364072634cc95be9
https://gist.github.com/9292fc3fe1a2e9afe09d

I'll see if I can translate a few of these test cases into C to figure out whether the problem is reproducible outside of Julia. I'll also try setting the BLIS integer size to 64-bit and delete the USE_BLAS64 = 0 line, see whether that changes anything.

@fgvanzee
Copy link
Member

fgvanzee commented Jul 3, 2014

Tyler took a look at your output and had an idea: is Julia calling BLIS via multiple threads? (The reference configuration is not thread-safe.)

@tkelman
Copy link
Contributor Author

tkelman commented Jul 3, 2014

Aha. Well not exactly multiple threads, but the master julia process launches separate worker julia processes to run the tests in parallel. The subprocesses should have separate memory spaces in which they load dynamic libraries, I think. Even when I run the tests in a single julia process, I still get similar-looking non-reproducible results and intermittent NaNs though.

Is there a list of which configurations are or are not thread-safe anywhere? Or should I assume none of them are at this point?

@fgvanzee
Copy link
Member

fgvanzee commented Jul 3, 2014

Is there a list of which configurations are or are not thread-safe anywhere? Or should
I assume none of them are at this point?

Sorry, I should have been more descriptive. The reference configuration isn't thread-safe because it is single-threaded and does not enable OpenMP. (OpenMP is currently the only threading model we support, but POSIX threads could be supported in the future.) In order to achieve thread safety, a critical section in the BLIS memory allocator must be enabled. Currently, it's only enabled when OpenMP is enabled, via BLIS_ENABLE_OPENMP.

Now, I'm not sure if the critical section works when BLIS is called from two different application threads. Tyler thinks that it may not do what we want it to do in that situation.

This all may be unrelated, of course. But it would be interesting to enable OpenMP and see if that fixes some or all of the NaNs.

@tkelman
Copy link
Contributor Author

tkelman commented Jul 3, 2014

But it would be interesting to enable OpenMP and see if that fixes some or all of the NaNs.

Sure, happy to try. Where would that flag be located?

@fgvanzee
Copy link
Member

fgvanzee commented Jul 3, 2014

Sure, happy to try. Where would that flag be located?

Just add

#define BLIS_ENABLE_OPENMP

to bli_config.h.

@tkelman
Copy link
Contributor Author

tkelman commented Jul 4, 2014

OK, I also had to uncomment -fopenmp from CMISCFLAGS and add it to LDFLAGS (and LIBBLAS in Julia's Make.user). This is giving a different set of errors - either index out of bounds, or LAPACKException(6597069766656), or SingularException(140011638882304), or determinants of the wrong signs.

Trying with 64-bit integers in BLIS resulted in a segfault, I'll see if I can get a backtrace.

@Maratyszcza
Copy link
Contributor

This kind of error might also result from reading uninitialized memory.
Could you try to run under valgrind memcheck to eliminate this possibility?

On Wednesday, July 2, 2014, Tony Kelman [email protected] wrote:

CPU: Core2 Duo E8400 (old machine)
OS: Ubuntu 14.04, x86-64

Compiled BLIS reference configuration, setting BLIS_ENABLE_DYNAMIC_BUILD
:= yes. By itself, BLIS passes its own make test.

I'm calling into BLIS from Julia https://github.com/julialang/julia by
the following steps:

git clone https://github.com/julialang/julia
cd julia
mkdir -p $PWD/usr/lib
cp /path/to/libblis.so $PWD/usr/lib
echo 'override USE_SYSTEM_BLAS = 1' > Make.user
echo 'override USE_BLAS64 = 0' >> Make.user
echo 'override LIBBLAS = -L$(build_libdir) -lblis' >> Make.user
echo 'override LIBBLASNAME = libblis' >> Make.user
make -j8 # this will take quite a while - Julia has lots of big dependencies
make testall

This gives me a different failure each time I repeat make testall. Here
are some examples, from the first couple of files in Julia's test suite (
linalg1 https://github.com/JuliaLang/julia/blob/master/test/linalg1.jl
and linalg2
https://github.com/JuliaLang/julia/blob/master/test/linalg2.jl):
https://gist.github.com/47dbd5517c4a6f56fb2e
https://gist.github.com/51835795c8ada7c0f2a1
https://gist.github.com/552ec09e5e78d1cd3da7
https://gist.github.com/7cce0057bf9a3009a92a
https://gist.github.com/0f01364072634cc95be9
https://gist.github.com/9292fc3fe1a2e9afe09d

I'll see if I can translate a few of these test cases into C to figure out
whether the problem is reproducible outside of Julia. I'll also try setting
the BLIS integer size to 64-bit and delete the USE_BLAS64 = 0 line, see
whether that changes anything.


Reply to this email directly or view it on GitHub
#12.

@tkelman
Copy link
Contributor Author

tkelman commented Jul 4, 2014

Not sure whether this is useful: https://gist.github.com/tkelman/f03fb2b9779b7adac894 edit: whoops, definitely not useful, valgrind doesn't apply to processes run by make

Regarding the 64-bit integer segfaults, something seemed to be calling lapack dpotrf at the very end of creating Julia's system image, which then went into BLIS with what looked to be garbage dimensions. I wouldn't expect anything to be doing a Cholesky factorization at that moment, so not sure what was happening there.

@Maratyszcza
Copy link
Contributor

I think you'll need to run it as
valgrind --tool=memcheck --trace-children=yes -- make testall
By default it does not instrument child processes.

On Fri, Jul 4, 2014 at 2:59 PM, Tony Kelman [email protected]
wrote:

Not sure whether this is useful:
https://gist.github.com/tkelman/f03fb2b9779b7adac894

Regarding the 64-bit integer segfaults, something seemed to be calling
lapack dpotrf at the very end of creating Julia's system image, which
then went into BLIS with what looked to be garbage dimensions. I wouldn't
expect anything to be doing a Cholesky factorization at that moment, so not
sure what was happening there.


Reply to this email directly or view it on GitHub
#12 (comment).

@tkelman
Copy link
Contributor Author

tkelman commented Jul 4, 2014

Thanks yeah I realized that a few minutes after initially posting.

So there's a lot going on here https://gist.github.com/tkelman/c55da37d849d01585739 that I don't quite know how to make sense of, though the calls to lapack sgecon_ and dgecon_ look most relevant.

@ViralBShah now might not be a good time, but in case you want to play with an alternate BLAS at all...

@tkelman
Copy link
Contributor Author

tkelman commented Jul 5, 2014

I think I've figured out what's happening. These first few tests in linalg are often going to lapack, which complicates matters. It was more useful to start at the tests that use blas directly.

Julia offers both in-place and allocating versions of blas calls. The allocating versions do not explicitly initialize the contents of the output array before calling blas, they just send in whatever garbage is in the memory and use beta = 0. BLIS appears to be following the mathematical statement of BLAS operations a bit more literally than the reference and other implementations I'm aware of, and is actually multiplying my input garbage by zero. When my input garbage has NaNs (or presumably /- Inf as well, but that's not as likely to happen randomly), the output ends up with corresponding NaNs in the same locations.

Should BLIS be checking for beta == 0 and special-casing to an assign-only operation, instead of multiplying the input by 0? Many applications may have grown accustomed to relying on this behavior over the years without realizing it.

@ViralBShah
Copy link

@tkelman I want to definitely try this out, but only after the 0.3 release. Would be great to have a BLIS.jl julia package for now. You could just replace all calls to Array with zeros on the julia side to get the initializations zeroed out.

@tkelman
Copy link
Contributor Author

tkelman commented Jul 5, 2014

@ViralBShah that's about what I thought you'd say right now, no worries. I'll see what I can do with Clang auto-wrapping and coming up with a package.

With this patch to Julia https://gist.github.com/a16ce380d174b67561cd to initialize the allocating BLAS calls to 0, I see BLAS tests passing reliably for Float32 and Float64. There appears to be a problem with the imaginary part of the result of cdotc, this is likely due to complex type convention. Julia had switched here JuliaLang/julia#5291 to use cblas for these calls due to some troubles with MKL, but BLIS does't provide a cblas compatibility API. Looks like no difference from setting BLIS_ENABLE_C99_COMPLEX.

The linalg tests still see a few nans, lapack might be making some similar beta == 0 initialization assumptions to what unpatched Julia was making.

@fgvanzee
Copy link
Member

fgvanzee commented Jul 5, 2014

Should BLIS be checking for beta == 0 and special-casing to an assign-only operation,
instead of multiplying the input by 0? Many applications may have grown accustomed to
relying on this behavior over the years without realizing it.

Generally speaking, BLIS should already be employing this write vs. update optimization for beta == 0 cases. Could you give me more details about how your are running BLIS? Are you configuring with the reference configuration? Which operation is being called that causes the NaNs to pop up?

Looking at the reference gemm micro-kernel, I can't see any reason it would allow uninitialized memory to get in the way of a beta = 0 invocation.

Maybe it's a different operation altogether (e.g. level-1 / level-2) that is causing the problem?

@tkelman
Copy link
Contributor Author

tkelman commented Jul 5, 2014

Are you configuring with the reference configuration?

Yes, only change here is BLIS_ENABLE_DYNAMIC_BUILD := yes.

Which operation is being called that cause the NaNs to pop up? Looking at the reference gemm micro-kernel

Not sure whether gemm was ever causing trouble, I'll see how many test failures I can cause from direct BLAS calls. syrk definitely was leaving input NaNs in place with beta = 0.

@tkelman
Copy link
Contributor Author

tkelman commented Jul 5, 2014

I can cause a problem with sgemm_, but dgemm_ looks ok. Test program, blistest.c:

#include <stdio.h>
#include "blis.h"

int main() {
    float A[16], B[16], C[16], alpha = 1.0, beta = 0.0;
    int i, ld = 4;
    char trans = 'N';
    for (i=0; i<16; i  ) {
        A[i] = 0.0;
        B[i] = 0.0;
        C[i] = NAN;
    }
    A[0]  = 1.0;
    A[5]  = 1.0;
    A[10] = 1.0;
    A[15] = 1.0;
    B[0]  = 1.0;
    B[5]  = 1.0;
    B[10] = 1.0;
    B[15] = 1.0;
    sgemm_(&trans, &trans, &ld, &ld, &ld, &alpha, A, &ld, B, &ld, &beta, C, &ld);
    for (i=0; i<16; i  )
        printf("C[%d] = %e\n", i, C[i]);
}

Compiled and run as:

tkelman@ygdesk:~/github/blis-build06$ gcc blistest.c -o blistest -Iinclude/blis -Llib -lblis -lm -std=gnu99
tkelman@ygdesk:~/github/blis-build06$ LD_LIBRARY_PATH=$PWD/lib ./blistest
C[0] = nan
C[1] = nan
C[2] = nan
C[3] = nan
C[4] = nan
C[5] = nan
C[6] = nan
C[7] = nan
C[8] = nan
C[9] = nan
C[10] = nan
C[11] = nan
C[12] = nan
C[13] = nan
C[14] = nan
C[15] = nan

@fgvanzee
Copy link
Member

fgvanzee commented Jul 5, 2014

Thanks for isolating the problem, Tony. I have been able to reproduce the bug. It turns out there are two places where the "overwrite or update" logic must exist. The first is in the micro-kernel itself, which is the first place that comes to mind when I think of this issue. The second is in the macro-kernel, immediately after the micro-kernel is called for edge cases. There, the micro-kernel computes into a temporary buffer, and then the elements corresponding to the edge case elements that actually exist are copied to the output matrix. This second location only has "update" logic, hence the bug. However, I can't figure out why it is not also manifesting for double-precision.

I'll add this fix to the top of my queue.

@fgvanzee
Copy link
Member

fgvanzee commented Jul 7, 2014

@tkelman Just a quick update: I figured out why the NaNs were only being produced by single-precision real, and not double-precision real gemm. It has to do with default micro-kernel sizes. The default sgemm micro-kernel size is 8x4, whereas for dgemm it is 4x4. The latter perfectly matches your 4x4 test case, and thus there is no edge case handling needed there (unlike for sgemm). It is in this edge case handling where the NaNs sneak through. (If you were to change the test matrix size in your test driver, for the double-precision case, they would likely show up.)

@tkelman
Copy link
Contributor Author

tkelman commented Jul 7, 2014

Yep, good catch, thanks for the heads up. 3x3 double lets nans through too. For 9x9 single or double, just the last row and last column are nans.

@fgvanzee
Copy link
Member

fgvanzee commented Jul 8, 2014

@tkelman Should be fixed now. Honestly, I don't know how this bug went undetected for so long, aside from dumb luck that, up until now, none of the infs or NaNs were inside the test matrices' edge cases.

@fgvanzee fgvanzee closed this as completed Jul 8, 2014
@tkelman
Copy link
Contributor Author

tkelman commented Jul 8, 2014

Great, thanks Field. This is now passing Julia's blas tests for single and double. I'll see if I can reproduce any issues with complex in a standalone example and open a new issue if so.

@tkelman
Copy link
Contributor Author

tkelman commented Jul 8, 2014

Okay I'm not sure what's at fault with the complex dot product, I don't think it's worth opening a new issue on. The problem only occurs with cdotu_ and cdotc_, they give the correct real part but wrong imaginary part. It might be Julia's C FFI getting slightly confused by the function returning a struct of 2 floats, though it also happens if I turn on BLIS_ENABLE_C99_COMPLEX.

I can work around the issue by applying this patch to Julia http://gist.github.com/43cece304e7247307e47 to account for the extra inputs to bli_cdotv relative to cblas_cdotu_sub. The good news is, with this change Julia's entire test suite passes using Blis in place of OpenBLAS.

@fgvanzee
Copy link
Member

fgvanzee commented Jul 8, 2014

@tkelman Hmm, If the problem were about return value conventions, you would think it would affect zdotc_ and zdotu_ too.

I did a quick test on my end and was not able to reproduce any issue using 4x1 vectors.

Would it be easy to check exactly how wrong the imaginary result is? (Perhaps this is subtle numerical issue, which of course single-precision floats are quite vulnerable to.)

@fgvanzee fgvanzee reopened this Jul 8, 2014
@tkelman
Copy link
Contributor Author

tkelman commented Jul 8, 2014

It's far enough off that this isn't a matter of numerical differences.

julia> elty = Complex64; n32 = int32(10); inc32 = int32(1); result = Array(Complex64, 1);

julia> z1 = convert(Vector{elty}, complex(randn(10),randn(10)))
10-element Array{Complex{Float32},1}:
   0.530397-0.547332im
 -0.0749473-0.128646im
     1.5251-0.105443im
   0.785446-0.0529454im
   -1.02396 0.624246im
   0.743135-1.66501im
   -0.46022 0.786384im
  -0.952381-0.12021im
   -1.65526 0.263255im
  -0.316383 1.36757im

julia> z2 = convert(Vector{elty}, complex(randn(10),randn(10)))
10-element Array{Complex{Float32},1}:
 -0.538045 0.899379im
 -0.303094-0.121614im
   0.25909 0.881399im
 -0.968432 0.460136im
  -1.29269 0.17956im
   1.04964 0.295628im
  0.533711 0.66925im
  0.374576 0.708107im
 -0.310224 0.204892im
 -0.437401-0.184823im

julia> sum(z1.*z2)
2.2747393f0 - 1.5521777f0im

julia> ccall((:cdotu_, "libblis"), Complex64, (Ptr{Int32}, Ptr{Complex64}, Ptr{Int32}, Ptr{Complex64}, Ptr{Int32}), &n32, z1, &inc32, z2, &inc32)
2.2747393f0   0.39114362f0im

julia> ccall((:cdotu_, "libblas"), Complex64, (Ptr{Int32}, Ptr{Complex64}, Ptr{Int32}, Ptr{Complex64}, Ptr{Int32}), &n32, z1, &inc32, z2, &inc32)
2.2747393f0   0.39114362f0im

julia> ccall((:cblas_cdotu_sub, "libblas"), Complex64, (Int32, Ptr{Complex64}, Int32, Ptr{Complex64}, Int32, Ptr{Complex64}), n32, z1, inc32, z2, inc32, result)
-1.5521777f0   0.39114362f0im

julia> result[1]
2.2747393f0 - 1.5521777f0im

julia> result2 = Array(Complex64, 1);

julia> ccall((:bli_cdotv, "libblis"), Void, (Cint, Cint, Int32, Ptr{Complex64}, Int32, Ptr{Complex64}, Int32, Ptr{Complex64}), 0x0, 0x0, n32, z1, inc32, z2, inc32, result2)

julia> result2[1]
2.2747393f0 - 1.5521777f0im

So libblis and libblas (system reference blas from Ubuntu package) give the same wrong result from cdotu_, which leads me to believe it's a Julia C FFI quirk. Interestingly if I ask for a Complex64 (Julia's name for complex single, 32 bits each for real and imaginary parts) return value from cblas_cdotu_sub out of libblas, which should actually be returning void, I get the imaginary part of the correct dot product value returned in the real part, and the imaginary part of the incorrect dot product value returned in the imaginary part. I'm not sure where that value's coming from, but the connection is interesting.

@tkelman
Copy link
Contributor Author

tkelman commented Jul 8, 2014

cc @andreasnoackjensen does this stuff with complex return types from ccall look familiar to you?

@fgvanzee
Copy link
Member

fgvanzee commented Jul 8, 2014

@tkelman If netlib BLAS exhibits the problem too, then it sounds like it's on Julia's side. I'll let you take it from here! :)

@ViralBShah
Copy link

The cdotc stuff is ABI related to returning values from fortran functions. I wouldn't worry too much about it for now. I can't find the relevant julia issue right now.

@ViralBShah
Copy link

One possibility is to just use the CBLAS wrapper instead of calling it directly.

@fgvanzee
Copy link
Member

fgvanzee commented Jul 8, 2014

FWIW, I ran into this issue a lot back when Kazushige Goto maintained GotoBLAS. It basically boiled down to making sure that the Fortran components of both the BLAS and the application were compiled with either -ff2c or -fno-f2c. (The important thing was that they matched, though I preferred the more sane -fno-f2c option.)

https://gcc.gnu.org/onlinedocs/gfortran/Code-Gen-Options.html

I assume the issue with Julia is something similar.

@tkelman
Copy link
Contributor Author

tkelman commented Jul 8, 2014

Thanks Viral. I'll have to check, I think the Ubuntu libblas package is cblas, rather than Netlib Fortran blas. edit: nevermind, this is wrong

The Julia issue was JuliaLang/julia#5291 where we did switch to using cblas for these functions, the trouble is that BLIS doesn't recreate an exact copy of cblas. BLIS' C API is similar to cblas for this particular function, but more general.

@tkelman tkelman closed this as completed Jul 8, 2014
@ViralBShah
Copy link

We can't use f2c with Julia, as we use the C calling convention for calling gfortran compiled libraries.

@fgvanzee
Copy link
Member

fgvanzee commented Jul 8, 2014

We can't use f2c with Julia, as we use the C calling convention for calling gfortran compiled libraries.

That's probably for the best. f2c calling conventions are horrible. (I only mentioned it because the real/imaginary return value issue seemed familiar.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants