Reported by Valentin Zauner, Sep 27, 2013
dneupd selects/calculates wrong eigenpairs if rvec = true Steps to reproduce the problem: 1. Have an installation of arpack-ng, lapack and blas, have fortran and C++ compiler (C++11 capable) 2. Download Armadillo library http://arma.sourceforge.net/ (no installation, only include folder needed) 3. compile both arpack_test_c.cpp and arpack_test_f from https://gist.github.com/Darkdragon84/6728023 as described below 4. play around some with arpack_test_c until the result for nev=1 is wrong Expected result: after specifying which='LR', DNEUPD should calculate and return the NEV eigenvalues with largest real part and the corresponding eigenvectors from the arnoldi factorization computed by DNAUPD Actual result: if eigenvectors are also desired (rvec = true), DNEUPD sometimes selects the wrong eigenpairs to compute, and even sometimes computes eigenpairs which are plain wrong for nev=1. Also for nev>1 the ordering of the eigenpairs is often scrambled, this might be related to issue 667 http://forge.scilab.org/index.php/p/arpack-ng/issues/667/ === FURTHER ELABORATION === My task is the follwing: I have a real, nonsymmetric MxM matrix A that is guaranteed that its eigenvalue with largest real part is real and positive (see Remark 1 below if details are of interest). I want to calculate only this one eigenvalue and its corresponding eigenvector. I therefore use DNAUPD with mode = 1, bmat = 'I', which = 'LR', nev = 1, TOL=1e-15 (I also use ISHIFT = 1) and DNEUPD with rvec = 1, howmny='A'. I have been doing this in MATLAB (which also uses arpack in the background) for a long time now and never got any wrong results/wrongly selected eigenpairs. As I am now porting my code to C++ I implemented a simple, matlab-'eigs'-like interface to ARPACK. To represent vectors and matrices in C++ I use the Armadillo library http://arma.sourceforge.net/ === IMPLEMENTATION === I have written a C++ test program that generates random matrices of the above kind (and saves them in binary format) and tries to calculate nev=1 eigenvalues of largest real part and their eigenvectors. As a control I also do this for nev=4 and compare if nev=1 gives the right result (To be super sure I also check with matlab). Now on a random basis the nev=1 result is either completely off (also not fulfilling the eigencondition A*x = lambda*x) or I just get the wrong eigenpair (not the one with largest real part). If I don't seed the random number generator I can reproduce the result if I load the problematic matrix again (I generate my own random starting vector RESID for DNAUPD, which of course uses the same seed then). To check if this is a C++/Fortran interface problem I have written a separate fortran program, using the example program dndrv1.f in the EXAMPLES directory as a template (sry this was basically my first fortran program, so it's pretty sloppy, but does the job). This program loads the stored matrix 'Amat.bin' and calls the same DNAUPD + DNEUPD combination. In many cases, where C++ fails, the fortran program also fails, but there are also instances where only 1 of the 2 fails (so sometimes fortran fails when C++ succeeds). It has to be noted that fails are more frequent when the matrix dimension is higher (-> i suspected memory overlap/corruption, see below) The source codes can be found at: https://gist.github.com/Darkdragon84/6728023 to build arpack_test_c.cpp (where CC is e.g. g++ or icpc): CC -g -m64 -I/home/valentin/armadillo/include/ -std=c++0x -o arpack_test_c arpack_test_c.cpp -lblas -llapack -larpack to run arpack_test_c: ./arpack_test_c N (where N*N is the total matrix dimension of A) or ./arpack_test_c name.bin (to load a previously saved matrix name.bin) to build arpack_test_f.f (where FF is e.g. gfortran or ifort): FF -g -I/path/to/arpack-ng/SRC/ -o arpack_test arpack_test.f -lblas -llapack -larpack to run (a matrix file Amat.bin generated by arpack_test_c must be present in this directory): ./arpack_test_f === DELVING FURTHER === I started delving through the fortan code of DNAUPD and DNEUPD. As DNAUPD seems to do its job right, I looked into DNEUPD.f. After a few checks of debugging traces (turning on mneupd = 3 defined in ARPACKROOT/SRC/debug.h) I found what I believe is the source of all this evil: In the case of failure, the ordering of the eigenvalues of the upper Hessenberg matrix H, computed by LAPACK's dlahqr, is different than the ordering of the Ritz values passed by DNAUPD. The logical array 'select' determines, which eigenpairs should be computed and is computed upon the results of ARPACK's dngets (which selects the wanted eigenvalues, in this case 'LR'). dngets however works with the ordering of the ritz values passed by DNAUPD, whereas select is furtheron used to select from the eigenvalues computed by dlahqr. Ergo, if the orderings of the ritz values passed by DNAUPD and the eigenvalues computed by dlahqr differ, wrong eigenvalues will be selected! I however have no idea, what triggers the ordering mismatch in my failure cases. IMHO it would be much smarter to determine 'select' upon the results of dlahqr, rather than the original ritz values passed by DNAUPD, or is this not good for some reason? This fact also sometimes causes a screwup in the ordering of eigenpairs for nev>1. In that case mostly because we are targeting several eigenpairs, we will also get the ones we want by chance, but not always! I have included some debug traces of 3 characteristic failure cases (trace1-3.txt) here: https://gist.github.com/Darkdragon84/6728023 If the corresponding matrices Amat1-3.bin are wanted, let me know. I actually cannot believe I am the first person to observe this, after all DNAUPD and DNEUPD are heavily used AFAIK. So if there is a simple, obvious way of doing this right, please tell me what I have missed!! I also want to stress, that MATLAB never had this problem and is also using arpack. From all this I am suspecting something is wrong with the LAPACK routine dlahqr === BUILD ENVIRONMENT === I use a machine with an intel core-i7 CPU running Linux Ubuntu 12.04 64bit. I have installed the repository BLAS (atlas 3.8.4) and LAPACK (3.3.1) packages. I also have an installation of the current Intel Composer 2013.sp1 (both C++ and Fortran Compiler 14.0) and MKL (which have their own BLAS and LAPACK implementations) I installed arpack-ng using gnu gfortran/g++ and the supplied automake. I also produced a static libarpack-ng using the intel Fortran Compiler following http://forge.scilab.org/index.php/p/arpack-ng/page/Compile-arpack-ng- without-configure/ I tried both the C++ and Fortran programs with 1.) the gnu compilers (g++ / gfortran), where I linked against the system BLAS, LAPACK and automade ARPACK-NG 2.) the intel compilers (icpc / ifort), where I linked against MKL and the static libarpack-ng I also tried these combinations on other machines (laptop, work PC, home PC, ...) I have also performed memory checks with valgrind (to see if any arrays overlap), which reported no errors. I have tried several flags and switches for the compilers: 1) I use -m64 for both C++ and Fortran, both gnu and intel 2) For intel C++ and Intel Fortran to generate the static libarpack-ng I have tried -align for both to assure that memory is properly aligned 3) I have tried -fpic to generate the static libarpack-ng 4) I have tried to generate CPU specific code by using -march=corei7-avx ALL the different combinations/flags of trying the programs had the same effect: nev=1 sometimes fails (roughly 1 out of 5-10), which is PRETTY BAD! None of the above mentioned things made ANY difference. As of now I cannot say if this is 1) an arpack bug 2) a lapack/blas bug 3) a C++/fortran interface bug (although I highly doubt it given that the pure fortran program also fails) 4) possibly something else === REMARKS === Remark 1: A is actually a superoperator acting on NxN matrices and is similar to a completely positive (CP) map. The action of the superoperator on a matrix C is defined as A(C) = herm(B)*C*B, with B a given NxN matrix describing the superoperator A and herm() the hermitian conjugate. By reshaping NxN matrices into vectors of length N*N one can represent the superoperator as a MxM matrix, where M=N*N. In my case the reshaped superoperator matrix can be constructed as A = kron(conj(B) , B), kron is the kronecker tensor product and conj() means complex conjugation (in my case I use real matrices for now, so A=kron(B,B) really). Operators of this kind are guaranteed that their dominant eigenvalue is real and positive.
Comment 1 by Valentin Zauner, Oct 3, 2013
I just saw that a similar issue 852 had been filed some time ago, but closed due to the lack of examples: http://forge.scilab.org/index.php/p/arpack-ng/issues/852/
Comment 2 by Jordi Gutiérrez Hermoso, Oct 3, 2013
Please be patient, I'm digesting this mammoth bug report.
Comment 3 by Edouard Canot, Oct 13, 2013
May be the following discussion helps: http://icl.cs.utk.edu/lapack-forum/viewtopic.php?f=2&t=3196 Roughly, it explains that if you get ARPACK errors with dlahqr (from LAPACK 3.x version), you may obtain correct results by using the olf dlahqr version provided with the original ARPACK library, and which was from the LAPACK 2.0
Comment 4 by Edouard Canot, Oct 13, 2013
In addition to the previous comment, we can observe that the last version of matlab seems to follow the same way... If we search for the dlahqr symbol in libmwarpack.so, we obtain: $ nm libmwarpack.so | grep dlahqr 00000000000050f0 T dlahqr2_ This result would tend to demonstrate that the Matlab arpack library itself do not use the official dlahqr of Lapack, but a renamed one, dlahqr2, which is certainly the 2.0 version provided in old Arpack. Hope this will help.
Comment 5 by Valentin Zauner, Oct 14, 2013
Hi Edouard! Great, thanks for this insight! I was already suspecting a possible change of the dlahqr routine between different LAPACK versions. As it appears there was a major change between dlahqr from LAPACK 3.0 to 3.1 (but possibly also sooner ones), which is probably responsible for the reordering of the eigenvalues. For now I see 2 possible fixes: 1.) Use an old version of dlahqr.f from LAPACK 2.0 or so, rename it to dlahqr2 (or similar), change the corresponding calls within ARPACK and include dlahqr2 into the ARPACK distro. This would be a pity however, because dlahqr in the newer LAPACK is much more optimized than the old version. 2.) The workaround that I used: I changed the routine dneupd.f quite a bit by reordering the steps taken for rvec= .true. : a) call dlahqr right away. b) copy the real and imag parts of the obtained eigenvalues (workl(iheigr) and workl(iheigi)) over the values still stored from dnaupd (workl(irr) and workl(iri)) c) call dngets as before d) set the select array using the ordering results from dngets but DO NOT PERFORM ritz estimate bound checking (they are still in the dnaupd - order) This way the info-flag -15 becomes obsolete, but it appears to me one will only get this error if one screws with the workspace variables between calling dnaupd and dneupd anyways, so I wholeheartedly discarded it. On the other hand one can use the optimized and newer version of dlahqr while only having to do 2 additional (very cheap) memcopys of arrays of dimension NCV. I have attached my edited dneupd routine as dneupd2.f, it is also available on https://gist.github.com/Darkdragon84/6728023 cheers and thanks for the clarification!
Comment 6 by Edouard Canot, Oct 14, 2013
I am looking for a matrix (not too small) for which the original dneupd routine, linked to the Lapack-3-dlahqr, gives wrong results... Could you give me such a matrix?
Comment 7 by Edouard Canot, Oct 14, 2013
Valentin, I provide, for a practical reason, this new version of dneupd2.f. It is exactly the same algorithm as the previous one, but the two utilitaries routines at the end (printmatvec and printmat) have been removed. Also, the blank trailing space at some lines have been restored, so a diff between dneupd.f and dneupd2.f will reveal only your algorithm changes. I have also added a small comment (at lines 303-304) which show the reason of the modification and also your name.
Comment 8 by Valentin Zauner, Oct 15, 2013
Edouard, Hey thanks for the cleanup and the modifications and yeah, oops, I forgot to remove the 2 aux routines at the end... ^^ About a problematic matrix: In my case this current issue is actually of stochastic nature. So a matrix that was problematic before can be fine after logout/login or reboot or just several hours later. I however noticed that the bug tends to show up more often for large matrix sizes. If you want you can try the test programs that I have supplied in the first comment... I don't know what causes the different behavior in the new version of dlahqr and why it sometimes behaves like the old routine and sometimes gives different orderings... Anyways, I actually think it's a flaw in the original dneupd algorithm setup to trust an external routine to always show a certain behavior that is nowhere guaranteed. Thanks for your help and input d_(^^)_b
Comment 9 by Valentin Zauner, Oct 15, 2013
on a side note, I think the newly introduced for-loop from line 642-646 in the source-code is not of FORTRAN77 standard, but uses do ... statements end do instead of do [label] ... statements [label] continue idk if backwards compatibility to F77 is wanted. if so, this for loop should be edited using a (not yet used) label...
Comment 10 by Edouard Canot, Oct 15, 2013
Valentin, Keep the do-enddo loop. It is more readable and most (if not all) Fortran compilers understand it. Besides, I have tried your new dneupd2.f routine in my library but, at least for one 8x8 matrix, I got the old wrong result. The matrix is stored in the ASCII attached file: A_8x8.dat I'm trying to compute the two smallest eigenvalues, in the two cases: (i) without eigenvectors (rvec=.false.) (ii) with eigenvectors (rvec=.true.) The reference is Matlab-8.2 which gives the expected values: 2.5347,+ 0.0000i 2.6160,- 1.2918i Note that Octave-3.6.4 gives also, curiously enough, the correct above values. Scilab-5.4.1 gives a wrong result (for sure, as it use the original dneupd routine): 2.6160,+ 1.2918i 2.6160,- 1.2918i Unfortunately, your dneupd2 routine gives also the wrong result. I don't why. Even with rvec=.false. !
- A_8x8.dat - 1.54 kB
Comment 11 by Valentin Zauner, Oct 16, 2013
Hi Edouard, That is unfortunate :-( I have to admit however that I was only considering and testing the case where I need either the eigenvalues of largest magnitude ('LM') or largest real part ('LR'). When you say smallest eigenvalue do you mean 'SR' or 'SM'? When you look at the matlab source code for eigs they actually set for 'SM': which = 'LM' sigma = 0 style = 'G' (generalized!!) mode = 3 and they do sth a little bit different in the IDO = 1 case... isk if this is the case you are considering? Have you checked if the ritz values from dnaupd have converged as you wanted them? If they have and dneupd2 still gives you wrong results then i suspect that sth is wrong with the dtrsen.f routine which select the wanted ritz values. I will see if I can get the eigenvalues you wanted from the matrix you supplied if I have time. cheers, Valentin
Comment 12 by Edouard Canot, Oct 16, 2013
In my library also, for not too bad conditionned matrix (this is the case for A_8x8), 'SM' implies larger eigenvalues for OP=inv(A). I verified that all ritz values converged!
Comment 13 by Valentin Zauner, Oct 16, 2013
OK, so does that mean to obtain the smallest eigenvalues you are using 'LM' and OP=inv(A) (which is basically 'SM')? Does dtrsen give you the correct ordering then? Sorry I haven't implemented anything else than 'LR' and 'LM' yet, do it's a bit hard for me to check :-P
Comment 14 by Edouard Canot, Oct 16, 2013
>> Does that mean to obtain the smallest eigenvalues you are using 'LM' and OP=inv(A) (which is basically 'SM')? Yes >> Does dtrsen give you the correct ordering then? No idea! I further checked the first option, which was "using the old dlahqr routine, renamed dlahqr2"; for my A_8x8 matrix, calling dnaupd/dneupd for smallest eigenvalues, with rvec=.true., gives now the good result. BUT, curiously enough, the same calling sequence with rvec=.true. (i.e. no eigenvectors) gives the wrong answer!
Comment 15 by Valentin Zauner, Oct 16, 2013
hmm... you can check the output of dtrsen by setting the appropriate debug message level (set the variable mneupd to anything > 2). And oops, sorry, I actually always meant the routine dngets (which orders the eigenvalues the way you want them, e.g. 'SM': smallest first) when I said dtrsen (which does a reordering of the schur-decomposition done by dlahqr. So: after calling dngets, the debug output should show an array, where the wanted eigenvectors are IN THE LAST FEW ENTRIES. if the desired NCONV eigenvalues are not among the first NCONV eigenvalues obtained by dlahqr, the routine dtrsen is called to reorder the schur-decomposition. With or without that call, in the end the desired eigenvalues should be among the first NCONV eigenvalues of the schur decomposition. This can all be checked by looking at the debug info (like the .txt files I posted in my first comment) If all this is fine but you still get wrong results, then I suspect something later on is going wrong, that is not called when WHICH = 'LR' or 'LM'. >> BUT, curiously enough, the same calling sequence with >> rvec=.true. (i.e. no eigenvectors) gives the wrong >> answer! Did you mean rvec=.false. or (i.e. WITH eigenvectors), sry I'm confused... cheers
Comment 16 by Edouard Canot, Oct 18, 2013
Sorry, Valentin! I have to say that your 'dneupd2.f' is correct. I used it to many tests with my library Muesli. The problem was the storing of eigenvectors (as in Lapack), and it is not so easy when you want to sort them according the module of eigenvalues, either in ascending or descending order. My tests are not yet complete, but I think that all is good. Many thanks, I've improved my numerical library! To the arpack-ng maintainers, I recommend to patch the dneupd() routine using the file 'dneupd2.f' of V. Zauner. Regards, (and sorry again for my wrong news two days ago).
Comment 17 by Valentin Zauner, Oct 18, 2013
Muesli, lol :-) I'm glad to hear it works in your case as well! That suggests that there are no more adaptations to be made to the rest of the routine after the selecting of the wanted eigenpairs. I suspect a similar adaptation should be made in all _neupd.f routines. cheers, Valentin
Comment 18 by Jordi Gutiérrez Hermoso, Oct 24, 2013
I'm evidently falling behind applying this patch. Would either of you feel competent enough to handle git and apply the patch yourself? I'm asking, do either of you want to become an arpack-ng maintainer? If Sylvester agrees, I think we could use more help.
Comment 19 by Sylvestre Ledru, Oct 24, 2013
I would be very happy to see new maintainers in arpack :) Edouard, do you want me to give you permissions on the git repository ? I can help you with the system if you want.
Comment 20 by Edouard Canot, Oct 25, 2013
Jordi and Valentin, I think we have to delayed the patch of dneupd. Yesterday, I have found a matrix for which dneupd failed, alas :-( I must check whether it comes from my library or from dneupd. Sylvestre, concerning the fact to become a new Arpack maintainer, I would like to accept but for the moment I cannot :-(
Comment 21 by Valentin Zauner, Oct 29, 2013
hi guys, sorry for the delay. Thanks for the offer as a maintainer! I however also have to refuse as I have only very rudimentary knowledge of fortran and my time also doesn't really permit it :-( I am however happy to help clarify further issues with dneupd Edouard: the debugging information that arpack provides was really very helpful for me in finding out the culprit. Maybe it helps you also to see what goes wrong for your special case? cheers
Comment 22 by Jordi Gutiérrez Hermoso, Oct 30, 2013
You two have already spent far more time debugging this issue than I have ever spent in my entire time as a maintainer of ARPACK. All this offer gives you, really, is the right to upload a fix once you have it ready, and it comes with absolutely no obligations other whatever you decide to impose on yourself.
Comment 23 by Sylvestre Ledru, Nov 6, 2013
Edouard, have you been able to make some progress on the patch ? Thanks
Comment 24 by Chris cmiller730, Feb 17, 2014
Sorry to be late to the party. I'm trying to write Haskell wrappers to arpack and beleive that I'm encountering this issue as well. I'm just doing a sanity check with a diagonal matrix (using the non-symmetric routine), A_ii = (i+1) 0 <= i < 1000. Here's a summary of my ARPACK params bmat = 'I' N = 1000 which = "LM" nev = 9 tol = 0 ncv = 2*nev + 1 rvec = 1 The computed eigenvalues should be 1000, 999,...,992. I'm getting 21.240820 209.280818 491.134969 1000.000000 999.000000 998.000000 997.000000 996.000000 995.000000 The failure mode appears to depend on nev. For nev = 1..8 I get the expected results, breaks for 9 as described, nev = 10 returns [999,998,..990] (missing 1000), 11 fails in a manner similar to 9, 15 returns correctly. Sorry I got carried away. By induction it's broken ;). All of these examples return ido = 99 info = 0 As others have reported I also always get correct results when rvec = 0. I'm attaching c code for this example.
Comment 25 by Chris cmiller730, Feb 17, 2014
I now also confirm that the above fix to dneupd.f corrects the problem at least in this simple test case. What would be involved in patching this? Would updates to _neupd be enough?
Comment 26 by Sylvestre Ledru, Feb 17, 2014
Well, to be clear, I don't have the skills to fix this bug I will be happy to apply a merge and use the various tests cases as unit tests.
Comment 27 by Chris cmiller730, Feb 17, 2014
I don't really feel like I have the skills either. It's been a long time since I've seriously looked at Fortran code past "how do I call it from C?" I guess I have three concerns 1) The problem appears to be related to a change in the way lapack handles things. Would the fix break people using older lapack? Does anyone care? 2) I don't really have a full grasp of what exactly the problem is so I'm hesitant to apply someone else's fix. 3) Unless I'm missing something, the test coverage in arpack-ng seems a little sparse. I'd worry that I'd break something down the line. If I have time and can get a better handle on what's going wrong I'll try and make a patch and write a few verification tests to send to the maintainers. I've setup a fork at https://github.com/cmiller730/arpack-ng to work on this in case anyone's interested in my progress.
Comment 28 by Tim Mitchell, Mar 5, 2014
Issue 1315 just caught my attention as it sounds very similar to issue 664 which I discovered and patched a couple years ago and that issue 1315 could be related to issue 667 which I also reported. Having read through the entire thread, I’m not 100% convinced there really is a bug here because there is more than one issue here being conflated together, thus making it difficult to determine what is causing what. First, the business of migrating to LAPACK 3.0 or later routines seems a very premature step. It seems part of the “bug" was caused by overlooking ARPACK instructions to use the included LAPACK 2.0 routines, which I’d say is user-error. The “bug" should first be replicated with the LAPACK routines included with ARPACK, not any other version. Migrating to a newer LAPACK version may be a good idea long-term but feature enhancement should be a completely separate issue from determining whether this bug exists, its cause, and its solution, unless we are certain that 2.0 routines are somehow at fault. So before moving to the much modified dneupd2, I think it would be best to focus on making sure this bug truly exists with the regular dneupd and LAPACK 2.0 routines. With that in mind, the next very important issue is whether the demo codes are properly giving ARPACK a fixed initial vector. The fact that bug seems to only occur sometimes for a given matrix likely indicates that the ARPACK is generating a random initial vector every call, which would explain why Valentin Zauner is reporting that the problem doesn’t always happen every time. I just skimmed Valentin Zauner’s demo codes and it looks like that indeed INFO is set to 0, thus telling ARPACK to ignore whatever is in RESID and to instead generate its own random initial vector every time it is called. See lines 123-127 of the comments for dnaupd.f for RESID and INFO: http://forge.scilab.org/index.php/p/arpack-ng/source/tree/master/SRC/ dnaupd.f I did this very quickly so perhaps someone could double-check my source-code skimming? I am pretty confident that if RESID is initially set to a vector of ones (or any other fixed set of values) and INFO is set to 1, then the bug will either always occur or it always won’t. So I think the number one priority moving forward should be to create a demo code that generates a random vector for RESID with INFO := 1 and then stores RESID to disk before calling ARPACK. That way, if the bug then occurs with that initial vector, we can load up the copy of the initial RESID vector on disk and can replicate the bug every time with a specific matrix and that specific initial vector. Krylov methods are of course sensitive to the choice of initial vector and it could be that sometimes just a bad initial vector is randomly generated and as a result, not all the desired Ritz values converge, which would be normal. With a really bad choice of initial vector (at least non-zero though… since zero would indeed be very very bad), ARPACK may not even have a single Ritz value converge before hitting its max iteration count. To confirm that we have a bug, we need to confirm that indeed ARPACK is returning undesired / not-converged Ritz values with eigenvectors when in fact, ARPACK has already computed more desired converged Ritz values that it isn’t returning for some reason. It’s not yet clear to me that this is really happening and we can’t do that test unless we have a fixed initial vector and matrix that supposedly replicates the problem. To sum up, my advice at this point is the following: 1) Use only the LAPACK routines included with ARPACK for the timing being and not the modified dneupd2 2) obtain a fixed example matrix A and a fixed initial vector x0 that replicates the problem every time 3) then with this A and x0, verify whether Matlab replicates the same behavior or not. At this point, I doubt the claim that the problem doesn’t occur in Matlab only because the initial vector and LAPACK routines don’t seem to have been adequately controlled for. If after all that, we have a matrix A and vector x0 which causes a problem with ARPACK-NG but in Matlab, then I think we’ll have verified that something fishy is going on here and we can start looking into whether ARPACK is not properly returning the desired set of converged Ritz values and corresponding eigenvectors. Unfortunately, I don’t have the time at the moment to do the investigative legwork at the moment, but if someone else here can follow up on getting a concrete problematic example generated that recreates the problem every time, I’d be happy to consult on it and time permitting, get more involved then.
Comment 29 by Valentin Zauner, Mar 5, 2014
Hi Tim, Thank you for your detailed analysis of what you choose not to call a bug. If indeed the ARPACK-NG documentation specifically points out to use included LAPACK 2.0 routines then I agree with you that the problem I described is user induced, although somewhat finicky. Could you please point out to me where exactly this is stated (and it has to be in the NG documentation, not the "old" ARPACK one!). Even if it is so, the LAPACK routine that causes the problem (dlahqr) is actually not meant to be called directly by the user from what I can tell from the LAPACK documentation. So we could continue to be finicky along that direction... Let me elaborate on your other points: 1) This has already been suggested as the simplest bugfix 2) Actually this is of course what I tried to supply in the beginning, as of course it would greatly help to have a set of input data to ALWAYS reproduce the error. The problem however is the following: Let's say you give me a fixed matrix A. I run my code many times with a random initial vectors x0 until I see the problem occuring. I save A and the problematic x0. If I then load x0 again in the next run and use it as an initial vector I can reproduce the problem many times over and over (I tried all this!). If however in the meantime I run some other program, close/open some programs or reboot the machine, the same x0 which caused problems before is now fine. So if I then give you the A and x0 that I saved you will probably not be able to reproduce the problem. I suspect that the routine dlahqr also uses some stochastic algorithm and somehow this gets messed up by opening/closing/rebooting etc. Anyways there is some undtable behavior and that's trouble enough. 3) Every problematic x0 I double checked with MATLAB. MATLAB has NEVER EVER failed a single time (the considered matrices were well behaved). I don't know exactly what you want to say in 3) but as stated above somewhere, MATLAB actually also commits what you call a user-error by using a current release of LAPACK and just using the LAPACK 2.0 version of dlahqr. In this sense, MATHWORKS has also found and fixed this bug and confirms that the routine dlahqr is the culprit. on a side remark: I have confirmed that the requested ritz values have converged but ARPACK just returns the wrong eigenpairs by supplying debug traces (if you care you can skim through them at https://gist.github.com/Darkdragon84/6728023). To conclude: - even if ARPACK tells you to use LAPACK 2.0 I think this is way out of date and users should be at least SPECIFICALLY pointed to that issue. - I actually just wanted to check if anyone else experienced this problem and alongside supplied my own (admittedly) dirty hack to fix the problem. It works fine for me, I never requested it to be included in the library. Whoever likes it can use it, if not, well then not :-) Hope this clarifies a few things, if not, let's exchange another few hundred lines ;-) cheers, Valentin
Comment 30 by Chris cmiller730, Mar 5, 2014
If it says only use the included lapack that's fine. But I didn't notice it anywhere in the README. What I did read in the README said 4. Unlike ARPACK, ARPACK-NG is providing autotools based build system. Therefor, the classical: $ ./configure $ make $ make install should work as expected. If nothing else the build script should be changed so it doesn't link with the user's LAPACK and/or the README should be updated. I'm also 99.9999% sure the problem has nothing to do with sensitivity to starting vectors. 1) ARPACK reports convergence when the results clearly violate the convergence criterion. Even if the failure is related to starting vectors this would still be a bug. 2) The results are always wrong for me in exactly the same ways. I've also tried seeding with a vector of all ones and and info = 1 I get the same incorrect results. 3) It's failing on matricies it has no buisness failing on (Diagonal) no matter what the starting vector is. I don't have matlab but I've verified that the matrix from my trial script works with scipy's sparse.linalg.eigs.
Comment 31 by Tim Mitchell, Mar 5, 2014
Hi Valentin and Chris, Thanks for the quick reply and providing additional details. To clarify my stance, I wasn’t stating that what you've reported is not a bug or that is a bug. All I was claiming is that it wasn’t really clear what was being reported. Given the additional replies by you both, it does sound like something wrong is going on. Concerning the initial vector issue, it also wasn’t clear to me from the codes provided here that is was accounted for, since INFO was set to 0 and I didn’t see any code relevant to storing or uploading a specific initial vector in either Valentin’s or Chris’s demo codes. It’s good to know that they are both reporting that problems occur even with a vector of 1’s as input. However, even with a fixed initial vector of 1’s, does the problem happen every time for you Chris or is it intermittent like Valentin reports? If it’s intermittent, can you at least get to occur “on demand” quickly by calling ARPACK-NG with the same matrix and initial vector in a loop over and over again? Concerning using the newer LAPACK routines, is it confirmed that this bug you only happens when using the newer 3.x versions of LAPACK and not the older 2.0 release that ARPACK included with it and that ARPACK-NG stripped out of its source tree? I could have sworn I had seen something regarding only using certain version of routines in LAPACK with ARPACK but this was long time ago and I can’t find it now so maybe I am mistaken. So yeah, my use of “user-error” was probably too harsh. Still, I think expecting ARPACK[-NG] to work fine with newer major version updates of LAPACK is on a plane of optimism that I neither possess nor even imagine having. :) It’s been nearly 20 years (!) since ARPACK was first released and it hasn’t been maintained by the original authors for a /long/ time. If it’s just the newer LAPACK routines that are inducing the problem in ARPACK-NG, priority number one should be instructing ARPACK-NG users and distros to only link to LAPACK 2.0 (or the most recent non-problematic release) for the time being until a new official update of ARPACK-NG is ready to be released with compatibility with the latest LAPACK versions. Certainly correctness of results are more important than any speed gains given by newer LAPACK code.
Comment 32 by Roland Wirth, May 14, 2014
> I could have sworn I had seen something regarding only using certain version of routines in LAPACK with ARPACK but this was long time ago and I can’t find it now so maybe I am mistaken. The passage you are referring to is in the README of the original ARPACK package. There, they strongly recommend that you use the bundled LAPACK if the system LAPACK is not 2.0. They also state that the current LAPACK was 3.0 at the time of writing. So, it might be that you have rediscovered a long-forgotten incompatibility.
Comment 33 by Tim Mitchell, May 14, 2014
Hi Roland, Thank you so much for reminding me where I had seen that warning concerning using LAPACK 2.0 instead of 3.0 with ARPACK. I knew I had seen it somewhere but I was rather swamped with other things last time I checked in here to go digging around for it. It seems the proper course of action moving forward is to put the 2.0 LAPACK code included with the official ARPACK back into the arpack-ng tree, such that it is always linked to upon compiling, along with new documentation warnings that specifically detail this incompatibility with LAPACK 3.0. I'd recommend also putting such a comment/warning in the makefile so no one in the future falls into the same trap. Once that is done, I believe issue 1315 should then be changed from "type defect" to "type enhancement" and when someone in the future has time and motivation, the arpack-ng code can be updated in a future release to properly support LAPACK 3.0, perhaps by following the work Valentin has already done on that front. Thanks to everyone for bringing attention to this regression that had been compared to the official ARPACK. Sylvestre or other admins, can you make these updates?
Comment 34 by Jordi Gutiérrez Hermoso, May 14, 2014
This isn't an "unofficial" ARPACK. We contacted the ARPACK developers telling them that we were doing this, and they agreed to it (well, they didn't disagree to it, they just never really responded except with promises to maintain ARPACK, which never happened). We shouldn't be packaging our own LAPACK either. This isn't a real fix. The whole point of creating arpack-ng in the first place was to avoid shipping embedded libraries. The real fix is to figure out how to make ARPACK work with LAPACK other than 2.0.
Comment 35 by Valentin Zauner, May 15, 2014
I agree with Jordi. As I gather, the point of arpack-ng is exactly to provide more than the (not well maintained) original ARPACK, i.e. also ensure compatibility with current libraries. In our case this is just LAPACK and BLAS. I understand that this might be a cumbersome task, but otherwise I see no point in arpack-ng in the first place. Sorry if this might sound a bit harsh, but I think otherwise arpack-ng would just be the usual ARPACK with a nice make environment...
Comment 36 by Sylvestre Ledru, May 15, 2014
Clearly, we won't be shipping a copy of LAPACK in Arpack-ng. However, I will be happy to apply any patch to support LAPACK 3.0 (but I won't be the one doing the work, sorry).
Comment 37 by Valentin Zauner, May 15, 2014
OK, this rules out the MATLAB approach, where they use a compiled current LAPACK library, where just the troublemaker routine is replaced with the old one from the 2.0 release.
Comment 38 by Sylvestre Ledru, May 15, 2014
Valentin, it is just about one file, we can take it as a workaround. I don't know if it is possible with Fortran but a load on the fly could be a workaround...
Comment 39 by Valentin Zauner, May 15, 2014
By "one file", you mean the ARPACK fortran file, i.e. dneupd.f? I actually strongly suspect that the same issue will also arise for all the other drivers (i.e. single/double real/complex symmetric/non-symmetric) that compute eigenvectors. One would have to test all of these and apply the corresponding workaround. I haven't done this since in my application I only need the double real non-symmetric routine.
Comment 40 by Tim Mitchell, May 15, 2014
Valentin, yes, I am 99.999% certain that the same issue will arise for all the other drivers. However, I believe Sylvestre was suggesting to only take the one LAPACK 2.0 file dlahqr and linking to to just that specific version for arpack-ng, possibly dynamically. I'd be fine with that approach as a temporary workaround (or by taking the Matlab approach which presumably includes it as dlahqr2 as Edouard has suggested). Is that what you meant Sylvestre? I agree with everyone that it'd be great for arpack-ng to push forward and support the newer LAPACK releases. However, until those improvements are ready and thoroughly tested, I feel it is prudent to provide a temporary workaround that deals with this incompatibility with newer LAPACK routine(s). Though I admit it's a bit of compromise to include and/or link to specific 2.0 versions as necessary, the alternative of having arpack-ng users live with some fairly erroneous results seems a worse trade-off, particularly since it seems like it will be some time before compatibility with LAPACK 3.0 is fully handled in arpack-ng. Valentin, as to the point of arpack-ng, I completely disagree that without feature enhancements that arpack-ng would just be regular ARPACK with a nice make environment. The official ARPACK tree is not being maintained /at all/ and as a result, the bug reports and fixes were that have cropped up over the years weren't being organized or compiled into one centralized source. I personally am very grateful that arpack-ng exists so that there is now one centralized and organized repository for all ARPACK related bugs and fixes to be available from. Even if arpack-ng intended to only be just such a maintenance release, that's still a huge benefit to the community.
Comment 41 by Viral Shah, May 25, 2014
Having the erroneous answer is certainly the worse solution. It would be really great if we can include DLAHQR2 as part of arpack-ng, so that users get a correct working solution out of the box.
Comment 42 by Johannes Jünemann, Jun 11, 2014
Hello everybody, I am encountering a problem that is related to your discussion. I am trying to find the 'LR' eigenvalues of given matrices (with real eigenvalues!) using the dnaupd/dneupd methods, however, the lowest eigenvalues do not always appear in the right order. Below a typical matrix example: A = RESHAPE((/ 0d0 , 1d-4, 1d-4, -1d0, & 1d-4, 0d0, -1d0, 1d-4, & 1d-4, -1d0, 0d0, 1d-4, & -1d0, 1d-4, 1d-4, 0d0 & /), (/4,4/) ) When calling the method repeatedly, the ordering of the two-lowest EV (-1.0002 resp. -0.9998) goes wrong in some cases. I used the arpack_test_f program (as provided by Valentin Zauner) to double-check and tried both the fix for the dlahqr.f and using dneupd2.f, but none of the two changes the problem. The ordering (at least for this particular example) goes right when choosing 'LR','SM' or 'LM'. I would be very glad if one of you could confirm this problem in oder to know if the issue should have been fixed by your patches (and I didn't follow the instructions through completely) or if this is an unrelated error. Thanks for your help, Johannes
Comment 43 by Johannes Jünemann, Jun 11, 2014
I am sorry, the second sentence was supposed to read: I am trying to find the 'SR' eigenvalues of given matrices (with real eigenvalues!) using the dnaupd/dneupd methods...
Comment 44 by Valentin Zauner, Jun 11, 2014
I have never used the 'SR' mode, maybe some additional stuff happens when you call this? there should be some shift-and-invert type of operation involved, maybe this screws it up? Also, your matrix is symmetric and you should use the symmetric driver dsaupd/dseupd to begin with...
Comment 45 by Johannes Jünemann, Jun 12, 2014
Thank you for the very quick response. I am using the non-symmetric libraries, as I apply the procedure to a set of different (not necessarily) symmetric matrices. The problem persists when slightly changing the matrix entries as to make it non-symmetric.
Comment 46 by Marco Caliari, Feb 12, 2015
Dear all, I had similar problems with rvec = .true. It was not clear to me how the Ritz values (computed, sorted, selected, ...) by dnaupd were related with the eigenvalues computed by the famous dlahqr.f. Till today, when I saw in arpack/src/dlaqrb.f c This is mostly a modification of the LAPACK routine dlahqr. Now it is clear: they are related because computed in the same way (by the same routine). So, I compared dlaqrb.f with dlahqr.f (2.0) and made the same small changes to dlahqr.f (3.5.0) and called the result dlaqrb35.f (attached). All my problems disappeared! If you all confirm that it works also for you, I can continue with the job (modify the description, the comments and the other routines).
Comment 48 by Valentin Zauner, Feb 12, 2015
Hi! Tracking the changes back to within the auxiliary routine is probably a better way :-) I will see if it works for me when I have the time! Thanks for your work! Cheers, Valentin
Comment 49 by Sylvestre Ledru, Feb 12, 2015
This bug is fixed in github: https://github.com/opencollab/arpack-ng/issues/3 FYI, this forge is no longer used.
Comment 50 by Sylvestre Ledru, Feb 12, 2015
Comment 51 by Valentin Zauner, Feb 12, 2015
So does that mean that the whole issue is now fully fixed in the current 3.1.5 release? Also for other drivers such as dseupd, zneupd, etc?
Comment 52 by Sylvestre Ledru, Feb 12, 2015
Please ask your question in github. This project is only here for legacy purposes.