arpack-ng  

Issue 1315: dneupd selects/calculates wrong eigenpairs if rvec = true

Reported by Valentin Zauner, Sep 27, 2013

dneupd selects/calculates wrong eigenpairs if rvec = true

Steps to reproduce the problem:

1. Have an installation of arpack-ng, lapack and blas, have fortran 
and C++ compiler (C++11 capable)
2. Download Armadillo library http://arma.sourceforge.net/ (no 
installation, only include folder needed)
3. compile both arpack_test_c.cpp and arpack_test_f from 
https://gist.github.com/Darkdragon84/6728023 as described below
4. play around some with arpack_test_c until the result for nev=1 is 
wrong

Expected result:

after specifying which='LR', DNEUPD should calculate and return the 
NEV eigenvalues with largest real part and the corresponding 
eigenvectors from the arnoldi factorization computed by DNAUPD

Actual result:

if eigenvectors are also desired (rvec = true), DNEUPD sometimes 
selects the wrong eigenpairs to compute, and even sometimes computes 
eigenpairs which are plain wrong for nev=1. Also for nev>1 the 
ordering of the eigenpairs is often scrambled, this might be related 
to issue 667
http://forge.scilab.org/index.php/p/arpack-ng/issues/667/

=== FURTHER ELABORATION ===

My task is the follwing: I have a real, nonsymmetric MxM matrix A 
that is guaranteed that its eigenvalue with largest real part is 
real and positive (see Remark 1 below if details are of interest). I 
want to calculate only this one eigenvalue and its corresponding 
eigenvector. I therefore use DNAUPD with mode = 1, bmat = 'I', which 
= 'LR', nev = 1, TOL=1e-15 (I also use ISHIFT = 1) and DNEUPD with 
rvec = 1, howmny='A'. 

I have been doing this in MATLAB (which also uses arpack in the 
background) for a long time now and never got any wrong 
results/wrongly selected eigenpairs. As I am now porting my code to 
C++ I implemented a simple, matlab-'eigs'-like interface to ARPACK. 
To represent vectors and matrices in C++ I use the Armadillo library 
http://arma.sourceforge.net/

=== IMPLEMENTATION ===

I have written a C++ test program that generates random matrices of 
the above kind (and saves them in binary format) and tries to 
calculate nev=1 eigenvalues of largest real part and their 
eigenvectors. As a control I also do this for nev=4 and compare if 
nev=1 gives the right result (To be super sure I also check with 
matlab).

Now on a random basis the nev=1 result is either completely off 
(also not fulfilling the eigencondition A*x = lambda*x) or I just 
get the wrong eigenpair (not the one with largest real part). If I 
don't seed the random number generator I can reproduce the result if 
I load the problematic matrix again (I generate my own random 
starting vector RESID for DNAUPD, which of course uses the same seed 
then).

To check if this is a C++/Fortran interface problem I have written a 
separate fortran program, using the example program dndrv1.f in the 
EXAMPLES directory as a template (sry this was basically my first 
fortran program, so it's pretty sloppy, but does the job). This 
program loads the stored matrix 'Amat.bin' and calls the same DNAUPD 
+ DNEUPD combination. In many cases, where C++ fails, the fortran 
program also fails, but there are also instances where only 1 of the 
2 fails (so sometimes fortran fails when C++ succeeds).

It has to be noted that fails are more frequent when the matrix 
dimension is higher (-> i suspected memory overlap/corruption, 
see below)

The source codes can be found at:
https://gist.github.com/Darkdragon84/6728023

to build arpack_test_c.cpp (where CC is e.g. g++ or icpc): CC -g 
-m64 -I/home/valentin/armadillo/include/ -std=c++0x -o arpack_test_c 
arpack_test_c.cpp -lblas -llapack -larpack
to run arpack_test_c:
./arpack_test_c N  (where N*N is the total matrix dimension of A)
or
./arpack_test_c name.bin (to load a previously saved matrix 
name.bin)

to build arpack_test_f.f (where FF is e.g. gfortran or ifort): FF -g 
-I/path/to/arpack-ng/SRC/ -o arpack_test arpack_test.f -lblas 
-llapack -larpack
to run (a matrix file Amat.bin generated by arpack_test_c must be 
present in this directory): ./arpack_test_f

=== DELVING FURTHER ===

I started delving through the fortan code of DNAUPD and DNEUPD. As 
DNAUPD seems to do its job right, I looked into DNEUPD.f. After a 
few checks of debugging traces (turning on mneupd = 3 defined in 
ARPACKROOT/SRC/debug.h) I found what I believe is the source of all 
this evil:

In the case of failure, the ordering of the eigenvalues of the upper 
Hessenberg matrix H, computed by LAPACK's dlahqr, is different than 
the ordering of the Ritz values passed by DNAUPD. The logical array 
'select' determines, which eigenpairs should be computed and is 
computed upon the results of ARPACK's dngets (which selects the 
wanted eigenvalues, in this case 'LR'). dngets however works with 
the ordering of the ritz values passed by DNAUPD, whereas select is 
furtheron used to select from the eigenvalues computed by dlahqr.

Ergo, if the orderings of the ritz values passed by DNAUPD and the 
eigenvalues computed by dlahqr differ, wrong eigenvalues will be 
selected!

I however have no idea, what triggers the ordering mismatch in my 
failure cases. IMHO it would be much smarter to determine 'select' 
upon the results of dlahqr, rather than the original ritz values 
passed by DNAUPD, or is this not good for some reason?

This fact also sometimes causes a screwup in the ordering of 
eigenpairs for nev>1. In that case mostly because we are 
targeting several eigenpairs, we will also get the ones we want by 
chance, but not always!

I have included some debug traces of 3 characteristic failure cases 
(trace1-3.txt) here:
https://gist.github.com/Darkdragon84/6728023

If the corresponding matrices Amat1-3.bin are wanted, let me know.

I actually cannot believe I am the first person to observe this, 
after all DNAUPD and DNEUPD are heavily used AFAIK. So if there is a 
simple, obvious way of doing this right, please tell me what I have 
missed!!

I also want to stress, that MATLAB never had this problem and is 
also using arpack. From all this I am suspecting something is wrong 
with the LAPACK routine dlahqr
=== BUILD ENVIRONMENT  ===

I use a machine with an intel core-i7 CPU running Linux Ubuntu 12.04 
64bit. I have installed the repository BLAS (atlas 3.8.4) and LAPACK 
(3.3.1) packages. 
I also have an installation of the current Intel Composer 2013.sp1 
(both C++ and Fortran Compiler 14.0) and MKL (which have their own 
BLAS and LAPACK implementations)

I installed arpack-ng using gnu gfortran/g++ and the supplied 
automake. I also produced a static libarpack-ng using the intel 
Fortran Compiler following
http://forge.scilab.org/index.php/p/arpack-ng/page/Compile-arpack-ng-
without-configure/

I tried both the C++ and Fortran programs with
1.) the gnu compilers (g++ / gfortran), where I linked against the 
system BLAS, LAPACK and automade ARPACK-NG
2.) the intel compilers (icpc / ifort), where I linked against MKL 
and the static libarpack-ng 

I also tried these combinations on other machines (laptop, work PC, 
home PC, ...)

I have also performed memory checks with valgrind (to see if any 
arrays overlap), which reported no errors.

I have tried several flags and switches for the compilers:
1) I use -m64 for both C++ and Fortran, both gnu and intel
2) For intel C++ and Intel Fortran to generate the static 
libarpack-ng I have tried -align for both to assure that memory is 
properly aligned
3) I have tried -fpic to generate the static libarpack-ng
4) I have tried to generate CPU specific code by using 
-march=corei7-avx

ALL the different combinations/flags of trying the programs had the 
same effect: nev=1 sometimes fails (roughly 1 out of 5-10), which is 
PRETTY BAD! None of the above mentioned things made ANY difference.
As of now I cannot say if this is 
1) an arpack bug
2) a lapack/blas bug
3) a C++/fortran interface bug (although I highly doubt it given 
that the pure fortran program also fails)
4) possibly something else

=== REMARKS ===

Remark 1: 
A is actually a superoperator acting on NxN matrices and is similar 
to a completely positive (CP) map. The action of the superoperator 
on a matrix C is defined as A(C) = herm(B)*C*B, with B a given NxN 
matrix describing the superoperator A and herm() the hermitian 
conjugate. By reshaping NxN matrices into vectors of length N*N one 
can represent the superoperator as a MxM matrix, where M=N*N. In my 
case the reshaped superoperator matrix can be constructed as A = 
kron(conj(B) , B), kron is the kronecker tensor product and conj() 
means complex conjugation (in my case I use real matrices for now, 
so A=kron(B,B) really). Operators of this kind are guaranteed that 
their dominant eigenvalue is real and positive.

Comment 1 by Valentin Zauner, Oct 3, 2013

I just saw that a similar issue 852 had been filed some time ago, 
but closed due to the lack of examples:
http://forge.scilab.org/index.php/p/arpack-ng/issues/852/

Comment 2 by Jordi Gutiérrez Hermoso, Oct 3, 2013

Please be patient, I'm digesting this mammoth bug report.

Comment 3 by Edouard Canot, Oct 13, 2013

May be the following discussion helps:
http://icl.cs.utk.edu/lapack-forum/viewtopic.php?f=2&t=3196

Roughly, it explains that if you get ARPACK errors with dlahqr (from 
LAPACK 3.x version), you may obtain correct results by using the olf 
dlahqr version provided with the original ARPACK library, and which 
was from the LAPACK 2.0

Comment 4 by Edouard Canot, Oct 13, 2013

In addition to the previous comment, we can observe that the last 
version of matlab seems to follow the same way...

If we search for the dlahqr symbol in libmwarpack.so, we obtain:
  $ nm libmwarpack.so | grep dlahqr
  00000000000050f0 T dlahqr2_

This result would tend to demonstrate that the Matlab arpack library 
itself do not use the official dlahqr of Lapack, but a renamed one, 
dlahqr2, which is certainly the 2.0 version provided in old Arpack.

Hope this will help.

Comment 5 by Valentin Zauner, Oct 14, 2013

Hi Edouard!

Great, thanks for this insight! I was already suspecting a possible 
change of the dlahqr routine between different LAPACK versions. As 
it appears there was a major change between dlahqr from LAPACK 3.0 
to 3.1 (but possibly also sooner ones), which is probably 
responsible for the reordering of the eigenvalues.

For now I see 2 possible fixes:

1.) Use an old version of dlahqr.f from LAPACK 2.0 or so, rename it 
to dlahqr2 (or similar), change the corresponding calls within 
ARPACK and include dlahqr2 into the ARPACK distro. This would be a 
pity however, because dlahqr in the newer LAPACK is much more 
optimized than the old version.

2.) The workaround that I used: 
I changed the routine dneupd.f quite a bit by reordering the steps 
taken for rvec= .true. :

a) call dlahqr right away.
b) copy the real and imag parts of the obtained eigenvalues 
(workl(iheigr) and workl(iheigi)) over the values still stored from 
dnaupd (workl(irr) and workl(iri))
c) call dngets as before
d) set the select array using the ordering results from dngets but 
DO NOT PERFORM ritz estimate bound checking (they are still in the 
dnaupd - order)

This way the info-flag -15 becomes obsolete, but it appears to me 
one will only get this error if one screws with the workspace 
variables between calling dnaupd and dneupd anyways, so I 
wholeheartedly discarded it.

On the other hand one can use the optimized and newer version of 
dlahqr while only having to do 2 additional (very cheap) memcopys of 
arrays of dimension NCV.

I have attached my edited dneupd routine as dneupd2.f, it is also 
available on https://gist.github.com/Darkdragon84/6728023

cheers and thanks for the clarification!

Comment 6 by Edouard Canot, Oct 14, 2013

I am looking for a matrix (not too small) for which the original 
dneupd routine, linked to the Lapack-3-dlahqr, gives wrong 
results...

Could you give me such a matrix?

Comment 7 by Edouard Canot, Oct 14, 2013

Valentin,

I provide, for a practical reason, this new version of dneupd2.f. It 
is exactly the same algorithm as the previous one, but the two 
utilitaries routines at the end (printmatvec and printmat) have been 
removed. Also, the blank trailing space at some lines have been 
restored, so a diff between dneupd.f and dneupd2.f will reveal only 
your algorithm changes.

I have also added a small comment (at lines 303-304) which show the 
reason of the modification and also your name.

Comment 8 by Valentin Zauner, Oct 15, 2013

Edouard,

Hey thanks for the cleanup and the modifications and yeah, oops, I 
forgot to remove the 2 aux routines at the end... ^^

About a problematic matrix: In my case this current issue is 
actually of stochastic nature. So a matrix that was problematic 
before can be fine after logout/login or reboot or just several 
hours later. I however noticed that the bug tends to show up more 
often for large matrix sizes. If you want you can try the test 
programs that I have supplied in the first comment...

I don't know what causes the different behavior in the new version 
of dlahqr and why it sometimes behaves like the old routine and 
sometimes gives different orderings...

Anyways, I actually think it's a flaw in the original dneupd 
algorithm setup to trust an external routine to always show a 
certain behavior that is nowhere guaranteed.

Thanks for your help and input d_(^^)_b

Comment 9 by Valentin Zauner, Oct 15, 2013

on a side note, I think the newly introduced for-loop from line 
642-646 in the source-code is not of FORTRAN77 standard, but uses

do ...
   statements
end do

instead of

do [label] ...
   statements
[label] continue

idk if backwards compatibility to F77 is wanted. if so, this for 
loop should be edited using a (not yet used) label...

Comment 10 by Edouard Canot, Oct 15, 2013

Valentin,

Keep the do-enddo loop. It is more readable and most (if not all) 
Fortran compilers understand it.

Besides, I have tried your new dneupd2.f routine in my library but, 
at least for one 8x8 matrix, I got the old wrong result.

The matrix is stored in the ASCII attached file: A_8x8.dat

I'm trying to compute the two smallest eigenvalues, in the two 
cases:
  (i) without eigenvectors (rvec=.false.)
  (ii) with eigenvectors (rvec=.true.)

The reference is Matlab-8.2 which gives the expected values:
    2.5347,+ 0.0000i
    2.6160,- 1.2918i

Note that Octave-3.6.4 gives also, curiously enough, the correct 
above values.

Scilab-5.4.1 gives a wrong result (for sure, as it use the original 
dneupd routine):
    2.6160,+ 1.2918i
    2.6160,- 1.2918i

Unfortunately, your dneupd2 routine gives also the wrong result. I 
don't why. Even with rvec=.false. !

Comment 11 by Valentin Zauner, Oct 16, 2013

Hi Edouard,

That is unfortunate :-( I have to admit however that I was only 
considering and testing the case where I need either the eigenvalues 
of largest magnitude ('LM') or largest real part ('LR'). 

When you say smallest eigenvalue do you mean 'SR' or 'SM'? When you 
look at the matlab source code for eigs they actually set for 'SM':
which = 'LM'
sigma = 0
style = 'G' (generalized!!)
mode = 3

and they do sth a little bit different in the IDO = 1 case... isk if 
this is the case you are considering?

Have you checked if the ritz values from dnaupd have converged as 
you wanted them? If they have and dneupd2 still gives you wrong 
results then i suspect that sth is wrong with the dtrsen.f routine 
which select the wanted ritz values.

I will see if I can get the eigenvalues you wanted from the matrix 
you supplied if I have time.

cheers,
Valentin

Comment 12 by Edouard Canot, Oct 16, 2013

In my library also, for not too bad conditionned matrix (this is the 
case for A_8x8), 'SM' implies larger eigenvalues for OP=inv(A).

I verified that all ritz values converged!

Comment 13 by Valentin Zauner, Oct 16, 2013

OK, so does that mean to obtain the smallest eigenvalues you are 
using 'LM' and OP=inv(A) (which is basically 'SM')? Does dtrsen give 
you the correct ordering then? 

Sorry I haven't implemented anything else than 'LR' and 'LM' yet, do 
it's a bit hard for me to check :-P

Comment 14 by Edouard Canot, Oct 16, 2013

>> Does that mean to obtain the smallest eigenvalues you are 
using 'LM' and OP=inv(A) (which is basically 'SM')? Yes

>> Does dtrsen give you the correct ordering then? No idea!

I further checked the first option, which was "using the old 
dlahqr routine, renamed dlahqr2"; for my A_8x8 matrix, calling 
dnaupd/dneupd for smallest eigenvalues, with rvec=.true., gives now 
the good result.

BUT, curiously enough, the same calling sequence with rvec=.true. 
(i.e. no eigenvectors) gives the wrong answer!

Comment 15 by Valentin Zauner, Oct 16, 2013

hmm... you can check the output of dtrsen by setting the appropriate 
debug message level (set the variable mneupd to anything > 2).

And oops, sorry, I actually always meant the routine dngets (which 
orders the eigenvalues the way you want them, e.g. 'SM': smallest 
first) when I said dtrsen (which does a reordering of the 
schur-decomposition done by dlahqr.

So: 
after calling dngets, the debug output should show an array, where 
the wanted eigenvectors are IN THE LAST FEW ENTRIES.

if the desired NCONV eigenvalues are not among the first NCONV 
eigenvalues obtained by dlahqr, the routine dtrsen is called to 
reorder the schur-decomposition. With or without that call, in the 
end the desired eigenvalues should be among the first NCONV 
eigenvalues of the schur decomposition. This can all be checked by 
looking at the debug info (like the .txt files I posted in my first 
comment)

If all this is fine but you still get wrong results, then I suspect 
something later on is going wrong, that is not called when WHICH = 
'LR' or 'LM'. 


>> BUT, curiously enough, the same calling sequence with 
>> rvec=.true. (i.e. no eigenvectors) gives the wrong
>> answer!
Did you mean rvec=.false. or (i.e. WITH eigenvectors), sry I'm 
confused...

cheers

Comment 16 by Edouard Canot, Oct 18, 2013

Sorry, Valentin! I have to say that your 'dneupd2.f' is correct. I 
used it to many tests with my library Muesli.

The problem was the storing of eigenvectors (as in Lapack), and it 
is not so easy when you want to sort them according the module of 
eigenvalues, either in ascending or descending order.

My tests are not yet complete, but I think that all is good. Many 
thanks, I've improved my numerical library!

To the arpack-ng maintainers, I recommend to patch the dneupd() 
routine using the file 'dneupd2.f' of V. Zauner.

Regards, (and sorry again for my wrong news two days ago).

Comment 17 by Valentin Zauner, Oct 18, 2013

Muesli, lol :-)

I'm glad to hear it works in your case as well! That suggests that 
there are no more adaptations to be made to the rest of the routine 
after the selecting of the wanted eigenpairs.

I suspect a similar adaptation should be made in all _neupd.f 
routines.

cheers,
Valentin

Comment 18 by Jordi Gutiérrez Hermoso, Oct 24, 2013

I'm evidently falling behind applying this patch. Would either of 
you feel competent enough to handle git and apply the patch 
yourself? I'm asking, do either of you want to become an arpack-ng 
maintainer? If Sylvester agrees, I think we could use more help.

Comment 19 by Sylvestre Ledru, Oct 24, 2013

I would be very happy to see new maintainers in arpack :)

Edouard, do you want me to give you permissions on the git 
repository ?
I can help you with the system if you want.

Comment 20 by Edouard Canot, Oct 25, 2013

Jordi and Valentin, I think we have to delayed the patch of dneupd. 
Yesterday, I have found a matrix for which dneupd failed, alas :-(

I must check whether it comes from my library or from dneupd.

Sylvestre, concerning the fact to become a new Arpack maintainer, I 
would like to accept but for the moment I cannot :-(

Comment 21 by Valentin Zauner, Oct 29, 2013

hi guys, sorry for the delay.

Thanks for the offer as a maintainer! I however also have to refuse 
as I have only very rudimentary knowledge of fortran and my time 
also doesn't really permit it :-(

I am however happy to help clarify further issues with dneupd

Edouard: the debugging information that arpack provides was really 
very helpful for me in finding out the culprit. Maybe it helps you 
also to see what goes wrong for your special case?

cheers

Comment 22 by Jordi Gutiérrez Hermoso, Oct 30, 2013

You two have already spent far more time debugging this issue than I 
have ever spent in my entire time as a maintainer of ARPACK.

All this offer gives you, really, is the right to upload a fix once 
you have it ready, and it comes with absolutely no obligations other 
whatever you decide to impose on yourself.

Comment 23 by Sylvestre Ledru, Nov 6, 2013

Edouard, have you been able to make some progress on the patch ? 
Thanks

Comment 24 by Chris cmiller730, Feb 17, 2014

Sorry to be late to the party. I'm trying to write Haskell wrappers 
to arpack and beleive that I'm encountering this issue as well. I'm 
just doing a sanity check with a diagonal matrix (using the 
non-symmetric routine),

A_ii = (i+1)  0 <= i < 1000.

Here's a summary of my ARPACK params

bmat = 'I'
N = 1000
which = "LM"
nev = 9
tol = 0
ncv = 2*nev + 1
rvec = 1

The computed eigenvalues should be 1000, 999,...,992. I'm getting

21.240820
209.280818
491.134969
1000.000000
999.000000
998.000000
997.000000
996.000000
995.000000

The failure mode appears to depend on nev. For nev = 1..8 I get the 
expected results, breaks for 9 as described,  nev = 10 returns 
[999,998,..990] (missing 1000), 11 fails in a manner similar to 9, 
15 returns correctly. Sorry I got carried away. By induction it's 
broken ;). All of these examples return ido = 99 info = 0

As others have reported I also always get correct results when rvec 
= 0. I'm attaching c code for this example.

Comment 25 by Chris cmiller730, Feb 17, 2014

I now also confirm that the above fix to dneupd.f corrects the 
problem at least in this simple test case. 

What would be involved in patching this? Would updates to _neupd be 
enough?

Comment 26 by Sylvestre Ledru, Feb 17, 2014

Well, to be clear, I don't have the skills to fix this bug I will be 
happy to apply a merge and use the various tests cases as unit tests.

Comment 27 by Chris cmiller730, Feb 17, 2014

I don't really feel like I have the skills either. It's been a long 
time since I've seriously looked at Fortran code past "how do I 
call it from C?"

I guess I have three concerns

1) The problem appears to be related to a change in the way lapack 
handles things. Would the fix break people using older lapack? Does 
anyone care?

2) I don't really have a full grasp of what exactly the problem is 
so I'm hesitant to apply someone else's fix. 

3) Unless I'm missing something, the test coverage in arpack-ng 
seems a little sparse. I'd worry that I'd break something down the 
line.

If I have time and can get a better handle on what's going wrong 
I'll try and make a patch and write a few verification tests to send 
to the maintainers.

I've setup a fork at https://github.com/cmiller730/arpack-ng to work 
on this in case anyone's interested in my progress.

Comment 28 by Tim Mitchell, Mar 5, 2014

Issue 1315 just caught my attention as it sounds very similar to 
issue 664 which I discovered and patched a couple years ago and that 
issue 1315 could be related to issue 667 which I also reported.

Having read through the entire thread, I’m not 100% convinced there 
really is a bug here because there is more than one issue here being 
conflated together, thus making it difficult to determine what is 
causing what. 

First, the business of migrating to LAPACK 3.0 or later routines 
seems a very premature step.  It seems part of the “bug" was 
caused by overlooking ARPACK instructions to use the included LAPACK 
2.0 routines, which I’d say is user-error.  The “bug" should 
first be replicated with the LAPACK routines included with ARPACK, 
not any other version.  Migrating to a newer LAPACK version may be a 
good idea long-term but feature enhancement should be a completely 
separate issue from determining whether this bug exists, its cause, 
and its solution, unless we are certain that 2.0 routines are 
somehow at fault.  So before moving to the much modified dneupd2, I 
think it would be best to focus on making sure this bug truly exists 
with the regular dneupd and LAPACK 2.0 routines.

With that in mind, the next very important issue is whether the demo 
codes are properly giving ARPACK a fixed initial vector.  The fact 
that bug seems to only occur sometimes for a given matrix likely 
indicates that the ARPACK is generating a random initial vector 
every call, which would explain why Valentin Zauner is reporting 
that the problem doesn’t always happen every time.  I just skimmed 
Valentin Zauner’s demo codes and it looks like that indeed INFO is 
set to 0, thus telling ARPACK to ignore whatever is in RESID and to 
instead generate its own random initial vector every time it is 
called.  See lines 123-127 of the comments for dnaupd.f for RESID 
and INFO:

http://forge.scilab.org/index.php/p/arpack-ng/source/tree/master/SRC/
dnaupd.f

I did this very quickly so perhaps someone could double-check my 
source-code skimming?  I am pretty confident that if RESID is 
initially set to a vector of ones (or any other fixed set of values) 
and INFO is set to 1, then the bug will either always occur or it 
always won’t.  

So I think the number one priority moving forward should be to 
create a demo code that generates a random vector for RESID with 
INFO := 1 and then stores RESID to disk before calling ARPACK.  That 
way, if the bug then occurs with that initial vector, we can load up 
the copy of the initial RESID vector on disk and can replicate the 
bug every time with a specific matrix and that specific initial 
vector.

Krylov methods are of course sensitive to the choice of initial 
vector and it could be that sometimes just a bad initial vector is 
randomly generated and as a result, not all the desired Ritz values 
converge, which would be normal.  With a really bad choice of 
initial vector (at least non-zero though… since zero would indeed be 
very very bad), ARPACK may not even have a single Ritz value 
converge before hitting its max iteration count.  To confirm that we 
have a bug, we need to confirm that indeed ARPACK is returning 
undesired / not-converged Ritz values with eigenvectors when in 
fact, ARPACK has already computed more desired converged Ritz values 
that it isn’t returning for some reason.  It’s not yet clear to me 
that this is really happening and we can’t do that test unless we 
have a fixed initial vector and matrix that supposedly replicates 
the problem. 

To sum up, my advice at this point is the following:

1) Use only the LAPACK routines included with ARPACK for the timing 
being and not the modified dneupd2

2) obtain a fixed example matrix A and a fixed initial vector x0 
that replicates the problem every time

3) then with this A and x0, verify whether Matlab replicates the 
same behavior or not.  At this point, I doubt the claim that the 
problem doesn’t occur in Matlab only because the initial vector and 
LAPACK routines don’t seem to have been adequately controlled for.

If after all that, we have a matrix A and vector x0 which causes a 
problem with ARPACK-NG but in Matlab, then I think we’ll have 
verified that something fishy is going on here and we can start 
looking into whether ARPACK is not properly returning the desired 
set of converged Ritz values and corresponding eigenvectors.  

Unfortunately, I don’t have the time at the moment to do the 
investigative legwork at the moment, but if someone else here can 
follow up on getting a concrete problematic example generated that 
recreates the problem every time, I’d be happy to consult on it and 
time permitting, get more involved then.

Comment 29 by Valentin Zauner, Mar 5, 2014

Hi Tim,

Thank you for your detailed analysis of what you choose not to call 
a bug. If indeed the ARPACK-NG documentation specifically points out 
to use included LAPACK 2.0 routines then I agree with you that the 
problem I described is user induced, although somewhat finicky.

Could you please point out to me where exactly this is stated (and 
it has to be in the NG documentation, not the "old" ARPACK 
one!).

Even if it is so, the LAPACK routine that causes the problem 
(dlahqr) is actually not meant to be called directly by the user 
from what I can tell from the LAPACK documentation. So we could 
continue to be finicky along that direction...

Let me elaborate on your other points:

1) This has already been suggested as the simplest bugfix

2) Actually this is of course what I tried to supply in the 
beginning, as of course it would greatly help to have a set of input 
data to ALWAYS reproduce the error. The problem however is the 
following: 

Let's say you give me a fixed matrix A. I run my code many times 
with a random initial vectors x0 until I see the problem occuring. I 
save A and the problematic x0.

If I then load x0 again in the next run and use it as an initial 
vector I can reproduce the problem many times over and over (I tried 
all this!).

If however in the meantime I run some other program, close/open some 
programs or reboot the machine, the same x0 which caused problems 
before is now fine. So if I then give you the A and x0 that I saved 
you will probably not be able to reproduce the problem.

I suspect that the routine dlahqr also uses some stochastic 
algorithm and somehow this gets messed up by 
opening/closing/rebooting etc. Anyways there is some undtable 
behavior and that's trouble enough.

3) Every problematic x0 I double checked with MATLAB. MATLAB has 
NEVER EVER failed a single time (the considered matrices were well 
behaved). I don't know exactly what you want to say in 3) but as 
stated above somewhere, MATLAB actually also commits what you call a 
user-error by using a current release of LAPACK and just using the 
LAPACK 2.0 version of dlahqr. In this sense, MATHWORKS has also 
found and fixed this bug and confirms that the routine dlahqr is the 
culprit.

on a side remark: I have confirmed that the requested ritz values 
have converged but ARPACK just returns the wrong eigenpairs by 
supplying debug traces (if you care you can skim through them at 
https://gist.github.com/Darkdragon84/6728023).

To conclude: 
- even if ARPACK tells you to use LAPACK 2.0 I think this is way out 
of date and users should be at least SPECIFICALLY pointed to that 
issue.
- I actually just wanted to check if anyone else experienced this 
problem and alongside supplied my own (admittedly) dirty hack to fix 
the problem. It works fine for me, I never requested it to be 
included in the library. Whoever likes it can use it, if not, well 
then not :-)

Hope this clarifies a few things, if not, let's exchange another few 
hundred lines ;-)

cheers,
Valentin

Comment 30 by Chris cmiller730, Mar 5, 2014

If it says only use the included lapack that's fine. But I didn't 
notice it anywhere in the README. What I did read in the README said 


4. Unlike ARPACK, ARPACK-NG is providing autotools based build 
system.
   Therefor, the classical:
   $ ./configure
   $ make
   $ make install
should work as expected.

If nothing else the build script should be changed so it doesn't 
link with the user's LAPACK and/or the README should be updated.

I'm also 99.9999% sure the problem has nothing to do with 
sensitivity to starting vectors.

1) ARPACK reports convergence when the results clearly violate the 
convergence criterion. Even if the failure is related to starting 
vectors this would still be a bug.

2) The results are always wrong for me in exactly the same ways. 
I've also tried seeding with a vector of all ones and and info = 1 I 
get the same incorrect results.

3) It's failing on matricies it has no buisness failing on 
(Diagonal) no matter what the starting vector is.

I don't have matlab but I've verified that the matrix from my trial 
script works with scipy's sparse.linalg.eigs.

Comment 31 by Tim Mitchell, Mar 5, 2014

Hi Valentin and Chris,

Thanks for the quick reply and providing additional details.  To 
clarify my stance, I wasn’t stating that what you've reported is not 
a bug or that is a bug.  All I was claiming is that it wasn’t really 
clear what was being reported.  Given the additional replies by you 
both, it does sound like something wrong is going on.  

Concerning the initial vector issue, it also wasn’t clear to me from 
the codes provided here that is was accounted for, since INFO was 
set to 0 and I didn’t see any code relevant to storing or uploading 
a specific initial vector in either Valentin’s or Chris’s demo 
codes.  It’s good to know that they are both reporting that problems 
occur even with a vector of 1’s as input.  However, even with a 
fixed initial vector of 1’s, does the problem happen every time for 
you Chris or is it intermittent like Valentin reports?  If it’s 
intermittent, can you at least get to occur “on demand” quickly by 
calling ARPACK-NG with the same matrix and initial vector in a loop 
over and over again?  

Concerning using the newer LAPACK routines, is it confirmed that 
this bug you only happens when using the newer 3.x versions of 
LAPACK and not the older 2.0 release that ARPACK included with it 
and that ARPACK-NG stripped out of its source tree?   I could have 
sworn I had seen something regarding only using certain version of 
routines in LAPACK with ARPACK but this was long time ago and I 
can’t find it now so maybe I am mistaken.  So yeah, my use of 
“user-error” was probably too harsh.  Still, I think expecting 
ARPACK[-NG] to work fine with newer major version updates of LAPACK 
is on a plane of optimism that I neither possess nor even imagine 
having.  :)  It’s been nearly 20 years (!) since ARPACK was first 
released and it hasn’t been maintained by the original authors for a 
/long/ time.  

If it’s just the newer LAPACK routines that are inducing the problem 
in ARPACK-NG, priority number one should be instructing ARPACK-NG 
users and distros to only link to LAPACK 2.0 (or the most recent 
non-problematic release) for the time being until a new official 
update of ARPACK-NG is ready to be released with compatibility with 
the latest LAPACK versions.   Certainly correctness of results are 
more important than any speed gains given by newer LAPACK code.

Comment 32 by Roland Wirth, May 14, 2014

> I could have sworn I had seen something regarding only using 
certain version of routines in LAPACK with ARPACK but this was long 
time ago and I can’t find it now so maybe I am mistaken.

The passage you are referring to is in the README of the original 
ARPACK package. There, they strongly recommend that you use the 
bundled LAPACK if the system LAPACK is not 2.0. They also state that 
the current LAPACK was 3.0 at the time of writing. So, it might be 
that you have rediscovered a long-forgotten incompatibility.

Comment 33 by Tim Mitchell, May 14, 2014

Hi Roland,

Thank you so much for reminding me where I had seen that warning 
concerning using LAPACK 2.0 instead of 3.0 with ARPACK.  I knew I 
had seen it somewhere but I was rather swamped with other things 
last time I checked in here to go digging around for it.

It seems the proper course of action moving forward is to put the 
2.0 LAPACK code included with the official ARPACK back into the 
arpack-ng tree, such that it is always linked to upon compiling, 
along with new documentation warnings that specifically detail this 
incompatibility with LAPACK 3.0.  I'd recommend also putting such a 
comment/warning in the makefile so no one in the future falls into 
the same trap. 

Once that is done, I believe issue 1315 should then be changed from 
"type defect" to "type enhancement" and when 
someone in the future has time and motivation, the arpack-ng code 
can be updated in a future release to properly support LAPACK 3.0, 
perhaps by following the work Valentin has already done on that 
front.

Thanks to everyone for bringing attention to this regression that 
had been compared to the official ARPACK.

Sylvestre or other admins, can you make these updates?

Comment 34 by Jordi Gutiérrez Hermoso, May 14, 2014

This isn't an "unofficial" ARPACK. We contacted the ARPACK 
developers telling them that we were doing this, and they agreed to 
it (well, they didn't disagree to it, they just never really 
responded except with promises to maintain ARPACK, which never 
happened).

We shouldn't be packaging our own LAPACK either. This isn't a real 
fix. The whole point of creating arpack-ng in the first place was to 
avoid shipping embedded libraries. The real fix is to figure out how 
to make ARPACK work with LAPACK other than 2.0.

Comment 35 by Valentin Zauner, May 15, 2014

I agree with Jordi. 

As I gather, the point of arpack-ng is exactly to provide more than 
the (not well maintained) original ARPACK, i.e. also ensure 
compatibility with current libraries. In our case this is just 
LAPACK and BLAS. I understand that this might be a cumbersome task, 
but otherwise I see no point in arpack-ng in the first place. Sorry 
if this might sound a bit harsh, but I think otherwise arpack-ng 
would just be the usual ARPACK with a nice make environment...

Comment 36 by Sylvestre Ledru, May 15, 2014

Clearly, we won't be shipping a copy of LAPACK in Arpack-ng.

However, I will be happy to apply any patch to support LAPACK 3.0 
(but I won't be the one doing the work, sorry).

Comment 37 by Valentin Zauner, May 15, 2014

OK, this rules out the MATLAB approach, where they use a compiled 
current LAPACK library, where just the troublemaker routine is 
replaced with the old one from the 2.0 release.

Comment 38 by Sylvestre Ledru, May 15, 2014

Valentin, it is just about one file, we can take it as a workaround. 


I don't know if it is possible with Fortran but a load on the fly 
could be a workaround...

Comment 39 by Valentin Zauner, May 15, 2014

By "one file", you mean the ARPACK fortran file, i.e. 
dneupd.f? 

I actually strongly suspect that the same issue will also arise for 
all the other drivers (i.e. single/double real/complex 
symmetric/non-symmetric) that compute eigenvectors. One would have 
to test all of these and apply the corresponding workaround.

I haven't done this since in my application I only need the double 
real non-symmetric routine.

Comment 40 by Tim Mitchell, May 15, 2014

Valentin, yes, I am 99.999% certain that the same issue will arise 
for all the other drivers.  However, I believe Sylvestre was 
suggesting to only take the one LAPACK 2.0 file dlahqr and linking 
to to just that specific version for arpack-ng, possibly 
dynamically.  I'd be fine with that approach as a temporary 
workaround (or by taking the Matlab approach which presumably 
includes it as dlahqr2 as Edouard has suggested). Is that what you 
meant Sylvestre?

I agree with everyone that it'd be great for arpack-ng to push 
forward and support the newer LAPACK releases.  However, until those 
improvements are ready and thoroughly tested, I feel it is prudent 
to provide a temporary workaround that deals with this 
incompatibility with newer LAPACK routine(s).  Though I admit it's a 
bit of compromise to include and/or link to specific 2.0 versions as 
necessary, the alternative of having arpack-ng users live with some 
fairly erroneous results seems a worse trade-off, particularly since 
it seems like it will be some time before compatibility with LAPACK 
3.0 is fully handled in arpack-ng.

Valentin, as to the point of arpack-ng, I completely disagree that 
without feature enhancements that arpack-ng would just be regular 
ARPACK with a nice make environment.  The official ARPACK tree is 
not being maintained /at all/ and as a result, the bug reports and 
fixes were that have cropped up over the years weren't being 
organized or compiled into one centralized source.  I personally am 
very grateful that arpack-ng exists so that there is now one 
centralized and organized repository for all ARPACK related bugs and 
fixes to be available from.  Even if arpack-ng intended to only be 
just such a maintenance release, that's still a huge benefit to the 
community.

Comment 41 by Viral Shah, May 25, 2014

Having the erroneous answer is certainly the worse solution. It 
would be really great if we can include DLAHQR2 as part of 
arpack-ng, so that users get a correct working solution out of the 
box.

Comment 42 by Johannes Jünemann, Jun 11, 2014

Hello everybody, I am encountering a problem that is related to your 
discussion. I am trying to find the 'LR' eigenvalues of given 
matrices (with real eigenvalues!) using the dnaupd/dneupd methods, 
however, the lowest eigenvalues do not always appear in the right 
order. Below a typical matrix example:

A = RESHAPE((/ 0d0 , 1d-4, 1d-4, -1d0,
 &             1d-4, 0d0, -1d0, 1d-4,
 &             1d-4, -1d0, 0d0, 1d-4,
 &             -1d0, 1d-4, 1d-4, 0d0
 &      /), (/4,4/) )

When calling the method repeatedly, the ordering of the two-lowest 
EV (-1.0002 resp. -0.9998) goes wrong in some cases. I used the 
arpack_test_f program (as provided by Valentin Zauner) to 
double-check and tried both the fix for the dlahqr.f and using 
dneupd2.f, but none of the two changes the problem. The ordering (at 
least for this particular example) goes right when choosing 
'LR','SM' or 'LM'.

I would be very glad if one of you could confirm this problem in 
oder to know if the issue should have been fixed by your patches 
(and I didn't follow the instructions through completely) or if this 
is an unrelated error.

Thanks for your help, Johannes

Comment 43 by Johannes Jünemann, Jun 11, 2014

I am sorry, the second sentence was supposed to read:

I am trying to find the 'SR' eigenvalues of given 
matrices (with real eigenvalues!) using the dnaupd/dneupd methods...

Comment 44 by Valentin Zauner, Jun 11, 2014

I have never used the 'SR' mode, maybe some additional stuff happens 
when you call this? there should be some shift-and-invert type of 
operation involved, maybe this screws it up?

Also, your matrix is symmetric and you should use the symmetric 
driver dsaupd/dseupd to begin with...

Comment 45 by Johannes Jünemann, Jun 12, 2014

Thank you for the very quick response. I am using the non-symmetric 
libraries, as I apply the procedure to a set of different (not 
necessarily) symmetric matrices.

The problem persists when slightly changing the matrix entries as to 
make it non-symmetric.

Comment 46 by Marco Caliari, Feb 12, 2015

Dear all,

I had similar problems with rvec = .true.
It was not clear to me how the Ritz values (computed, sorted, 
selected, ...) by dnaupd were related with the eigenvalues computed 
by the famous dlahqr.f. Till today, when I saw in 
arpack/src/dlaqrb.f

c  This is mostly a modification of the LAPACK routine dlahqr.

Now it is clear: they are related because computed in the same way 
(by the same routine). So, I compared dlaqrb.f with dlahqr.f (2.0) 
and made the same small changes to dlahqr.f (3.5.0) and called the 
result dlaqrb35.f (attached). All my problems disappeared! If you 
all confirm that it works also for you, I can continue with the job 
(modify the description, the comments and the other routines).

Comment 47 by Marco Caliari, Feb 12, 2015

the file

Comment 48 by Valentin Zauner, Feb 12, 2015

Hi!

Tracking the changes back to within the auxiliary routine is 
probably a better way :-) I will see if it works for me when I have 
the time! Thanks for your work!

Cheers,
Valentin

Comment 49 by Sylvestre Ledru, Feb 12, 2015

This bug is fixed in github:
https://github.com/opencollab/arpack-ng/issues/3

FYI, this forge is no longer used.

Comment 50 by Sylvestre Ledru, Feb 12, 2015

Status: Fixed

Comment 51 by Valentin Zauner, Feb 12, 2015

So does that mean that the whole issue is now fully fixed in the 
current 3.1.5 release? Also for other drivers such as dseupd, 
zneupd, etc?

Comment 52 by Sylvestre Ledru, Feb 12, 2015

Please ask your question in github.
This project is only here for legacy purposes.

Created: 4 years 2 months ago by Valentin Zauner

Updated: 2 years 10 months ago

Status: Fixed

Followed by: 9 persons

Labels:
Type:Defect
Priority:Medium

This issue duplicates
1397 - Not...t with diagonal matrices