Registered Member
|
Here is an observation while implementing bfgs nonlinear optimiser. It is just a prototype code and use VectorXD and MatrixXD has intermediate type.
The prototype basically are doing the following 1) compute the numerical hessian 2) get the inverse by ldlt.solve(Identity(n,n)) 3) go into BFGS loop and using the inverse BFGS update, and matrix multiplication to get next location I am just doing some preliminary test on ubuntu with GCC 4.4. When I compile with gcc -O3 the timing result for solving 100 variable extended rosenbrock problem is about 1.5s but gcc -O3 -msse took about 35 seconds. When I have time I will try to reduce it to a test case. However does anyone know what might have been the problem? |
Moderator
|
with -msse only, Eigen does nothing different. We start to vectorize with -msse2. Also, are you sure you're not already running a 64 bits system? (uname -a will tell you)
|
Registered Member
|
32 bit. By using -msse2 the time dropped almost 50% (expected)! Any idea why -msse would actually slow down the performance? (Actually this question is unimportant to me now)
|
Registered Member
|
The real question is, why would you select Katmai as the target architecture?
To be honest, it is hard to answer <your question> w/o proper profiling, and further knowledge on the hardware / optimization flags, etc., but I guess: - Eigen (as it has been mentioned) does not vectorize when target instruction set is SSE(1) - I presume that -msse instructs gcc to select PIII (Katmai?) as the target arch, and it is known that tuning to PIII has an adverse effect on the instruction flow / execution unit utilization on newer processors (which is probably even more adverse than tuning to Willamette and running on Penryn or older arch). If you are really interested, do an event-based profiling on both binary, check the ASM and that should point you to the right direction. |
Registered users: bartoloni, Bing [Bot], Google [Bot], Yahoo [Bot]