vectorization - Page 2 - Developer IT

Problem: Vectorizing Code with Intel Visual FORTRAN for X64

- by user313209

I'm compiling my fortran90 code using Intel Visual FORTRAN on Windows Server 2003 Enterprise X64 Edition. When I compile the code for 32 bit structure and using automatic and manual vectorizing options. The code will be compiled, vectorized. And when I run it on 8 core system the compiled code uses 70% of CPU that shows me that vectorizing is working. But when I compile the code with 64 Bit compiler, it says that the code is vectorized but when I run it it only shows CPU usage of about 12% that is full usage for one core out of 8, so it means that while the compiler says that code is vectorized, vectorization is not working. And it's strange for me because it's on a X64 Edition Windows and I was expecting to see the reverse result. I thought that it should be better to run a code that is compiled for 64 Bit architecture on a 64 bit windows. Anyone have any idea why the compiled code is not able to use the full power of multiple cores for 64 Bit Compiled version? Thanks in advance for your responses.

Read the article

Faster method for Matrix vector product for large matrix in C or C++ for use in GMRES

- by user35959

I have a large, dense matrix A, and I aim to find the solution to the linear system Ax=b using an iterative method (in MATLAB was the plan using its built in GMRES). For more than 10,000 rows, this is too much for my computer to store in memory, but I know that the entries in A are constructed by two known vectors x and y of length N and the entries satisfy: A(i,j) = .5*(x[i]-x[j])^2+([y[i]-y[j])^2 * log(x[i]-x[j])^2+([y[i]-y[j]^2). MATLAB's GMRES command accepts as input a function call that can compute the matrix vector product A*x, which allows me to handle larger matrices than I can store in memory. To write the matrix-vecotr product function, I first tried this in matlab by going row by row and using some vectorization, but I avoid spawning the entire array A (since it would be too large). This was fairly slow unfortnately in my application for GMRES. My plan was to write a mex file for MATLAB to, which is in C, and ideally should be significantly faster than the matlab code. I'm rather new to C, so this went rather poorly and my naive attempt at writing the code in C was slower than my partially vectorized attempt in Matlab. #include <math.h> #include "mex.h" void Aproduct(double *x, double *ctrs_x, double *ctrs_y, double *b, mwSize n) { mwSize i; mwSize j; double val; for (i=0; i<n; i++) { for (j=0; j<i; j++) { val = pow(ctrs_x[i]-ctrs_x[j],2)+pow(ctrs_y[i]-ctrs_y[j],2); b[i] = b[i] + .5* val * log(val) * x[j]; } for (j=i+1; j<n; j++) { val = pow(ctrs_x[i]-ctrs_x[j],2)+pow(ctrs_y[i]-ctrs_y[j],2); b[i] = b[i] + .5* val * log(val) * x[j]; } } } The above is the computational portion of the code for the matlab mex file (which is slightly modified C, if I understand correctly). Please note that I skip the case i=j, since in that case the variable val will be a 0*log(0), which should be interpreted as 0 for me, so I just skip it. Is there a more efficient or faster way to write this? When I call this C function via the mex file in matlab, it is quite slow, slower even than the matlab method I used. This surprises me since I suspected that C code should be much faster than matlab. The alternative matlab method which is partially vectorized that I am comparing it with is function Ax = Aprod(x,ctrs) n = length(x); Ax = zeros(n,1); for j=1:(n-3) v = .5*((ctrs(j,1)-ctrs(:,1)).^2+(ctrs(j,2)-ctrs(:,2)).^2).*log((ctrs(j,1)-ctrs(:,1)).^2+(ctrs(j,2)-ctrs(:,2)).^2); v(j)=0; Ax(j) = dot(v,x(1:n-3); end (the n-3 is because there is actually 3 extra components, but they are dealt with separately,so I excluded that code). This is partly vectorized and only needs one for loop, so it makes some sense that it is faster. However, I was hoping I could go even faster with C+mex file. Any suggestions or help would be greatly appreciated! Thanks! EDIT: I should be more clear. I am open to any faster method that can help me use GMRES to invert this matrix that I am interested in, which requires a faster way of doing the matrix vector product without explicitly loading the array into memory. Thanks!

Read the article

rotating bitmaps. In code.

- by Marco van de Voort

Is there a faster way to rotate a large bitmap by 90 or 270 degrees than simply doing a nested loop with inverted coordinates? The bitmaps are 8bpp and typically 2048*2400*8bpp Currently I do this by simply copying with argument inversion, roughly (pseudo code: for x = 0 to 2048-1 for y = 0 to 2048-1 dest[x][y]=src[y][x]; (In reality I do it with pointers, for a bit more speed, but that is roughly the same magnitude) GDI is quite slow with large images, and GPU load/store times for textures (GF7 cards) are in the same magnitude as the current CPU time. Any tips, pointers? An in-place algorithm would even be better, but speed is more important than being in-place. Target is Delphi, but it is more an algorithmic question. SSE(2) vectorization no problem, it is a big enough problem for me to code it in assembler Duplicates How do you rotate a two dimensional array?. Follow up to Nils' answer Image 2048x2700 - 2700x2048 Compiler Turbo Explorer 2006 with optimization on. Windows: Power scheme set to "Always on". (important!!!!) Machine: Core2 6600 (2.4 GHz) time with old routine: 32ms (step 1) time with stepsize 8 : 12ms time with stepsize 16 : 10ms time with stepsize 32+ : 9ms Meanwhile I also tested on a Athlon 64 X2 (5200+ iirc), and the speed up there was slightly more than a factor four (80 to 19 ms). The speed up is well worth it, thanks. Maybe that during the summer months I'll torture myself with a SSE(2) version. However I already thought about how to tackle that, and I think I'll run out of SSE2 registers for an straight implementation: for n:=0 to 7 do begin load r0, <source+n*rowsize> shift byte from r0 into r1 shift byte from r0 into r2 .. shift byte from r0 into r8 end; store r1, <target> store r2, <target+1*<rowsize> .. store r8, <target+7*<rowsize> So 8x8 needs 9 registers, but 32-bits SSE only has 8. Anyway that is something for the summer months :-) Note that the pointer thing is something that I do out of instinct, but it could be there is actually something to it, if your dimensions are not hardcoded, the compiler can't turn the mul into a shift. While muls an sich are cheap nowadays, they also generate more register pressure afaik. The code (validated by subtracting result from the "naieve" rotate1 implementation): const stepsize = 32; procedure rotatealign(Source: tbw8image; Target:tbw8image); var stepsx,stepsy,restx,resty : Integer; RowPitchSource, RowPitchTarget : Integer; pSource, pTarget,ps1,ps2 : pchar; x,y,i,j: integer; rpstep : integer; begin RowPitchSource := source.RowPitch; // bytes to jump to next line. Can be negative (includes alignment) RowPitchTarget := target.RowPitch; rpstep:=RowPitchTarget*stepsize; stepsx:=source.ImageWidth div stepsize; stepsy:=source.ImageHeight div stepsize; // check if mod 16=0 here for both dimensions, if so -> SSE2. for y := 0 to stepsy - 1 do begin psource:=source.GetImagePointer(0,y*stepsize); // gets pointer to pixel x,y ptarget:=Target.GetImagePointer(target.imagewidth-(y+1)*stepsize,0); for x := 0 to stepsx - 1 do begin for i := 0 to stepsize - 1 do begin ps1:=@psource[rowpitchsource*i]; // ( 0,i) ps2:=@ptarget[stepsize-1-i]; // (maxx-i,0); for j := 0 to stepsize - 1 do begin ps2[0]:=ps1[j]; inc(ps2,RowPitchTarget); end; end; inc(psource,stepsize); inc(ptarget,rpstep); end; end; // 3 more areas to do, with dimensions // - stepsy*stepsize * restx // right most column of restx width // - stepsx*stepsize * resty // bottom row with resty height // - restx*resty // bottom-right rectangle. restx:=source.ImageWidth mod stepsize; // typically zero because width is // typically 1024 or 2048 resty:=source.Imageheight mod stepsize; if restx>0 then begin // one loop less, since we know this fits in one line of "blocks" psource:=source.GetImagePointer(source.ImageWidth-restx,0); // gets pointer to pixel x,y ptarget:=Target.GetImagePointer(Target.imagewidth-stepsize,Target.imageheight-restx); for y := 0 to stepsy - 1 do begin for i := 0 to stepsize - 1 do begin ps1:=@psource[rowpitchsource*i]; // ( 0,i) ps2:=@ptarget[stepsize-1-i]; // (maxx-i,0); for j := 0 to restx - 1 do begin ps2[0]:=ps1[j]; inc(ps2,RowPitchTarget); end; end; inc(psource,stepsize*RowPitchSource); dec(ptarget,stepsize); end; end; if resty>0 then begin // one loop less, since we know this fits in one line of "blocks" psource:=source.GetImagePointer(0,source.ImageHeight-resty); // gets pointer to pixel x,y ptarget:=Target.GetImagePointer(0,0); for x := 0 to stepsx - 1 do begin for i := 0 to resty- 1 do begin ps1:=@psource[rowpitchsource*i]; // ( 0,i) ps2:=@ptarget[resty-1-i]; // (maxx-i,0); for j := 0 to stepsize - 1 do begin ps2[0]:=ps1[j]; inc(ps2,RowPitchTarget); end; end; inc(psource,stepsize); inc(ptarget,rpstep); end; end; if (resty>0) and (restx>0) then begin // another loop less, since only one block psource:=source.GetImagePointer(source.ImageWidth-restx,source.ImageHeight-resty); // gets pointer to pixel x,y ptarget:=Target.GetImagePointer(0,target.ImageHeight-restx); for i := 0 to resty- 1 do begin ps1:=@psource[rowpitchsource*i]; // ( 0,i) ps2:=@ptarget[resty-1-i]; // (maxx-i,0); for j := 0 to restx - 1 do begin ps2[0]:=ps1[j]; inc(ps2,RowPitchTarget); end; end; end; end;

Developer IT