optimization level - Page 7

Objective-C optimization

- by SpecialK

Are there standard optimization tricks for Objective-C to make for faster execution along the lines of "inlining" frequent methods as in C++ or the "g++ -fast" tag?

Performance Optimization for Matrix Rotation

Hello everyone: I'm now trapped by a performance optimization lab in the book "Computer System from a Programmer's Perspective" described as following: In a N*N matrix M, where N is multiple of 32, the rotate operation can be represented as: Transpose: interchange elements M(i,j) and M(j,i) Exchange rows: Row i is exchanged with row N-1-i A example for matrix rotation(N is 3 instead of 32 for simplicity): ------- ------- |1|2|3| |3|6|9| ------- ------- |4|5|6| after rotate is |2|5|8| ------- ------- |7|8|9| |1|4|7| ------- ------- A naive implementation is: #define RIDX(i,j,n) ((i)*(n)+(j)) void naive_rotate(int dim, pixel *src, pixel *dst) { int i, j; for (i = 0; i < dim; i++) for (j = 0; j < dim; j++) dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)]; } I come up with an idea by inner-loop-unroll. The result is: Code Version Speed Up original 1x unrolled by 2 1.33x unrolled by 4 1.33x unrolled by 8 1.55x unrolled by 16 1.67x unrolled by 32 1.61x I also get a code snippet from pastebin.com that seems can solve this problem: void rotate(int dim, pixel *src, pixel *dst) { int stride = 32; int count = dim >> 5; src += dim - 1; int a1 = count; do { int a2 = dim; do { int a3 = stride; do { *dst++ = *src; src += dim; } while(--a3); src -= dim * stride + 1; dst += dim - stride; } while(--a2); src += dim * (stride + 1); dst -= dim * dim - stride; } while(--a1); } After carefully read the code, I think main idea of this solution is treat 32 rows as a data zone, and perform the rotating operation respectively. Speed up of this version is 1.85x, overwhelming all the loop-unroll version. Here are the questions: In the inner-loop-unroll version, why does increment slow down if the unrolling factor increase, especially change the unrolling factor from 8 to 16, which does not effect the same when switch from 4 to 8? Does the result have some relationship with depth of the CPU pipeline? If the answer is yes, could the degrade of increment reflect pipeline length? What is the probable reason for the optimization of data-zone version? It seems that there is no too much essential difference from the original naive version. EDIT: My test environment is Intel Centrino Duo processor and the verion of gcc is 4.4 Any advice will be highly appreciated! Kind regards!

Search Results

Search found 14719 results on 589 pages for 'optimization level'.

Page 7/589 | < Previous Page | 3 4 5 6 7 8 9 10 11 12 13 14 | Next Page >

- by SpecialK

- by Summer_More_More_Tea

- by ??iu

- by int3

- by ML

- by Zoltan Hamori

- by Ish Kumar

- by rano

- by Kip Birgen

- by user173552

- by topright

- by Bittercoder

- by metdos

- by ana

- by jerryjvl

- by Shea Levy

- by Vitaly

- by aaa

- by keithjgrant

- by Dženan

- by mafutrct

- by Charlton Santana

- by jmlane

- by Tom

- by jmlane

< Previous Page | 3 4 5 6 7 8 9 10 11 12 13 14 | Next Page >