Search Results

Search found 11822 results on 473 pages for 'external assembly'.

Page 95/473 | < Previous Page | 91 92 93 94 95 96 97 98 99 100 101 102  | Next Page >

  • easy asm program(nasm)

    - by GLeBaTi
    org 0x100 SEGMENT .CODE mov ah,0x9 mov dx, Msg1 int 0x21 ;string input mov ah,0xA mov dx,buff int 0x21 mov ax,0 mov al,[buff+1]; length ;string UPPERCASE mov cl, al mov si, buff cld loop1: lodsb; cmp al, 'a' jnb upper loop loop1 ;output mov ah,0x9 mov dx, buff int 0x21 exit: mov ah, 0x8 int 0x21 int 0x20 upper: sub al,32 jmp loop1 SEGMENT .DATA Msg1 db 'Press string: $' buff db 254,0 this code perform poorly. I think that problem in "jnb upper". This program make small symbols into big symbols.

    Read the article

  • Electronic resources for learning Z/OS assembler?

    - by Jared
    This is a follow up to this question. I'm totally blind so printed books aren't an option. All the recommended books appear to have been published before electronic publishing got started. I've been able to learn the very basics but would like something between here's what a register is, and the IBM reference material. Searching the normal places like Safari Books Online has come up dry.

    Read the article

  • Is there any kind of standard for 8086 multiprocessing?

    - by Earlz
    Back when I made an 8086 emulator I noticed that there was the LOCK prefix intended for synchonization in a multiprocessor environment. Yet the only multitasking I know of for the x86 arch. involves use of the APIC which didn't come around until either the Pentiums or 486s. Was there any kind of standard for 8086 multitasking or was it done by some manufacturer specific extensions to the instruction set and/or special ports? By standard, I mean things like: How do you separate the 2 processors if they both use the same memory? This is impossible without some kind of way to make each processor execute a different piece of code. (or cause an interrupt on only one processor)

    Read the article

  • ARM Simulator on Windows

    - by Betamoo
    I am studying ARM Processors from a textbook... I thought it will be more useful if I could apply what I learn on an ARM simulator... writing code then watching results and different execution stages would be more fun... I have searched for it, but all I could find was either a freeware on linux or a demo on windows Is there a simulator that allow me to see execution steps and different changes for ARM processor (any version!) that runs on windows?? Thanks

    Read the article

  • translate ia32 into C

    - by David Lee
    I am trying to translate the following: Action: push %ebp #function prolog mov %esp, %ebp sub $0x10, %esp mov 0x8(%ebp), %eax #first line compiles to these 4 lines imul 0x8(%ebp), %eax sub $0x7, %eax mov %eax, -0x4(%ebp) addl $0x8, 0xc(%ebp) #second line mov -0x4(%ebp), %eax #third line mov 0xc(%ebp), %edx mov (%edx, %eax, 4), %eax add $0x3, %eax movb $0x41, (%eax) leave ret So far I have the following: //What am I missing? void Action(int x, char **y) { int z = x * x - 7; y+=8; //missing third line } What is the best way to translate this?

    Read the article

  • Why does it NOT give a segmentation violation?

    - by user198729
    The code below is said to give a segmentation violation: #include <stdio.h> #include <string.h> void function(char *str) { char buffer[16]; strcpy(buffer,str); } int main() { char large_string[256]; int i; for( i = 0; i < 255; i++) large_string[i] = 'A'; function(large_string); return 1; } It's compiled and run like this: gcc -Wall -Wextra hw.cpp && a.exe But there is nothing output. NOTE The above code indeed overwrites the ret address and so on if you really understand what's going underneath.

    Read the article

  • C++ word to bytes

    - by Vit
    Hi, I tried to read CPUID using assembler in C++. I know there is function for it in , but I want the asm way. So, after CPUID is executed, it should fill eax,ebx,ecx registers with ASCII coded string. But my problem is, since I can in asm adress only full, or half eax register, how to break that 32 bits into 4 bytes. I used this: #include <iostream> #include <stdlib.h> int main() { _asm { cpuid /*There I need to mov values from eax,ebx and ecx to some propriate variables*/ } system("PAUSE"); return(0); }

    Read the article

  • Project management and bundling dependencies

    - by Joshua
    I've been looking for ways to learn about the right way to manage a software project, and I've stumbled upon the following blog post. I've learned some of the things mentioned the hard way, others make sense, and yet others are still unclear to me. To sum up, the author lists a bunch of features of a project and how much those features contribute to a project's 'suckiness' for a lack of a better term. You can find the full article here: http://spot.livejournal.com/308370.html In particular, I don't understand the author's stance on bundling dependencies with your project. These are: == Bundling == Your source only comes with other code projects that it depends on [ +20 points of FAIL ] Why is this a problem, (especially given the last point)? If your source code cannot be built without first building the bundled code bits [ +10 points of FAIL ] Doesn't this necessarily have to be the case for software built against 3rd party libs? Your code needs that other code to be compiled into its library before the linker can work? If you have modified those other bundled code bits [ +40 points of FAIL ] If this is necessary for your project, then it naturally follows that you've bundled said code with yours. If you want to customize a build of some lib,say WxWidgets, you'll have to edit that projects build scripts to bulid the library that you want. Subsequently, you'll have to publish those changes to people who wish to build your code, so why not use a high level make script with the params already written in, and distribute that? Furthermore, (especially in a windows env) if your code base is dependent on a particular version of a lib (that you also need to custom compile for your project) wouldn't it be easier to give the user the code yourself (because in this case, it is unlikely that the user will already have the correct version installed)? So how would you respond to these comments, and what points may I be failing to take into consideration? Would you agree or disagree with the author's take (or mine), and why?

    Read the article

  • Illegal instruction gcc assembler.

    - by Bernt
    In assembler: .globl _test _test: pushl %ebp movl %esp, %ebp movl $0, %eax pushl %eax popl %ebp ret Calling from c main() { _test(); } Compile: gcc -m32 -o test test.c test.s This code gives me illegal instruction sometimes and segment fault other times. In gdc i always get illegal instruction, this is just a simple test, i had a larger program that was working and suddenly after no apperant reason stopped working, now i always get this error even if i start from scratch like above. I have narrowed it down to pushl %eax (or any other register....), if i comment out that line the code runs fine. Any ideas? (I'm running the program at my universities linux cluster, so I have not changed any settings..)

    Read the article

  • How to benchmark on multi-core processors

    - by Pascal Cuoq
    I am looking for ways to perform micro-benchmarks on multi-core processors. Context: At about the same time desktop processors introduced out-of-order execution that made performance hard to predict, they, perhaps not coincidentally, also introduced special instructions to get very precise timings. Example of these instructions are rdtsc on x86 and rftb on PowerPC. These instructions gave timings that were more precise than could ever be allowed by a system call, allowed programmers to micro-benchmark their hearts out, for better or for worse. On a yet more modern processor with several cores, some of which sleep some of the time, the counters are not synchronized between cores. We are told that rdtsc is no longer safe to use for benchmarking, but I must have been dozing off when we were explained the alternative solutions. Question: Some systems may save and restore the performance counter and provide an API call to read the proper sum. If you know what this call is for any operating system, please let us know in an answer. Some systems may allow to turn off cores, leaving only one running. I know Mac OS X Leopard does when the right Preference Pane is installed from the Developers Tools. Do you think that this make rdtsc safe to use again? More context: Please assume I know what I am doing when trying to do a micro-benchmark. If you are of the opinion that if an optimization's gains cannot be measured by timing the whole application, it's not worth optimizing, I agree with you, but I cannot time the whole application until the alternative data structure is finished, which will take a long time. In fact, if the micro-benchmark were not promising, I could decide to give up on the implementation now; I need figures to provide in a publication whose deadline I have no control over.

    Read the article

  • What are CFI directives in Gnu Assembler (GAS) used for?

    - by claws
    There seem to be a .CFI directive after every line and also there are wide varities of these ex.,.cfi_startproc , .cfi_endproc etc.. more here. .file "temp.c" .text .globl main .type main, @function main: .LFB0: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 movq %rsp, %rbp .cfi_offset 6, -16 .cfi_def_cfa_register 6 movl $0, %eax leave ret .cfi_endproc .LFE0: .size main, .-main .globl func .type func, @function func: .LFB1: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 movq %rsp, %rbp .cfi_offset 6, -16 .cfi_def_cfa_register 6 movl %edi, -4(%rbp) movl %esi, %eax movb %al, -8(%rbp) leave ret .cfi_endproc .LFE1: .size func, .-func .ident "GCC: (Ubuntu 4.4.1-4ubuntu9) 4.4.1" .section .note.GNU-stack,"",@progbits I didn't get the purpose of these.

    Read the article

  • "C variable type sizes are machine dependent." Is it really true? signed & unsigned numbers ;

    - by claws
    Hello, I've been told that C types are machine dependent. Today I wanted to verify it. void legacyTypes() { /* character types */ char k_char = 'a'; //Signedness --> signed & unsigned signed char k_char_s = 'a'; unsigned char k_char_u = 'a'; /* integer types */ int k_int = 1; /* Same as "signed int" */ //Signedness --> signed & unsigned signed int k_int_s = -2; unsigned int k_int_u = 3; //Size --> short, _____, long, long long short int k_s_int = 4; long int k_l_int = 5; long long int k_ll_int = 6; /* real number types */ float k_float = 7; double k_double = 8; } I compiled it on a 32-Bit machine using minGW C compiler _legacyTypes: pushl %ebp movl %esp, %ebp subl $48, %esp movb $97, -1(%ebp) # char movb $97, -2(%ebp) # signed char movb $97, -3(%ebp) # unsigned char movl $1, -8(%ebp) # int movl $-2, -12(%ebp)# signed int movl $3, -16(%ebp) # unsigned int movw $4, -18(%ebp) # short int movl $5, -24(%ebp) # long int movl $6, -32(%ebp) # long long int movl $0, -28(%ebp) movl $0x40e00000, %eax movl %eax, -36(%ebp) fldl LC2 fstpl -48(%ebp) leave ret I compiled the same code on 64-Bit processor (Intel Core 2 Duo) on GCC (linux) legacyTypes: .LFB2: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 movq %rsp, %rbp .cfi_offset 6, -16 .cfi_def_cfa_register 6 movb $97, -1(%rbp) # char movb $97, -2(%rbp) # signed char movb $97, -3(%rbp) # unsigned char movl $1, -12(%rbp) # int movl $-2, -16(%rbp)# signed int movl $3, -20(%rbp) # unsigned int movw $4, -6(%rbp) # short int movq $5, -32(%rbp) # long int movq $6, -40(%rbp) # long long int movl $0x40e00000, %eax movl %eax, -24(%rbp) movabsq $4620693217682128896, %rax movq %rax, -48(%rbp) leave ret Observations char, signed char, unsigned char, int, unsigned int, signed int, short int, unsigned short int, signed short int all occupy same no. of bytes on both 32-Bit & 64-Bit Processor. The only change is in long int & long long int both of these occupy 32-bit on 32-bit machine & 64-bit on 64-bit machine. And also the pointers, which take 32-bit on 32-bit CPU & 64-bit on 64-bit CPU. Questions: I cannot say, what the books say is wrong. But I'm missing something here. What exactly does "Variable types are machine dependent mean?" As you can see, There is no difference between instructions for unsigned & signed numbers. Then how come the range of numbers that can be addressed using both is different? I was reading http://stackoverflow.com/questions/2511246/how-to-maintain-fixed-size-of-c-variable-types-over-different-machines I didn't get the purpose of the question or their answers. What maintaining fixed size? They all are the same. I didn't understand how those answers are going to ensure the same size.

    Read the article

  • How do I patch a Windows API at runtime so that it to returns 0 in x64?

    - by Jorge Vasquez
    In x86, I get the function address using GetProcAddress() and write a simple XOR EAX,EAX; RET; in it. Simple and effective. How do I do the same in x64? bool DisableSetUnhandledExceptionFilter() { const BYTE PatchBytes[5] = { 0x33, 0xC0, 0xC2, 0x04, 0x00 }; // XOR EAX,EAX; RET; // Obtain the address of SetUnhandledExceptionFilter HMODULE hLib = GetModuleHandle( _T("kernel32.dll") ); if( hLib == NULL ) return false; BYTE* pTarget = (BYTE*)GetProcAddress( hLib, "SetUnhandledExceptionFilter" ); if( pTarget == 0 ) return false; // Patch SetUnhandledExceptionFilter if( !WriteMemory( pTarget, PatchBytes, sizeof(PatchBytes) ) ) return false; // Ensures out of cache FlushInstructionCache(GetCurrentProcess(), pTarget, sizeof(PatchBytes)); // Success return true; } static bool WriteMemory( BYTE* pTarget, const BYTE* pSource, DWORD Size ) { // Check parameters if( pTarget == 0 ) return false; if( pSource == 0 ) return false; if( Size == 0 ) return false; if( IsBadReadPtr( pSource, Size ) ) return false; // Modify protection attributes of the target memory page DWORD OldProtect = 0; if( !VirtualProtect( pTarget, Size, PAGE_EXECUTE_READWRITE, &OldProtect ) ) return false; // Write memory memcpy( pTarget, pSource, Size ); // Restore memory protection attributes of the target memory page DWORD Temp = 0; if( !VirtualProtect( pTarget, Size, OldProtect, &Temp ) ) return false; // Success return true; } This example is adapted from code found here: http://www.debuginfo.com/articles/debugfilters.html#overwrite .

    Read the article

  • Software Protection: Shuffeling my application?

    - by Martijn Courteaux
    Hi, I want to continue on my previous question: http://stackoverflow.com/questions/3007168/torrents-can-i-protect-my-software-by-sending-wrong-bytes Developer Art suggested to add a unique key to the application, to identifier the cracker. But JAB said that crackers can search where my unique key is located by checking for binary differences, if the cracker has multiple copies of my software. Then crackers change that key to make them self anonymous. That is true. Now comes the question: If I want to add a unique key, are there tools to shuffle (a kind of obfuscation) the program modules? So, that a binary compare would say that the two files are completely different. So they can't locate the identifier key. I'm pretty sure it is possible (maybe by replacing assembler blocks and make some jumps). I think it would be enough to make 30 to 40 shuffles of my software. Thanks

    Read the article

  • Inline assembler get address of pointer Visual Studio

    - by Joe
    I have a function in VS where I pass a pointer to the function. I then want to store the pointer in a register to further manipulate. How do you do that? I have tried void f(*p) { __asm mov eax, p // try one FAIL __asm mov eax, [p] // try two FAIL __asm mov eax, &p // try three FAIL } Both 1 and 2 are converted to the same code and load the value pointed to. I just want the address. Oddly, option 1 works just fine with integers. void f() { int i = 5; __asm mov eax, i // SUCCESS? }

    Read the article

  • Interrupt On GAS

    - by Nathan Campos
    I'm trying to convert my simple program from Intel syntax to the AT&T(to compile it with GAS). I've successfully converted a big part of my application, but I'm still getting an error with the int(the interrupts). My function is like this: printf: mov $0x0e, %ah mov $0x07, %bl nextchar: lodsb or %al, %al jz return int 10 jmp nextchar return: ret msg db "Welcome To Track!", 0Ah But when I compile it, I got this: hello.S: Assembler messages: hello.S:13: Error: operand size mismatch for int' hello.S:19: Error: no such instruction:msg db "Hello, World!",0Ah' What I need to do?

    Read the article

  • Why increased pipeline depth does not always mean increased throughput?

    - by worlds-apart89
    This is perhaps more of a discussion question, but I thought stackoverflow could be the right place to ask it. I am studying the concept of instruction pipelining. I have been taught that a pipeline's instruction throughput is increased once the number of pipeline stages is increased, but in some cases, throughput might not change. Under what conditions, does this happen? I am thinking stalling and branching could be the answer to the question, but I wonder if I am missing something crucial.

    Read the article

  • Why is FLD1 loading NaN instead?

    - by Bernd Jendrissek
    I have a one-liner C function that is just return value * pow(1.+rate, -delay); - it discounts a future value to a present value. The interesting part of the disassembly is 0x080555b9 : neg %eax 0x080555bb : push %eax 0x080555bc : fildl (%esp) 0x080555bf : lea 0x4(%esp),%esp 0x080555c3 : fldl 0xfffffff0(%ebp) 0x080555c6 : fld1 0x080555c8 : faddp %st,%st(1) 0x080555ca : fxch %st(1) 0x080555cc : fstpl 0x8(%esp) 0x080555d0 : fstpl (%esp) 0x080555d3 : call 0x8051ce0 0x080555d8 : fmull 0xfffffff8(%ebp) While single-stepping through this function, gdb says (rate is 0.02, delay is 2; you can see them on the stack): (gdb) si 0x080555c6 30 return value * pow(1.+rate, -delay); (gdb) info float R7: Valid 0x4004a6c28f5c28f5c000 +41.68999999999999773 R6: Valid 0x4004e15c28f5c28f6000 +56.34000000000000341 R5: Valid 0x4004dceb851eb851e800 +55.22999999999999687 R4: Valid 0xc0008000000000000000 -2 =R3: Valid 0x3ff9a3d70a3d70a3d800 +0.02000000000000000042 R2: Valid 0x4004ff147ae147ae1800 +63.77000000000000313 R1: Valid 0x4004e17ae147ae147800 +56.36999999999999744 R0: Valid 0x4004efb851eb851eb800 +59.92999999999999972 Status Word: 0x1861 IE PE SF TOP: 3 Control Word: 0x037f IM DM ZM OM UM PM PC: Extended Precision (64-bits) RC: Round to nearest Tag Word: 0x0000 Instruction Pointer: 0x73:0x080555c3 Operand Pointer: 0x7b:0xbff41d78 Opcode: 0xdd45 And after the fld1: (gdb) si 0x080555c8 30 return value * pow(1.+rate, -delay); (gdb) info float R7: Valid 0x4004a6c28f5c28f5c000 +41.68999999999999773 R6: Valid 0x4004e15c28f5c28f6000 +56.34000000000000341 R5: Valid 0x4004dceb851eb851e800 +55.22999999999999687 R4: Valid 0xc0008000000000000000 -2 R3: Valid 0x3ff9a3d70a3d70a3d800 +0.02000000000000000042 =R2: Special 0xffffc000000000000000 Real Indefinite (QNaN) R1: Valid 0x4004e17ae147ae147800 +56.36999999999999744 R0: Valid 0x4004efb851eb851eb800 +59.92999999999999972 Status Word: 0x1261 IE PE SF C1 TOP: 2 Control Word: 0x037f IM DM ZM OM UM PM PC: Extended Precision (64-bits) RC: Round to nearest Tag Word: 0x0020 Instruction Pointer: 0x73:0x080555c6 Operand Pointer: 0x7b:0xbff41d78 Opcode: 0xd9e8 After this, everything goes to hell. Things get grossly over or undervalued, so even if there were no other bugs in my freeciv AI attempt, it would choose all the wrong strategies. Like sending the whole army to the arctic. (Sigh, if only I were getting that far.) I must be missing something obvious, or getting blinded by something, because I can't believe that fld1 should ever possibly fail. Even less that it should fail only after a handful of passes through this function. On earlier passes the FPU correctly loads 1 into ST(0). The bytes at 0x080555c6 definitely encode fld1 - checked with x/... on the running process. What gives?

    Read the article

  • where did the _syscallN macros go in <linux/unistd.h>?

    - by Evan Teran
    It used to be the case that if you needed to make a system call directly in linux without the use of an existing library, you could just include <linux/unistd.h> and it would define a macro similar to this: #define _syscall3(type,name,type1,arg1,type2,arg2,type3,arg3) \ type name(type1 arg1,type2 arg2,type3 arg3) \ { \ long __res; \ __asm__ volatile ("int $0x80" \ : "=a" (__res) \ : "0" (__NR_##name),"b" ((long)(arg1)),"c" ((long)(arg2)), \ "d" ((long)(arg3))); \ if (__res>=0) \ return (type) __res; \ errno=-__res; \ return -1; \ } Then you could just put somewhere in your code: _syscall3(ssize_t, write, int, fd, const void *, buf, size_t, count); which would define a write function for you that properly performed the system call. It seems that this system has been superseded by something (i am guessing that "[vsyscall]" page that every process gets) more robust. So what is the proper way (please be specific) for a program to perform a system call directly on newer linux kernels? I realize that I should be using libc and let it do the work for me. But let's assume that I have a decent reason for wanting to know how to do this :-).

    Read the article

  • Where are the function literals in c++?

    - by academicRobot
    First of all, maybe literals is not the right term for this concept, but its the closest I could think of (not literals in the sense of functions as first class citizens). The idea is that when you make a conventional function call, it compiles to something like this: callq <immediate address> But if you make a function call using a function pointer, it compiles to something like this: mov <memory location>,%rax callq *%rax Which is all well and good. However, what if I'm writing a template library that requires a callback of some sort with a specified argument list and the user of the library is expected to know what function they want to call at compile time? Then I would like to write my template to accept a function literal as a template parameter. So, similar to template <int int_literal> struct my_template {...};` I'd like to write template <func_literal_t func_literal> struct my_template {...}; and have calls to func_literal within my_template compile to callq <immediate address>. Is there a facility in C++ for this, or a work around to achieve the same effect? If not, why not (e.g. some cataclysmic side effects)? How about C++0x or another language? Solutions that are not portable are fine. Solutions that include the use of member function pointers would be ideal. I'm not particularly interested in being told "You are a <socially unacceptable term for a person of low IQ>, just use function pointers/functors." This is a curiosity based question, and it seems that it might be useful in some (albeit limited) applications. It seems like this should be possible since function names are just placeholders for a (relative) memory address, so why not allow more liberal use (e.g. aliasing) of this placeholder. p.s. I use function pointers and functions objects all the the time and they are great. But this post got me thinking about the don't pay for what you don't use principle in relation to function calls, and it seems like forcing the use of function pointers or similar facility when the function is known at compile time is a violation of this principle, though a small one.

    Read the article

  • Where are the function address literals in c++?

    - by academicRobot
    First of all, maybe literals is not the right term for this concept, but its the closest I could think of (not literals in the sense of functions as first class citizens). <UPDATE> After some reading with help from answer by Chris Dodd, what I'm looking for is literal function addresses as template parameters. Chris' answer indicates how to do this for standard functions, but how can the addresses of member functions be used as template parameters? Since the standard prohibits non-static member function addresses as template parameters (c++03 14.3.2.3), I suspect the work around is quite complicated. Any ideas for a workaround? Below the original form of the question is left as is for context. </UPDATE> The idea is that when you make a conventional function call, it compiles to something like this: callq <immediate address> But if you make a function call using a function pointer, it compiles to something like this: mov <memory location>,%rax callq *%rax Which is all well and good. However, what if I'm writing a template library that requires a callback of some sort with a specified argument list and the user of the library is expected to know what function they want to call at compile time? Then I would like to write my template to accept a function literal as a template parameter. So, similar to template <int int_literal> struct my_template {...};` I'd like to write template <func_literal_t func_literal> struct my_template {...}; and have calls to func_literal within my_template compile to callq <immediate address>. Is there a facility in C++ for this, or a work around to achieve the same effect? If not, why not (e.g. some cataclysmic side effects)? How about C++0x or another language? Solutions that are not portable are fine. Solutions that include the use of member function pointers would be ideal. I'm not particularly interested in being told "You are a <socially unacceptable term for a person of low IQ>, just use function pointers/functors." This is a curiosity based question, and it seems that it might be useful in some (albeit limited) applications. It seems like this should be possible since function names are just placeholders for a (relative) memory address, so why not allow more liberal use (e.g. aliasing) of this placeholder. p.s. I use function pointers and functions objects all the the time and they are great. But this post got me thinking about the don't pay for what you don't use principle in relation to function calls, and it seems like forcing the use of function pointers or similar facility when the function is known at compile time is a violation of this principle, though a small one. Edit The intent of this question is not to implement delegates, rather to identify a pattern that will embed a conventional function call, (in immediate mode) directly into third party code, possibly a template.

    Read the article

  • Making JSON Feed Work Server to Server

    - by Oscar Godson
    I have a JSON feed (which I don't directly control, I have to ask the server admins to update it), that is returning NULL, or otherwise not working when I try to access it from another server. The actual feed works fine, but I can't get it to work like a normal API. Why can I use Flickr, WhitePages, Twitter, etc's JSON APIs, but for: http://portlandoregon.gov/shared/cfm/json.cfm It returns NULL in the browser when using JS. If I call it on the same server, it's fine though. If this is a server issue, what do I ask the admins to change, and if this is a local (browser) based issue, what am I doing wrong for this one feed that works for every other JSON API?

    Read the article

  • Odd optimization problem under MSVC

    - by Goz
    I've seen this blog: http://igoro.com/archive/gallery-of-processor-cache-effects/ The "weirdness" in part 7 is what caught my interest. My first thought was "Thats just C# being weird". Its not I wrote the following C++ code. volatile int* p = (volatile int*)_aligned_malloc( sizeof( int ) * 8, 64 ); memset( (void*)p, 0, sizeof( int ) * 8 ); double dStart = t.GetTime(); for (int i = 0; i < 200000000; i++) { //p[0]++;p[1]++;p[2]++;p[3]++; // Option 1 //p[0]++;p[2]++;p[4]++;p[6]++; // Option 2 p[0]++;p[2]++; // Option 3 } double dTime = t.GetTime() - dStart; The timing I get on my 2.4 Ghz Core 2 Quad go as follows: Option 1 = ~8 cycles per loop. Option 2 = ~4 cycles per loop. Option 3 = ~6 cycles per loop. Now This is confusing. My reasoning behind the difference comes down to the cache write latency (3 cycles) on my chip and an assumption that the cache has a 128-bit write port (This is pure guess work on my part). On that basis in Option 1: It will increment p[0] (1 cycle) then increment p[2] (1 cycle) then it has to wait 1 cycle (for cache) then p[1] (1 cycle) then wait 1 cycle (for cache) then p[3] (1 cycle). Finally 2 cycles for increment and jump (Though its usually implemented as decrement and jump). This gives a total of 8 cycles. In Option 2: It can increment p[0] and p[4] in one cycle then increment p[2] and p[6] in another cycle. Then 2 cycles for subtract and jump. No waits needed on cache. Total 4 cycles. In option 3: It can increment p[0] then has to wait 2 cycles then increment p[2] then subtract and jump. The problem is if you set case 3 to increment p[0] and p[4] it STILL takes 6 cycles (which kinda blows my 128-bit read/write port out of the water). So ... can anyone tell me what the hell is going on here? Why DOES case 3 take longer? Also I'd love to know what I've got wrong in my thinking above, as i obviously have something wrong! Any ideas would be much appreciated! :) It'd also be interesting to see how GCC or any other compiler copes with it as well! Edit: Jerry Coffin's idea gave me some thoughts. I've done some more tests (on a different machine so forgive the change in timings) with and without nops and with different counts of nops case 2 - 0.46 00401ABD jne (401AB0h) 0 nops - 0.68 00401AB7 jne (401AB0h) 1 nop - 0.61 00401AB8 jne (401AB0h) 2 nops - 0.636 00401AB9 jne (401AB0h) 3 nops - 0.632 00401ABA jne (401AB0h) 4 nops - 0.66 00401ABB jne (401AB0h) 5 nops - 0.52 00401ABC jne (401AB0h) 6 nops - 0.46 00401ABD jne (401AB0h) 7 nops - 0.46 00401ABE jne (401AB0h) 8 nops - 0.46 00401ABF jne (401AB0h) 9 nops - 0.55 00401AC0 jne (401AB0h) I've included the jump statetements so you can see that the source and destination are in one cache line. You can also see that we start to get a difference when we are 13 bytes or more apart. Until we hit 16 ... then it all goes wrong. So Jerry isn't right (though his suggestion DOES help a bit), however something IS going on. I'm more and more intrigued to try and figure out what it is now. It does appear to be more some sort of memory alignment oddity rather than some sort of instruction throughput oddity. Anyone want to explain this for an inquisitive mind? :D Edit 3: Interjay has a point on the unrolling that blows the previous edit out of the water. With an unrolled loop the performance does not improve. You need to add a nop in to make the gap between jump source and destination the same as for my good nop count above. Performance still sucks. Its interesting that I need 6 nops to improve performance though. I wonder how many nops the processor can issue per cycle? If its 3 then that account for the cache write latency ... But, if thats it, why is the latency occurring? Curiouser and curiouser ...

    Read the article

< Previous Page | 91 92 93 94 95 96 97 98 99 100 101 102  | Next Page >