Search Results

Search found 2568 results on 103 pages for 'x86'.

Page 14/103 | < Previous Page | 10 11 12 13 14 15 16 17 18 19 20 21  | Next Page >

  • Actual long double precision does not agree with std::numeric_limits

    - by dmb
    Working on Mac OS X 10.6.2, Intel, with i686-apple-darwin10-g++-4.2.1, and compiling with the -arch x86_64 flag, I just noticed that while... std::numeric_limits<long double>::max_exponent10 = 4932 ...as is expected, when a long double is actually set to a value with exponent greater than 308, it becomes inf--ie in reality it only has 64bit precision instead of 80bit. Also, sizeof() is showing long doubles to be 16 bytes, which they should be. Finally, using gives the same results as . Does anyone know where the discrepancy might be? long double x = 1e308, y = 1e309; cout << std::numeric_limits::max_exponent10 << endl; cout << x << '\t' << y << endl; cout << sizeof(x) << endl; gives 4932 1e+308 inf 16

    Read the article

  • masm division overflow

    - by Help I'm in college
    I'm trying divide two numbers in assembly. I'm working out of the Irvine assembly for intel computers book and I can't make division work for the life of me. Here's my code .code main PROC call division exit main ENDP division PROC mov eax, 4 mov ebx, 2 div ebx call WriteDec ret divison ENDP END main Where WriteDec should write whatever number is in the eax register (should be set to the quotient after the division call). Instead everytime I run it visual studio crashes (the program does compile however).

    Read the article

  • how to know location of return address on stack c/c++

    - by Dr Deo
    i have been reading about a function that can overwrite its return address. void foo(const char* input) { char buf[10]; //What? No extra arguments supplied to printf? //It's a cheap trick to view the stack 8-) //We'll see this trick again when we look at format strings. printf("My stack looks like:\n%p\n%p\n%p\n%p\n%p\n% p\n\n"); //%p ie expect pointers //Pass the user input straight to secure code public enemy #1. strcpy(buf, input); printf("%s\n", buf); printf("Now the stack looks like:\n%p\n%p\n%p\n%p\n%p\n%p\n\n"); } It was sugggested that this is how the stack would look like Address of foo = 00401000 My stack looks like: 00000000 00000000 7FFDF000 0012FF80 0040108A <-- We want to overwrite the return address for foo. 00410EDE Question: -. Why did the author arbitrarily choose the second last value as the return address of foo()? -. Are values added to the stack from the bottom or from the top? apart from the function return address, what are the other values i apparently see on the stack? ie why isn't it filled with zeros Thanks.

    Read the article

  • x86_64 printf segfault after brk call

    - by gmb11
    While i was trying do use brk (int 0x80 with 45 in %rax) to implement a simple memory manager program in assembly and print the blocks in order, i kept getting segfault. After a while i could only reproduce the error, but have no idea why is this happening: .section .data helloworld: .ascii "hello world" .section .text .globl _start _start: push %rbp mov %rsp, %rbp movq $45, %rax movq $0, %rbx #brk(0) should just return the current break of the programm int $0x80 #incq %rax #segfault #addq $1, %rax #segfault movq $0, %rax #works fine? #addq $1, %rax #segfault again? movq $helloworld, %rdi call printf movq $1, %rax #exit int $0x80 In the example here, if the commented lines are uncommented, i have a segfault, but some commands (like de movq $0, %rax) work just fine. In my other program, the first couple printf work, but the third crashes... Looking for other questions, i heard that printf sometimes allocates some memory, and that the brk shouldn't be used, because in this case it corrupts the heap or something... I'm very confused, does anyone know something about that? EDIT: I've just found out that for printf to work you need %rax=0.

    Read the article

  • Multithreading and Interrupts

    - by Nicholas Flynt
    I'm doing some work on the input buffers for my kernel, and I had some questions. On Dual Core machines, I know that more than one "process" can be running simultaneously. What I don't know is how the OS and the individual programs work to protect collisions in data. There are two things I'd like to know on this topic: (1) Where do interrupts occur? Are they guaranteed to occur on one core and not the other, and could this be used to make sure that real-time operations on one core were not interrupted by, say, file IO which could be handled on the other core? (I'd logically assume that the interrupts would happen on the 1st core, but is that always true, and how would you tell? Or perhaps does each core have its own settings for interrupts? Wouldn't that lead to a scenario where each core could react simultaneously to the same interrupt, possibly in different ways?) (2) How does the dual core processor handle opcode memory collision? If one core is reading an address in memory at exactly the same time that another core is writing to that same address in memory, what happens? Is an exception thrown, or is a value read? (I'd assume the write would work either way.) If a value is read, is it guaranteed to be either the old or new value at the time of the collision? I understand that programs should ideally be written to avoid these kinds of complications, but the OS certainly can't expect that, and will need to be able to handle such events without choking on itself.

    Read the article

  • How is thread synchronization implemented, at the assembly language level?

    - by Martin
    While I'm familiar with concurrent programming concepts such as mutexes and semaphores, I have never understood how they are implemented at the assembly language level. I imagine there being a set of memory "flags" saying: lock A is held by thread 1 lock B is held by thread 3 lock C is not held by any thread etc But how is access to these flags synchronized between threads? Something like this naive example would only create a race condition: mov edx, [myThreadId] wait: cmp [lock], 0 jne wait mov [lock], edx ; I wanted an exclusive lock but the above ; three instructions are not an atomic operation :(

    Read the article

  • How to install python2.6-devel package under CentOs 5

    - by Creotiv
    I need to install mysql-python under python2.6. mysql-python package needs python2.6-devel package that depends on the libpython2.6.so.1.0(64bit) I found on the net some python2.6-devel packages, but can't find libpython2.6 Server architecture is x86_64. Maybe someone have this lib, or know where i can find it. Thanks for help)

    Read the article

  • Combining prefixes in SSE

    - by Nathan Fellman
    In SSE the prefixes 066h (operand size override) 0F2H (REPNE) and 0F3h (REPE) are part of the opcode. In non-SSE 066h switches between 32-bit (or 64-bit) and 16-bit operation. 0F2h and 0F3h are used for string operations. They can be combined so that 066h and 0F2h (or 0F3h) can be used in the same instruction, because this is meaningful. What is the behavior in an SSE instruction? For instance, we have (ignoring mod/rm for now): 0f 58 -- addps 66 0f 58 -- addpd f2 0f 58 -- addsd f3 0f 58 -- addss But what is this? 66 f2 0f 58 And how about? f2 66 0f 58 Not to mention the following which has two conflicting REP prefixes: f2 f3 0f 58 What is the spec for thse?

    Read the article

  • ASP.NET application developed in 32 bit environment not working in 64 bit environment

    - by jgonchik
    We have developed an ASP.NET website on a Windows 7 - 32 bit platform using Visual Studio 2008. This website is being hosted at a hosting company where we share a server with hundreds of other ASP.NET websites. We are in the process of changing our hosting to a dedicated Windows 2008 - 64 bit server. We have installed Visual Studio on this new server in order to debug our application. If we try to start the application on this new server using Visual Studios 2008's own web server (not IIS 7) we get the error below. We have tried to compile the application in both 32 as well as 64 bit mode. We also tried to compile to "Any CPU". But nothing helps. We also tried running Visual Studio as an administrator but without success. We get the following error: Server Error in '/' Application. The specified module could not be found. (Exception from HRESULT: 0x8007007E) Description: An unhandled exception occurred during the execution of the current web request. Please review the stack trace for more information about the error and where it originated in the code. Exception Details: System.IO.FileNotFoundException: The specified module could not be found. (Exception from HRESULT: 0x8007007E) Source Error: An unhandled exception was generated during the execution of the current web request. Information regarding the origin and location of the exception can be identified using the exception stack trace below. Stack Trace: [FileNotFoundException: The specified module could not be found. (Exception from HRESULT: 0x8007007E)] System.Reflection.Assembly._nLoad(AssemblyName fileName, String codeBase, Evidence assemblySecurity, Assembly locationHint, StackCrawlMark& stackMark, Boolean throwOnFileNotFound, Boolean forIntrospection) +0 System.Reflection.Assembly.nLoad(AssemblyName fileName, String codeBase, Evidence assemblySecurity, Assembly locationHint, StackCrawlMark& stackMark, Boolean throwOnFileNotFound, Boolean forIntrospection) +43 System.Reflection.Assembly.InternalLoad(AssemblyName assemblyRef, Evidence assemblySecurity, StackCrawlMark& stackMark, Boolean forIntrospection) +127 System.Reflection.Assembly.InternalLoad(String assemblyString, Evidence assemblySecurity, StackCrawlMark& stackMark, Boolean forIntrospection) +142 System.Reflection.Assembly.Load(String assemblyString) +28 System.Web.Configuration.CompilationSection.LoadAssemblyHelper(String assemblyName, Boolean starDirective) +46 [ConfigurationErrorsException: The specified module could not be found. (Exception from HRESULT: 0x8007007E)] System.Web.Configuration.CompilationSection.LoadAssemblyHelper(String assemblyName, Boolean starDirective) +613 System.Web.Configuration.CompilationSection.LoadAllAssembliesFromAppDomainBinDirectory() +203 System.Web.Configuration.CompilationSection.LoadAssembly(AssemblyInfo ai) +105 System.Web.Compilation.BuildManager.GetReferencedAssemblies(CompilationSection compConfig) +178 System.Web.Compilation.BuildProvidersCompiler..ctor(VirtualPath configPath, Boolean supportLocalization, String outputAssemblyName) +54 System.Web.Compilation.ApplicationBuildProvider.GetGlobalAsaxBuildResult(Boolean isPrecompiledApp) +232 System.Web.Compilation.BuildManager.CompileGlobalAsax() +51 System.Web.Compilation.BuildManager.EnsureTopLevelFilesCompiled() +337 [HttpException (0x80004005): The specified module could not be found. (Exception from HRESULT: 0x8007007E)] System.Web.Compilation.BuildManager.ReportTopLevelCompilationException() +58 System.Web.Compilation.BuildManager.EnsureTopLevelFilesCompiled() +512 System.Web.Hosting.HostingEnvironment.Initialize(ApplicationManager appManager, IApplicationHost appHost, IConfigMapPathFactory configMapPathFactory, HostingEnvironmentParameters hostingParameters) +729 [HttpException (0x80004005): The specified module could not be found. (Exception from HRESULT: 0x8007007E)] System.Web.HttpRuntime.FirstRequestInit(HttpContext context) +8897659 System.Web.HttpRuntime.EnsureFirstRequestInit(HttpContext context) +85 System.Web.HttpRuntime.ProcessRequestInternal(HttpWorkerRequest wr) +259 Does anyone know why this error appears and how to solve it?

    Read the article

  • 80x86 16-bit asm: lea cx, [cx*8+cx] causes error on NASM (compiling .com file)

    - by larz
    Title says it all. The error NASM gives (dispite my working OS) is "invalid effective address". Now i've seen many examples of how to use LEA and i think i gots it right but yet my NASM dislikes it. I tried "lea cx, [cx+9]" and it worked; "lea cx, [bx+cx]" didn't. Now if i extended my registers to 32-bits (i.e. "lea ecx, [ecx*8+ecx]") everything would be well but i am restricted to use 16- and 8-bit registers only. Is here anyone so knoweledgeable who could explain me WHY my assembler doesn't let me use lea the way i supposed it should be used? Thanks.

    Read the article

  • Doubts in System call mechanism in linux

    - by bala1486
    We transit from ring3 to ring0 using 'int' or the new 'syscall/sysenter' instruction. Does that mean that the page tables and other stuffs that needs to be modified for the kernel is automatically done by the 'int' instruction or the interrupt handler for the 'int 0x80' will do the required stuff and jump to the respective system call. Also when returning from a system call, we again need to go to user space. For this we need to know the instruction address in the user space to continue the user application. Where is that address stored. Does the 'ret' instruction automatically changes the ring from ring3 to ring0 or where/how this ring changing mechanism takes place? Then, i read that changing from ring3 to ring0 is not as costly as changing from ring0 to ring3. Why is this so?? Thanks, Bala

    Read the article

  • Why is a 16-bit register used with BSR instruction in this code snippet?

    - by sharptooth
    In this hardcore article there's a function find_maskwidth() that basically detects the number of bits required to represent itemCount dictinct values: unsigned int find_maskwidth( unsigned int itemCount ) { unsigned int maskWidth, count = itemCount; __asm { mov eax, count mov ecx, 0 mov maskWidth, ecx dec eax bsr cx, ax jz next inc cx mov maskWidth, ecx next: } return maskWidth; } the question is why do they use ax and cx registers instead of eax and ecx?

    Read the article

  • Segmentation Fault when using "mov" in Assembly

    - by quithakay207
    I am working on a simple assembly program for a class, and am encountering an odd segmentation fault. It's a pretty simple program to convert bytes into kilobytes. However, within the function that does the conversion, I get a segmentation fault when I try to move the value 1024 into the ebx register. I've never had this kind of problem before when working with registers. Does someone know what could be causing this? I imagine it is something simple that I'm overlooking. Thank you! asm_main: enter 0,0 pusha mov eax, 0 mov ebx, 0 call read_int push eax call functionA popa mov leave ret functionA: mov eax, [esp + 4] call print_int call print_nl mov ebx, 1024 ;segmentation fault occurs here div ebx call print_int ret UPDATE: One interesting discovery is that if I delete the lines interacting with the stack, push eax and mov eax, [esp + 4], there is no longer a segmentation fault. However, I get a crazy result in eax after performing div ebx.

    Read the article

  • How does loop address alignment affect the speed on Intel x86_64?

    - by Alexander Gololobov
    I'm seeing 15% performance degradation of the same C++ code compiled to exactly same machine instructions but located on differently aligned addresses. When my tiny main loop starts at 0x415220 it's faster then when it is at 0x415250. I'm running this on Intel Core2 Duo. I use gcc 4.4.5 on x86_64 Ubuntu. Can anybody explain the cause of slowdown and how I can force gcc to optimally align the loop? Here is the disassembly for both cases with profiler annotation: 415220 576 12.56% |XXXXXXXXXXXXXX 48 c1 eb 08 shr $0x8,%rbx 415224 110 2.40% |XX 0f b6 c3 movzbl %bl,%eax 415227 0.00% | 41 0f b6 04 00 movzbl (%r8,%rax,1),%eax 41522c 40 0.87% | 48 8b 04 c1 mov (%rcx,%rax,8),%rax 415230 806 17.58% |XXXXXXXXXXXXXXXXXXX 4c 63 f8 movslq %eax,%r15 415233 186 4.06% |XXXX 48 c1 e8 20 shr $0x20,%rax 415237 102 2.22% |XX 4c 01 f9 add %r15,%rcx 41523a 414 9.03% |XXXXXXXXXX a8 0f test $0xf,%al 41523c 680 14.83% |XXXXXXXXXXXXXXXX 74 45 je 415283 ::Run(char const*, char const*)+0x4b3 41523e 0.00% | 41 89 c7 mov %eax,%r15d 415241 0.00% | 41 83 e7 01 and $0x1,%r15d 415245 0.00% | 41 83 ff 01 cmp $0x1,%r15d 415249 0.00% | 41 89 c7 mov %eax,%r15d 415250 679 13.05% |XXXXXXXXXXXXXXXX 48 c1 eb 08 shr $0x8,%rbx 415254 124 2.38% |XX 0f b6 c3 movzbl %bl,%eax 415257 0.00% | 41 0f b6 04 00 movzbl (%r8,%rax,1),%eax 41525c 43 0.83% |X 48 8b 04 c1 mov (%rcx,%rax,8),%rax 415260 828 15.91% |XXXXXXXXXXXXXXXXXXX 4c 63 f8 movslq %eax,%r15 415263 388 7.46% |XXXXXXXXX 48 c1 e8 20 shr $0x20,%rax 415267 141 2.71% |XXX 4c 01 f9 add %r15,%rcx 41526a 634 12.18% |XXXXXXXXXXXXXXX a8 0f test $0xf,%al 41526c 749 14.39% |XXXXXXXXXXXXXXXXXX 74 45 je 4152b3 ::Run(char const*, char const*)+0x4c3 41526e 0.00% | 41 89 c7 mov %eax,%r15d 415271 0.00% | 41 83 e7 01 and $0x1,%r15d 415275 0.00% | 41 83 ff 01 cmp $0x1,%r15d 415279 0.00% | 41 89 c7 mov %eax,%r15d

    Read the article

  • Problem with asm program (nasm)

    - by GLeBaTi
    org 0x100 SEGMENT .CODE mov ah,0x9 mov dx, Msg1 int 0x21 ;string input mov ah,0xA mov dx,buff int 0x21 mov ax,0 mov al,[buff+1]; length ;string UPPERCASE mov cl, al mov si, buff cld loop1: lodsb; cmp al, 'a' jnb upper loop loop1 ;output mov ah,0x9 mov dx, buff int 0x21 exit: mov ah, 0x8 int 0x21 int 0x20 upper: sub al,32 jmp loop1 SEGMENT .DATA Msg1 db 'Press string: $' buff db 254,0 this code perform poorly. I think that problem in "jnb upper". This program make small symbols into big symbols.

    Read the article

  • help understanding differences between #define, const and enum in C and C++ on assembly level.

    - by martin
    recently, i am looking into assembly codes for #define, const and enum: C codes(#define): 3 #define pi 3 4 int main(void) 5 { 6 int a,r=1; 7 a=2*pi*r; 8 return 0; 9 } assembly codes(for line 6 and 7 in c codes) generated by GCC: 6 mov $0x1, -0x4(%ebp) 7 mov -0x4(%ebp), %edx 7 mov %edx, %eax 7 add %eax, %eax 7 add %edx, %eax 7 add %eax, %eax 7 mov %eax, -0x8(%ebp) C codes(enum): 2 int main(void) 3 { 4 int a,r=1; 5 enum{pi=3}; 6 a=2*pi*r; 7 return 0; 8 } assembly codes(for line 4 and 6 in c codes) generated by GCC: 6 mov $0x1, -0x4(%ebp) 7 mov -0x4(%ebp), %edx 7 mov %edx, %eax 7 add %eax, %eax 7 add %edx, %eax 7 add %eax, %eax 7 mov %eax, -0x8(%ebp) C codes(const): 4 int main(void) 5 { 6 int a,r=1; 7 const int pi=3; 8 a=2*pi*r; 9 return 0; 10 } assembly codes(for line 7 and 8 in c codes) generated by GCC: 6 movl $0x3, -0x8(%ebp) 7 movl $0x3, -0x4(%ebp) 8 mov -0x4(%ebp), %eax 8 add %eax, %eax 8 imul -0x8(%ebp), %eax 8 mov %eax, 0xc(%ebp) i found that use #define and enum, the assembly codes are the same. The compiler use 3 add instructions to perform multiplication. However, when use const, imul instruction is used. Anyone knows the reason behind that?

    Read the article

  • Setting processor to 32-bit mode

    - by dboarman-FissureStudios
    It seems that the following is a common method given in many tutorials on switching a processor from 16-bit to 32-bit: mov eax, cr0 ; set bit 0 in CR0-go to pmode or eax, 1 mov cr0, eax Why wouldn't I simply do the following: or cr0, 1 Is there something I'm missing? Possibly the only thing I can think of is that I cannot perform an operation like this on the cr0 register.

    Read the article

  • Why is FLD1 loading NaN instead?

    - by Bernd Jendrissek
    I have a one-liner C function that is just return value * pow(1.+rate, -delay); - it discounts a future value to a present value. The interesting part of the disassembly is 0x080555b9 : neg %eax 0x080555bb : push %eax 0x080555bc : fildl (%esp) 0x080555bf : lea 0x4(%esp),%esp 0x080555c3 : fldl 0xfffffff0(%ebp) 0x080555c6 : fld1 0x080555c8 : faddp %st,%st(1) 0x080555ca : fxch %st(1) 0x080555cc : fstpl 0x8(%esp) 0x080555d0 : fstpl (%esp) 0x080555d3 : call 0x8051ce0 0x080555d8 : fmull 0xfffffff8(%ebp) While single-stepping through this function, gdb says (rate is 0.02, delay is 2; you can see them on the stack): (gdb) si 0x080555c6 30 return value * pow(1.+rate, -delay); (gdb) info float R7: Valid 0x4004a6c28f5c28f5c000 +41.68999999999999773 R6: Valid 0x4004e15c28f5c28f6000 +56.34000000000000341 R5: Valid 0x4004dceb851eb851e800 +55.22999999999999687 R4: Valid 0xc0008000000000000000 -2 =R3: Valid 0x3ff9a3d70a3d70a3d800 +0.02000000000000000042 R2: Valid 0x4004ff147ae147ae1800 +63.77000000000000313 R1: Valid 0x4004e17ae147ae147800 +56.36999999999999744 R0: Valid 0x4004efb851eb851eb800 +59.92999999999999972 Status Word: 0x1861 IE PE SF TOP: 3 Control Word: 0x037f IM DM ZM OM UM PM PC: Extended Precision (64-bits) RC: Round to nearest Tag Word: 0x0000 Instruction Pointer: 0x73:0x080555c3 Operand Pointer: 0x7b:0xbff41d78 Opcode: 0xdd45 And after the fld1: (gdb) si 0x080555c8 30 return value * pow(1.+rate, -delay); (gdb) info float R7: Valid 0x4004a6c28f5c28f5c000 +41.68999999999999773 R6: Valid 0x4004e15c28f5c28f6000 +56.34000000000000341 R5: Valid 0x4004dceb851eb851e800 +55.22999999999999687 R4: Valid 0xc0008000000000000000 -2 R3: Valid 0x3ff9a3d70a3d70a3d800 +0.02000000000000000042 =R2: Special 0xffffc000000000000000 Real Indefinite (QNaN) R1: Valid 0x4004e17ae147ae147800 +56.36999999999999744 R0: Valid 0x4004efb851eb851eb800 +59.92999999999999972 Status Word: 0x1261 IE PE SF C1 TOP: 2 Control Word: 0x037f IM DM ZM OM UM PM PC: Extended Precision (64-bits) RC: Round to nearest Tag Word: 0x0020 Instruction Pointer: 0x73:0x080555c6 Operand Pointer: 0x7b:0xbff41d78 Opcode: 0xd9e8 After this, everything goes to hell. Things get grossly over or undervalued, so even if there were no other bugs in my freeciv AI attempt, it would choose all the wrong strategies. Like sending the whole army to the arctic. (Sigh, if only I were getting that far.) I must be missing something obvious, or getting blinded by something, because I can't believe that fld1 should ever possibly fail. Even less that it should fail only after a handful of passes through this function. On earlier passes the FPU correctly loads 1 into ST(0). The bytes at 0x080555c6 definitely encode fld1 - checked with x/... on the running process. What gives?

    Read the article

  • Odd optimization problem under MSVC

    - by Goz
    I've seen this blog: http://igoro.com/archive/gallery-of-processor-cache-effects/ The "weirdness" in part 7 is what caught my interest. My first thought was "Thats just C# being weird". Its not I wrote the following C++ code. volatile int* p = (volatile int*)_aligned_malloc( sizeof( int ) * 8, 64 ); memset( (void*)p, 0, sizeof( int ) * 8 ); double dStart = t.GetTime(); for (int i = 0; i < 200000000; i++) { //p[0]++;p[1]++;p[2]++;p[3]++; // Option 1 //p[0]++;p[2]++;p[4]++;p[6]++; // Option 2 p[0]++;p[2]++; // Option 3 } double dTime = t.GetTime() - dStart; The timing I get on my 2.4 Ghz Core 2 Quad go as follows: Option 1 = ~8 cycles per loop. Option 2 = ~4 cycles per loop. Option 3 = ~6 cycles per loop. Now This is confusing. My reasoning behind the difference comes down to the cache write latency (3 cycles) on my chip and an assumption that the cache has a 128-bit write port (This is pure guess work on my part). On that basis in Option 1: It will increment p[0] (1 cycle) then increment p[2] (1 cycle) then it has to wait 1 cycle (for cache) then p[1] (1 cycle) then wait 1 cycle (for cache) then p[3] (1 cycle). Finally 2 cycles for increment and jump (Though its usually implemented as decrement and jump). This gives a total of 8 cycles. In Option 2: It can increment p[0] and p[4] in one cycle then increment p[2] and p[6] in another cycle. Then 2 cycles for subtract and jump. No waits needed on cache. Total 4 cycles. In option 3: It can increment p[0] then has to wait 2 cycles then increment p[2] then subtract and jump. The problem is if you set case 3 to increment p[0] and p[4] it STILL takes 6 cycles (which kinda blows my 128-bit read/write port out of the water). So ... can anyone tell me what the hell is going on here? Why DOES case 3 take longer? Also I'd love to know what I've got wrong in my thinking above, as i obviously have something wrong! Any ideas would be much appreciated! :) It'd also be interesting to see how GCC or any other compiler copes with it as well! Edit: Jerry Coffin's idea gave me some thoughts. I've done some more tests (on a different machine so forgive the change in timings) with and without nops and with different counts of nops case 2 - 0.46 00401ABD jne (401AB0h) 0 nops - 0.68 00401AB7 jne (401AB0h) 1 nop - 0.61 00401AB8 jne (401AB0h) 2 nops - 0.636 00401AB9 jne (401AB0h) 3 nops - 0.632 00401ABA jne (401AB0h) 4 nops - 0.66 00401ABB jne (401AB0h) 5 nops - 0.52 00401ABC jne (401AB0h) 6 nops - 0.46 00401ABD jne (401AB0h) 7 nops - 0.46 00401ABE jne (401AB0h) 8 nops - 0.46 00401ABF jne (401AB0h) 9 nops - 0.55 00401AC0 jne (401AB0h) I've included the jump statetements so you can see that the source and destination are in one cache line. You can also see that we start to get a difference when we are 13 bytes or more apart. Until we hit 16 ... then it all goes wrong. So Jerry isn't right (though his suggestion DOES help a bit), however something IS going on. I'm more and more intrigued to try and figure out what it is now. It does appear to be more some sort of memory alignment oddity rather than some sort of instruction throughput oddity. Anyone want to explain this for an inquisitive mind? :D Edit 3: Interjay has a point on the unrolling that blows the previous edit out of the water. With an unrolled loop the performance does not improve. You need to add a nop in to make the gap between jump source and destination the same as for my good nop count above. Performance still sucks. Its interesting that I need 6 nops to improve performance though. I wonder how many nops the processor can issue per cycle? If its 3 then that account for the cache write latency ... But, if thats it, why is the latency occurring? Curiouser and curiouser ...

    Read the article

< Previous Page | 10 11 12 13 14 15 16 17 18 19 20 21  | Next Page >