Search Results

Search found 2842 results on 114 pages for 'fo x86'.

Page 17/114 | < Previous Page | 13 14 15 16 17 18 19 20 21 22 23 24  | Next Page >

  • Why is a 16-bit register used with BSR instruction in this code snippet?

    - by sharptooth
    In this hardcore article there's a function find_maskwidth() that basically detects the number of bits required to represent itemCount dictinct values: unsigned int find_maskwidth( unsigned int itemCount ) { unsigned int maskWidth, count = itemCount; __asm { mov eax, count mov ecx, 0 mov maskWidth, ecx dec eax bsr cx, ax jz next inc cx mov maskWidth, ecx next: } return maskWidth; } the question is why do they use ax and cx registers instead of eax and ecx?

    Read the article

  • Segmentation Fault when using "mov" in Assembly

    - by quithakay207
    I am working on a simple assembly program for a class, and am encountering an odd segmentation fault. It's a pretty simple program to convert bytes into kilobytes. However, within the function that does the conversion, I get a segmentation fault when I try to move the value 1024 into the ebx register. I've never had this kind of problem before when working with registers. Does someone know what could be causing this? I imagine it is something simple that I'm overlooking. Thank you! asm_main: enter 0,0 pusha mov eax, 0 mov ebx, 0 call read_int push eax call functionA popa mov leave ret functionA: mov eax, [esp + 4] call print_int call print_nl mov ebx, 1024 ;segmentation fault occurs here div ebx call print_int ret UPDATE: One interesting discovery is that if I delete the lines interacting with the stack, push eax and mov eax, [esp + 4], there is no longer a segmentation fault. However, I get a crazy result in eax after performing div ebx.

    Read the article

  • How does loop address alignment affect the speed on Intel x86_64?

    - by Alexander Gololobov
    I'm seeing 15% performance degradation of the same C++ code compiled to exactly same machine instructions but located on differently aligned addresses. When my tiny main loop starts at 0x415220 it's faster then when it is at 0x415250. I'm running this on Intel Core2 Duo. I use gcc 4.4.5 on x86_64 Ubuntu. Can anybody explain the cause of slowdown and how I can force gcc to optimally align the loop? Here is the disassembly for both cases with profiler annotation: 415220 576 12.56% |XXXXXXXXXXXXXX 48 c1 eb 08 shr $0x8,%rbx 415224 110 2.40% |XX 0f b6 c3 movzbl %bl,%eax 415227 0.00% | 41 0f b6 04 00 movzbl (%r8,%rax,1),%eax 41522c 40 0.87% | 48 8b 04 c1 mov (%rcx,%rax,8),%rax 415230 806 17.58% |XXXXXXXXXXXXXXXXXXX 4c 63 f8 movslq %eax,%r15 415233 186 4.06% |XXXX 48 c1 e8 20 shr $0x20,%rax 415237 102 2.22% |XX 4c 01 f9 add %r15,%rcx 41523a 414 9.03% |XXXXXXXXXX a8 0f test $0xf,%al 41523c 680 14.83% |XXXXXXXXXXXXXXXX 74 45 je 415283 ::Run(char const*, char const*)+0x4b3 41523e 0.00% | 41 89 c7 mov %eax,%r15d 415241 0.00% | 41 83 e7 01 and $0x1,%r15d 415245 0.00% | 41 83 ff 01 cmp $0x1,%r15d 415249 0.00% | 41 89 c7 mov %eax,%r15d 415250 679 13.05% |XXXXXXXXXXXXXXXX 48 c1 eb 08 shr $0x8,%rbx 415254 124 2.38% |XX 0f b6 c3 movzbl %bl,%eax 415257 0.00% | 41 0f b6 04 00 movzbl (%r8,%rax,1),%eax 41525c 43 0.83% |X 48 8b 04 c1 mov (%rcx,%rax,8),%rax 415260 828 15.91% |XXXXXXXXXXXXXXXXXXX 4c 63 f8 movslq %eax,%r15 415263 388 7.46% |XXXXXXXXX 48 c1 e8 20 shr $0x20,%rax 415267 141 2.71% |XXX 4c 01 f9 add %r15,%rcx 41526a 634 12.18% |XXXXXXXXXXXXXXX a8 0f test $0xf,%al 41526c 749 14.39% |XXXXXXXXXXXXXXXXXX 74 45 je 4152b3 ::Run(char const*, char const*)+0x4c3 41526e 0.00% | 41 89 c7 mov %eax,%r15d 415271 0.00% | 41 83 e7 01 and $0x1,%r15d 415275 0.00% | 41 83 ff 01 cmp $0x1,%r15d 415279 0.00% | 41 89 c7 mov %eax,%r15d

    Read the article

  • Problem with asm program (nasm)

    - by GLeBaTi
    org 0x100 SEGMENT .CODE mov ah,0x9 mov dx, Msg1 int 0x21 ;string input mov ah,0xA mov dx,buff int 0x21 mov ax,0 mov al,[buff+1]; length ;string UPPERCASE mov cl, al mov si, buff cld loop1: lodsb; cmp al, 'a' jnb upper loop loop1 ;output mov ah,0x9 mov dx, buff int 0x21 exit: mov ah, 0x8 int 0x21 int 0x20 upper: sub al,32 jmp loop1 SEGMENT .DATA Msg1 db 'Press string: $' buff db 254,0 this code perform poorly. I think that problem in "jnb upper". This program make small symbols into big symbols.

    Read the article

  • help understanding differences between #define, const and enum in C and C++ on assembly level.

    - by martin
    recently, i am looking into assembly codes for #define, const and enum: C codes(#define): 3 #define pi 3 4 int main(void) 5 { 6 int a,r=1; 7 a=2*pi*r; 8 return 0; 9 } assembly codes(for line 6 and 7 in c codes) generated by GCC: 6 mov $0x1, -0x4(%ebp) 7 mov -0x4(%ebp), %edx 7 mov %edx, %eax 7 add %eax, %eax 7 add %edx, %eax 7 add %eax, %eax 7 mov %eax, -0x8(%ebp) C codes(enum): 2 int main(void) 3 { 4 int a,r=1; 5 enum{pi=3}; 6 a=2*pi*r; 7 return 0; 8 } assembly codes(for line 4 and 6 in c codes) generated by GCC: 6 mov $0x1, -0x4(%ebp) 7 mov -0x4(%ebp), %edx 7 mov %edx, %eax 7 add %eax, %eax 7 add %edx, %eax 7 add %eax, %eax 7 mov %eax, -0x8(%ebp) C codes(const): 4 int main(void) 5 { 6 int a,r=1; 7 const int pi=3; 8 a=2*pi*r; 9 return 0; 10 } assembly codes(for line 7 and 8 in c codes) generated by GCC: 6 movl $0x3, -0x8(%ebp) 7 movl $0x3, -0x4(%ebp) 8 mov -0x4(%ebp), %eax 8 add %eax, %eax 8 imul -0x8(%ebp), %eax 8 mov %eax, 0xc(%ebp) i found that use #define and enum, the assembly codes are the same. The compiler use 3 add instructions to perform multiplication. However, when use const, imul instruction is used. Anyone knows the reason behind that?

    Read the article

  • Setting processor to 32-bit mode

    - by dboarman-FissureStudios
    It seems that the following is a common method given in many tutorials on switching a processor from 16-bit to 32-bit: mov eax, cr0 ; set bit 0 in CR0-go to pmode or eax, 1 mov cr0, eax Why wouldn't I simply do the following: or cr0, 1 Is there something I'm missing? Possibly the only thing I can think of is that I cannot perform an operation like this on the cr0 register.

    Read the article

  • Why is FLD1 loading NaN instead?

    - by Bernd Jendrissek
    I have a one-liner C function that is just return value * pow(1.+rate, -delay); - it discounts a future value to a present value. The interesting part of the disassembly is 0x080555b9 : neg %eax 0x080555bb : push %eax 0x080555bc : fildl (%esp) 0x080555bf : lea 0x4(%esp),%esp 0x080555c3 : fldl 0xfffffff0(%ebp) 0x080555c6 : fld1 0x080555c8 : faddp %st,%st(1) 0x080555ca : fxch %st(1) 0x080555cc : fstpl 0x8(%esp) 0x080555d0 : fstpl (%esp) 0x080555d3 : call 0x8051ce0 0x080555d8 : fmull 0xfffffff8(%ebp) While single-stepping through this function, gdb says (rate is 0.02, delay is 2; you can see them on the stack): (gdb) si 0x080555c6 30 return value * pow(1.+rate, -delay); (gdb) info float R7: Valid 0x4004a6c28f5c28f5c000 +41.68999999999999773 R6: Valid 0x4004e15c28f5c28f6000 +56.34000000000000341 R5: Valid 0x4004dceb851eb851e800 +55.22999999999999687 R4: Valid 0xc0008000000000000000 -2 =R3: Valid 0x3ff9a3d70a3d70a3d800 +0.02000000000000000042 R2: Valid 0x4004ff147ae147ae1800 +63.77000000000000313 R1: Valid 0x4004e17ae147ae147800 +56.36999999999999744 R0: Valid 0x4004efb851eb851eb800 +59.92999999999999972 Status Word: 0x1861 IE PE SF TOP: 3 Control Word: 0x037f IM DM ZM OM UM PM PC: Extended Precision (64-bits) RC: Round to nearest Tag Word: 0x0000 Instruction Pointer: 0x73:0x080555c3 Operand Pointer: 0x7b:0xbff41d78 Opcode: 0xdd45 And after the fld1: (gdb) si 0x080555c8 30 return value * pow(1.+rate, -delay); (gdb) info float R7: Valid 0x4004a6c28f5c28f5c000 +41.68999999999999773 R6: Valid 0x4004e15c28f5c28f6000 +56.34000000000000341 R5: Valid 0x4004dceb851eb851e800 +55.22999999999999687 R4: Valid 0xc0008000000000000000 -2 R3: Valid 0x3ff9a3d70a3d70a3d800 +0.02000000000000000042 =R2: Special 0xffffc000000000000000 Real Indefinite (QNaN) R1: Valid 0x4004e17ae147ae147800 +56.36999999999999744 R0: Valid 0x4004efb851eb851eb800 +59.92999999999999972 Status Word: 0x1261 IE PE SF C1 TOP: 2 Control Word: 0x037f IM DM ZM OM UM PM PC: Extended Precision (64-bits) RC: Round to nearest Tag Word: 0x0020 Instruction Pointer: 0x73:0x080555c6 Operand Pointer: 0x7b:0xbff41d78 Opcode: 0xd9e8 After this, everything goes to hell. Things get grossly over or undervalued, so even if there were no other bugs in my freeciv AI attempt, it would choose all the wrong strategies. Like sending the whole army to the arctic. (Sigh, if only I were getting that far.) I must be missing something obvious, or getting blinded by something, because I can't believe that fld1 should ever possibly fail. Even less that it should fail only after a handful of passes through this function. On earlier passes the FPU correctly loads 1 into ST(0). The bytes at 0x080555c6 definitely encode fld1 - checked with x/... on the running process. What gives?

    Read the article

  • use callback function to report stack backtrace

    - by user353394
    Assume I have the following: typedef struct { char *name; char binding; int address; } Fn_Symbol //definition of function symbol static Fn_Symbol *fnSymbols; //array of function symbols in a file statc int total; //number of symbol functions in the array and file static void PrintBacktrace(int sigum, siginfo_t * siginfo, void *context) { printf("\nSignal received %d (%s)\n", signum, strsignal(signum)); const int eip_index = 14; void *eip = (void *)((struct ucontext *)context)->uc_mcontext.gregs[eip_index]; printf("Error at [%p] %s (+0x%x), eip, fnName, offset from start); //????? exit(0); } I have this so far, but what is the best way using the fnSymbols static global pointer to identify the function where the error occured and then back trace through the stack to identify each calling function by address, name, and offset?

    Read the article

  • Odd optimization problem under MSVC

    - by Goz
    I've seen this blog: http://igoro.com/archive/gallery-of-processor-cache-effects/ The "weirdness" in part 7 is what caught my interest. My first thought was "Thats just C# being weird". Its not I wrote the following C++ code. volatile int* p = (volatile int*)_aligned_malloc( sizeof( int ) * 8, 64 ); memset( (void*)p, 0, sizeof( int ) * 8 ); double dStart = t.GetTime(); for (int i = 0; i < 200000000; i++) { //p[0]++;p[1]++;p[2]++;p[3]++; // Option 1 //p[0]++;p[2]++;p[4]++;p[6]++; // Option 2 p[0]++;p[2]++; // Option 3 } double dTime = t.GetTime() - dStart; The timing I get on my 2.4 Ghz Core 2 Quad go as follows: Option 1 = ~8 cycles per loop. Option 2 = ~4 cycles per loop. Option 3 = ~6 cycles per loop. Now This is confusing. My reasoning behind the difference comes down to the cache write latency (3 cycles) on my chip and an assumption that the cache has a 128-bit write port (This is pure guess work on my part). On that basis in Option 1: It will increment p[0] (1 cycle) then increment p[2] (1 cycle) then it has to wait 1 cycle (for cache) then p[1] (1 cycle) then wait 1 cycle (for cache) then p[3] (1 cycle). Finally 2 cycles for increment and jump (Though its usually implemented as decrement and jump). This gives a total of 8 cycles. In Option 2: It can increment p[0] and p[4] in one cycle then increment p[2] and p[6] in another cycle. Then 2 cycles for subtract and jump. No waits needed on cache. Total 4 cycles. In option 3: It can increment p[0] then has to wait 2 cycles then increment p[2] then subtract and jump. The problem is if you set case 3 to increment p[0] and p[4] it STILL takes 6 cycles (which kinda blows my 128-bit read/write port out of the water). So ... can anyone tell me what the hell is going on here? Why DOES case 3 take longer? Also I'd love to know what I've got wrong in my thinking above, as i obviously have something wrong! Any ideas would be much appreciated! :) It'd also be interesting to see how GCC or any other compiler copes with it as well! Edit: Jerry Coffin's idea gave me some thoughts. I've done some more tests (on a different machine so forgive the change in timings) with and without nops and with different counts of nops case 2 - 0.46 00401ABD jne (401AB0h) 0 nops - 0.68 00401AB7 jne (401AB0h) 1 nop - 0.61 00401AB8 jne (401AB0h) 2 nops - 0.636 00401AB9 jne (401AB0h) 3 nops - 0.632 00401ABA jne (401AB0h) 4 nops - 0.66 00401ABB jne (401AB0h) 5 nops - 0.52 00401ABC jne (401AB0h) 6 nops - 0.46 00401ABD jne (401AB0h) 7 nops - 0.46 00401ABE jne (401AB0h) 8 nops - 0.46 00401ABF jne (401AB0h) 9 nops - 0.55 00401AC0 jne (401AB0h) I've included the jump statetements so you can see that the source and destination are in one cache line. You can also see that we start to get a difference when we are 13 bytes or more apart. Until we hit 16 ... then it all goes wrong. So Jerry isn't right (though his suggestion DOES help a bit), however something IS going on. I'm more and more intrigued to try and figure out what it is now. It does appear to be more some sort of memory alignment oddity rather than some sort of instruction throughput oddity. Anyone want to explain this for an inquisitive mind? :D Edit 3: Interjay has a point on the unrolling that blows the previous edit out of the water. With an unrolled loop the performance does not improve. You need to add a nop in to make the gap between jump source and destination the same as for my good nop count above. Performance still sucks. Its interesting that I need 6 nops to improve performance though. I wonder how many nops the processor can issue per cycle? If its 3 then that account for the cache write latency ... But, if thats it, why is the latency occurring? Curiouser and curiouser ...

    Read the article

  • How do you use printf from Assembly?

    - by bobobobo
    I have an MSVC++ project set up to compile and run assembly code. In main.c: #include <stdio.h> void go() ; int main() { go() ; // call the asm routine } In go.asm: .586 .model flat, c .code go PROC invoke puts,"hi" RET go ENDP end But when I compile and run, I get an error in go.asm: error A2006: undefined symbol : puts How do I define the symbols in <stdio.h> for the .asm files in the project?

    Read the article

  • What am I doing wrong? (Simple Assembly Loop)

    - by sunnyohno
    It won't let me post the picture. Btw, Someone from Reddit.programming sent me over here. So thanks! TITLE MASM Template ; Description ; ; Revision date: INCLUDE Irvine32.inc .data myArray BYTE 10, 20, 30, 40, 50, 60, 70, 80, 90, 100 .code main PROC call Clrscr mov esi, OFFSET myArray mov ecx, LENGTHOF myArray mov eax, 0 L1: add eax, [esi] inc esi loop L1 call WriteInt exit main ENDP END main Results in: -334881242

    Read the article

  • Examining C/C++ Heap memory statistics in gdb

    - by fd
    I'm trying to investigate the state of the C/C++ heap from within gdb on Linux amd64, is there a nice way to do this? One approach I've tried is to "call mallinfo()" but unfortunately I can't then extract the values I want since gdb deal with the return value properly. I'm not easily able to write a function to be compiled into the binary for the process I am attached to, so I can simply implement my own function to extract the values by calling mallinfo() in my own code this way. Is there perhaps a clever trick that will allow me to do this on-the-fly? Another option could be to locate the heap and traverse the malloc headers / free list; I'd appreciate any pointers to where I could start in finding the location and layout of these. I've been trying to Google and read around the problem for about 2 hours and I've learnt some fascinating stuff but still not found what I need.

    Read the article

  • Poco C++ library on OSX 10.8.2: Undefined symbols for architecture x86_64

    - by Arman
    I'm trying to use Poco C++ library to do the simple http requests in C++ on Mac OS X 10.8.2. I installed Poco, copy-pasted the http_request.cc code from this tutorial, ran it with 'g++ -o http_get http_get.cc -lPocoNet', but got: Undefined symbols for architecture x86_64: "Poco::StreamCopier::copyStream(std::basic_istream<char, std::char_traits<char> >&, std::basic_ostream<char, std::char_traits<char> >&, unsigned long)", referenced from: _main in ccKuZb1g.o "Poco::URI::URI(char const*)", referenced from: _main in ccKuZb1g.o "Poco::URI::~URI()", referenced from: _main in ccKuZb1g.o "Poco::URI::getPathAndQuery() const", referenced from: _main in ccKuZb1g.o "Poco::URI::getPort() const", referenced from: _main in ccKuZb1g.o "Poco::Exception::displayText() const", referenced from: _main in ccKuZb1g.o "typeinfo for Poco::Exception", referenced from: GCC_except_table1 in ccKuZb1g.o ld: symbol(s) not found for architecture x86_64 collect2: ld returned 1 exit status Have been struggling with this for couple of hours. Any idea how to fix this? Thanks in advance!

    Read the article

  • Where is the bottleneck in this code?

    - by Mikhail
    I have the following tight loop that makes up the serial bottle neck of my code. Ideally I would parallelize the function that calls this but that is not possible. //n is about 60 for (int k = 0;k < n;k++) { double fone = z[k*n+i+1]; double fzer = z[k*n+i]; z[k*n+i+1]= s*fzer+c*fone; z[k*n+i] = c*fzer-s*fone; } Are there any optimizations that can be made such as vectorization or some evil inline that can help this code? I am looking into finding eigen solutions of tridiagonal matrices. http://www.cimat.mx/~posada/OptDoglegGraph/DocLogisticDogleg/projects/adjustedrecipes/tqli.cpp.html

    Read the article

  • NASM shift operators

    - by Hudson Worden
    How would you go about doing a bit shift in NASM on a register? I read the manual and it only seems to mention these operators , <<. When I try to use them NASM complains about the shift operator working on scalar values. Can you explain what a scalar value is and give an example of how to use and <<. Also, I thought there were a shr or shl operators. If they do exist can you give an example of how to use them? Thank you for your time.

    Read the article

  • Calling assembly in GCC????

    - by rbr200
    include static inline uint xchg(volatile unsigned int *addr, unsigned int newval) { uint result; asm volatile("lock; xchgl %0, %1" : "+m" (*addr), "=a" (result) : "1" (newval) : "cc"); return result; } Can some one tell me what this code does exactly. I mean I have an idea or the parts of this command. "1" newval is the input, "=a" is to flush out its previous value and update it. "m" is for the memory operation but I am confused about the functionality of this function. What does the "+m" sign do? Does this function do sumthing like m=a; m = newval; return a

    Read the article

  • How to move value from the stack to ST(0)?

    - by George Edison
    I am having trouble believing the following code is the most efficient way to move a value from the stack to ST(0): .data var dd 4.2 tmp dd ? .code mov EAX, var push EAX ; top of stack now contains a value ; move it to ST(0) pop EAX mov tmp, EAX fld tmp Is the temporary variable really necessary? Further, is there an easier way to get a value from the stack to ST(0)?

    Read the article

  • Accessing PCI Device from user space programs

    - by crissangel
    I have a device which would be interface with my processor through pcie. I have written driver for it using the existing pci file operations. Now my problem is how do I access it from user space programs? PCI File operations do not have IOCTL support and hence I cant make an ioctl call unlike other char devices. I cannot use pci_config_read_byte etc. functions as they are meant for kernel space(included in linux/pci.h).

    Read the article

  • Effective optimization strategies on modern C++ compilers

    - by user168715
    I'm working on scientific code that is very performance-critical. An initial version of the code has been written and tested, and now, with profiler in hand, it's time to start shaving cycles from the hot spots. It's well-known that some optimizations, e.g. loop unrolling, are handled these days much more effectively by the compiler than by a programmer meddling by hand. Which techniques are still worthwhile? Obviously, I'll run everything I try through a profiler, but if there's conventional wisdom as to what tends to work and what doesn't, it would save me significant time. I know that optimization is very compiler- and architecture- dependent. I'm using Intel's C++ compiler targeting the Core 2 Duo, but I'm also interested in what works well for gcc, or for "any modern compiler." Here are some concrete ideas I'm considering: Is there any benefit to replacing STL containers/algorithms with hand-rolled ones? In particular, my program includes a very large priority queue (currently a std::priority_queue) whose manipulation is taking a lot of total time. Is this something worth looking into, or is the STL implementation already likely the fastest possible? Along similar lines, for std::vectors whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays? I've found that dynamic memory allocation is often a severe bottleneck, and that eliminating it can lead to significant speedups. As a consequence I'm interesting in the performance tradeoffs of returning large temporary data structures by value vs. returning by pointer vs. passing the result in by reference. Is there a way to reliably determine whether or not the compiler will use RVO for a given method (assuming the caller doesn't need to modify the result, of course)? How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops? Given the scientific nature of the program, floating-point numbers are used everywhere. A significant bottleneck in my code used to be conversions from floating point to integers: the compiler would emit code to save the current rounding mode, change it, perform the conversion, then restore the old rounding mode --- even though nothing in the program ever changed the rounding mode! Disabling this behavior significantly sped up my code. Are there any similar floating-point-related gotchas I should be aware of? One consequence of C++ being compiled and linked separately is that the compiler is unable to do what would seem to be very simple optimizations, such as move method calls like strlen() out of the termination conditions of loop. Are there any optimization like this one that I should look out for because they can't be done by the compiler and must be done by hand? On the flip side, are there any techniques I should avoid because they are likely to interfere with the compiler's ability to automatically optimize code? Lastly, to nip certain kinds of answers in the bud: I understand that optimization has a cost in terms of complexity, reliability, and maintainability. For this particular application, increased performance is worth these costs. I understand that the best optimizations are often to improve the high-level algorithms, and this has already been done.

    Read the article

  • Implementing traceback on i386

    - by markelliott2000
    Hi, I am currently porting our code from an alpha (Tru64) to an i386 processor (Linux) in C. Everything has gone pretty smoothly up until I looked into porting our exception handling routine. Currently we have a parent process which spawns lots of sub processes, and when one of these sub-processes fatal's (unfielded) I have routines to catch the process. I am currently struggling to find the best method of implementing a traceback routine which can list the function addresses in the error log, currently my routine just prints the the signal which caused the exception and the exception qualifier code. Any help would be greatly received, ideally I would write error handling for all processors, however at this stage I only really care about i386, and x86_64. Thanks Mark

    Read the article

< Previous Page | 13 14 15 16 17 18 19 20 21 22 23 24  | Next Page >