Search Results

Search found 3905 results on 157 pages for 'assembly'.

Page 37/157 | < Previous Page | 33 34 35 36 37 38 39 40 41 42 43 44  | Next Page >

  • Using ret with FASM on Win32

    - by Jon Purdy
    I'm using SDL with FASM, and have code that's minimally like the following: format ELF extrn _SDL_Init extrn _SDL_SetVideoMode extrn _SDL_Quit extrn _exit SDL_INIT_VIDEO equ 0x00000020 section '.text' public _SDL_main _SDL_main: ccall _SDL_Init, SDL_INIT_VIDEO ccall _SDL_SetVideoMode, 640, 480, 32, 0 ccall _SDL_Quit ccall _exit, 0 ; Success, or ret ; failure. With the following quick-and-dirty makefile: SOURCES = main.asm OBJECTS = main.o TARGET = SDLASM.exe FASM = C:\fasm\fasm.exe release : $(OBJECTS) ld $(OBJECTS) -LC:/SDL/lib/ -lSDLmain -lSDL -LC:/MinGW/lib/ -lmingw32 -lcrtdll -o $(TARGET) --subsystem windows cleanrelease : del $(OBJECTS) %.o : %.asm $(FASM) $< $@ Using exit() (or Windows' ExitProcess()) seems to be the only way to get this program to exit cleanly, even though I feel like I should be able to use retn/retf. When I just ret without calling exit(), the application does not terminate and needs to be killed. Could anyone shed some light on this? It only happens when I make the call to SDL_SetVideoMode().

    Read the article

  • ASP.NET reading files from BIN

    - by nettguy
    I am processing some CSV file which i have copied in Bin folder of My ASP.NET Website. When i execute using (IDataReader csv = new CsvReader (new StreamReader("sample.txt"), true, '|')) { ..... } it complains me that "sample.txt" not found in "c:\Program Files\.....\" Won't the runtime automatically look into the bin folder? what modification do i need to do?

    Read the article

  • More about the Standard Entry Sequence

    - by Mask
    quoted from here: _function: push ebp ;store the old base pointer mov ebp, esp ;make the base pointer point to the current ;stack location – at the top of the stack is the ;old ebp, followed by the return address and then ;the parameters. sub esp, x ;x is the size, in bytes, of all ;"automatic variables" in the function What's stored in esp in the above code snippet?

    Read the article

  • Why is a 16-bit register used with BSR instruction in this code snippet?

    - by sharptooth
    In this hardcore article there's a function find_maskwidth() that basically detects the number of bits required to represent itemCount dictinct values: unsigned int find_maskwidth( unsigned int itemCount ) { unsigned int maskWidth, count = itemCount; __asm { mov eax, count mov ecx, 0 mov maskWidth, ecx dec eax bsr cx, ax jz next inc cx mov maskWidth, ecx next: } return maskWidth; } the question is why do they use ax and cx registers instead of eax and ecx?

    Read the article

  • NDepend: How to not display 'tier' assemblies in dependency graph?

    - by Edward Buatois
    I was able to do this in an earlier version of nDepend by going to tools-options and setting which assemblies would be part of the analysis (and ignore the rest). The latest version of the trial version of nDepend lets me set it, but it seems to ignore the setting and always analyze all assemblies whether I want it to or not. I tried to delete the "tier" assemblies by moving them over to the "application assemblies" list, but when I delete them out of there, they just get added back to the "tier" list, which I can't ignore. I don't want my dependency graph to contain assemblies like "system," "system.xml," and "system.serialization!" I want only MY assemblies in the dependency graph! Or is that a paid-version feature now? Is there a way to do what I'm talking about?

    Read the article

  • See if any application has a DLL from the GAC loaded

    - by rwmnau
    I'm trying to deploy new copies of my DLL to the GAC on remote servers, but I need to identify if any processes currently running have a loaded copy of the DLL I'm replacing - I'd like to restart them, or at least tell the user. For example, Biztalk seems to load the DLLs it needs the first time they're used, and then replacing them keeps the old copy in memory until the Host Instances are restarted - something I could easily do as part of my deployment. Is there a way to tell using .NET which processes have loaded a particular DLL from the GAC? UPDATE: Some further investigation shows that both Process Explorer has this functionality, and another Sysinternals tool, ListDLL, does exactly what I want to be able to do. I'd like to know how they do it, since I'd love to replicate this functionality in my application without having to include and screen-scrape ListDLL (if that's even allowed inside the license).

    Read the article

  • Ret Failure with SDL using FASM on Win32

    - by Jon Purdy
    I'm using SDL with FASM, and have code that's minimally like the following: format ELF extrn _SDL_Init extrn _SDL_SetVideoMode extrn _SDL_Quit extrn _exit SDL_INIT_VIDEO equ 0x00000020 section '.text' public _SDL_main _SDL_main: ccall _SDL_Init, SDL_INIT_VIDEO ccall _SDL_SetVideoMode, 640, 480, 32, 0 ccall _SDL_Quit ccall _exit, 0 ; Success, or ret ; failure. With the following quick-and-dirty makefile: SOURCES = main.asm OBJECTS = main.o TARGET = SDLASM.exe FASM = C:\fasm\fasm.exe release : $(OBJECTS) ld $(OBJECTS) -LC:/SDL/lib/ -lSDLmain -lSDL -LC:/MinGW/lib/ -lmingw32 -lcrtdll -o $(TARGET) --subsystem windows cleanrelease : del $(OBJECTS) %.o : %.asm $(FASM) $< $@ Using exit() (or Windows' ExitProcess()) seems to be the only way to get this program to exit cleanly, even though I feel like I should be able to use retn/retf. When I just ret without calling exit(), the application does not terminate and needs to be killed. Could anyone shed some light on this? It only happens when I make the call to SDL_SetVideoMode().

    Read the article

  • In which scenario it is useful to use Disassembly on python?

    - by systempuntoout
    The dis module can be effectively used to disassemble Python methods, functions and classes into low-level interpreter instructions. I know that dis information can be used for: 1. Find race condition in programs that use threads 2. Find possible optimizations From your experience, do you know any other scenarios where Disassembly Python feature could be useful?

    Read the article

  • How Do You Make An Assembler?

    - by mudge
    I'd like to make a simple x86 assembler. I'm wondering if there's any tutorials for making your own assembler. Or if there's a simple assembler that I could study. Also, I wonder what tools are used in looking at and handling the binary/hex of programs.

    Read the article

  • x86_64 assembler: only one call per subroutine?

    - by zneak
    Hello everyone, I decided yesterday to start doing assembler. Most of it is okay (well, as okay as assembler can be), but I'm getting some problems with gas. It seems that I can call functions only once. After that, any subsequent call opcode with the same function name will fail. I must be doing something terribly wrong, though I can't see what. Take this small C function for instance: void path_free(path_t path) { if (path == NULL) return; free(((point_list_t*)path)->points); free(path); } I "translated" it to assembler like that: .globl _path_free _path_free: push rbp mov rbp, rsp cmp rdi, 0 jz byebye push rdi mov rdi, qword ptr [rdi] call _free pop rdi sub rsp, 8 call _free byebye: leave ret This triggers the following error for the second call _free: suffix or operands invalid for ``call''. And if I change it to something else, like free2, everything works (until link time, that is). Assembler code gcc -S gave me looks very similar to what I've done (except it's in AT&T syntax), so I'm kind of lost. I'm doing this on Mac OS X under the x86_64 architecture.

    Read the article

  • (x86) Assembler Optimization

    - by Pindatjuh
    I'm building a compiler/assembler/linker in Java for the x86-32 (IA32) processor targeting Windows. High-level concepts of a "language" (in essential a Java API for creating executables) are translated into opcodes, which then are wrapped and outputted to a file. The translation process has several phases, one is the translation between languages: the highest-level code is translated into the medium-level code which is then translated into the lowest-level code (probably more than 3 levels). My problem is the following; if I have higher-level code (X and Y) translated to lower-level code (x, y, U and V), then an example of such a translation is, in pseudo-code: x + U(f) // generated by X + V(f) + y // generated by Y (An easy example) where V is the opposite of U (compare with a stack push as U and a pop as V). This needs to be 'optimized' into: x + y (essentially removing the "useless" code) My idea was to use regular expressions. For the above case, it'll be a regular expression looking like this: x:(U(x)+V(x)):null, meaning for all x find U(x) followed by V(x) and replace by null. Imagine more complex regular expressions, for more complex optimizations. This should work on all levels. What do you suggest? What would be a good approach to optimize in these situations?

    Read the article

  • OpenGL Calls Lock/Freeze

    - by Necrolis
    I am using some dell workstations(running WinXP Pro SP 2 & DeepFreeze) for development, but something was recenlty loaded onto these machines that prevents any opengl call(the call locks) from completing(and I know the code works as I have tested it on 'clean' machines, I also tested with simple opengl apps generated by dev-cpp, which will also lock on the dell machines). I have tried to debug my own apps to see where exactly the gl calls freeze, but there is some global system hook on ZwQueryInformationProcess that messes up calls to ZwQueryInformationThread(used by ExitThread), preventing me from debugging at all(it causes the debugger, OllyDBG, to go into an access violation reporting loop or the program to crash if the exception is passed along). the hook: ntdll.ZwQueryInformationProcess 7C90D7E0 B8 9A000000 MOV EAX,9A 7C90D7E5 BA 0003FE7F MOV EDX,7FFE0300 7C90D7EA FF12 CALL DWORD PTR DS:[EDX] 7C90D7EC - E9 0F28448D JMP 09D50000 7C90D7F1 9B WAIT 7C90D7F2 0000 ADD BYTE PTR DS:[EAX],AL 7C90D7F4 00BA 0003FE7F ADD BYTE PTR DS:[EDX+7FFE0300],BH 7C90D7FA FF12 CALL DWORD PTR DS:[EDX] 7C90D7FC C2 1400 RETN 14 7C90D7FF 90 NOP ntdll.ZwQueryInformationToken 7C90D800 B8 9C000000 MOV EAX,9C the messed up function + call: ntdll.ZwQueryInformationThread 7C90D7F0 8D9B 000000BA LEA EBX,DWORD PTR DS:[EBX+BA000000] 7C90D7F6 0003 ADD BYTE PTR DS:[EBX],AL 7C90D7F8 FE ??? ; Unknown command 7C90D7F9 7F FF JG SHORT ntdll.7C90D7FA 7C90D7FB 12C2 ADC AL,DL 7C90D7FD 14 00 ADC AL,0 7C90D7FF 90 NOP ntdll.ZwQueryInformationToken 7C90D800 B8 9C000000 MOV EAX,9C So firstly, anyone know what if anything would lead to OpenGL calls cause an infinite lock,and if there are any ways around it? and what would be creating such a hook in kernal memory ? Update: After some more fiddling, I have discovered a few more kernal hooks, a lot of them are used to nullify data returned by system information calls(such as the remote debugging port), I also managed to find out the what ever is doing this is using madchook.dll(by madshi) to do this, this dll is also injected into every running process(these seem to be some anti debugging code). Also, on the OpenGL side, it seems Direct X is fine/unaffected(I ran one of the DX 9 demo's without problems), so could one of these kernal hooks somehow affect OpenGL?

    Read the article

  • Throwing a C++ exception after an inline-asm jump

    - by SoapBox
    I have some odd self modifying code, but at the root of it is a pretty simple problem: I want to be able to execute a jmp (or a call) and then from that arbitrary point throw an exception and have it caught by the try/catch block that contained the jmp/call. But when I do this (in gcc 4.4.1 x86_64) the exception results in a terminate() as it would if the exception was thrown from outside of a try/catch. I don't really see how this is different than throwing an exception from inside of some far-flung library, yet it obviously is because it just doesn't work. How can I execute a jmp or call but still throw an exception back to the original try/catch? Why doesn't this try/catch continue to handle these exceptions as it would if the function was called normally? The code: #include <iostream> #include <stdexcept> using namespace std; void thrower() { cout << "Inside thrower" << endl; throw runtime_error("some exception"); } int main() { cout << "Top of main" << endl; try { asm volatile ( "jmp *%0" // same thing happens with a call instead of a jmp : : "r"((long)thrower) : ); } catch (exception &e) { cout << "Caught : " << e.what() << endl; } cout << "Bottom of main" << endl << endl; } The expected output: Top of main Inside thrower Caught : some exception Bottom of main The actual output: Top of main Inside thrower terminate called after throwing an instance of 'std::runtime_error' what(): some exception Aborted

    Read the article

  • Intrinsics program (SSE) - g++ - help needed

    - by Sriram
    Hi all, This is the first time I am posting a question on stackoverflow, so please try and overlook any errors I may have made in formatting my question/code. But please do point the same out to me so I may be more careful. I was trying to write some simple intrinsics routines for the addition of two 128-bit (containing 4 float variables) numbers. I found some code on the net and was trying to get it to run on my system. The code is as follows: //this is a sample Intrinsics program to add two vectors. #include <iostream> #include <iomanip> #include <xmmintrin.h> #include <stdio.h> using namespace std; struct vector4 { float x, y, z, w; }; //functions to operate on them. vector4 set_vector(float x, float y, float z, float w = 0) { vector4 temp; temp.x = x; temp.y = y; temp.z = z; temp.w = w; return temp; } void print_vector(const vector4& v) { cout << " This is the contents of vector: " << endl; cout << " > vector.x = " << v.x << endl; cout << " vector.y = " << v.y << endl; cout << " vector.z = " << v.z << endl; cout << " vector.w = " << v.w << endl; } vector4 sse_vector4_add(const vector4&a, const vector4& b) { vector4 result; asm volatile ( "movl $a, %eax" //move operands into registers. "\n\tmovl $b, %ebx" "\n\tmovups (%eax), xmm0" //move register contents into SSE registers. "\n\tmovups (%ebx), xmm1" "\n\taddps xmm0, xmm1" //add the elements. addps operates on single-precision vectors. "\n\t movups xmm0, result" //move result into vector4 type data. ); return result; } int main() { vector4 a, b, result; a = set_vector(1.1, 2.1, 3.2, 4.5); b = set_vector(2.2, 4.2, 5.6); result = sse_vector4_add(a, b); print_vector(a); print_vector(b); print_vector(result); return 0; } The g++ parameters I use are: g++ -Wall -pedantic -g -march=i386 -msse intrinsics_SSE_example.C -o h The errors I get are as follows: intrinsics_SSE_example.C: Assembler messages: intrinsics_SSE_example.C:45: Error: too many memory references for movups intrinsics_SSE_example.C:46: Error: too many memory references for movups intrinsics_SSE_example.C:47: Error: too many memory references for addps intrinsics_SSE_example.C:48: Error: too many memory references for movups I have spent a lot of time on trying to debug these errors, googled them and so on. I am a complete noob to Intrinsics and so may have overlooked some important things. Any help is appreciated, Thanks, Sriram.

    Read the article

  • Error in Print Function in Bubble Sort MIPS?

    - by m00nbeam360
    Sorry that this is such a long block of code, but do you see any obvious syntax errors in this? I feel like the problem is that the code isn't printing correctly since the sort and swap methods were from my textbook. Please help if you can! .data save: .word 1,2,4,2,5,6 size: .word 6 .text swap: sll $t1, $a1, 2 #shift bits by 2 add $t1, $a1, $t1 #set $t1 address to v[k] lw $t0, 0($t1) #load v[k] into t1 lw $t2, 4($t1) #load v[k+1] into t1 sw $t2, 0($t1) #swap addresses sw $t0, 4($t1) #swap addresses jr $ra #return sort: addi $sp, $sp, -20 #make enough room on the stack for five registers sw $ra, 16($sp) #save the return address on the stack sw $s3, 12($sp) #save $s3 on the stack sw $s2, 8($sp) #save Ss2 on the stack sw $s1, 4($sp) #save $s1 on the stack sw $s0, 0($sp) #save $s0 on the stack move $s2, $a0 #copy the parameter $a0 into $s2 (save $a0) move $s3, $a1 #copy the parameter $a1 into $s3 (save $a1) move $s0, $zero #start of for loop, i = 0 for1tst: slt $t0, $s0, $s3 #$t0 = 0 if $s0 S $s3 (i S n) beq $t0, $zero, exit1 #go to exit1 if $s0 S $s3 (i S n) addi $s1, $s0, -1 #j - i - 1 for2tst: slti $t0, $s1, 0 #$t0 = 1 if $s1 < 0 (j < 0) bne $t0, $zero, exit2 #$t0 = 1 if $s1 < 0 (j < 0) sll $t1, $s1, 2 #$t1 = j * 4 (shift by 2 bits) add $t2, $s2, $t1 #$t2 = v + (j*4) lw $t3, 0($t2) #$t3 = v[j] lw $t4, 4($t2) #$t4 = v[j+1] slt $t0, $t4, $t3 #$t0 = 0 if $t4 S $t3 beq $t0, $zero, exit2 #go to exit2 if $t4 S $t3 move $a0, $s2 #1st parameter of swap is v(old $a0) move $a1, $s1 #2nd parameter of swap is j jal swap #swap addi $s1, $s1, -1 j for2tst #jump to test of inner loop j print exit2: addi $s0, $s0, 1 #i = i + 1 j for1tst #jump to test of outer loop exit1: lw $s0, 0($sp) #restore $s0 from stack lw $s1, 4($sp) #resture $s1 from stack lw $s2, 8($sp) #restore $s2 from stack lw $s3, 12($sp) #restore $s3 from stack lw $ra, 16($sp) #restore $ra from stack addi $sp, $sp, 20 #restore stack pointer jr $ra #return to calling routine .data space:.asciiz " " # space to insert between numbers head: .asciiz "The sorted numbers are:\n" .text print:add $t0, $zero, $a0 # starting address of array add $t1, $zero, $a1 # initialize loop counter to array size la $a0, head # load address of print heading li $v0, 4 # specify Print String service syscall # print heading out: lw $a0, 0($t0) # load fibonacci number for syscall li $v0, 1 # specify Print Integer service syscall # print fibonacci number la $a0, space # load address of spacer for syscall li $v0, 4 # specify Print String service syscall # output string addi $t0, $t0, 4 # increment address addi $t1, $t1, -1 # decrement loop counter bgtz $t1, out # repeat if not finished jr $ra # return

    Read the article

  • help me improve my sse yuv to rgb ssse3 code

    - by David McPaul
    Hello, I am looking to optimise some sse code I wrote for converting yuv to rgb (both planar and packed yuv functions). i am using SSSE3 at the moment but if there are useful functions from later sse versions thats ok. I am mainly interested in how I would work out processor stalls and the like. Anyone know of any tools that do static analysis of sse code? ; ; Copyright (C) 2009-2010 David McPaul ; ; All rights reserved. Distributed under the terms of the MIT License. ; ; A rather unoptimised set of ssse3 yuv to rgb converters ; does 8 pixels per loop ; inputer: ; reads 128 bits of yuv 8 bit data and puts ; the y values converted to 16 bit in xmm0 ; the u values converted to 16 bit and duplicated into xmm1 ; the v values converted to 16 bit and duplicated into xmm2 ; conversion: ; does the yuv to rgb conversion using 16 bit integer and the ; results are placed into the following registers as 8 bit clamped values ; r values in xmm3 ; g values in xmm4 ; b values in xmm5 ; outputer: ; writes out the rgba pixels as 8 bit values with 0 for alpha ; xmm6 used for scratch ; xmm7 used for scratch %macro cglobal 1 global _%1 %define %1 _%1 align 16 %1: %endmacro ; conversion code %macro yuv2rgbsse2 0 ; u = u - 128 ; v = v - 128 ; r = y + v + v >> 2 + v >> 3 + v >> 5 ; g = y - (u >> 2 + u >> 4 + u >> 5) - (v >> 1 + v >> 3 + v >> 4 + v >> 5) ; b = y + u + u >> 1 + u >> 2 + u >> 6 ; subtract 16 from y movdqa xmm7, [Const16] ; loads a constant using data cache (slower on first fetch but then cached) psubsw xmm0,xmm7 ; y = y - 16 ; subtract 128 from u and v movdqa xmm7, [Const128] ; loads a constant using data cache (slower on first fetch but then cached) psubsw xmm1,xmm7 ; u = u - 128 psubsw xmm2,xmm7 ; v = v - 128 ; load r,b with y movdqa xmm3,xmm0 ; r = y pshufd xmm5,xmm0, 0xE4 ; b = y ; r = y + v + v >> 2 + v >> 3 + v >> 5 paddsw xmm3, xmm2 ; add v to r movdqa xmm7, xmm1 ; move u to scratch pshufd xmm6, xmm2, 0xE4 ; move v to scratch psraw xmm6,2 ; divide v by 4 paddsw xmm3, xmm6 ; and add to r psraw xmm6,1 ; divide v by 2 paddsw xmm3, xmm6 ; and add to r psraw xmm6,2 ; divide v by 4 paddsw xmm3, xmm6 ; and add to r ; b = y + u + u >> 1 + u >> 2 + u >> 6 paddsw xmm5, xmm1 ; add u to b psraw xmm7,1 ; divide u by 2 paddsw xmm5, xmm7 ; and add to b psraw xmm7,1 ; divide u by 2 paddsw xmm5, xmm7 ; and add to b psraw xmm7,4 ; divide u by 32 paddsw xmm5, xmm7 ; and add to b ; g = y - u >> 2 - u >> 4 - u >> 5 - v >> 1 - v >> 3 - v >> 4 - v >> 5 movdqa xmm7,xmm2 ; move v to scratch pshufd xmm6,xmm1, 0xE4 ; move u to scratch movdqa xmm4,xmm0 ; g = y psraw xmm6,2 ; divide u by 4 psubsw xmm4,xmm6 ; subtract from g psraw xmm6,2 ; divide u by 4 psubsw xmm4,xmm6 ; subtract from g psraw xmm6,1 ; divide u by 2 psubsw xmm4,xmm6 ; subtract from g psraw xmm7,1 ; divide v by 2 psubsw xmm4,xmm7 ; subtract from g psraw xmm7,2 ; divide v by 4 psubsw xmm4,xmm7 ; subtract from g psraw xmm7,1 ; divide v by 2 psubsw xmm4,xmm7 ; subtract from g psraw xmm7,1 ; divide v by 2 psubsw xmm4,xmm7 ; subtract from g %endmacro ; outputer %macro rgba32sse2output 0 ; clamp values pxor xmm7,xmm7 packuswb xmm3,xmm7 ; clamp to 0,255 and pack R to 8 bit per pixel packuswb xmm4,xmm7 ; clamp to 0,255 and pack G to 8 bit per pixel packuswb xmm5,xmm7 ; clamp to 0,255 and pack B to 8 bit per pixel ; convert to bgra32 packed punpcklbw xmm5,xmm4 ; bgbgbgbgbgbgbgbg movdqa xmm0, xmm5 ; save bg values punpcklbw xmm3,xmm7 ; r0r0r0r0r0r0r0r0 punpcklwd xmm5,xmm3 ; lower half bgr0bgr0bgr0bgr0 punpckhwd xmm0,xmm3 ; upper half bgr0bgr0bgr0bgr0 ; write to output ptr movntdq [edi], xmm5 ; output first 4 pixels bypassing cache movntdq [edi+16], xmm0 ; output second 4 pixels bypassing cache %endmacro SECTION .data align=16 Const16 dw 16 dw 16 dw 16 dw 16 dw 16 dw 16 dw 16 dw 16 Const128 dw 128 dw 128 dw 128 dw 128 dw 128 dw 128 dw 128 dw 128 UMask db 0x01 db 0x80 db 0x01 db 0x80 db 0x05 db 0x80 db 0x05 db 0x80 db 0x09 db 0x80 db 0x09 db 0x80 db 0x0d db 0x80 db 0x0d db 0x80 VMask db 0x03 db 0x80 db 0x03 db 0x80 db 0x07 db 0x80 db 0x07 db 0x80 db 0x0b db 0x80 db 0x0b db 0x80 db 0x0f db 0x80 db 0x0f db 0x80 YMask db 0x00 db 0x80 db 0x02 db 0x80 db 0x04 db 0x80 db 0x06 db 0x80 db 0x08 db 0x80 db 0x0a db 0x80 db 0x0c db 0x80 db 0x0e db 0x80 ; void Convert_YUV422_RGBA32_SSSE3(void *fromPtr, void *toPtr, int width) width equ ebp+16 toPtr equ ebp+12 fromPtr equ ebp+8 ; void Convert_YUV420P_RGBA32_SSSE3(void *fromYPtr, void *fromUPtr, void *fromVPtr, void *toPtr, int width) width1 equ ebp+24 toPtr1 equ ebp+20 fromVPtr equ ebp+16 fromUPtr equ ebp+12 fromYPtr equ ebp+8 SECTION .text align=16 cglobal Convert_YUV422_RGBA32_SSSE3 ; reserve variables push ebp mov ebp, esp push edi push esi push ecx mov esi, [fromPtr] mov edi, [toPtr] mov ecx, [width] ; loop width / 8 times shr ecx,3 test ecx,ecx jng ENDLOOP REPEATLOOP: ; loop over width / 8 ; YUV422 packed inputer movdqa xmm0, [esi] ; should have yuyv yuyv yuyv yuyv pshufd xmm1, xmm0, 0xE4 ; copy to xmm1 movdqa xmm2, xmm0 ; copy to xmm2 ; extract both y giving y0y0 pshufb xmm0, [YMask] ; extract u and duplicate so each u in yuyv becomes u0u0 pshufb xmm1, [UMask] ; extract v and duplicate so each v in yuyv becomes v0v0 pshufb xmm2, [VMask] yuv2rgbsse2 rgba32sse2output ; endloop add edi,32 add esi,16 sub ecx, 1 ; apparently sub is better than dec jnz REPEATLOOP ENDLOOP: ; Cleanup pop ecx pop esi pop edi mov esp, ebp pop ebp ret cglobal Convert_YUV420P_RGBA32_SSSE3 ; reserve variables push ebp mov ebp, esp push edi push esi push ecx push eax push ebx mov esi, [fromYPtr] mov eax, [fromUPtr] mov ebx, [fromVPtr] mov edi, [toPtr1] mov ecx, [width1] ; loop width / 8 times shr ecx,3 test ecx,ecx jng ENDLOOP1 REPEATLOOP1: ; loop over width / 8 ; YUV420 Planar inputer movq xmm0, [esi] ; fetch 8 y values (8 bit) yyyyyyyy00000000 movd xmm1, [eax] ; fetch 4 u values (8 bit) uuuu000000000000 movd xmm2, [ebx] ; fetch 4 v values (8 bit) vvvv000000000000 ; extract y pxor xmm7,xmm7 ; 00000000000000000000000000000000 punpcklbw xmm0,xmm7 ; interleave xmm7 into xmm0 y0y0y0y0y0y0y0y0 ; extract u and duplicate so each becomes 0u0u punpcklbw xmm1,xmm7 ; interleave xmm7 into xmm1 u0u0u0u000000000 punpcklwd xmm1,xmm7 ; interleave again u000u000u000u000 pshuflw xmm1,xmm1, 0xA0 ; copy u values pshufhw xmm1,xmm1, 0xA0 ; to get u0u0 ; extract v punpcklbw xmm2,xmm7 ; interleave xmm7 into xmm1 v0v0v0v000000000 punpcklwd xmm2,xmm7 ; interleave again v000v000v000v000 pshuflw xmm2,xmm2, 0xA0 ; copy v values pshufhw xmm2,xmm2, 0xA0 ; to get v0v0 yuv2rgbsse2 rgba32sse2output ; endloop add edi,32 add esi,8 add eax,4 add ebx,4 sub ecx, 1 ; apparently sub is better than dec jnz REPEATLOOP1 ENDLOOP1: ; Cleanup pop ebx pop eax pop ecx pop esi pop edi mov esp, ebp pop ebp ret SECTION .note.GNU-stack noalloc noexec nowrite progbits

    Read the article

  • Problem with asm program (nasm)

    - by GLeBaTi
    org 0x100 SEGMENT .CODE mov ah,0x9 mov dx, Msg1 int 0x21 ;string input mov ah,0xA mov dx,buff int 0x21 mov ax,0 mov al,[buff+1]; length ;string UPPERCASE mov cl, al mov si, buff cld loop1: lodsb; cmp al, 'a' jnb upper loop loop1 ;output mov ah,0x9 mov dx, buff int 0x21 exit: mov ah, 0x8 int 0x21 int 0x20 upper: sub al,32 jmp loop1 SEGMENT .DATA Msg1 db 'Press string: $' buff db 254,0 this code perform poorly. I think that problem in "jnb upper". This program make small symbols into big symbols.

    Read the article

  • How to specify execution time of x86 and PowerPC instructions?

    - by Goofy
    Hello! I have to approximate execution time of PowerPC and x86 assembler code.I understand that I cannot compute exact it dependson many problems (current processor state - x86 processor dicides internal instructions in microinstructions, memory access time obtainig code from cache of from slower memory etc.). I found some information in Intel Optimizaction reference (APPENDIX C), but it does not provide information about all general purpose instructions. Is there any complete reference about it? What about PowerPC processors? Where can I fund such information?

    Read the article

< Previous Page | 33 34 35 36 37 38 39 40 41 42 43 44  | Next Page >