Search Results

Search found 14719 results on 589 pages for 'optimization level'.

Page 91/589 | < Previous Page | 87 88 89 90 91 92 93 94 95 96 97 98  | Next Page >

  • how to speed up the code??

    - by kaushik
    in my program i have a method which requires about 4 files to be open each time it is called,as i require to take some data.all this data from the file i have been storing in list for manupalation. I approximatily need to call this method about 10,000 times.which is making my program very slow? any method for handling this files in a better ways and is storing the whole data in list time consuming what is better alternatives for list? I can give some code,but my previous question was closed as that only confused everyone as it is a part of big program and need to be explained completely to understand,so i am not giving any code,please suggest ways thinking this as a general question... thanks in advance

    Read the article

  • Unicorn: Which number of worker processes to use?

    - by blackbird07
    I am running a Ruby on Rails app on a virtual Linux server that is capped at 1GB RAM. Currently, I am constantly hitting the limit and would like to optimize memory utilization. One option I am looking at is reducing the number of unicorn workers. So what is the best way to determine the number of unicorn workers to use? The current setting is 10 workers, but the maximum number of requests per second I have seen on Google Analytics Real-Time is 3 (only scored once at a peak time; in 99% of the time not going above 1 request per second). So is it a save assumption that I can - for now - go with 4 workers, leaving room for unexpected amounts of requests? What are the metrics I should have a look at for determining the number of workers and what are the tools I can use for that on my Ubuntu machine?

    Read the article

  • Quickest way to write to file in java

    - by user1097772
    I'm writing an application which compares directory structure. First I wrote an application which writes gets info about files - one line about each file or directory. My soulution is: calling method toFile Static PrintWriter pw = new PrintWriter(new BufferedWriter( new FileWriter("DirStructure.dlis")), true); String line; // info about file or directory public void toFile(String line) { pw.println(line); } and of course pw.close(), at the end. My question is, can I do it quicker? What is the quickest way? Edit: quickest way = quickest writing in the file

    Read the article

  • Should i use HttpResponse.End() for a fast webapp?

    - by acidzombie24
    HttpResponse.End() seems to throw an exception according to msdn. Right now i have the choice of returning a value to say end thread (it only goes 2 functions deep) or i can call end(). I know that throwing exceptions is significantly slower (read the comment for a C#/.NET test) so if i want a fast webapp should i consider not calling it when it is trivially easy to not call it? -edit- I do have a function call in certain functions and in the constructor in classes to ensure the user is logged in. So i call HttpResponse.End() in enough places although hopefully in regular site usage it doesn't occur too often.

    Read the article

  • Trying to reduce the speed overhead of an almost-but-not-quite-int number class

    - by Fumiyo Eda
    I have implemented a C++ class which behaves very similarly to the standard int type. The difference is that it has an additional concept of "epsilon" which represents some tiny value that is much less than 1, but greater than 0. One way to think of it is as a very wide fixed point number with 32 MSBs (the integer parts), 32 LSBs (the epsilon parts) and a huge sea of zeros in between. The following class works, but introduces a ~2x speed penalty in the overall program. (The program includes code that has nothing to do with this class, so the actual speed penalty of this class is probably much greater than 2x.) I can't paste the code that is using this class, but I can say the following: +, -, +=, <, > and >= are the only heavily used operators. Use of setEpsilon() and getInt() is extremely rare. * is also rare, and does not even need to consider the epsilon values at all. Here is the class: #include <limits> struct int32Uepsilon { typedef int32Uepsilon Self; int32Uepsilon () { _value = 0; _eps = 0; } int32Uepsilon (const int &i) { _value = i; _eps = 0; } void setEpsilon() { _eps = 1; } Self operator+(const Self &rhs) const { Self result = *this; result._value += rhs._value; result._eps += rhs._eps; return result; } Self operator-(const Self &rhs) const { Self result = *this; result._value -= rhs._value; result._eps -= rhs._eps; return result; } Self operator-( ) const { Self result = *this; result._value = -result._value; result._eps = -result._eps; return result; } Self operator*(const Self &rhs) const { return this->getInt() * rhs.getInt(); } // XXX: discards epsilon bool operator<(const Self &rhs) const { return (_value < rhs._value) || (_value == rhs._value && _eps < rhs._eps); } bool operator>(const Self &rhs) const { return (_value > rhs._value) || (_value == rhs._value && _eps > rhs._eps); } bool operator>=(const Self &rhs) const { return (_value >= rhs._value) || (_value == rhs._value && _eps >= rhs._eps); } Self &operator+=(const Self &rhs) { this->_value += rhs._value; this->_eps += rhs._eps; return *this; } Self &operator-=(const Self &rhs) { this->_value -= rhs._value; this->_eps -= rhs._eps; return *this; } int getInt() const { return(_value); } private: int _value; int _eps; }; namespace std { template<> struct numeric_limits<int32Uepsilon> { static const bool is_signed = true; static int max() { return 2147483647; } } }; The code above works, but it is quite slow. Does anyone have any ideas on how to improve performance? There are a few hints/details I can give that might be helpful: 32 bits are definitely insufficient to hold both _value and _eps. In practice, up to 24 ~ 28 bits of _value are used and up to 20 bits of _eps are used. I could not measure a significant performance difference between using int32_t and int64_t, so memory overhead itself is probably not the problem here. Saturating addition/subtraction on _eps would be cool, but isn't really necessary. Note that the signs of _value and _eps are not necessarily the same! This broke my first attempt at speeding this class up. Inline assembly is no problem, so long as it works with GCC on a Core i7 system running Linux!

    Read the article

  • How to optimize this user ranking query

    - by James Simpson
    I have 2 databases (users, userRankings) for a system that needs to have rankings updated every 10 minutes. I use the following code to update these rankings which works fairly well, but there is still a full table scan involved which slows things down with a few hundred thousand users. mysql_query("TRUNCATE TABLE userRankings"); mysql_query("INSERT INTO userRankings (userid) SELECT id FROM users ORDER BY score DESC"); mysql_query("UPDATE users a, userRankings b SET a.rank = b.rank WHERE a.id = b.userid"); In the userRankings table, rank is the primary key and userid is an index. Both tables are MyISAM (I've wondered if it might be beneficial to make userRankings InnoDB).

    Read the article

  • Design for a Debate club assignment application

    - by Amir Rachum
    Hi all, For my university's debate club, I was asked to create an application to assign debate sessions and I'm having some difficulties as to come up with a good design for it. I will do it in Java. Here's what's needed: What you need to know about BP debates: There are four teams of 2 debaters each and a judge. The four groups are assigned a specific position: gov1, gov2, op1, op2. There is no significance to the order within a team. The goal of the application is to get as input the debaters who are present (for example, if there are 20 people, we will hold 2 debates) and assign them to teams and roles with regards to the history of each debater so that: Each debater should debate with (be on the same team) as many people as possible. Each debater should uniformly debate in different positions. The debate should be fair - debaters have different levels of experience and this should be as even as possible - i.e., there shouldn't be a team of two very experienced debaters and a team of junior debaters. There should be an option for the user to restrict the assignment in various ways, such as: Specifying that two people should debate together, in a specific position or not. Specifying that a single debater should be in a specific position, regardless of the partner. etc... If anyone can try to give me some pointers for a design for this application, I'll be so thankful! Also, I've never implemented a GUI before, so I'd appreciate some pointers on that as well, but it's not the major issue right now.

    Read the article

  • In PHP is faster to get a value from an if statement or from an array?

    - by Vittorio Vittori
    Maybe this is a stupid question but what is faster? <?php function getCss1 ($id = 0) { if ($id == 1) { return 'red'; } else if ($id == 2) { return 'yellow'; } else if ($id == 3) { return 'green'; } else if ($id == 4) { return 'blue'; } else if ($id == 5) { return 'orange'; } else { return 'grey'; } } function getCss2 ($id = 0) { $css[] = 'grey'; $css[] = 'red'; $css[] = 'yellow'; $css[] = 'green'; $css[] = 'blue'; $css[] = 'orange'; return $css[$id]; } echo getCss1(3); echo getCss2(3); ?> I suspect is faster the if statement but I prefere to ask!

    Read the article

  • Algorithm for optimally choosing actions to perform a task

    - by Jules
    There are two data types: tasks and actions. An action costs a certain time to complete, and a set of tasks this actions consists of. A task has a set of actions, and our job is to choose one of them. So: class Task { Set<Action> choices; } class Action { float time; Set<Task> dependencies; } For example the primary task could be "Get a house". The possible actions for this task: "Buy a house" or "Build a house". The action "Build a house" costs 10 hours and has the dependencies "Get bricks" and "Get cement", etcetera. The total time is the sum of all the times of the actions required to perform. We want to choose actions such that the total time is minimal. Note that the dependencies can be diamond shaped. For example "Get bricks" could require "Get a car" (to transport the bricks) and "Get cement" would also require a car. Even if you do "Get bricks" and "Get cement" you only have to count the time it takes to get a car once. Note also that the dependencies can be circular. For example "Money" - "Job" - "Car" - "Money". This is no problem for us, we simply select all of "Money", "Job" and "Car". The total time is simply the sum of the time of these 3 things. Mathematical description: Let actions be the chosen actions. valid(task) = ?action ? task.choices. (action ? actions ? ?tasks ? action.dependencies. valid(task)) time = sum {action.time | action ? actions} minimize time subject to valid(primaryTask)

    Read the article

  • Find point which sum of distances to set of other points is minimal

    - by Pawel Markowski
    I have one set (X) of points (not very big let's say 1-20 points) and the second (Y), much larger set of points. I need to choose some point from Y which sum of distances to all points from X is minimal. I came up with an idea that I would treat X as a vertices of a polygon and find centroid of this polygon, and then I will choose a point from Y nearest to the centroid. But I'm not sure whether centroid minimizes sum of its distances to the vertices of polygon, so I'm not sure whether this is a good way? Is there any algorithm for solving this problem? Points are defined by geographical coordinates.

    Read the article

  • An image from byte to optimized web page presentation

    - by blgnklc
    I get the data of the stored image on database as byte[] array; then I convert it to System.Drawing.Image like the code shown below; public System.Drawing.Image CreateImage(byte[] bytes) { System.IO.MemoryStream memoryStream = new System.IO.MemoryStream(bytes); System.Drawing.Image image = System.Drawing.Image.FromStream(memoryStream); return image; } (*) On the other hand I am planning to show a list of images on asp.net pages as the client scrolls downs the page. The more user gets down and down on the page he/she does see the more photos. So it means fast page loads and rich user experience. (you may see what I mean on www.mashable.com, just take care the new loads of the photos as you scroll down.) Moreover, the returned imgae object from the method above, how can i show it in a loop dynamically using the (*) conditions above. Regards bk

    Read the article

  • How to index a date column with null values?

    - by Heinz Z.
    How should I index a date column when some rows has null values? We have to select rows between a date range and rows with null dates. We use Oracle 9.2 and higher. Options I found Using a bitmap index on the date column Using an index on date column and an index on a state field which value is 1 when the date is null Using an index on date column and an other granted not null column My thoughts to the options are: to 1: the table have to many different values to use an bitmap index to 2: I have to add an field only for this purpose and to change the query when I want to retrieve the null date rows to 3: locks tricky to add an field to an index which is not really needed What is the best practice for this case? Thanks in advance Some infos I have read: Oracle Date Index When does Oracle index null column values?

    Read the article

  • Does the <script> tag position in HTML affects performance of the webpage?

    - by Rahul Joshi
    If the script tag is above or below the body in a HTML page, does it matter for the performance of a website? And what if used in between like this: <body> ..blah..blah.. <script language="JavaScript" src="JS_File_100_KiloBytes"> function f1() { .. some logic reqd. for manipulating contents in a webpage } </script> ... some text here too ... </body> Or is this better?: <script language="JavaScript" src="JS_File_100_KiloBytes"> function f1() { .. some logic reqd. for manipulating contents in a webpage } </script> <body> ..blah..blah.. ..call above functions on some events like onclick,onfocus,etc.. </body> Or this one?: <body> ..blah..blah.. ..call above functions on some events like onclick,onfocus,etc.. <script language="JavaScript" src="JS_File_100_KiloBytes"> function f1() { .. some logic reqd. for manipulating contents in a webpage } </script> </body> Need not tell everything is again in the <html> tag!! How does it affect performance of webpage while loading? Does it really? Which one is the best, either out of these 3 or some other which you know? And one more thing, I googled a bit on this, from which I went here: Best Practices for Speeding Up Your Web Site and it suggests put scripts at the bottom, but traditionally many people put it in <head> tag which is above the <body> tag. I know it's NOT a rule but many prefer it that way. If you don't believe it, just view source of this page! And tell me what's the better style for best performance.

    Read the article

  • Does replacing statements by expressions using the C++ comma operator could allow more compiler opti

    - by Gabriel Cuvillier
    The C++ comma operator is used to chain individual expressions, yielding the value of the last executed expression as the result. For example the skeleton code (6 statements, 6 expressions): step1; step2; if (condition) step3; return step4; else return step5; May be rewritten to: (1 statement, 6 expressions) return step1, step2, condition? step3, step4 : step5; I noticed that it is not possible to perform step-by-step debugging of such code, as the expression chain seems to be executed as a whole. Does it means that the compiler is able to perform special optimizations which are not possible with the traditional statement approach (specially if the steps are const or inline)? Note: I'm not talking about the coding style merit of that way of expressing sequence of expressions! Just about the possible optimisations allowed by replacing statements by expressions.

    Read the article

  • Optimize a MySQL count each duplicate Query

    - by Onema
    I have the following query That gets the city name, city id, the region name, and a count of duplicate names for that record: SELECT Country_CA.City AS currentCity, Country_CA.CityID, globe_region.region_name, ( SELECT count(Country_CA.City) FROM Country_CA WHERE City LIKE currentCity ) as counter FROM Country_CA LEFT JOIN globe_region ON globe_region.region_id = Country_CA.RegionID AND globe_region.country_code = Country_CA.CountryCode ORDER BY City This example is for Canada, and the cities will be displayed on a dropdown list. There are a few towns in Canada, and in other countries, that have the same names. Therefore I want to know if there is more than one town with the same name region name will be appended to the town name. Region names are found in the globe_region table. Country_CA and globe_region look similar to this (I have changed a few things for visualization purposes) CREATE TABLE IF NOT EXISTS `Country_CA` ( `City` varchar(75) NOT NULL DEFAULT '', `RegionID` varchar(10) NOT NULL DEFAULT '', `CountryCode` varchar(10) NOT NULL DEFAULT '', `CityID` int(11) NOT NULL DEFAULT '0', PRIMARY KEY (`City`,`RegionID`), KEY `CityID` (`CityID`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; AND CREATE TABLE IF NOT EXISTS `globe_region` ( `country_code` char(2) COLLATE utf8_unicode_ci NOT NULL, `region_code` char(2) COLLATE utf8_unicode_ci NOT NULL, `region_name` varchar(50) COLLATE utf8_unicode_ci NOT NULL, PRIMARY KEY (`country_code`,`region_code`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci; The query on the top does exactly what I want it to do, but It takes way too long to generate a list for 5000 records. I would like to know if there is a way to optimize the sub-query in order to obtain the same results faster. the results should look like this City CityID region_name counter sheraton 2349269 British Columbia 1 sherbrooke 2349270 Quebec 2 sherbrooke 2349271 Nova Scotia 2 shere 2349273 British Columbia 1 sherridon 2349274 Manitoba 1

    Read the article

  • Where is the bottleneck in this code?

    - by Mikhail
    I have the following tight loop that makes up the serial bottle neck of my code. Ideally I would parallelize the function that calls this but that is not possible. //n is about 60 for (int k = 0;k < n;k++) { double fone = z[k*n+i+1]; double fzer = z[k*n+i]; z[k*n+i+1]= s*fzer+c*fone; z[k*n+i] = c*fzer-s*fone; } Are there any optimizations that can be made such as vectorization or some evil inline that can help this code? I am looking into finding eigen solutions of tridiagonal matrices. http://www.cimat.mx/~posada/OptDoglegGraph/DocLogisticDogleg/projects/adjustedrecipes/tqli.cpp.html

    Read the article

  • How to store static content across branches in a single location in version control

    - by Shravan
    [Just a random thought] I have a pdf doc that is downloaded when the user clicks on 'help' on my website. Now, this is a pretty huge document and is saved in version control (SVN) and is thus copied for all branches that exist in SVN. This is static content and something that developers are not working on, and does not change often. Is there a more efficient way to store it (that would not hamper local deployments) that would make SVN checkouts and updates relatively faster. I know the benefit we get is not huge, this is something that came to my head none the less.

    Read the article

  • How to insert zeros between bits in a bitmap?

    - by anatolyg
    I have some performance-heavy code that performs bit manipulations. It can be reduced to the following well-defined problem: Given a 13-bit bitmap, construct a 26-bit bitmap that contains the original bits spaced at even positions. To illustrate: 0000000000000000000abcdefghijklm (input, 32 bits) 0000000a0b0c0d0e0f0g0h0i0j0k0l0m (output, 32 bits) I currently have it implemented in the following way in C: if (input & (1 << 12)) output |= 1 << 24; if (input & (1 << 11)) output |= 1 << 22; if (input & (1 << 10)) output |= 1 << 20; ... My compiler (MS Visual Studio) turned this into the following: test eax,1000h jne 0064F5EC or edx,1000000h ... (repeated 13 times with minor differences in constants) I wonder whether i can make it any faster. I would like to have my code written in C, but switching to assembly language is possible. Can i use some MMX/SSE instructions to process all bits at once? Maybe i can use multiplication? (multiply by 0x11111111 or some other magical constant) Would it be better to use condition-set instruction (SETcc) instead of conditional-jump instruction? If yes, how can i make the compiler produce such code for me? Any other idea how to make it faster? Any idea how to do the inverse bitmap transformation (i have to implement it too, bit it's less critical)?

    Read the article

  • some pointer to understanding GCC source code

    - by user299570
    hi, I'm student working on optimizing GCC for multi-core processor. I tried going through the source code, it is difficult to follow through it since I need to add some code to the back end. Can anyone suggest some good resource which explains the code flow through the different phases. Also suggest some development environment for debugging GCC mainly to step through the code. Is it possible on windows?

    Read the article

  • MySQL subqueries

    - by swamprunner7
    Can we do this query without subqueries? SELECT login, post_n, (SELECT SUM(vote) FROM votes WHERE votes.post_n=posts.post_n)AS votes, (SELECT COUNT(comments.post_n) FROM comments WHERE comments.post_n=posts.post_n)AS comments_count FROM users, posts WHERE posts.id=users.id AND (visibility=2 OR visibility=3) ORDER BY date DESC LIMIT 0, 15 tables: Users: id, login Posts: post_n, id, visibility Votes: post_n, vote id — it`s user id, Users the main table.

    Read the article

  • help me improve my sse yuv to rgb ssse3 code

    - by David McPaul
    Hello, I am looking to optimise some sse code I wrote for converting yuv to rgb (both planar and packed yuv functions). i am using SSSE3 at the moment but if there are useful functions from later sse versions thats ok. I am mainly interested in how I would work out processor stalls and the like. Anyone know of any tools that do static analysis of sse code? ; ; Copyright (C) 2009-2010 David McPaul ; ; All rights reserved. Distributed under the terms of the MIT License. ; ; A rather unoptimised set of ssse3 yuv to rgb converters ; does 8 pixels per loop ; inputer: ; reads 128 bits of yuv 8 bit data and puts ; the y values converted to 16 bit in xmm0 ; the u values converted to 16 bit and duplicated into xmm1 ; the v values converted to 16 bit and duplicated into xmm2 ; conversion: ; does the yuv to rgb conversion using 16 bit integer and the ; results are placed into the following registers as 8 bit clamped values ; r values in xmm3 ; g values in xmm4 ; b values in xmm5 ; outputer: ; writes out the rgba pixels as 8 bit values with 0 for alpha ; xmm6 used for scratch ; xmm7 used for scratch %macro cglobal 1 global _%1 %define %1 _%1 align 16 %1: %endmacro ; conversion code %macro yuv2rgbsse2 0 ; u = u - 128 ; v = v - 128 ; r = y + v + v >> 2 + v >> 3 + v >> 5 ; g = y - (u >> 2 + u >> 4 + u >> 5) - (v >> 1 + v >> 3 + v >> 4 + v >> 5) ; b = y + u + u >> 1 + u >> 2 + u >> 6 ; subtract 16 from y movdqa xmm7, [Const16] ; loads a constant using data cache (slower on first fetch but then cached) psubsw xmm0,xmm7 ; y = y - 16 ; subtract 128 from u and v movdqa xmm7, [Const128] ; loads a constant using data cache (slower on first fetch but then cached) psubsw xmm1,xmm7 ; u = u - 128 psubsw xmm2,xmm7 ; v = v - 128 ; load r,b with y movdqa xmm3,xmm0 ; r = y pshufd xmm5,xmm0, 0xE4 ; b = y ; r = y + v + v >> 2 + v >> 3 + v >> 5 paddsw xmm3, xmm2 ; add v to r movdqa xmm7, xmm1 ; move u to scratch pshufd xmm6, xmm2, 0xE4 ; move v to scratch psraw xmm6,2 ; divide v by 4 paddsw xmm3, xmm6 ; and add to r psraw xmm6,1 ; divide v by 2 paddsw xmm3, xmm6 ; and add to r psraw xmm6,2 ; divide v by 4 paddsw xmm3, xmm6 ; and add to r ; b = y + u + u >> 1 + u >> 2 + u >> 6 paddsw xmm5, xmm1 ; add u to b psraw xmm7,1 ; divide u by 2 paddsw xmm5, xmm7 ; and add to b psraw xmm7,1 ; divide u by 2 paddsw xmm5, xmm7 ; and add to b psraw xmm7,4 ; divide u by 32 paddsw xmm5, xmm7 ; and add to b ; g = y - u >> 2 - u >> 4 - u >> 5 - v >> 1 - v >> 3 - v >> 4 - v >> 5 movdqa xmm7,xmm2 ; move v to scratch pshufd xmm6,xmm1, 0xE4 ; move u to scratch movdqa xmm4,xmm0 ; g = y psraw xmm6,2 ; divide u by 4 psubsw xmm4,xmm6 ; subtract from g psraw xmm6,2 ; divide u by 4 psubsw xmm4,xmm6 ; subtract from g psraw xmm6,1 ; divide u by 2 psubsw xmm4,xmm6 ; subtract from g psraw xmm7,1 ; divide v by 2 psubsw xmm4,xmm7 ; subtract from g psraw xmm7,2 ; divide v by 4 psubsw xmm4,xmm7 ; subtract from g psraw xmm7,1 ; divide v by 2 psubsw xmm4,xmm7 ; subtract from g psraw xmm7,1 ; divide v by 2 psubsw xmm4,xmm7 ; subtract from g %endmacro ; outputer %macro rgba32sse2output 0 ; clamp values pxor xmm7,xmm7 packuswb xmm3,xmm7 ; clamp to 0,255 and pack R to 8 bit per pixel packuswb xmm4,xmm7 ; clamp to 0,255 and pack G to 8 bit per pixel packuswb xmm5,xmm7 ; clamp to 0,255 and pack B to 8 bit per pixel ; convert to bgra32 packed punpcklbw xmm5,xmm4 ; bgbgbgbgbgbgbgbg movdqa xmm0, xmm5 ; save bg values punpcklbw xmm3,xmm7 ; r0r0r0r0r0r0r0r0 punpcklwd xmm5,xmm3 ; lower half bgr0bgr0bgr0bgr0 punpckhwd xmm0,xmm3 ; upper half bgr0bgr0bgr0bgr0 ; write to output ptr movntdq [edi], xmm5 ; output first 4 pixels bypassing cache movntdq [edi+16], xmm0 ; output second 4 pixels bypassing cache %endmacro SECTION .data align=16 Const16 dw 16 dw 16 dw 16 dw 16 dw 16 dw 16 dw 16 dw 16 Const128 dw 128 dw 128 dw 128 dw 128 dw 128 dw 128 dw 128 dw 128 UMask db 0x01 db 0x80 db 0x01 db 0x80 db 0x05 db 0x80 db 0x05 db 0x80 db 0x09 db 0x80 db 0x09 db 0x80 db 0x0d db 0x80 db 0x0d db 0x80 VMask db 0x03 db 0x80 db 0x03 db 0x80 db 0x07 db 0x80 db 0x07 db 0x80 db 0x0b db 0x80 db 0x0b db 0x80 db 0x0f db 0x80 db 0x0f db 0x80 YMask db 0x00 db 0x80 db 0x02 db 0x80 db 0x04 db 0x80 db 0x06 db 0x80 db 0x08 db 0x80 db 0x0a db 0x80 db 0x0c db 0x80 db 0x0e db 0x80 ; void Convert_YUV422_RGBA32_SSSE3(void *fromPtr, void *toPtr, int width) width equ ebp+16 toPtr equ ebp+12 fromPtr equ ebp+8 ; void Convert_YUV420P_RGBA32_SSSE3(void *fromYPtr, void *fromUPtr, void *fromVPtr, void *toPtr, int width) width1 equ ebp+24 toPtr1 equ ebp+20 fromVPtr equ ebp+16 fromUPtr equ ebp+12 fromYPtr equ ebp+8 SECTION .text align=16 cglobal Convert_YUV422_RGBA32_SSSE3 ; reserve variables push ebp mov ebp, esp push edi push esi push ecx mov esi, [fromPtr] mov edi, [toPtr] mov ecx, [width] ; loop width / 8 times shr ecx,3 test ecx,ecx jng ENDLOOP REPEATLOOP: ; loop over width / 8 ; YUV422 packed inputer movdqa xmm0, [esi] ; should have yuyv yuyv yuyv yuyv pshufd xmm1, xmm0, 0xE4 ; copy to xmm1 movdqa xmm2, xmm0 ; copy to xmm2 ; extract both y giving y0y0 pshufb xmm0, [YMask] ; extract u and duplicate so each u in yuyv becomes u0u0 pshufb xmm1, [UMask] ; extract v and duplicate so each v in yuyv becomes v0v0 pshufb xmm2, [VMask] yuv2rgbsse2 rgba32sse2output ; endloop add edi,32 add esi,16 sub ecx, 1 ; apparently sub is better than dec jnz REPEATLOOP ENDLOOP: ; Cleanup pop ecx pop esi pop edi mov esp, ebp pop ebp ret cglobal Convert_YUV420P_RGBA32_SSSE3 ; reserve variables push ebp mov ebp, esp push edi push esi push ecx push eax push ebx mov esi, [fromYPtr] mov eax, [fromUPtr] mov ebx, [fromVPtr] mov edi, [toPtr1] mov ecx, [width1] ; loop width / 8 times shr ecx,3 test ecx,ecx jng ENDLOOP1 REPEATLOOP1: ; loop over width / 8 ; YUV420 Planar inputer movq xmm0, [esi] ; fetch 8 y values (8 bit) yyyyyyyy00000000 movd xmm1, [eax] ; fetch 4 u values (8 bit) uuuu000000000000 movd xmm2, [ebx] ; fetch 4 v values (8 bit) vvvv000000000000 ; extract y pxor xmm7,xmm7 ; 00000000000000000000000000000000 punpcklbw xmm0,xmm7 ; interleave xmm7 into xmm0 y0y0y0y0y0y0y0y0 ; extract u and duplicate so each becomes 0u0u punpcklbw xmm1,xmm7 ; interleave xmm7 into xmm1 u0u0u0u000000000 punpcklwd xmm1,xmm7 ; interleave again u000u000u000u000 pshuflw xmm1,xmm1, 0xA0 ; copy u values pshufhw xmm1,xmm1, 0xA0 ; to get u0u0 ; extract v punpcklbw xmm2,xmm7 ; interleave xmm7 into xmm1 v0v0v0v000000000 punpcklwd xmm2,xmm7 ; interleave again v000v000v000v000 pshuflw xmm2,xmm2, 0xA0 ; copy v values pshufhw xmm2,xmm2, 0xA0 ; to get v0v0 yuv2rgbsse2 rgba32sse2output ; endloop add edi,32 add esi,8 add eax,4 add ebx,4 sub ecx, 1 ; apparently sub is better than dec jnz REPEATLOOP1 ENDLOOP1: ; Cleanup pop ebx pop eax pop ecx pop esi pop edi mov esp, ebp pop ebp ret SECTION .note.GNU-stack noalloc noexec nowrite progbits

    Read the article

< Previous Page | 87 88 89 90 91 92 93 94 95 96 97 98  | Next Page >