nano optimization - Page 61

Verifying compiler optimizations in gcc/g++ by analyzing assembly listings

- by Victor Liu

I just asked a question related to how the compiler optimizes certain C++ code, and I was looking around SO for any questions about how to verify that the compiler has performed certain optimizations. I was trying to look at the assembly listing generated with g++ (g++ -c -g -O2 -Wa,-ahl=file.s file.c) to possibly see what is going on under the hood, but the output is too cryptic to me. What techniques do people use to tackle this problem, and are there any good references on how to interpret the assembly listings of optimized code or articles specific to the GCC toolchain that talk about this problem?

Read the article

Optimizing Vector elements swaps using CUDA

- by Orion Nebula

Hi all, Since I am new to cuda .. I need your kind help I have this long vector, for each group of 24 elements, I need to do the following: for the first 12 elements, the even numbered elements are multiplied by -1, for the second 12 elements, the odd numbered elements are multiplied by -1 then the following swap takes place: Graph: because I don't yet have enough points, I couldn't post the image so here it is: http://www.freeimagehosting.net/image.php?e4b88fb666.png I have written this piece of code, and wonder if you could help me further optimize it to solve for divergence or bank conflicts .. //subvector is a multiple of 24, Mds and Nds are shared memory _shared_ double Mds[subVector]; _shared_ double Nds[subVector]; int tx = threadIdx.x; int tx_mod = tx ^ 0x0001; int basex = __umul24(blockDim.x, blockIdx.x); Mds[tx] = M.elements[basex + tx]; __syncthreads(); // flip the signs if (tx < (tx/24)*24 + 12) { //if < 12 and even if ((tx & 0x0001)==0) Mds[tx] = -Mds[tx]; } else if (tx < (tx/24)*24 + 24) { //if >12 and < 24 and odd if ((tx & 0x0001)==1) Mds[tx] = -Mds[tx]; } __syncthreads(); if (tx < (tx/24)*24 + 6) { //for the first 6 elements .. swap with last six in the 24elements group (see graph) Nds[tx] = Mds[tx_mod + 18]; Mds [tx_mod + 18] = Mds [tx]; Mds[tx] = Nds[tx]; } else if (tx < (tx/24)*24 + 12) { // for the second 6 elements .. swp with next adjacent group (see graph) Nds[tx] = Mds[tx_mod + 6]; Mds [tx_mod + 6] = Mds [tx]; Mds[tx] = Nds[tx]; } __syncthreads(); Thanks in advance ..

Read the article

How to store static content across branches in a single location in version control

- by Shravan

[Just a random thought] I have a pdf doc that is downloaded when the user clicks on 'help' on my website. Now, this is a pretty huge document and is saved in version control (SVN) and is thus copied for all branches that exist in SVN. This is static content and something that developers are not working on, and does not change often. Is there a more efficient way to store it (that would not hamper local deployments) that would make SVN checkouts and updates relatively faster. I know the benefit we get is not huge, this is something that came to my head none the less.

Read the article

MySQL won't use index for query?

- by Jack Sleight

I have this table: CREATE TABLE `point` ( `id` INT(11) NOT NULL AUTO_INCREMENT, `siteid` INT(11) NOT NULL, `lft` INT(11) DEFAULT NULL, `rgt` INT(11) DEFAULT NULL, `level` SMALLINT(6) DEFAULT NULL, PRIMARY KEY (`id`), KEY `point_siteid_site_id` (`siteid`), CONSTRAINT `point_siteid_site_id` FOREIGN KEY (`siteid`) REFERENCES `site` (`id`) ON DELETE CASCADE ) ENGINE=INNODB AUTO_INCREMENT=35 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci And this query: SELECT * FROM `point` WHERE siteid = 1; Which results in this EXPLAIN information: +----+-------------+-------+------+----------------------+------+---------+------+------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+----------------------+------+---------+------+------+-------------+ | 1 | SIMPLE | point | ALL | point_siteid_site_id | NULL | NULL | NULL | 6 | Using where | +----+-------------+-------+------+----------------------+------+---------+------+------+-------------+ Question is, why isn't the query using the point_siteid_site_id index?

Read the article

When compiling programs to run inside a VM, what should march and mtune be set to?

- by Russ

With VMs being slave to whatever the host machine is providing, what compiler flags should be provided to gcc? I would normally think that -march=native would be what you would use when compiling for a dedicated box, but the fine detail that -march=native is going to as indicated in this article makes me extremely wary of using it. So... what to set -march and -mtune to inside a VM? For a specific example... My specific case right now is compiling python (and more) in a linux guest inside a KVM-based "cloud" host that I have no real control over the host hardware (aside from 'simple' stuff like CPU GHz m CPU count, and available RAM). Currently, cpuinfo tells me I've got an "AMD Opteron(tm) Processor 6176" but I honestly don't know (yet) if that is reliable and whether the guest can get moved around to different architectures on me to meet the host's infrastructure shuffling needs (sounds hairy/unlikely). All I can really guarantee is my OS, which is a 64-bit linux kernel where uname -m yields x86_64.

Read the article

Is there a faster TList implementation ?

- by dmauric.mp

My application makes heavy use of TList, so I was wondering if there are any alternative implementations that are faster or optimized for particular use case. I know of RtlVCLOptimize.pas 2.77, which has optimized implementations of several TList methods. But I'd like to know if there is anything else out there. I also don't require it to be a TList descendant, I just need the TList functionality regardless of how it's implemented. It's entirely possible, given the rather basic functionality TList provides, that there is not much room for improvement, but would still like to verify that, hence this question.

Read the article

Hierarchical Hibernate, how many queries are executed?

- by ghost1

So I've been dealing with a home brew DB framework that has some seriously flaws, the justification for use being that not using an ORM will save on the number of queries executed. If I'm selecting all possibile records from the top level of a joinable object hierarchy, how many separate calls to the DB will be made when using an ORM (such as Hibernate)? I feel like calling bullshit on this, as joinable entities should be brought down in one query , right? Am I missing something here? note: lazy initialization doesn't matter in this scenario as all records will be used.

Read the article

Database structure, optimizing table with lots of rows

- by aepheus

I have a table with a lot of rows. The structure is something like this: UserID | ItemID | Item Data... Would I see any gains in query time if I separated this into a table per user, or per smaller group of users? Queries are always single user getting/modifying a selection of items.

Read the article

How can I Query only key on a Google Appengine PolyModel child?

- by Gabriel

So the situation is: I want to optimize my code some for doing counting of records. So I have a parent Model class Base, a PolyModel class Entry, and a child class of Entry Article: How would I query Article.key so I can reduce the query load but only get the Article count. My first thought was to use: q = db.GqlQuery("SELECT __key__ from Article where base = :1", i_base) but it turns out GqlQuery doesn't like that because articles are actually stored in a table called Entry. Would it be possible to Query the class attribute? something like: q = db.GqlQuery("select __key__ from Entry where base = :1 and :2 in class", i_base, 'Article') neither of which work. Turns out the answer is even easier. But I am going to finish this question because I looked everywhere for this. q = db.GqlQuery("select __key__ from Entry where base = :1 and class = :2", i_base, 'Article')

Read the article

Why is doing a top(1) on an indexed column in SQL Server slow?

- by reinier

I'm puzzled by the following. I have a DB with around 10 million rows, and (among other indices) on 1 column (campaignid_int) is an index. Now I have 700k rows where the campaignid is indeed 3835 For all these rows, the connectionid is the same. I just want to find out this connectionid. use messaging_db; SELECT TOP (1) connectionid FROM outgoing_messages WITH (NOLOCK) WHERE (campaignid_int = 3835) Now this query takes approx 30 seconds to perform! I (with my small db knowledge) would expect that it would take any of the rows, and return me that connectionid If I test this same query for a campaign which only has 1 entry, it goes really fast. So the index works. How would I tackle this and why does this not work? edit: estimated execution plan: select (0%) - top (0%) - clustered index scan (100%)

Read the article

Why is Javascript's Math.floor the slowest way to calculate floor in Javascript?

- by z5h

I'm generally not a fan of microbenchmarks. But this one has a very interesting result. http://ernestdelgado.com/archive/benchmark-on-the-floor/ It suggests that Math.floor is the SLOWEST way to calculate floor in Javascript. ~~n, n|n, n&n all being faster. This seems pretty shocking as I would expect that people implementing Javascript in today's modern browsers would be some pretty smart people. Does floor do something important that the other methods fail to do? Is there any reason to use it?

Read the article

Splitting tables by field to optimize MySQL?

- by AK

Do splitting fields into multiple tables ever yield faster queries? Consider the following two scenarios: Table1 ----------- int PersonID text Value1 float Value2 or Table1 ----------- int PersonID text Value1 Table2 ----------- int PersonID float Value2 If Value1 and Value2 are always being displayed together, I imagine Table1 is always faster because the second schema would require two SELECT statements. But are there any situations where you would choose the second? If the number of records were expected to be really large?

Read the article

Is this implementation truely tail-recursive?

- by CFP

Hello everyone! I've come up with the following code to compute in a tail-recursive way the result of an expression such as 3 4 * 1 + cos 8 * (aka 8*cos(1+(3*4))) The code is in OCaml. I'm using a list refto emulate a stack. type token = Num of float | Fun of (float->float) | Op of (float->float->float);; let pop l = let top = (List.hd !l) in l := List.tl (!l); top;; let push x l = l := (x::!l);; let empty l = (l = []);; let pile = ref [];; let eval data = let stack = ref data in let rec _eval cont = match (pop stack) with | Num(n) -> cont n; | Fun(f) -> _eval (fun x -> cont (f x)); | Op(op) -> _eval (fun x -> cont (op x (_eval (fun y->y)))); in _eval (fun x->x) ;; eval [Fun(fun x -> x**2.); Op(fun x y -> x+.y); Num(1.); Num(3.)];; I've used continuations to ensure tail-recursion, but since my stack implements some sort of a tree, and therefore provides quite a bad interface to what should be handled as a disjoint union type, the call to my function to evaluate the left branch with an identity continuation somehow irks a little. Yet it's working perfectly, but I have the feeling than in calling the _eval (fun y->y) bit, there must be something wrong happening, since it doesn't seem that this call can replace the previous one in the stack structure... Am I misunderstanding something here? I mean, I understand that with only the first call to _eval there wouldn't be any problem optimizing the calls, but here it seems to me that evaluation the _eval (fun y->y) will require to be stacked up, and therefore will fill the stack, possibly leading to an overflow... Thanks!

Read the article

MySQL subqueries

- by swamprunner7

Can we do this query without subqueries? SELECT login, post_n, (SELECT SUM(vote) FROM votes WHERE votes.post_n=posts.post_n)AS votes, (SELECT COUNT(comments.post_n) FROM comments WHERE comments.post_n=posts.post_n)AS comments_count FROM users, posts WHERE posts.id=users.id AND (visibility=2 OR visibility=3) ORDER BY date DESC LIMIT 0, 15 tables: Users: id, login Posts: post_n, id, visibility Votes: post_n, vote id — it`s user id, Users the main table.

Read the article

In PHP is faster to get a value from an if statement or from an array?

- by Vittorio Vittori

Maybe this is a stupid question but what is faster? <?php function getCss1 ($id = 0) { if ($id == 1) { return 'red'; } else if ($id == 2) { return 'yellow'; } else if ($id == 3) { return 'green'; } else if ($id == 4) { return 'blue'; } else if ($id == 5) { return 'orange'; } else { return 'grey'; } } function getCss2 ($id = 0) { $css[] = 'grey'; $css[] = 'red'; $css[] = 'yellow'; $css[] = 'green'; $css[] = 'blue'; $css[] = 'orange'; return $css[$id]; } echo getCss1(3); echo getCss2(3); ?> I suspect is faster the if statement but I prefere to ask!

Read the article

Javascriptlibrary more efficient than Rickshaw for realtime visualizations

- by dan kutz

I want to visualize data as time-series graphs on mobile devices(tablets) and therefore stumbled upon rickshaw, which is based on D3. First I must say I was a little bit confused when I realized that realtime in web design is defined totally different to realtime in engineering which has fixed(and often very short) timeframes. Anyway my aim is to visualize the data as fast as possible, and on older tablets visualization with rickshaw is quite slow. Can anybody recommend another library, which may be more efficient in rendering? Or is there no way out and I have to go native? regards Dan.

Read the article

Avoid the use of loops (for) with R

- by albergali

Hi, I'm working with R and I have a code like this: i<-1 j<-1 for (i in 1:10) for (j in 1:100) if (data[i] == paths[j,1]) cluster[i,4] <- paths[j,2] where : data is a vector with 100 rows and 1 column paths is a matrix with 100 rows and 5 columns cluster is a matrix with 100 rows and 5 columns My question is: how could I avoid the use of "for" loops to iterate through the matrix? I don't know whether apply functions (lapply, tapply...) are useful in this case. This is a problem when j=10000 for example, because execution time is very long. Thank you

Read the article

Drawbacks of Dynamic Query in Sqlserver 2005 ?

- by KuldipMCA

I have using the many dynamic Query in my database for the procedures because my filter is not fix so i have taken @filter as parameter and pass in the procedure. Declare @query as varchar(8000) Declare @Filter as varchar(1000) set @query = 'Select * from Person.Address where 1=1 and ' + @Filter exec(@query) Like that my filter contain any Field from the table for comparison. It will affect my performance or not ? is there any alternate way to achieve this type of things

Read the article

NHibernate, each property is filled with a different select statement

- by Eitan

I'm retrieving a list of nhibernate entites which have relationships to other tables/entities. I've noticed instead of NHibernate performing JOINS and populating the properties, it retrieves the entity and then calls a select for each property. For example if a user can have many roles and I retrieve a user from the DB, Nhibernate retrieves the user and then populates the roles with another select statement. The problem is that I want to retrieve oh let's say a list of products which have various many-to-many relationships and relationships to items which have their own relationships. In the end I'm left with over a thousand DB calls to retrieve a list of 30 products. Thanks. I've also set default lazy loading to false because whenever I save the list of entities to a session, I get an error when trying to retrieve it on another page: LazyInitializationException: could not initialize proxy If anybody could shed any light I would truly appreciate it. Thanks. Eitan

Read the article

How to insert zeros between bits in a bitmap?

- by anatolyg

I have some performance-heavy code that performs bit manipulations. It can be reduced to the following well-defined problem: Given a 13-bit bitmap, construct a 26-bit bitmap that contains the original bits spaced at even positions. To illustrate: 0000000000000000000abcdefghijklm (input, 32 bits) 0000000a0b0c0d0e0f0g0h0i0j0k0l0m (output, 32 bits) I currently have it implemented in the following way in C: if (input & (1 << 12)) output |= 1 << 24; if (input & (1 << 11)) output |= 1 << 22; if (input & (1 << 10)) output |= 1 << 20; ... My compiler (MS Visual Studio) turned this into the following: test eax,1000h jne 0064F5EC or edx,1000000h ... (repeated 13 times with minor differences in constants) I wonder whether i can make it any faster. I would like to have my code written in C, but switching to assembly language is possible. Can i use some MMX/SSE instructions to process all bits at once? Maybe i can use multiplication? (multiply by 0x11111111 or some other magical constant) Would it be better to use condition-set instruction (SETcc) instead of conditional-jump instruction? If yes, how can i make the compiler produce such code for me? Any other idea how to make it faster? Any idea how to do the inverse bitmap transformation (i have to implement it too, bit it's less critical)?

Read the article

Does the order of columns in a query matter?

- by James Simpson

When selecting columns from a MySQL table, is performance affected by the order that you select the columns as compared to their order in the table (not considering indexes that may cover the columns)? For example, you have a table with rows uid, name, bday, and you have the following query. SELECT uid, name, bday FROM table Does MySQL see the following query any differently and thus cause any sort of performance hit? SELECT uid, bday, name FROM table

Read the article

Unicorn: Which number of worker processes to use?

- by blackbird07

I am running a Ruby on Rails app on a virtual Linux server that is capped at 1GB RAM. Currently, I am constantly hitting the limit and would like to optimize memory utilization. One option I am looking at is reducing the number of unicorn workers. So what is the best way to determine the number of unicorn workers to use? The current setting is 10 workers, but the maximum number of requests per second I have seen on Google Analytics Real-Time is 3 (only scored once at a peak time; in 99% of the time not going above 1 request per second). So is it a save assumption that I can - for now - go with 4 workers, leaving room for unexpected amounts of requests? What are the metrics I should have a look at for determining the number of workers and what are the tools I can use for that on my Ubuntu machine?

Read the article

Optimizing an iphone app for 3G in landscape with opengl, camera, quartz

- by Joey

I have an iphone app that basically uses the camera, an opengl layer, and UIViews (some drawing with Quartz). It runs ok on 3GS, but on the 3G it is unusable. Particularly, when I press a UIButton, it literally takes sometimes 10 seconds to register the press. Shark doesn't do me much good because it crashes when I try to profile even a tiny portion, and I've tried turning off some of the layers to see if they might be obvious contributors to the lag. I've noticed that turning off the camera really helps. I'm wondering if anyone has any familiarity with this and might suggest some likely causes. I had issues with extreme slowdown from running my app in landscape mode and using transforms, so considered that might be a cause, but I'm wondering if hoping for a 3G to run something with all of the above elements is just not really possible considering the camera seems to really cost a lot. The fact that the buttons are horribly delayed in their response makes me think there is something fundamental that I might be missing.

Read the article

Should Python import statements always be at the top of a module?

- by Adam J. Forster

PEP 08 states: Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants. However if the class/method/function that I am importing is only used in rare cases, surely it is more efficient to do the import when it is needed? Isn't this: class SomeClass(object): def not_often_called(self) from datetime import datetime self.datetime = datetime.now() more efficient than this? from datetime import datetime class SomeClass(object): def not_often_called(self) self.datetime = datetime.now()

Read the article

WPF: Improving Performance for Running on Older PCs

- by Phil Sandler

So, I'm building a WPF app and did a test deployment today, and found that it performed pretty poorly. I was surprised, as we are really not doing much in the way of visual effects or animations. I deployed on two machines: the fastest and the slowest that will need to run the application (the slowest PC has an Intel Celeron 1.80GHz with 2GB RAM). The application ran pretty well on the faster machine, but was choppy on the slower machine. And when I say "choppy", I mean the cursor jumped even just passing it over any open window of the app that had focus. I opened the Task Manager Performance window, and could see that the CPU usage jumped whenever the app had focus and the cursor was moving over it. If I gave focus to another (e.g. Excel), the CPU usage went back down after a second. This happened on both machines, but the choppiness was only noticeable on the slower machine. I had very limited time to tinker on the deployment machines, so didn't do a lot of detailed testing. The app runs fine on my development machine, but I also see the CPU spiking up to 10% there, just running the cursor over the window. I downloaded the WPF performance tool from MS and have been tinkering with it (on my dev machine). The docs say this about the "Frame Rate" metric in the Perforator tool: For applications without animation, this value should be near 0. The app is not doing any heavy animation, but the frame rate stays near 50 when the cursor is over any window. The screens I tested on have column headers in a grid that "highlight" and buttons that change color and appearance when scrolled over. Even moving the mouse on blank areas of the windows cause the same Frame rate and CPU usage (doesn't seem to be related to these minor animations). (Also, I am unable to figure out how to get anything but the two default tools--Perforator and Visual Profiler--installed into the WPF performance tool. That is probably a separate question). I also have Redgate's profiling tool, but I'm not sure if that can shed any light on rendering performance. So, I realize this is not an easy thing to troubleshoot without specifics or sample code (which I can't post). My questions are: What are some general things to look for (or avoid) in the code to improve performance? What steps can I take using the WPF performance tool to narrow down the problem? Is the PC spec listed above (Intel Celeron 1.80GHz with 2GB RAM) too slow to be running even vanilla WPF applications?

Search Results

Search found 3706 results on 149 pages for 'nano optimization'.

Page 61/149 | < Previous Page | 57 58 59 60 61 62 63 64 65 66 67 68 | Next Page >

- by Victor Liu

- by Orion Nebula

- by Shravan

- by Jack Sleight

- by Russ

- by dmauric.mp

- by ghost1

- by aepheus

- by Gabriel

- by reinier

- by z5h

- by AK

- by CFP

- by swamprunner7

- by Vittorio Vittori

- by dan kutz

- by albergali

- by KuldipMCA

- by Eitan

- by anatolyg

- by James Simpson

- by blackbird07

- by Joey

- by Adam J. Forster

- by Phil Sandler

< Previous Page | 57 58 59 60 61 62 63 64 65 66 67 68 | Next Page >