Search Results

Search found 5698 results on 228 pages for 'cpu throttling'.

Page 84/228 | < Previous Page | 80 81 82 83 84 85 86 87 88 89 90 91 | Next Page >

OpenMP + SSE gives no speedup

- by Sayan Ghosh

Hi, My Professor found out this interesting experiment of 3D Linearly separable Kernel Convolution using SSE and OpenMP, and gave the task to me to benchmark the statistics on our system. The author claims a crazy 18 fold speedup from the serial approach! Might not be always, but we were expecting at least a 2-4 times speedup running this on a Dual Core Intel. http://software.intel.com/en-us/articles/16bit-3d-convolution-sse4openmp-implementation-on-penryn-cpu/#comment-41994 Alas, we could find exactly no speedup. The serial code performs always better, with or without OpenMP. I am using Linux, and observed a certain trend...when no other processes are running on the system, after a while the loadavg starts increasing, and the the %CPU utilization falls down. Another probable false positive which I ran into accidentally...I started the program, then immediately paused it. Then I ran it on background with bg, and saw a speedup of more than 2. This happens all the time! Any advice would be great. Thanks, Sayan

Read the article
Referencing a x86 assembly in a 64bit one

- by Jörg Battermann

In one project we have a dependency on a legacy system and its .Net assembly which is delivered as a 'x86' and they do not provide a 'Any CPU' and/or 64bit one. Now the project itself juggles with lots of data and we hit the limitations we have with a forced x86 on the whole project due to that one assembly (if we used 64bit/any cpu it would raise a BadImageFormatException once that x86 is loaded on a 64bit machine). Now is there a (workaround)way to use a x86 assembly in a 64bit .net host app? That tiny dependency on that x86 assembly makes and keeps 99% of the project heavily limited.

Read the article
What's a reliable and practical way to protect software with a user license ?

- by Frank

I know software companies use licenses to protect their softwares, but I also know there are keygen programs to bypass them. I'm a Java developer, if I put my program online for sale, what's a reliable and practical way to protect it ? How about something like this, would it work ? <1> I use ProGuard to protect the source code. <2> Sign the executable Jar file. <3> Since my Java program only need to work on PC [I need to use JDIC in it], I wrap the final executable Jar into an .exe file which makes it harder to decompile. <4> When a user first downloads and runs my app, it checks for a Pass file on his PC. <5> If the Pass file doesn't exist, run the app in demo mode, exits in 5 minutes. <6> When demo exits a panel opens with a "Buy Now" button. This demo mode repeats forever unless step <7> happens. <7> If user clicks the "Buy Now" button, he fills out a detailed form [name, phone, email ...], presses a "Verify Info" button to save the form to a Pass file, leaving license Key # field empty in this newly generated Pass file. <8> Pressing "Verify Info" button will take him to a html form pre-filled with his info to verify what he is buying, also hidden in the form's input filed is a license Key number. He can now press a "Pay Now" button to goto Paypal to finish the process. <9> The hidden license Key # will be passed to Paypal as product Id info and emailed to me. <10> After I got the payment and Paypal email, I'll add the license Key # to a valid license Key list, and put it on my site, only I know the url. The list is updated hourly. <11> Few hours later when the user runs the app again, it can find the Pass file on his PC, but the license Key # value is empty, so it goes to the valid list url to see if its license Key # is on the list, if so, write the license Key # into the Pass file, and the next time it starts again, it will find the valid license Key # and start in purchased mode without exiting in 5 minutes. <12> If it can't find its license Key # on the list from my url, run in demo mode. <13> In order to prevent a user from copying and using another paid user's valid Pass file, the license Key # is unique to each PC [I'm trying to find how], so a valid Pass file only works on one PC. Only after a user has paid will Paypal email me the valid license Key # with his payment. <14> The Id checking goes like this : Use the CPU ID : "CPU_01-02-ABC" for example, encrypt it to the result ID : "XeR5TY67rgf", and compare it to the list on my url, if "XeR5TY67rgf" is not on my valid user list, run in demo mode. If it exists write "XeR5TY67rgf" into the Pass File license field. In order to get a unique license Key, can I use his PC's CPU Id ? Or something unique and useful [ relatively less likely to change ]. If so let's say this CPU ID is "CPU_01-02-ABC", I can encrypt it to something like "XeR5TY67rgf", and pass it to Paypal as product Id in the hidden html form field, then I'll get it from Paypal's email notification, and add it to the valid license Key # list on the url. So, even if a hacker knows it uses CPU Id, he can't write it into the Pass file field, because only encrypted Ids are valid Ids. And only my program knows how to generate the encrypted Ids. And even if another hacker knows the encrypted Id is hidden in the html form input field, as long as it's not on my url list, it's still invalid. Can anyone find any flaw in the above system ? Is it practical ? And most importantly how do I get hold of this unique ID that can represent a user's PC ? Frank

Read the article
Is it too early to start designing for Task Parallel Library?

- by Joe Erickson

I have been following the development of the .NET Task Parallel Library (TPL) with great interest since Microsoft first announced it. There is no doubt in my mind that we will eventually take advantage of TPL. What I am questioning is whether it makes sense to start taking advantage of TPL when Visual Studio 2010 and .NET 4.0 are released, or whether it makes sense to wait a while longer. Why Start Now? The .NET 4.0 Task Parallel Library appears to be well designed and some relatively simple tests demonstrate that it works well on today's multi-core CPUs. I have been very interested in the potential advantages of using multiple lightweight threads to speed up our software since buying my first quad processor Dell Poweredge 6400 about seven years ago. Experiments at that time indicated that it was not worth the effort, which I attributed largely to the overhead of moving data between each CPU's cache (there was no shared cache back then) and RAM. Competitive advantage - some of our customers can never get enough performance and there is no doubt that we can build a faster product using TPL today. It sounds fun. Yes, I realize that some developers would rather poke themselves in the eye with a sharp stick, but we really enjoy maximizing performance. Why Wait? Are today's Intel Nehalem CPUs representative of where we are going as multi-core support matures? You can purchase a Nehalem CPU with 4 cores which share a single level 3 cache today, and most likely a 6 core CPU sharing a single level 3 cache by the time Visual Studio 2010 / .NET 4.0 are released. Obviously, the number of cores will go up over time, but what about the architecture? As the number of cores goes up, will they still share a cache? One issue with Nehalem is the fact that, even though there is a very fast interconnect between the cores, they have non-uniform memory access (NUMA) which can lead to lower performance and less predictable results. Will future multi-core architectures be able to do away with NUMA? Similarly, will the .NET Task Parallel Library change as it matures, requiring modifications to code to fully take advantage of it? Limitations Our core engine is 100% C# and has to run without full trust, so we are limited to using .NET APIs.

Read the article
Why Eventmachine Defer slower than Ruby Thread?

- by allenwei

I have two scripts which using mechanize to fetch google index page. I assuming Eventmachine will faster than ruby thread, but not. Eventmachine code cost "0.24s user 0.08s system 2% cpu 12.682 total" Ruby Thread code cost "0.22s user 0.08s system 5% cpu 5.167 total " Am I use eventmachine in wrong way? Who can explain it to me, thanks! 1 Eventmachine require 'rubygems' require 'mechanize' require 'eventmachine' trap("INT") {EM.stop} EM.run do num = 0 operation = proc { agent = Mechanize.new sleep 1 agent.get("http://google.com").body.to_s.size } callback = proc { |result| sleep 1 puts result num+=1 EM.stop if num == 9 } 10.times do EventMachine.defer operation, callback end end 2 Ruby Thread require 'rubygems' require 'mechanize' threads = [] 10.times do threads << Thread.new do agent = Mechanize.new sleep 1 puts agent.get("http://google.com").body.to_s.size sleep 1 end end threads.each do |aThread| aThread.join end

Read the article
SQL 2008 Encryption Scan

- by Mike K.

We recently upgraded a database server from SQL 2005 to SQL 2008 64 bit. CPU utilization is oftentimes running at 100% on all four processors now (this never happended on the SQL 2005 server). When I run sp_lock I see a number of processes waiting on a resource called [ENCRYPTION_SCAN]. I am not using any SQL 2008 encryption features. Does anyone know why I would have tasks waiting on this resource? It appears that whenever I have four processes waiting on this resource, CPU hits 100% on all four processors.

Read the article
iPhone/ARM calling convention

- by Seva Alekseyev

Hi all, on iPhone/ARM, which CPU registers are functions supposed to preserve, if any?

Read the article
Computationally intensive scala process using actors hangs uncooperatively

- by Chick Markley

I have a computationally intensive scala application that hangs. By hangs I means it is sitting in the process stack using 1% CPU but does not respond to kill -QUIT nor can it be attached via jdb attach. Runs 2-12 hours at 800-900% CPU before it gets stuck The application is using ~10 scala.actors. Until now I have had great success with kill -QUIT but I am bit stumped as to how to proceed. The actors write a fair amount to stdout using println which is redirected to a text file but has not been helpful so far diagnostically. I am just hoping there is some obvious technique when kill -QUIT fails that I am ignorant of. Or just confirmation that having multiple actors println asynchronously is a real bad idea (though I've been doing it for a long time only recently with these results) Details scala 2.8.1 & 2.8.0 mac osx 10.6.5 java version "1.6.0_22" Thanks

Read the article
Does set affinity ensure that only one core resources are used?

- by Kailash

Hello, I just wanted to find out if setting cpu affinity ensure that the application runs only on that core ?

Read the article
Are all the system's floating points operations the same?

- by Jj

We're making this web app in PHP and when working in the reports we have Excel files to compare our results to make sure our coding is doing the right operations. Now we're running into some differences due floating point arithmetics. We're doing the same divisions and multiplications and running into slightly different numbers, that add up to a notable difference. My question is if Excel is delegating it's floating point arithmetic to the CPU and PHP is also relying in the CPU for it's operations. Or does each application implements its own set of math algorithms?

Read the article
How many layers are between my program and the hardware?

- by sub

I somehow have the feeling that modern systems, including runtime libraries, this exception handler and that built-in debugger build up more and more layers between my (C++) programs and the CPU/rest of the hardware. I'm thinking of something like this: 1 + 2 OS top layer Runtime library/helper/error handler a hell lot of DLL modules OS kernel layer Do you really want to run 1 + 2?-Windows popup (don't take this serious) OS kernel layer Hardware abstraction Hardware Go through at least 100 miles of circuits Eventually arrive at the CPU ADD 1, 2 Go all the way back to my program Nearly all technical things are simply wrong and in some random order, but you get my point right? How much longer/shorter is this chain when I run a C++ program that calculates 1 + 2 at runtime on Windows? How about when I do this in an interpreter? (Python|Ruby|PHP) Is this chain really as dramatic in reality? Does Windows really try "not to stand in the way"? e.g.: Direct connection my binary < hardware?

Read the article
Improving the efficiency of Kinect for Windows DTWGestureRecognition Application

- by Ray

Currently I am using the DTWGestureRecognition open source tool for Kinect SDK v1.5. I have recorded a few gestures and use them to navigate through Windows 7. I also have implemented voice control for simple things such as opening PowerPoint, Chrome, etc. My main issue is that the application uses quite a bit of my CPU power which causes it to become slow. During gestures and voice commands, the CPU usage sometimes spikes to 80-90%, which causes the application to be unresponsive for a few seconds. I am running it on a 64 bit Windows 7 machine with an i5 processor and 8 GB of RAM. I was wondering if anyone with any experience using this tool or Kinect in general has made it more efficient and less performance hogging. Right now I removed sections which display the RGB video and the Depth video but even doing that did not make a big impact. Any help is appreciated, thanks!

Read the article
Continuously checking database from a Windows service

- by JonF

I am making a Windows service which needs to continuously check for database entries that can be added at any time to tell it to execute some code. It is looking to see if it's status is set to pending, and it's execute time entry is than the current time. Is the only way to do this to just run select statements over and over? It might need to execute the code every minute which means I need to run the select statement every minute looking for entries in the database. I'm trying to avoid unneccesary cpu time because I'm probably going to end up paying for cpu cycles on the hosting provider

Read the article
ffmpeg libavcodec.so missing while compile with cygwin

- by nick

I am building ffmpeg for android by following this tutorial now i got android folder inside the ffmpeg2.0.1 folder but there is no libavcodec-55.so file. instead of that i have lib/libavcodec.a how can i get libavcodec.so file? build_android.sh #!/bin/bash NDK=$HOME/Desktop/adt/android-ndk-r9 SYSROOT=$NDK/platforms/android-9/arch-arm/ TOOLCHAIN=$NDK/toolchains/arm-linux-androideabi-4.8/prebuilt/windows-x86_64 function build_one { ./configure \ --prefix=$PREFIX \ --enable-shared \ --disable-static \ --disable-doc \ --disable-ffmpeg \ --disable-ffplay \ --disable-ffprobe \ --disable-ffserver \ --disable-avdevice \ --disable-doc \ --disable-symver \ --cross-prefix=$TOOLCHAIN/bin/arm-linux-androideabi- \ --target-os=linux \ --arch=arm \ --enable-cross-compile \ --sysroot=$SYSROOT \ --extra-cflags="-Os -fpic $ADDI_CFLAGS" \ --extra-ldflags="$ADDI_LDFLAGS" \ $ADDITIONAL_CONFIGURE_FLAG make clean make make install } CPU=arm PREFIX=$(pwd)/android/$CPU ADDI_CFLAGS="-marm" build_one

Read the article
measuring performance - using real clicks vs "ab" command

- by shanyu

I have a web site in closed beta, developed in Django, runs with Mysql on Debian. In the last few days, the main page has been showing a slowdown. For every ten clicks, one or two receives extremely slow response (10 secs or more), others are as fast as they used to be. When I was searching for the problem, I ran into this issue that I couldn't grasp: top command shows that when I request the main page, mysql shoots up to 90% - 100% cpu usage. I get the page just as the cpu use gets back to normal. So, I thought, it is db. Then I called ab with parameters -n 1000 -c 5, I got decent performance, about 100 pages per second, just as it was before the slowdown. I would imagine a worse performance as 10-20% of requests take 10 secs to load. Is this conflict between ab and "real" clicks normal, or am I using ab in a wrong configuration?

Read the article
Cannot run 32bit compiled WPF applications on Windows 7 64bit

- by adriaanp

I created a WPF project in VS2008 and compiled it with Any CPU, x64 and x86. Any CPU and x64 works, but compiling to x86 the application is hanging when running through VS2008 and crashing when running without debugging. Debugging it with WinDbg I can see a StackOverflowException and sometimes a MissingMethodException relating to WPF methods. Common sense is telling something here that the CLR is not loading the correct assemblies or something when running 32bit WPF apps. I tried reinstalling .NET Framework 3.5 SP1, but it does not fix the problem. I don't know how to go about checking if the correct assemblies are loaded or used. Any ideas? UPDATE: Not a real solution but the best I could do quickly was to reinstall Windows 7

Read the article
Neo4j 1.9.4 (REST Server,CYPHER) performance issue

- by user2968943

I have Neo4j 1.9.4 installed on 24 core 24Gb ram (centos) machine and for most queries CPU usage spikes goes to 200% with only few concurrent requests. Domain: some sort of social application where few types of nodes(profiles) with 3-30 text/array properties and 36 relationship types with at least 3 properties. Most of nodes currently has ~300-500 relationships. Current data set footprint(from console): LogicalLogSize=4294907 (32MB) ArrayStoreSize=1675520 (12MB) NodeStoreSize=1342170 (10MB) PropertyStoreSize=1739548 (13MB) RelationshipStoreSize=6395202 (48MB) StringStoreSize=1478400 (11MB) which is IMHO really small. most queries looks like this one(with more or less WITH .. MATCH .. statements and few queries with variable length relations but the often fast): START targetUser=node({id}), currentUser=node({current}) MATCH targetUser-[contact:InContactsRelation]->n, n-[:InLocationRelation]->l, n-[:InCategoryRelation]->c WITH currentUser, targetUser,n, l,c, contact.fav is not null as inFavorites MATCH n<-[followers?:InContactsRelation]-() WITH currentUser, targetUser,n, l,c,inFavorites, COUNT(followers) as numFollowers RETURN id(n) as id, n.name? as name, n.title? as title, n._class as _class, n.avatar? as avatar, n.avatar_type? as avatar_type, l.name as location__name, c.name as category__name, true as isInContacts, inFavorites as isInFavorites, numFollowers it runs in ~1s-3s(for first run) and ~1s-70ms (for consecutive and it depends on query) and there is about 5-10 queries runs for each impression. Another interesting behavior is when i try run query from console(neo4j) on my local machine many consecutive times(just press ctrl+enter for few seconds) it has almost constant execution time but when i do it on server it goes slower exponentially and i guess it somehow related with my problem. Problem: So my problem is that neo4j is very CPU greedy(for 24 core machine its may be not an issue but its obviously overkill for small project). First time i used AWS EC2 m1.large instance but over all performance was bad, during testing, CPU always was over 100%. Some relevant parts of configuration: neostore.nodestore.db.mapped_memory=1280M wrapper.java.maxmemory=8192 note: I already tried configuration where all memory related parameters where HIGH and it didn't worked(no change at all). Question: Where to digg? configuration? scheme? queries? what i'm doing wrong? if need more info(logs, configs) just ask ;)

Read the article
segmented reduction with scattered segments

- by Christian Rau

I got to solve a pretty standard problem on the GPU, but I'm quite new to practical GPGPU, so I'm looking for ideas to approach this problem. I have many points in 3-space which are assigned to a very small number of groups (each point belongs to one group), specifically 15 in this case (doesn't ever change). Now I want to compute the mean and covariance matrix of all the groups. So on the CPU it's roughly the same as: for each point p { mean[p.group] += p.pos; covariance[p.group] += p.pos * p.pos; ++count[p.group]; } for each group g { mean[g] /= count[g]; covariance[g] = covariance[g]/count[g] - mean[g]*mean[g]; } Since the number of groups is extremely small, the last step can be done on the CPU (I need those values on the CPU, anyway). The first step is actually just a segmented reduction, but with the segments scattered around. So the first idea I came up with, was to first sort the points by their groups. I thought about a simple bucket sort using atomic_inc to compute bucket sizes and per-point relocation indices (got a better idea for sorting?, atomics may not be the best idea). After that they're sorted by groups and I could possibly come up with an adaption of the segmented scan algorithms presented here. But in this special case, I got a very large amount of data per point (9-10 floats, maybe even doubles if the need arises), so the standard algorithms using a shared memory element per thread and a thread per point might make problems regarding per-multiprocessor resources as shared memory or registers (Ok, much more on compute capability 1.x than 2.x, but still). Due to the very small and constant number of groups I thought there might be better approaches. Maybe there are already existing ideas suited for these specific properties of such a standard problem. Or maybe my general approach isn't that bad and you got ideas for improving the individual steps, like a good sorting algorithm suited for a very small number of keys or some segmented reduction algorithm minimizing shared memory/register usage. I'm looking for general approaches and don't want to use external libraries. FWIW I'm using OpenCL, but it shouldn't really matter as the general concepts of GPU computing don't really differ over the major frameworks.

Read the article
Which Ipod touch generation should I buy? 2nd or 3rd?

- by kukabunga

I want to create games for Iphone/Ipod touch. Unfortunately I don't have a lot of money so I can buy only one device. Ipod is cheaper than Iphone, so I decided to bought Ipod touch. But I am afraid of buying 3rd generation - because it has more memory, more faster CPU, etc. And I think if I post my app on appstore - people with 2nd generation Ipod might have trouble with my app (because I was testing it on 3rd generation). But on the other hand - I am planning to create 3d/cpu demanding game - and it would be easy for me to implement it on device with more calculation power... What should I do in this situation? Any advice is appreciate.

Read the article
Cassandra performance slow down with counter column

- by tubcvt

I have a cluster (4 node ) and a node have 16 core and 24 gb ram: 192.168.23.114 datacenter1 rack1 Up Normal 44.48 GB 25.00% 192.168.23.115 datacenter1 rack1 Up Normal 44.51 GB 25.00% 192.168.23.116 datacenter1 rack1 Up Normal 44.51 GB 25.00% 192.168.23.117 datacenter1 rack1 Up Normal 44.51 GB 25.00% We use about 10 column family (counter column) to make some system statistic report. Problem on here is that When i set replication_factor of this keyspace from 1 to 2 (contain 10 counter column family ), all cpu of node increase from 10% ( when use replication factor=1) to --- 90%. :( :( who can help me work around that :( . why counter column consume too much cpu time :(. thanks all

Read the article
When compiling programs to run inside a VM, what should march and mtune be set to?

- by Russ

With VMs being slave to whatever the host machine is providing, what compiler flags should be provided to gcc? I would normally think that -march=native would be what you would use when compiling for a dedicated box, but the fine detail that -march=native is going to as indicated in this article makes me extremely wary of using it. So... what to set -march and -mtune to inside a VM? For a specific example... My specific case right now is compiling python (and more) in a linux guest inside a KVM-based "cloud" host that I have no real control over the host hardware (aside from 'simple' stuff like CPU GHz m CPU count, and available RAM). Currently, cpuinfo tells me I've got an "AMD Opteron(tm) Processor 6176" but I honestly don't know (yet) if that is reliable and whether the guest can get moved around to different architectures on me to meet the host's infrastructure shuffling needs (sounds hairy/unlikely). All I can really guarantee is my OS, which is a 64-bit linux kernel where uname -m yields x86_64.

Read the article
Direct video card access

- by icemanind

Guys, I am trying to write a class in C# that can be used as a direct replacement for the C# Bitmap class. What I want to do instead though is perform all graphic functions done on the bitmap using the power of the video card. From what I understand, functions such as DrawLine or DrawArc or DrawText are primitive functions that use simple cpu math algorithms to perform the job. I, instead, want to use the graphics card cpu and memory to do these and other advanced functions, such as skinning a bitmap (applying a texture) and true transparancy. My problem is, in C#, how do I access direct video functions? Is there a library or something I need?

Read the article
CUDA small kernel 2d convolution - how to do it

- by paulAl

I've been experimenting with CUDA kernels for days to perform a fast 2D convolution between a 500x500 image (but I could also vary the dimensions) and a very small 2D kernel (a laplacian 2d kernel, so it's a 3x3 kernel.. too small to take a huge advantage with all the cuda threads). I created a CPU classic implementation (two for loops, as easy as you would think) and then I started creating CUDA kernels. After a few disappointing attempts to perform a faster convolution I ended up with this code: http://www.evl.uic.edu/sjames/cs525/final.html (see the Shared Memory section), it basically lets a 16x16 threads block load all the convolution data he needs in the shared memory and then performs the convolution. Nothing, the CPU is still a lot faster. I didn't try the FFT approach because the CUDA SDK states that it is efficient with large kernel sizes. Whether or not you read everything I wrote, my question is: how can I perform a fast 2D convolution between a relatively large image and a very small kernel (3x3) with CUDA?

Read the article
Can this loop be sped up in pure Python?

- by Noctis Skytower

I was trying out an experiment with Python, trying to find out how many times it could add one to an integer in one minute's time. Assuming two computers are the same except for the speed of the CPUs, this should give an estimate of how fast some CPU operations may take for the computer in question. The code below is an example of a test designed to fulfill the requirements given above. This version is about 20% faster than the first attempt and 150% faster than the third attempt. Can anyone make any suggestions as to how to get the most additions in a minute's time span? Higher numbers are desireable. EDIT: This experiment is being written in Python 3.1 and is 15% faster than the fourth speed-up attempt. def start(seconds): import time, _thread def stop(seconds, signal): time.sleep(seconds) signal.pop() total, signal = 0, [None] _thread.start_new_thread(stop, (seconds, signal)) while signal: total += 1 return total if __name__ == '__main__': print('Testing the CPU speed ...') print('Relative speed:', start(60))

Read the article
Given a function which produces a random integer in the range 1 to 5, write a function which produce

- by praveen

what is simple solution what is effective solution to less minimum memory and(or) cpu speed?

Read the article

< Previous Page | 80 81 82 83 84 85 86 87 88 89 90 91 | Next Page >