Search Results

Search found 9 results on 1 pages for 'rdtsc'.

Page 1/1 | 1 

  • Calculating CPU frequency in C with RDTSC always returns 0

    - by Nazgulled
    Hi, The following piece of code was given to us from our instructor so we could measure some algorithms performance: #include <stdio.h> #include <unistd.h> static unsigned cyc_hi = 0, cyc_lo = 0; static void access_counter(unsigned *hi, unsigned *lo) { asm("rdtsc; movl %%edx,%0; movl %%eax,%1" : "=r" (*hi), "=r" (*lo) : /* No input */ : "%edx", "%eax"); } void start_counter() { access_counter(&cyc_hi, &cyc_lo); } double get_counter() { unsigned ncyc_hi, ncyc_lo, hi, lo, borrow; double result; access_counter(&ncyc_hi, &ncyc_lo); lo = ncyc_lo - cyc_lo; borrow = lo > ncyc_lo; hi = ncyc_hi - cyc_hi - borrow; result = (double) hi * (1 << 30) * 4 + lo; return result; } However, I need this code to be portable to machines with different CPU frequencies. For that, I'm trying to calculate the CPU frequency of the machine where the code is being run like this: int main(void) { double c1, c2; start_counter(); c1 = get_counter(); sleep(1); c2 = get_counter(); printf("CPU Frequency: %.1f MHz\n", (c2-c1)/1E6); printf("CPU Frequency: %.1f GHz\n", (c2-c1)/1E9); return 0; } The problem is that the result is always 0 and I can't understand why. I'm running Linux (Arch) as guest on VMware. On a friend's machine (MacBook) it is working to some extent; I mean, the result is bigger than 0 but it's variable because the CPU frequency is not fixed (we tried to fix it but for some reason we are not able to do it). He has a different machine which is running Linux (Ubuntu) as host and it also reports 0. This rules out the problem being on the virtual machine, which I thought it was the issue at first. Any ideas why this is happening and how can I fix it?

    Read the article

  • How to benchmark on multi-core processors

    - by Pascal Cuoq
    I am looking for ways to perform micro-benchmarks on multi-core processors. Context: At about the same time desktop processors introduced out-of-order execution that made performance hard to predict, they, perhaps not coincidentally, also introduced special instructions to get very precise timings. Example of these instructions are rdtsc on x86 and rftb on PowerPC. These instructions gave timings that were more precise than could ever be allowed by a system call, allowed programmers to micro-benchmark their hearts out, for better or for worse. On a yet more modern processor with several cores, some of which sleep some of the time, the counters are not synchronized between cores. We are told that rdtsc is no longer safe to use for benchmarking, but I must have been dozing off when we were explained the alternative solutions. Question: Some systems may save and restore the performance counter and provide an API call to read the proper sum. If you know what this call is for any operating system, please let us know in an answer. Some systems may allow to turn off cores, leaving only one running. I know Mac OS X Leopard does when the right Preference Pane is installed from the Developers Tools. Do you think that this make rdtsc safe to use again? More context: Please assume I know what I am doing when trying to do a micro-benchmark. If you are of the opinion that if an optimization's gains cannot be measured by timing the whole application, it's not worth optimizing, I agree with you, but I cannot time the whole application until the alternative data structure is finished, which will take a long time. In fact, if the micro-benchmark were not promising, I could decide to give up on the implementation now; I need figures to provide in a publication whose deadline I have no control over.

    Read the article

  • Context switches much slower in new linux kernels

    - by Michael Goldshteyn
    We are looking to upgrade the OS on our servers from Ubuntu 10.04 LTS to Ubuntu 12.04 LTS. Unfortunately, it seems that the latency to run a thread that has become runnable has significantly increased from the 2.6 kernel to the 3.2 kernel. In fact the latency numbers we are getting are hard to believe. Let me be more specific about the test. We have a program that has two threads. The first thread gets the current time (in ticks using RDTSC) and then signals a condition variable once a second. The second thread waits on the condition variable and wakes up when it is signaled. It then gets the current time (in ticks using RDTSC). The difference between the time in the second thread and the time in the first thread is computed and displayed on the console. After this the second thread waits on the condition variable once more. So, we get a thread to thread signaling latency measurement once a second as a result. In linux 2.6.32, this latency is somewhere on the order of 2.8-3.5 us, which is reasonable. In linux 3.2.0, this latency is somewhere on the order of 40-100 us. I have excluded any differences in hardware between the two host hosts. They run on identical hardware (dual socket X5687 {Westmere-EP} processors running at 3.6 GHz with hyperthreading, speedstep and all C states turned off). We are changing the affinity to run both threads on physical cores of the same socket (i.e., the first thread is run on Core 0 and the second thread is run on Core 1), so there is no bouncing of threads on cores or bouncing/communication between sockets. The only difference between the two hosts is that one is running Ubuntu 10.04 LTS with kernel 2.6.32-28 (the fast context switch box) and the other is running the latest Ubuntu 12.04 LTS with kernel 3.2.0-23 (the slow context switch box). Have there been any changes in the kernel that could account for this ridiculous slow down in how long it takes for a thread to be scheduled to run?

    Read the article

  • Did I implement clock drift properly?

    - by David Titarenco
    I couldn't find any clock drift RNG code for Windows anywhere so I attempted to implement it myself. I haven't run the numbers through ent or DIEHARD yet, and I'm just wondering if this is even remotely correct... void QueryRDTSC(__int64* tick) { __asm { xor eax, eax cpuid rdtsc mov edi, dword ptr tick mov dword ptr [edi], eax mov dword ptr [edi+4], edx } } __int64 clockDriftRNG() { __int64 CPU_start, CPU_end, OS_start, OS_end; // get CPU ticks -- uses RDTSC on the Processor QueryRDTSC(&CPU_start); Sleep(1); QueryRDTSC(&CPU_end); // get OS ticks -- uses the Motherboard clock QueryPerformanceCounter((LARGE_INTEGER*)&OS_start); Sleep(1); QueryPerformanceCounter((LARGE_INTEGER*)&OS_end); // CPU clock is ~1000x faster than mobo clock // return raw return ((CPU_end - CPU_start)/(OS_end - OS_start)); // or // return a random number from 0 to 9 // return ((CPU_end - CPU_start)/(OS_end - OS_start)%10); } If you're wondering why I Sleep(1), it's because if I don't, OS_end - OS_start returns 0 consistently (because of the bad timer resolution, I presume). Basically, (CPU_end - CPU_start)/(OS_end - OS_start) always returns around 1000 with a slight variation based on the entropy of CPU load, maybe temperature, quartz crystal vibration imperfections, etc. Anyway, the numbers have a pretty decent distribution, but this could be totally wrong. I have no idea.

    Read the article

  • Fast, cross-platform timer?

    - by dsimcha
    I'm looking to improve the D garbage collector by adding some heuristics to avoid garbage collection runs that are unlikely to result in significant freeing. One heuristic I'd like to add is that GC should not be run more than once per X amount of time (maybe once per second or so). To do this I need a timer with the following properties: It must be able to grab the correct time with minimal overhead. Calling core.stdc.time takes an amount of time roughly equivalent to a small memory allocation, so it's not a good option. Ideally, should be cross-platform (both OS and CPU), for maintenance simplicity. Super high resolution isn't terribly important. If the times are accurate to maybe 1/4 of a second, that's good enough. Must work in a multithreaded/multi-CPU context. The x86 rdtsc instruction won't work.

    Read the article

  • How to get an ARM CPU clock speed in Linux?

    - by MiKy
    I have an ARM-based embedded machine based on S3C2416 board. According to the specifications I have available there should be a 533 MHz ARM9 (ARM926EJ-S according to /proc/cpuinfo), however the software running on it "feels" slow, compared to the same software on my Android phone with a 528MHz ARM CPU. /proc/cpuinfo tells me that BogoMIPS is 266.24. I know that I should not trust BogoMIPS regarding performance ("Bogo" = bogus), however I would like to get a measurement on the actual CPU speed. On x86, I could use the rdtsc instruction to get the time stamp counter, wait a second (sleep(1)), read the counter again to get an approximation on the CPU speed, and according to my experience, this value was close enough to the real CPU speed. How can I find the actual CPU speed of given ARM processor? Update I found this simple Pi calculator, which I compiled both for my Android phone and the ARM board. The results are as follows: S3C2416 # cat /proc/cpuinfo Processor : ARM926EJ-S rev 5 (v5l) BogoMIPS : 266.24 Features : swp half fastmult edsp java ... #./pi_arm 10000 Calculation of PI using FFT and AGM, ver. LG1.1.2-MP1.5.2a.memsave ... 8.50 sec. (real time) Android # cat /proc/cpuinfo Processor : ARMv6-compatible processor rev 2 (v6l) BogoMIPS : 527.56 Features : swp half thumb fastmult edsp java # ./pi_android 10000 Calculation of PI using FFT and AGM, ver. LG1.1.2-MP1.5.2a.memsave ... 5.95 sec. (real time) So it seems that the ARM926EJ-S is slower than my Android phone, but not twice slower as I would expect by the BogoMIPS figures. I am still unsure about the clock speed of the ARM9 CPU.

    Read the article

  • Measuring the CPU frequency scaling effect

    - by Bryan Fok
    Recently I am trying to measure the effect of the cpu scaling. Is it accurate if I use this clock to measure it? template<std::intmax_t clock_freq> struct rdtsc_clock { typedef unsigned long long rep; typedef std::ratio<1, clock_freq> period; typedef std::chrono::duration<rep, period> duration; typedef std::chrono::time_point<rdtsc_clock> time_point; static const bool is_steady = true; static time_point now() noexcept { unsigned lo, hi; asm volatile("rdtsc" : "=a" (lo), "=d" (hi)); return time_point(duration(static_cast<rep>(hi) << 32 | lo)); } }; Update: According to the comment from my another post, I believe redtsc cannot use for measure the effect of cpu frequency scaling because the counter from the redtsc does not affected by the CPU frequency, am i right?

    Read the article

  • Performance Optimization &ndash; It Is Faster When You Can Measure It

    - by Alois Kraus
    Performance optimization in bigger systems is hard because the measured numbers can vary greatly depending on the measurement method of your choice. To measure execution timing of specific methods in your application you usually use Time Measurement Method Potential Pitfalls Stopwatch Most accurate method on recent processors. Internally it uses the RDTSC instruction. Since the counter is processor specific you can get greatly different values when your thread is scheduled to another core or the core goes into a power saving mode. But things do change luckily: Intel's Designer's vol3b, section 16.11.1 "16.11.1 Invariant TSC The time stamp counter in newer processors may support an enhancement, referred to as invariant TSC. Processor's support for invariant TSC is indicated by CPUID.80000007H:EDX[8]. The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource." DateTime.Now Good but it has only a resolution of 16ms which can be not enough if you want more accuracy.   Reporting Method Potential Pitfalls Console.WriteLine Ok if not called too often. Debug.Print Are you really measuring performance with Debug Builds? Shame on you. Trace.WriteLine Better but you need to plug in some good output listener like a trace file. But be aware that the first time you call this method it will read your app.config and deserialize your system.diagnostics section which does also take time.   In general it is a good idea to use some tracing library which does measure the timing for you and you only need to decorate some methods with tracing so you can later verify if something has changed for the better or worse. In my previous article I did compare measuring performance with quantum mechanics. This analogy does work surprising well. When you measure a quantum system there is a lower limit how accurately you can measure something. The Heisenberg uncertainty relation does tell us that you cannot measure of a quantum system the impulse and location of a particle at the same time with infinite accuracy. For programmers the two variables are execution time and memory allocations. If you try to measure the timings of all methods in your application you will need to store them somewhere. The fastest storage space besides the CPU cache is the memory. But if your timing values do consume all available memory there is no memory left for the actual application to run. On the other hand if you try to record all memory allocations of your application you will also need to store the data somewhere. This will cost you memory and execution time. These constraints are always there and regardless how good the marketing of tool vendors for performance and memory profilers are: Any measurement will disturb the system in a non predictable way. Commercial tool vendors will tell you they do calculate this overhead and subtract it from the measured values to give you the most accurate values but in reality it is not entirely true. After falling into the trap to trust the profiler timings several times I have got into the habit to Measure with a profiler to get an idea where potential bottlenecks are. Measure again with tracing only the specific methods to check if this method is really worth optimizing. Optimize it Measure again. Be surprised that your optimization has made things worse. Think harder Implement something that really works. Measure again Finished! - Or look for the next bottleneck. Recently I have looked into issues with serialization performance. For serialization DataContractSerializer was used and I was not sure if XML is really the most optimal wire format. After looking around I have found protobuf-net which uses Googles Protocol Buffer format which is a compact binary serialization format. What is good for Google should be good for us. A small sample app to check out performance was a matter of minutes: using ProtoBuf; using System; using System.Diagnostics; using System.IO; using System.Reflection; using System.Runtime.Serialization; [DataContract, Serializable] class Data { [DataMember(Order=1)] public int IntValue { get; set; } [DataMember(Order = 2)] public string StringValue { get; set; } [DataMember(Order = 3)] public bool IsActivated { get; set; } [DataMember(Order = 4)] public BindingFlags Flags { get; set; } } class Program { static MemoryStream _Stream = new MemoryStream(); static MemoryStream Stream { get { _Stream.Position = 0; _Stream.SetLength(0); return _Stream; } } static void Main(string[] args) { DataContractSerializer ser = new DataContractSerializer(typeof(Data)); Data data = new Data { IntValue = 100, IsActivated = true, StringValue = "Hi this is a small string value to check if serialization does work as expected" }; var sw = Stopwatch.StartNew(); int Runs = 1000 * 1000; for (int i = 0; i < Runs; i++) { //ser.WriteObject(Stream, data); Serializer.Serialize<Data>(Stream, data); } sw.Stop(); Console.WriteLine("Did take {0:N0}ms for {1:N0} objects", sw.Elapsed.TotalMilliseconds, Runs); Console.ReadLine(); } } The results are indeed promising: Serializer Time in ms N objects protobuf-net   807 1000000 DataContract 4402 1000000 Nearly a factor 5 faster and a much more compact wire format. Lets use it! After switching over to protbuf-net the transfered wire data has dropped by a factor two (good) and the performance has worsened by nearly a factor two. How is that possible? We have measured it? Protobuf-net is much faster! As it turns out protobuf-net is faster but it has a cost: For the first time a type is de/serialized it does use some very smart code-gen which does not come for free. Lets try to measure this one by setting of our performance test app the Runs value not to one million but to 1. Serializer Time in ms N objects protobuf-net 85 1 DataContract 24 1 The code-gen overhead is significant and can take up to 200ms for more complex types. The break even point where the code-gen cost is amortized by its faster serialization performance is (assuming small objects) somewhere between 20.000-40.000 serialized objects. As it turned out my specific scenario involved about 100 types and 1000 serializations in total. That explains why the good old DataContractSerializer is not so easy to take out of business. The final approach I ended up was to reduce the number of types and to serialize primitive types via BinaryWriter directly which turned out to be a pretty good alternative. It sounded good until I measured again and found that my optimizations so far do not help much. After looking more deeper at the profiling data I did found that one of the 1000 calls did take 50% of the time. So how do I find out which call it was? Normal profilers do fail short at this discipline. A (totally undeserved) relatively unknown profiler is SpeedTrace which does unlike normal profilers create traces of your applications by instrumenting your IL code at runtime. This way you can look at the full call stack of the one slow serializer call to find out if this stack was something special. Unfortunately the call stack showed nothing special. But luckily I have my own tracing as well and I could see that the slow serializer call did happen during the serialization of a bool value. When you encounter after much analysis something unreasonable you cannot explain it then the chances are good that your thread was suspended by the garbage collector. If there is a problem with excessive GCs remains to be investigated but so far the serialization performance seems to be mostly ok.  When you do profile a complex system with many interconnected processes you can never be sure that the timings you just did measure are accurate at all. Some process might be hitting the disc slowing things down for all other processes for some seconds as well. There is a big difference between warm and cold startup. If you restart all processes you can basically forget the first run because of the OS disc cache, JIT and GCs make the measured timings very flexible. When you are in need of a random number generator you should measure cold startup times of a sufficiently complex system. After the first run you can try again getting different and much lower numbers. Now try again at least two times to get some feeling how stable the numbers are. Oh and try to do the same thing the next day. It might be that the bottleneck you found yesterday is gone today. Thanks to GC and other random stuff it can become pretty hard to find stuff worth optimizing if no big bottlenecks except bloatloads of code are left anymore. When I have found a spot worth optimizing I do make the code changes and do measure again to check if something has changed. If it has got slower and I am certain that my change should have made it faster I can blame the GC again. The thing is that if you optimize stuff and you allocate less objects the GC times will shift to some other location. If you are unlucky it will make your faster working code slower because you see now GCs at times where none were before. This is where the stuff does get really tricky. A safe escape hatch is to create a repro of the slow code in an isolated application so you can change things fast in a reliable manner. Then the normal profilers do also start working again. As Vance Morrison does point out it is much more complex to profile a system against the wall clock compared to optimize for CPU time. The reason is that for wall clock time analysis you need to understand how your system does work and which threads (if you have not one but perhaps 20) are causing a visible delay to the end user and which threads can wait a long time without affecting the user experience at all. Next time: Commercial profiler shootout.

    Read the article

1