Search Results

Search found 1692 results on 68 pages for 'trending and statistics'.

Page 17/68 | < Previous Page | 13 14 15 16 17 18 19 20 21 22 23 24 | Next Page >

Programmatically determine the relative "popularities" of a list of items (books, songs, movies, etc

- by Horace Loeb

Given a list of (say) songs, what's the best way to determine their relative "popularity"? My first thought is to use Google Trends. This list of songs: Subterranean Homesick Blues Empire State of Mind California Gurls produces the following Google Trends report: (to find out what's popular now, I restricted the report to the last 30 days) Empire State of Mind is marginally more popular than California Gurls, and Subterranean Homesick Blues is far less popular than either. So this works pretty well, but what happens when your list is 100 or 1000 songs long? Google Trends only allows you to compare 5 terms at once, so absent a huge round-robin, what's the right approach? Another option is to just do a Google Search for each song and see which has the most results, but this doesn't really measure the same thing

Read the article
Automating R Script

- by ETD

I have written an R script that pulls some data from a database, performs several operations on it and post the output to a new database. I would like this script to run every day at a specific time but I can not find any way to do this effectively. Can anyone recommend a resource I could look at to solve this issue? I am running this script on a Windows machine.

Read the article
Does anyone know of a Cocoa/Obj-C library that can be used to gather application usage data

- by Shackelford R

I would like to be able to gather info like how often certain windows are opened, what types of user data are accessed, how often menu items are clicked, etc. Does anyone know of a 3rd party (open source or commercial) Cocoa/Obj-C library or plugin that would allow me to gather this info?

Read the article
Determining smallest number of samples for 99% accuracy

- by test

I'm trying to compare 100,000 records on a local database (L) with 100,000 records on a remote database (R). Basically I want to know if an elment in L exusts in R. To determine that, I have to make a request against the R for each L, which takes a long time (I know, there should be a better way, there isn't, that's the API I've got). So I would like to test a small sample of L against R, and then infer with some level of confidence how many are present in the whole R. How many do I have to test to have a 99% confidence level?

Read the article
Does Intel Smart Response provide any statistics on the cache usage?

- by Tom Seddon

I've set up my Z68-based Core i7 PC with a 60GB SSD dedicated as a Smart Response cache drive. Is there any way I can get any statistics out of it? It would be nice to have some information on how much cache space is actually being used, maybe how much of it was actually accessed recently, and how many reads in general are coming from the SSD rather than from the mechanical disk. These statistics might help to quickly provide some evidence for or against the use of Smart Response, without my having to reinstall Windows on the SSD (etc.) to find out. The Windows ReadyBoost feature has some performance counters you can access via the Windows 7 perfmon tool, for example, which is the kind of thing I'm hoping is somehow available. Smart Response provides no perfmon counters, though, and the Intel Rapid Storage Utility tells you pretty much nothing except that Smart Response is switched on.

Read the article
How to check if a Statistics is auto-created in a SQL Server 2000 DB using T-SQL?

- by The Shaper

Hi all. A while back I had to come up with a way to clean up all indexes and user-created statistics from some tables in a SQL Server 2005 database. After a few attempts it worked, but now I gotta have it working in SQL Server 2000 databases as well. For SQL Server 2005, I used SELECT Name FROM sys.stats WHERE object_id = object_id(@tableName) AND auto_created = 0 to fetch Statistics that were user-created. However, SQL 2000 doesn't have a sys.stats table. I managed to fetch the indexes and statistics in a distinguishable way from the sysindexes table, but I just couldn't figure out what the sys.stats.auto_created match is for SQL 2000. Any pointers? BTW: T-SQL please.

Read the article
What statistics app should I use for my website?

- by Camran

I have my own server (with root access). I need statistics of users who visit my website etc etc... I have looked at an app called Webalyzer... Is this a good choice? I run apache2 on a Ubuntu 9 system... If you know of any good statistics apps for servers please let me know. And a follow-up question: All statistics are saved in log-files right? So how large would these log-files become then? Possibility to split them would be good, dont know if this is possible with Webalyzer though...

Read the article
Calculating the Size (in Bytes and MB) of a Oracle Coherence Cache

- by Ricardo Ferreira

The concept and usage of data grids are becoming very popular in this days since this type of technology are evolving very fast with some cool lead products like Oracle Coherence. Once for a while, developers need an programmatic way to calculate the total size of a specific cache that are residing in the data grid. In this post, I will show how to accomplish this using Oracle Coherence API. This example has been tested with 3.6, 3.7 and 3.7.1 versions of Oracle Coherence. To start the development of this example, you need to create a POJO ("Plain Old Java Object") that represents a data structure that will hold user data. This data structure will also create an internal fat so I call that should increase considerably the size of each instance in the heap memory. Create a Java class named "Person" as shown in the listing below. package com.oracle.coherence.domain; import java.io.Serializable; import java.util.ArrayList; import java.util.HashMap; import java.util.List; import java.util.Random; @SuppressWarnings("serial") public class Person implements Serializable { private String firstName; private String lastName; private List<Object> fat; private String email; public Person() { generateFat(); } public Person(String firstName, String lastName, String email) { setFirstName(firstName); setLastName(lastName); setEmail(email); generateFat(); } private void generateFat() { fat = new ArrayList<Object>(); Random random = new Random(); for (int i = 0; i < random.nextInt(18000); i++) { HashMap<Long, Double> internalFat = new HashMap<Long, Double>(); for (int j = 0; j < random.nextInt(10000); j++) { internalFat.put(random.nextLong(), random.nextDouble()); } fat.add(internalFat); } } public String getFirstName() { return firstName; } public void setFirstName(String firstName) { this.firstName = firstName; } public String getLastName() { return lastName; } public void setLastName(String lastName) { this.lastName = lastName; } public String getEmail() { return email; } public void setEmail(String email) { this.email = email; } } Now let's create a Java program that will start a data grid into Coherence and will create a cache named "People", that will hold people instances with sequential integer keys. Each person created in this program will trigger the execution of a custom constructor created in the People class that instantiates an internal fat (the random amount of data generated to increase the size of the object) for each person. Create a Java class named "CreatePeopleCacheAndPopulateWithData" as shown in the listing below. package com.oracle.coherence.demo; import com.oracle.coherence.domain.Person; import com.tangosol.net.CacheFactory; import com.tangosol.net.NamedCache; public class CreatePeopleCacheAndPopulateWithData { public static void main(String[] args) { // Asks Coherence for a new cache named "People"... NamedCache people = CacheFactory.getCache("People"); // Creates three people that will be putted into the data grid. Each person // generates an internal fat that should increase its size in terms of bytes... Person pessoa1 = new Person("Ricardo", "Ferreira", "[email protected]"); Person pessoa2 = new Person("Vitor", "Ferreira", "[email protected]"); Person pessoa3 = new Person("Vivian", "Ferreira", "[email protected]"); // Insert three people at the data grid... people.put(1, pessoa1); people.put(2, pessoa2); people.put(3, pessoa3); // Waits for 5 minutes until the user runs the Java program // that calculates the total size of the people cache... try { System.out.println("---> Waiting for 5 minutes for the cache size calculation..."); Thread.sleep(300000); } catch (InterruptedException ie) { ie.printStackTrace(); } } } Finally, let's create a Java program that, using the Coherence API and JMX, will calculate the total size of each cache that the data grid is currently managing. The approach used in this example was retrieve every cache that the data grid are currently managing, but if you are interested on an specific cache, the same approach can be used, you should only filter witch cache will be looked for. Create a Java class named "CalculateTheSizeOfPeopleCache" as shown in the listing below. package com.oracle.coherence.demo; import java.text.DecimalFormat; import java.util.Map; import java.util.Set; import java.util.TreeMap; import javax.management.MBeanServer; import javax.management.MBeanServerFactory; import javax.management.ObjectName; import com.tangosol.net.CacheFactory; public class CalculateTheSizeOfPeopleCache { @SuppressWarnings({ "unchecked", "rawtypes" }) private void run() throws Exception { // Enable JMX support in this Coherence data grid session... System.setProperty("tangosol.coherence.management", "all"); // Create a sample cache just to access the data grid... CacheFactory.getCache(MBeanServerFactory.class.getName()); // Gets the JMX server from Coherence data grid... MBeanServer jmxServer = getJMXServer(); // Creates a internal data structure that would maintain // the statistics from each cache in the data grid... Map cacheList = new TreeMap(); Set jmxObjectList = jmxServer.queryNames(new ObjectName("Coherence:type=Cache,*"), null); for (Object jmxObject : jmxObjectList) { ObjectName jmxObjectName = (ObjectName) jmxObject; String cacheName = jmxObjectName.getKeyProperty("name"); if (cacheName.equals(MBeanServerFactory.class.getName())) { continue; } else { cacheList.put(cacheName, new Statistics(cacheName)); } } // Updates the internal data structure with statistic data // retrieved from caches inside the in-memory data grid... Set<String> cacheNames = cacheList.keySet(); for (String cacheName : cacheNames) { Set resultSet = jmxServer.queryNames( new ObjectName("Coherence:type=Cache,name=" + cacheName + ",*"), null); for (Object resultSetRef : resultSet) { ObjectName objectName = (ObjectName) resultSetRef; if (objectName.getKeyProperty("tier").equals("back")) { int unit = (Integer) jmxServer.getAttribute(objectName, "Units"); int size = (Integer) jmxServer.getAttribute(objectName, "Size"); Statistics statistics = (Statistics) cacheList.get(cacheName); statistics.incrementUnit(unit); statistics.incrementSize(size); cacheList.put(cacheName, statistics); } } } // Finally... print the objects from the internal data // structure that represents the statistics from caches... cacheNames = cacheList.keySet(); for (String cacheName : cacheNames) { Statistics estatisticas = (Statistics) cacheList.get(cacheName); System.out.println(estatisticas); } } public MBeanServer getJMXServer() { MBeanServer jmxServer = null; for (Object jmxServerRef : MBeanServerFactory.findMBeanServer(null)) { jmxServer = (MBeanServer) jmxServerRef; if (jmxServer.getDefaultDomain().equals(DEFAULT_DOMAIN) || DEFAULT_DOMAIN.length() == 0) { break; } jmxServer = null; } if (jmxServer == null) { jmxServer = MBeanServerFactory.createMBeanServer(DEFAULT_DOMAIN); } return jmxServer; } private class Statistics { private long unit; private long size; private String cacheName; public Statistics(String cacheName) { this.cacheName = cacheName; } public void incrementUnit(long unit) { this.unit += unit; } public void incrementSize(long size) { this.size += size; } public long getUnit() { return unit; } public long getSize() { return size; } public double getUnitInMB() { return unit / (1024.0 * 1024.0); } public double getAverageSize() { return size == 0 ? 0 : unit / size; } public String toString() { StringBuffer sb = new StringBuffer(); sb.append("\nCache Statistics of '").append(cacheName).append("':\n"); sb.append(" - Total Entries of Cache -----> " + getSize()).append("\n"); sb.append(" - Used Memory (Bytes) --------> " + getUnit()).append("\n"); sb.append(" - Used Memory (MB) -----------> " + FORMAT.format(getUnitInMB())).append("\n"); sb.append(" - Object Average Size --------> " + FORMAT.format(getAverageSize())).append("\n"); return sb.toString(); } } public static void main(String[] args) throws Exception { new CalculateTheSizeOfPeopleCache().run(); } public static final DecimalFormat FORMAT = new DecimalFormat("###.###"); public static final String DEFAULT_DOMAIN = ""; public static final String DOMAIN_NAME = "Coherence"; } I've commented the overall example so, I don't think that you should get into trouble to understand it. Basically we are dealing with JMX. The first thing to do is enable JMX support for the Coherence client (ie, an JVM that will only retrieve values from the data grid and will not integrate the cluster) application. This can be done very easily using the runtime "tangosol.coherence.management" system property. Consult the Coherence documentation for JMX to understand the possible values that could be applied. The program creates an in memory data structure that holds a custom class created called "Statistics". This class represents the information that we are interested to see, which in this case are the size in bytes and in MB of the caches. An instance of this class is created for each cache that are currently managed by the data grid. Using JMX specific methods, we retrieve the information that are relevant for calculate the total size of the caches. To test this example, you should execute first the CreatePeopleCacheAndPopulateWithData.java program and after the CreatePeopleCacheAndPopulateWithData.java program. The results in the console should be something like this: 2012-06-23 13:29:31.188/4.970 Oracle Coherence 3.6.0.4 <Info> (thread=Main Thread, member=n/a): Loaded operational configuration from "jar:file:/E:/Oracle/Middleware/oepe_11gR1PS4/workspace/calcular-tamanho-cache-coherence/lib/coherence.jar!/tangosol-coherence.xml" 2012-06-23 13:29:31.219/5.001 Oracle Coherence 3.6.0.4 <Info> (thread=Main Thread, member=n/a): Loaded operational overrides from "jar:file:/E:/Oracle/Middleware/oepe_11gR1PS4/workspace/calcular-tamanho-cache-coherence/lib/coherence.jar!/tangosol-coherence-override-dev.xml" 2012-06-23 13:29:31.219/5.001 Oracle Coherence 3.6.0.4 <D5> (thread=Main Thread, member=n/a): Optional configuration override "/tangosol-coherence-override.xml" is not specified 2012-06-23 13:29:31.266/5.048 Oracle Coherence 3.6.0.4 <D5> (thread=Main Thread, member=n/a): Optional configuration override "/custom-mbeans.xml" is not specified Oracle Coherence Version 3.6.0.4 Build 19111 Grid Edition: Development mode Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved. 2012-06-23 13:29:33.156/6.938 Oracle Coherence GE 3.6.0.4 <Info> (thread=Main Thread, member=n/a): Loaded Reporter configuration from "jar:file:/E:/Oracle/Middleware/oepe_11gR1PS4/workspace/calcular-tamanho-cache-coherence/lib/coherence.jar!/reports/report-group.xml" 2012-06-23 13:29:33.500/7.282 Oracle Coherence GE 3.6.0.4 <Info> (thread=Main Thread, member=n/a): Loaded cache configuration from "jar:file:/E:/Oracle/Middleware/oepe_11gR1PS4/workspace/calcular-tamanho-cache-coherence/lib/coherence.jar!/coherence-cache-config.xml" 2012-06-23 13:29:35.391/9.173 Oracle Coherence GE 3.6.0.4 <D4> (thread=Main Thread, member=n/a): TCMP bound to /192.168.177.133:8090 using SystemSocketProvider 2012-06-23 13:29:37.062/10.844 Oracle Coherence GE 3.6.0.4 <Info> (thread=Cluster, member=n/a): This Member(Id=2, Timestamp=2012-06-23 13:29:36.899, Address=192.168.177.133:8090, MachineId=55685, Location=process:244, Role=Oracle, Edition=Grid Edition, Mode=Development, CpuCount=2, SocketCount=2) joined cluster "cluster:0xC4DB" with senior Member(Id=1, Timestamp=2012-06-23 13:29:14.031, Address=192.168.177.133:8088, MachineId=55685, Location=process:1128, Role=CreatePeopleCacheAndPopulateWith, Edition=Grid Edition, Mode=Development, CpuCount=2, SocketCount=2) 2012-06-23 13:29:37.172/10.954 Oracle Coherence GE 3.6.0.4 <D5> (thread=Cluster, member=n/a): Member 1 joined Service Cluster with senior member 1 2012-06-23 13:29:37.188/10.970 Oracle Coherence GE 3.6.0.4 <D5> (thread=Cluster, member=n/a): Member 1 joined Service Management with senior member 1 2012-06-23 13:29:37.188/10.970 Oracle Coherence GE 3.6.0.4 <D5> (thread=Cluster, member=n/a): Member 1 joined Service DistributedCache with senior member 1 2012-06-23 13:29:37.188/10.970 Oracle Coherence GE 3.6.0.4 <Info> (thread=Main Thread, member=n/a): Started cluster Name=cluster:0xC4DB Group{Address=224.3.6.0, Port=36000, TTL=4} MasterMemberSet ( ThisMember=Member(Id=2, Timestamp=2012-06-23 13:29:36.899, Address=192.168.177.133:8090, MachineId=55685, Location=process:244, Role=Oracle) OldestMember=Member(Id=1, Timestamp=2012-06-23 13:29:14.031, Address=192.168.177.133:8088, MachineId=55685, Location=process:1128, Role=CreatePeopleCacheAndPopulateWith) ActualMemberSet=MemberSet(Size=2, BitSetCount=2 Member(Id=1, Timestamp=2012-06-23 13:29:14.031, Address=192.168.177.133:8088, MachineId=55685, Location=process:1128, Role=CreatePeopleCacheAndPopulateWith) Member(Id=2, Timestamp=2012-06-23 13:29:36.899, Address=192.168.177.133:8090, MachineId=55685, Location=process:244, Role=Oracle) ) RecycleMillis=1200000 RecycleSet=MemberSet(Size=0, BitSetCount=0 ) ) TcpRing{Connections=[1]} IpMonitor{AddressListSize=0} 2012-06-23 13:29:37.891/11.673 Oracle Coherence GE 3.6.0.4 <D5> (thread=Invocation:Management, member=2): Service Management joined the cluster with senior service member 1 2012-06-23 13:29:39.203/12.985 Oracle Coherence GE 3.6.0.4 <D5> (thread=DistributedCache, member=2): Service DistributedCache joined the cluster with senior service member 1 2012-06-23 13:29:39.297/13.079 Oracle Coherence GE 3.6.0.4 <D4> (thread=DistributedCache, member=2): Asking member 1 for 128 primary partitions Cache Statistics of 'People': - Total Entries of Cache -----> 3 - Used Memory (Bytes) --------> 883920 - Used Memory (MB) -----------> 0.843 - Object Average Size --------> 294640 I hope that this post could save you some time when calculate the total size of Coherence cache became a requirement for your high scalable system using data grids. See you!

Read the article
Where can I find statistics / figures on how long testing should / could take?

- by NoCarrier

I'm trying to convince management that testing/QA takes considerably longer than non-developers think. Some smaller shops don't have budgets for testers and phbs automatically assume the developer will spend a few minutes after every build "testing" and deliver a perfectly functional system. Can someone point me to some numbers? e.g. Testing should be XX% of your total man hour count , etc etc? Or perhaps some real world experience? My goal is to have some numbers that are grounded in real life so I can make time/effort allocation justifications for "proper" testing when preparing estimates and timelines for applications. Maybe not full blown 100% TDD, but pragmatically close to it. I apologize if I seem vague.

Read the article
Will using HTTPS hurt my site's SEO or other statistics?

- by yannbane

I've set up a WordPress blog. Since I have to log into it from many different locations/machines, I've also got an SSL certificate, and set up Apache to redirect HTTP to HTTPS. It all works, but I'm wondering whether that's an overkill. Since most people who go to my site don't have to log in, I'm starting to wonder whether HTTPS has some drawbacks. If so, should I look for a way to make HTTPS optional?

Read the article
User activity vs. System activity on the Index Usage Statistics report

- by Zachary G Jensen

I recently decided to crawl over the indexes on one of our most heavily used databases to see which were suboptimal. I generated the built-in Index Usage Statistics report from SSMS, and it's showing me a great deal of information that I'm unsure how to understand. I found an article at Carpe Datum about the report, but it doesn't tell me much more than I could assume from the column titles. In particular, the report differentiates between User activity and system activity, and I'm unsure what qualifies as each type of activity. I assume that any query that uses a given index increases the '# of user X' columns. But what increases the system columns? building statistics? Is there anything that depends on the user or role(s) of a user that's running the query?

Read the article
More CPU cores may not always lead to better performance – MAXDOP and query memory distribution in spotlight

- by sqlworkshops

More hardware normally delivers better performance, but there are exceptions where it can hinder performance. Understanding these exceptions and working around it is a major part of SQL Server performance tuning. When a memory allocating query executes in parallel, SQL Server distributes memory to each task that is executing part of the query in parallel. In our example the sort operator that executes in parallel divides the memory across all tasks assuming even distribution of rows. Common memory allocating queries are that perform Sort and do Hash Match operations like Hash Join or Hash Aggregation or Hash Union. In reality, how often are column values evenly distributed, think about an example; are employees working for your company distributed evenly across all the Zip codes or mainly concentrated in the headquarters? What happens when you sort result set based on Zip codes? Do all products in the catalog sell equally or are few products hot selling items? One of my customers tested the below example on a 24 core server with various MAXDOP settings and here are the results:MAXDOP 1: CPU time = 1185 ms, elapsed time = 1188 msMAXDOP 4: CPU time = 1981 ms, elapsed time = 1568 msMAXDOP 8: CPU time = 1918 ms, elapsed time = 1619 msMAXDOP 12: CPU time = 2367 ms, elapsed time = 2258 msMAXDOP 16: CPU time = 2540 ms, elapsed time = 2579 msMAXDOP 20: CPU time = 2470 ms, elapsed time = 2534 msMAXDOP 0: CPU time = 2809 ms, elapsed time = 2721 ms - all 24 cores.In the above test, when the data was evenly distributed, the elapsed time of parallel query was always lower than serial query. Why does the query get slower and slower with more CPU cores / higher MAXDOP? Maybe you can answer this question after reading the article; let me know: [email protected]. Well you get the point, let’s see an example. The best way to learn is to practice. To create the below tables and reproduce the behavior, join the mailing list by using this link: www.sqlworkshops.com/ml and I will send you the table creation script. Let’s update the Employees table with 49 out of 50 employees located in Zip code 2001. update Employees set Zip = EmployeeID / 400 + 1 where EmployeeID % 50 = 1 update Employees set Zip = 2001 where EmployeeID % 50 != 1 go update statistics Employees with fullscan go Let’s create the temporary table #FireDrill with all possible Zip codes. drop table #FireDrill go create table #FireDrill (Zip int primary key) insert into #FireDrill select distinct Zip from Employees update statistics #FireDrill with fullscan go Let’s execute the query serially with MAXDOP 1. --Example provided by www.sqlworkshops.com --Execute query with uneven Zip code distribution --First serially with MAXDOP 1 set statistics time on go declare @EmployeeID int, @EmployeeName varchar(48),@zip int select @EmployeeName = e.EmployeeName, @zip = e.Zip from Employees e inner join #FireDrill fd on (e.Zip = fd.Zip) order by e.Zip option (maxdop 1) goThe query took 1011 ms to complete. The execution plan shows the 77816 KB of memory was granted while the estimated rows were 799624. No Sort Warnings in SQL Server Profiler. Now let’s execute the query in parallel with MAXDOP 0. --Example provided by www.sqlworkshops.com --Execute query with uneven Zip code distribution --In parallel with MAXDOP 0 set statistics time on go declare @EmployeeID int, @EmployeeName varchar(48),@zip int select @EmployeeName = e.EmployeeName, @zip = e.Zip from Employees e inner join #FireDrill fd on (e.Zip = fd.Zip) order by e.Zip option (maxdop 0) go The query took 1912 ms to complete. The execution plan shows the 79360 KB of memory was granted while the estimated rows were 799624. The estimated number of rows between serial and parallel plan are the same. The parallel plan has slightly more memory granted due to additional overhead. Sort properties shows the rows are unevenly distributed over the 4 threads. Sort Warnings in SQL Server Profiler. Intermediate Summary: The reason for the higher duration with parallel plan was sort spill. This is due to uneven distribution of employees over Zip codes, especially concentration of 49 out of 50 employees in Zip code 2001. Now let’s update the Employees table and distribute employees evenly across all Zip codes. update Employees set Zip = EmployeeID / 400 + 1 go update statistics Employees with fullscan go Let’s execute the query serially with MAXDOP 1. --Example provided by www.sqlworkshops.com --Execute query with uneven Zip code distribution --Serially with MAXDOP 1 set statistics time on go declare @EmployeeID int, @EmployeeName varchar(48),@zip int select @EmployeeName = e.EmployeeName, @zip = e.Zip from Employees e inner join #FireDrill fd on (e.Zip = fd.Zip) order by e.Zip option (maxdop 1) go The query took 751 ms to complete. The execution plan shows the 77816 KB of memory was granted while the estimated rows were 784707. No Sort Warnings in SQL Server Profiler. Now let’s execute the query in parallel with MAXDOP 0. --Example provided by www.sqlworkshops.com --Execute query with uneven Zip code distribution --In parallel with MAXDOP 0 set statistics time on go declare @EmployeeID int, @EmployeeName varchar(48),@zip int select @EmployeeName = e.EmployeeName, @zip = e.Zip from Employees e inner join #FireDrill fd on (e.Zip = fd.Zip) order by e.Zip option (maxdop 0) go The query took 661 ms to complete. The execution plan shows the 79360 KB of memory was granted while the estimated rows were 784707. Sort properties shows the rows are evenly distributed over the 4 threads. No Sort Warnings in SQL Server Profiler. Intermediate Summary: When employees were distributed unevenly, concentrated on 1 Zip code, parallel sort spilled while serial sort performed well without spilling to tempdb. When the employees were distributed evenly across all Zip codes, parallel sort and serial sort did not spill to tempdb. This shows uneven data distribution may affect the performance of some parallel queries negatively. For detailed discussion of memory allocation, refer to webcasts available at www.sqlworkshops.com/webcasts. Some of you might conclude from the above execution times that parallel query is not faster even when there is no spill. Below you can see when we are joining limited amount of Zip codes, parallel query will be fasted since it can use Bitmap Filtering. Let’s update the Employees table with 49 out of 50 employees located in Zip code 2001. update Employees set Zip = EmployeeID / 400 + 1 where EmployeeID % 50 = 1 update Employees set Zip = 2001 where EmployeeID % 50 != 1 go update statistics Employees with fullscan go Let’s create the temporary table #FireDrill with limited Zip codes. drop table #FireDrill go create table #FireDrill (Zip int primary key) insert into #FireDrill select distinct Zip from Employees where Zip between 1800 and 2001 update statistics #FireDrill with fullscan go Let’s execute the query serially with MAXDOP 1. --Example provided by www.sqlworkshops.com --Execute query with uneven Zip code distribution --Serially with MAXDOP 1 set statistics time on go declare @EmployeeID int, @EmployeeName varchar(48),@zip int select @EmployeeName = e.EmployeeName, @zip = e.Zip from Employees e inner join #FireDrill fd on (e.Zip = fd.Zip) order by e.Zip option (maxdop 1) go The query took 989 ms to complete. The execution plan shows the 77816 KB of memory was granted while the estimated rows were 785594. No Sort Warnings in SQL Server Profiler. Now let’s execute the query in parallel with MAXDOP 0. --Example provided by www.sqlworkshops.com --Execute query with uneven Zip code distribution --In parallel with MAXDOP 0 set statistics time on go declare @EmployeeID int, @EmployeeName varchar(48),@zip int select @EmployeeName = e.EmployeeName, @zip = e.Zip from Employees e inner join #FireDrill fd on (e.Zip = fd.Zip) order by e.Zip option (maxdop 0) go The query took 1799 ms to complete. The execution plan shows the 79360 KB of memory was granted while the estimated rows were 785594. Sort Warnings in SQL Server Profiler. The estimated number of rows between serial and parallel plan are the same. The parallel plan has slightly more memory granted due to additional overhead. Intermediate Summary: The reason for the higher duration with parallel plan even with limited amount of Zip codes was sort spill. This is due to uneven distribution of employees over Zip codes, especially concentration of 49 out of 50 employees in Zip code 2001. Now let’s update the Employees table and distribute employees evenly across all Zip codes. update Employees set Zip = EmployeeID / 400 + 1 go update statistics Employees with fullscan go Let’s execute the query serially with MAXDOP 1. --Example provided by www.sqlworkshops.com --Execute query with uneven Zip code distribution --Serially with MAXDOP 1 set statistics time on go declare @EmployeeID int, @EmployeeName varchar(48),@zip int select @EmployeeName = e.EmployeeName, @zip = e.Zip from Employees e inner join #FireDrill fd on (e.Zip = fd.Zip) order by e.Zip option (maxdop 1) go The query took 250 ms to complete. The execution plan shows the 9016 KB of memory was granted while the estimated rows were 79973.8. No Sort Warnings in SQL Server Profiler. Now let’s execute the query in parallel with MAXDOP 0. --Example provided by www.sqlworkshops.com --Execute query with uneven Zip code distribution --In parallel with MAXDOP 0 set statistics time on go declare @EmployeeID int, @EmployeeName varchar(48),@zip int select @EmployeeName = e.EmployeeName, @zip = e.Zip from Employees e inner join #FireDrill fd on (e.Zip = fd.Zip) order by e.Zip option (maxdop 0) go The query took 85 ms to complete. The execution plan shows the 13152 KB of memory was granted while the estimated rows were 784707. No Sort Warnings in SQL Server Profiler. Here you see, parallel query is much faster than serial query since SQL Server is using Bitmap Filtering to eliminate rows before the hash join. Parallel queries are very good for performance, but in some cases it can hinder performance. If one identifies the reason for these hindrances, then it is possible to get the best out of parallelism. I covered many aspects of monitoring and tuning parallel queries in webcasts (www.sqlworkshops.com/webcasts) and articles (www.sqlworkshops.com/articles). I suggest you to watch the webcasts and read the articles to better understand how to identify and tune parallel query performance issues. Summary: One has to avoid sort spill over tempdb and the chances of spills are higher when a query executes in parallel with uneven data distribution. Parallel query brings its own advantage, reduced elapsed time and reduced work with Bitmap Filtering. So it is important to understand how to avoid spills over tempdb and when to execute a query in parallel. I explain these concepts with detailed examples in my webcasts (www.sqlworkshops.com/webcasts), I recommend you to watch them. The best way to learn is to practice. To create the above tables and reproduce the behavior, join the mailing list at www.sqlworkshops.com/ml and I will send you the relevant SQL Scripts. Register for the upcoming 3 Day Level 400 Microsoft SQL Server 2008 and SQL Server 2005 Performance Monitoring & Tuning Hands-on Workshop in London, United Kingdom during March 15-17, 2011, click here to register / Microsoft UK TechNet.These are hands-on workshops with a maximum of 12 participants and not lectures. For consulting engagements click here. Disclaimer and copyright information:This article refers to organizations and products that may be the trademarks or registered trademarks of their various owners. Copyright of this article belongs to R Meyyappan / www.sqlworkshops.com. You may freely use the ideas and concepts discussed in this article with acknowledgement (www.sqlworkshops.com), but you may not claim any of it as your own work. This article is for informational purposes only; you use any of the suggestions given here entirely at your own risk. Register for the upcoming 3 Day Level 400 Microsoft SQL Server 2008 and SQL Server 2005 Performance Monitoring & Tuning Hands-on Workshop in London, United Kingdom during March 15-17, 2011, click here to register / Microsoft UK TechNet.These are hands-on workshops with a maximum of 12 participants and not lectures. For consulting engagements click here. R Meyyappan [email protected] LinkedIn: http://at.linkedin.com/in/rmeyyappan

Read the article
What's a good tool for collecting statistics on filesystem usage?

- by Kamil Kisiel

We have a number of filesystems for our computational cluster, with a lot of users that store a lot of really large files. We'd like to monitor the filesystem and help optimize their usage of it, as well as plan for expansion. In order to this, we need some way to monitor how these filesystems are used. Essentially I'd like to know all sorts of statistics about the files: Age Frequency of access Last accessed times Types Sizes Ideally this information would be available in aggregate form for any directory so that we could monitor it based on project or user. Short of writing something up myself in Python, I haven't been able to find any tools capable of performing these duties. Any recommendations?

Read the article
SQL SERVER – Weekly Series – Memory Lane – #035

- by Pinal Dave

Here is the list of selected articles of SQLAuthority.com across all these years. Instead of just listing all the articles I have selected a few of my most favorite articles and have listed them here with additional notes below it. Let me know which one of the following is your favorite article from memory lane. 2007 Row Overflow Data Explanation In SQL Server 2005 one table row can contain more than one varchar(8000) fields. One more thing, the exclusions has exclusions also the limit of each individual column max width of 8000 bytes does not apply to varchar(max), nvarchar(max), varbinary(max), text, image or xml data type columns. Comparison Index Fragmentation, Index De-Fragmentation, Index Rebuild – SQL SERVER 2000 and SQL SERVER 2005 An old but like a gold article. Talks about lots of concepts related to Index and the difference from earlier version to the newer version. I strongly suggest that everyone should read this article just to understand how SQL Server has moved forward with the technology. Improvements in TempDB SQL Server 2005 had come up with quite a lots of improvements and this blog post describes them and explains the same. If you ask me what is my the most favorite article from early career. I must point out to this article as when I wrote this one I personally have learned a lot of new things. Recompile All The Stored Procedure on Specific TableI prefer to recompile all the stored procedure on the table, which has faced mass insert or update. sp_recompiles marks stored procedures to recompile when they execute next time. This blog post explains the same with the help of a script. 2008 SQLAuthority Download – SQL Server Cheatsheet You can download and print this cheat sheet and use it for your personal reference. If you have any suggestions, please let me know and I will see if I can update this SQL Server cheat sheet. Difference Between DBMS and RDBMS What is the difference between DBMS and RDBMS? DBMS – Data Base Management System RDBMS – Relational Data Base Management System or Relational DBMS High Availability – Hot Add Memory Hot Add CPU and Hot Add Memory are extremely interesting features of the SQL Server, however, personally I have not witness them heavily used. These features also have few restriction as well. I blogged about them in detail. 2009 Delete Duplicate Rows I have demonstrated in this blog post how one can identify and delete duplicate rows. Interesting Observation of Logon Trigger On All Servers – Solution The question I put forth in my previous article was – In single login why the trigger fires multiple times; it should be fired only once. I received numerous answers in thread as well as in my MVP private news group. Now, let us discuss the answer for the same. The answer is – It happens because multiple SQL Server services are running as well as intellisense is turned on. Blog post demonstrates how we can do the same with the help of SQL scripts. Management Studio New Features I have selected my favorite 5 features and blogged about it. IntelliSense for Query Editing Multi Server Query Query Editor Regions Object Explorer Enhancements Activity Monitors Maximum Number of Index per Table One of the questions I asked in my user group was – What is the maximum number of Index per table? I received lots of answers to this question but only two answers are correct. Let us now take a look at them in this blog post. 2010 Default Statistics on Column – Automatic Statistics on Column The truth is, Statistics can be in a table even though there is no Index in it. If you have the auto- create and/or auto-update Statistics feature turned on for SQL Server database, Statistics will be automatically created on the Column based on a few conditions. Please read my previously posted article, SQL SERVER – When are Statistics Updated – What triggers Statistics to Update, for the specific conditions when Statistics is updated. 2011 T-SQL Scripts to Find Maximum between Two Numbers In this blog post there are two different scripts listed which demonstrates way to find the maximum number between two numbers. I need your help, which one of the script do you think is the most accurate way to find maximum number? Find Details for Statistics of Whole Database – DMV – T-SQL Script I was recently asked is there a single script which can provide all the necessary details about statistics for any database. This question made me write following script. I was initially planning to use sp_helpstats command but I remembered that this is marked to be deprecated in future. 2012 Introduction to Function SIGN SIGN Function is very fundamental function. It will return the value 1, -1 or 0. If your value is negative it will return you negative -1 and if it is positive it will return you positive +1. Let us start with a simple small example. Template Browser – A Very Important and Useful Feature of SSMS Templates are like a quick cheat sheet or quick reference. Templates are available to create objects like databases, tables, views, indexes, stored procedures, triggers, statistics, and functions. Templates are also available for Analysis Services as well. The template scripts contain parameters to help you customize the code. You can Replace Template Parameters dialog box to insert values into the script. An invalid floating point operation occurred If you run any of the above functions they will give you an error related to invalid floating point. Honestly there is no workaround except passing the function appropriate values. SQRT of a negative number will give you result in real numbers which is not supported at this point of time as well LOG of a negative number is not possible (because logarithm is the inverse function of an exponential function and the exponential function is NEVER negative). Validating Spatial Object with IsValidDetailed Function SQL Server 2012 has introduced the new function IsValidDetailed(). This function has made my life very easy. In simple words, this function will check if the spatial object passed is valid or not. If it is valid it will give information that it is valid. If the spatial object is not valid it will return the answer that it is not valid and the reason for the same. This makes it very easy to debug the issue and make the necessary correction. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Memory Lane, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL, Technology

Read the article
Advanced TSQL Tuning: Why Internals Knowledge Matters

- by Paul White

There is much more to query tuning than reducing logical reads and adding covering nonclustered indexes. Query tuning is not complete as soon as the query returns results quickly in the development or test environments. In production, your query will compete for memory, CPU, locks, I/O and other resources on the server. Today’s entry looks at some tuning considerations that are often overlooked, and shows how deep internals knowledge can help you write better TSQL. As always, we’ll need some example data. In fact, we are going to use three tables today, each of which is structured like this: Each table has 50,000 rows made up of an INTEGER id column and a padding column containing 3,999 characters in every row. The only difference between the three tables is in the type of the padding column: the first table uses CHAR(3999), the second uses VARCHAR(MAX), and the third uses the deprecated TEXT type. A script to create a database with the three tables and load the sample data follows: USE master; GO IF DB_ID('SortTest') IS NOT NULL DROP DATABASE SortTest; GO CREATE DATABASE SortTest COLLATE LATIN1_GENERAL_BIN; GO ALTER DATABASE SortTest MODIFY FILE ( NAME = 'SortTest', SIZE = 3GB, MAXSIZE = 3GB ); GO ALTER DATABASE SortTest MODIFY FILE ( NAME = 'SortTest_log', SIZE = 256MB, MAXSIZE = 1GB, FILEGROWTH = 128MB ); GO ALTER DATABASE SortTest SET ALLOW_SNAPSHOT_ISOLATION OFF ; ALTER DATABASE SortTest SET AUTO_CLOSE OFF ; ALTER DATABASE SortTest SET AUTO_CREATE_STATISTICS ON ; ALTER DATABASE SortTest SET AUTO_SHRINK OFF ; ALTER DATABASE SortTest SET AUTO_UPDATE_STATISTICS ON ; ALTER DATABASE SortTest SET AUTO_UPDATE_STATISTICS_ASYNC ON ; ALTER DATABASE SortTest SET PARAMETERIZATION SIMPLE ; ALTER DATABASE SortTest SET READ_COMMITTED_SNAPSHOT OFF ; ALTER DATABASE SortTest SET MULTI_USER ; ALTER DATABASE SortTest SET RECOVERY SIMPLE ; USE SortTest; GO CREATE TABLE dbo.TestCHAR ( id INTEGER IDENTITY (1,1) NOT NULL, padding CHAR(3999) NOT NULL, CONSTRAINT [PK dbo.TestCHAR (id)] PRIMARY KEY CLUSTERED (id), ) ; CREATE TABLE dbo.TestMAX ( id INTEGER IDENTITY (1,1) NOT NULL, padding VARCHAR(MAX) NOT NULL, CONSTRAINT [PK dbo.TestMAX (id)] PRIMARY KEY CLUSTERED (id), ) ; CREATE TABLE dbo.TestTEXT ( id INTEGER IDENTITY (1,1) NOT NULL, padding TEXT NOT NULL, CONSTRAINT [PK dbo.TestTEXT (id)] PRIMARY KEY CLUSTERED (id), ) ; -- ============= -- Load TestCHAR (about 3s) -- ============= INSERT INTO dbo.TestCHAR WITH (TABLOCKX) ( padding ) SELECT padding = REPLICATE(CHAR(65 + (Data.n % 26)), 3999) FROM ( SELECT TOP (50000) n = ROW_NUMBER() OVER (ORDER BY (SELECT 0)) - 1 FROM master.sys.columns C1, master.sys.columns C2, master.sys.columns C3 ORDER BY n ASC ) AS Data ORDER BY Data.n ASC ; -- ============ -- Load TestMAX (about 3s) -- ============ INSERT INTO dbo.TestMAX WITH (TABLOCKX) ( padding ) SELECT CONVERT(VARCHAR(MAX), padding) FROM dbo.TestCHAR ORDER BY id ; -- ============= -- Load TestTEXT (about 5s) -- ============= INSERT INTO dbo.TestTEXT WITH (TABLOCKX) ( padding ) SELECT CONVERT(TEXT, padding) FROM dbo.TestCHAR ORDER BY id ; -- ========== -- Space used -- ========== -- EXECUTE sys.sp_spaceused @objname = 'dbo.TestCHAR'; EXECUTE sys.sp_spaceused @objname = 'dbo.TestMAX'; EXECUTE sys.sp_spaceused @objname = 'dbo.TestTEXT'; ; CHECKPOINT ; That takes around 15 seconds to run, and shows the space allocated to each table in its output: To illustrate the points I want to make today, the example task we are going to set ourselves is to return a random set of 150 rows from each table. The basic shape of the test query is the same for each of the three test tables: SELECT TOP (150) T.id, T.padding FROM dbo.Test AS T ORDER BY NEWID() OPTION (MAXDOP 1) ; Test 1 – CHAR(3999) Running the template query shown above using the TestCHAR table as the target, we find that the query takes around 5 seconds to return its results. This seems slow, considering that the table only has 50,000 rows. Working on the assumption that generating a GUID for each row is a CPU-intensive operation, we might try enabling parallelism to see if that speeds up the response time. Running the query again (but without the MAXDOP 1 hint) on a machine with eight logical processors, the query now takes 10 seconds to execute – twice as long as when run serially. Rather than attempting further guesses at the cause of the slowness, let’s go back to serial execution and add some monitoring. The script below monitors STATISTICS IO output and the amount of tempdb used by the test query. We will also run a Profiler trace to capture any warnings generated during query execution. DECLARE @read BIGINT, @write BIGINT ; SELECT @read = SUM(num_of_bytes_read), @write = SUM(num_of_bytes_written) FROM tempdb.sys.database_files AS DBF JOIN sys.dm_io_virtual_file_stats(2, NULL) AS FS ON FS.file_id = DBF.file_id WHERE DBF.type_desc = 'ROWS' ; SET STATISTICS IO ON ; SELECT TOP (150) TC.id, TC.padding FROM dbo.TestCHAR AS TC ORDER BY NEWID() OPTION (MAXDOP 1) ; SET STATISTICS IO OFF ; SELECT tempdb_read_MB = (SUM(num_of_bytes_read) - @read) / 1024. / 1024., tempdb_write_MB = (SUM(num_of_bytes_written) - @write) / 1024. / 1024., internal_use_MB = ( SELECT internal_objects_alloc_page_count / 128.0 FROM sys.dm_db_task_space_usage WHERE session_id = @@SPID ) FROM tempdb.sys.database_files AS DBF JOIN sys.dm_io_virtual_file_stats(2, NULL) AS FS ON FS.file_id = DBF.file_id WHERE DBF.type_desc = 'ROWS' ; Let’s take a closer look at the statistics and query plan generated from this: Following the flow of the data from right to left, we see the expected 50,000 rows emerging from the Clustered Index Scan, with a total estimated size of around 191MB. The Compute Scalar adds a column containing a random GUID (generated from the NEWID() function call) for each row. With this extra column in place, the size of the data arriving at the Sort operator is estimated to be 192MB. Sort is a blocking operator – it has to examine all of the rows on its input before it can produce its first row of output (the last row received might sort first). This characteristic means that Sort requires a memory grant – memory allocated for the query’s use by SQL Server just before execution starts. In this case, the Sort is the only memory-consuming operator in the plan, so it has access to the full 243MB (248,696KB) of memory reserved by SQL Server for this query execution. Notice that the memory grant is significantly larger than the expected size of the data to be sorted. SQL Server uses a number of techniques to speed up sorting, some of which sacrifice size for comparison speed. Sorts typically require a very large number of comparisons, so this is usually a very effective optimization. One of the drawbacks is that it is not possible to exactly predict the sort space needed, as it depends on the data itself. SQL Server takes an educated guess based on data types, sizes, and the number of rows expected, but the algorithm is not perfect. In spite of the large memory grant, the Profiler trace shows a Sort Warning event (indicating that the sort ran out of memory), and the tempdb usage monitor shows that 195MB of tempdb space was used – all of that for system use. The 195MB represents physical write activity on tempdb, because SQL Server strictly enforces memory grants – a query cannot ‘cheat’ and effectively gain extra memory by spilling to tempdb pages that reside in memory. Anyway, the key point here is that it takes a while to write 195MB to disk, and this is the main reason that the query takes 5 seconds overall. If you are wondering why using parallelism made the problem worse, consider that eight threads of execution result in eight concurrent partial sorts, each receiving one eighth of the memory grant. The eight sorts all spilled to tempdb, resulting in inefficiencies as the spilled sorts competed for disk resources. More importantly, there are specific problems at the point where the eight partial results are combined, but I’ll cover that in a future post. CHAR(3999) Performance Summary: 5 seconds elapsed time 243MB memory grant 195MB tempdb usage 192MB estimated sort set 25,043 logical reads Sort Warning Test 2 – VARCHAR(MAX) We’ll now run exactly the same test (with the additional monitoring) on the table using a VARCHAR(MAX) padding column: DECLARE @read BIGINT, @write BIGINT ; SELECT @read = SUM(num_of_bytes_read), @write = SUM(num_of_bytes_written) FROM tempdb.sys.database_files AS DBF JOIN sys.dm_io_virtual_file_stats(2, NULL) AS FS ON FS.file_id = DBF.file_id WHERE DBF.type_desc = 'ROWS' ; SET STATISTICS IO ON ; SELECT TOP (150) TM.id, TM.padding FROM dbo.TestMAX AS TM ORDER BY NEWID() OPTION (MAXDOP 1) ; SET STATISTICS IO OFF ; SELECT tempdb_read_MB = (SUM(num_of_bytes_read) - @read) / 1024. / 1024., tempdb_write_MB = (SUM(num_of_bytes_written) - @write) / 1024. / 1024., internal_use_MB = ( SELECT internal_objects_alloc_page_count / 128.0 FROM sys.dm_db_task_space_usage WHERE session_id = @@SPID ) FROM tempdb.sys.database_files AS DBF JOIN sys.dm_io_virtual_file_stats(2, NULL) AS FS ON FS.file_id = DBF.file_id WHERE DBF.type_desc = 'ROWS' ; This time the query takes around 8 seconds to complete (3 seconds longer than Test 1). Notice that the estimated row and data sizes are very slightly larger, and the overall memory grant has also increased very slightly to 245MB. The most marked difference is in the amount of tempdb space used – this query wrote almost 391MB of sort run data to the physical tempdb file. Don’t draw any general conclusions about VARCHAR(MAX) versus CHAR from this – I chose the length of the data specifically to expose this edge case. In most cases, VARCHAR(MAX) performs very similarly to CHAR – I just wanted to make test 2 a bit more exciting. MAX Performance Summary: 8 seconds elapsed time 245MB memory grant 391MB tempdb usage 193MB estimated sort set 25,043 logical reads Sort warning Test 3 – TEXT The same test again, but using the deprecated TEXT data type for the padding column: DECLARE @read BIGINT, @write BIGINT ; SELECT @read = SUM(num_of_bytes_read), @write = SUM(num_of_bytes_written) FROM tempdb.sys.database_files AS DBF JOIN sys.dm_io_virtual_file_stats(2, NULL) AS FS ON FS.file_id = DBF.file_id WHERE DBF.type_desc = 'ROWS' ; SET STATISTICS IO ON ; SELECT TOP (150) TT.id, TT.padding FROM dbo.TestTEXT AS TT ORDER BY NEWID() OPTION (MAXDOP 1, RECOMPILE) ; SET STATISTICS IO OFF ; SELECT tempdb_read_MB = (SUM(num_of_bytes_read) - @read) / 1024. / 1024., tempdb_write_MB = (SUM(num_of_bytes_written) - @write) / 1024. / 1024., internal_use_MB = ( SELECT internal_objects_alloc_page_count / 128.0 FROM sys.dm_db_task_space_usage WHERE session_id = @@SPID ) FROM tempdb.sys.database_files AS DBF JOIN sys.dm_io_virtual_file_stats(2, NULL) AS FS ON FS.file_id = DBF.file_id WHERE DBF.type_desc = 'ROWS' ; This time the query runs in 500ms. If you look at the metrics we have been checking so far, it’s not hard to understand why: TEXT Performance Summary: 0.5 seconds elapsed time 9MB memory grant 5MB tempdb usage 5MB estimated sort set 207 logical reads 596 LOB logical reads Sort warning SQL Server’s memory grant algorithm still underestimates the memory needed to perform the sorting operation, but the size of the data to sort is so much smaller (5MB versus 193MB previously) that the spilled sort doesn’t matter very much. Why is the data size so much smaller? The query still produces the correct results – including the large amount of data held in the padding column – so what magic is being performed here? TEXT versus MAX Storage The answer lies in how columns of the TEXT data type are stored. By default, TEXT data is stored off-row in separate LOB pages – which explains why this is the first query we have seen that records LOB logical reads in its STATISTICS IO output. You may recall from my last post that LOB data leaves an in-row pointer to the separate storage structure holding the LOB data. SQL Server can see that the full LOB value is not required by the query plan until results are returned, so instead of passing the full LOB value down the plan from the Clustered Index Scan, it passes the small in-row structure instead. SQL Server estimates that each row coming from the scan will be 79 bytes long – 11 bytes for row overhead, 4 bytes for the integer id column, and 64 bytes for the LOB pointer (in fact the pointer is rather smaller – usually 16 bytes – but the details of that don’t really matter right now). OK, so this query is much more efficient because it is sorting a very much smaller data set – SQL Server delays retrieving the LOB data itself until after the Sort starts producing its 150 rows. The question that normally arises at this point is: Why doesn’t SQL Server use the same trick when the padding column is defined as VARCHAR(MAX)? The answer is connected with the fact that if the actual size of the VARCHAR(MAX) data is 8000 bytes or less, it is usually stored in-row in exactly the same way as for a VARCHAR(8000) column – MAX data only moves off-row into LOB storage when it exceeds 8000 bytes. The default behaviour of the TEXT type is to be stored off-row by default, unless the ‘text in row’ table option is set suitably and there is room on the page. There is an analogous (but opposite) setting to control the storage of MAX data – the ‘large value types out of row’ table option. By enabling this option for a table, MAX data will be stored off-row (in a LOB structure) instead of in-row. SQL Server Books Online has good coverage of both options in the topic In Row Data. The MAXOOR Table The essential difference, then, is that MAX defaults to in-row storage, and TEXT defaults to off-row (LOB) storage. You might be thinking that we could get the same benefits seen for the TEXT data type by storing the VARCHAR(MAX) values off row – so let’s look at that option now. This script creates a fourth table, with the VARCHAR(MAX) data stored off-row in LOB pages: CREATE TABLE dbo.TestMAXOOR ( id INTEGER IDENTITY (1,1) NOT NULL, padding VARCHAR(MAX) NOT NULL, CONSTRAINT [PK dbo.TestMAXOOR (id)] PRIMARY KEY CLUSTERED (id), ) ; EXECUTE sys.sp_tableoption @TableNamePattern = N'dbo.TestMAXOOR', @OptionName = 'large value types out of row', @OptionValue = 'true' ; SELECT large_value_types_out_of_row FROM sys.tables WHERE [schema_id] = SCHEMA_ID(N'dbo') AND name = N'TestMAXOOR' ; INSERT INTO dbo.TestMAXOOR WITH (TABLOCKX) ( padding ) SELECT SPACE(0) FROM dbo.TestCHAR ORDER BY id ; UPDATE TM WITH (TABLOCK) SET padding.WRITE (TC.padding, NULL, NULL) FROM dbo.TestMAXOOR AS TM JOIN dbo.TestCHAR AS TC ON TC.id = TM.id ; EXECUTE sys.sp_spaceused @objname = 'dbo.TestMAXOOR' ; CHECKPOINT ; Test 4 – MAXOOR We can now re-run our test on the MAXOOR (MAX out of row) table: DECLARE @read BIGINT, @write BIGINT ; SELECT @read = SUM(num_of_bytes_read), @write = SUM(num_of_bytes_written) FROM tempdb.sys.database_files AS DBF JOIN sys.dm_io_virtual_file_stats(2, NULL) AS FS ON FS.file_id = DBF.file_id WHERE DBF.type_desc = 'ROWS' ; SET STATISTICS IO ON ; SELECT TOP (150) MO.id, MO.padding FROM dbo.TestMAXOOR AS MO ORDER BY NEWID() OPTION (MAXDOP 1, RECOMPILE) ; SET STATISTICS IO OFF ; SELECT tempdb_read_MB = (SUM(num_of_bytes_read) - @read) / 1024. / 1024., tempdb_write_MB = (SUM(num_of_bytes_written) - @write) / 1024. / 1024., internal_use_MB = ( SELECT internal_objects_alloc_page_count / 128.0 FROM sys.dm_db_task_space_usage WHERE session_id = @@SPID ) FROM tempdb.sys.database_files AS DBF JOIN sys.dm_io_virtual_file_stats(2, NULL) AS FS ON FS.file_id = DBF.file_id WHERE DBF.type_desc = 'ROWS' ; TEXT Performance Summary: 0.3 seconds elapsed time 245MB memory grant 0MB tempdb usage 193MB estimated sort set 207 logical reads 446 LOB logical reads No sort warning The query runs very quickly – slightly faster than Test 3, and without spilling the sort to tempdb (there is no sort warning in the trace, and the monitoring query shows zero tempdb usage by this query). SQL Server is passing the in-row pointer structure down the plan and only looking up the LOB value on the output side of the sort. The Hidden Problem There is still a huge problem with this query though – it requires a 245MB memory grant. No wonder the sort doesn’t spill to tempdb now – 245MB is about 20 times more memory than this query actually requires to sort 50,000 records containing LOB data pointers. Notice that the estimated row and data sizes in the plan are the same as in test 2 (where the MAX data was stored in-row). The optimizer assumes that MAX data is stored in-row, regardless of the sp_tableoption setting ‘large value types out of row’. Why? Because this option is dynamic – changing it does not immediately force all MAX data in the table in-row or off-row, only when data is added or actually changed. SQL Server does not keep statistics to show how much MAX or TEXT data is currently in-row, and how much is stored in LOB pages. This is an annoying limitation, and one which I hope will be addressed in a future version of the product. So why should we worry about this? Excessive memory grants reduce concurrency and may result in queries waiting on the RESOURCE_SEMAPHORE wait type while they wait for memory they do not need. 245MB is an awful lot of memory, especially on 32-bit versions where memory grants cannot use AWE-mapped memory. Even on a 64-bit server with plenty of memory, do you really want a single query to consume 0.25GB of memory unnecessarily? That’s 32,000 8KB pages that might be put to much better use. The Solution The answer is not to use the TEXT data type for the padding column. That solution happens to have better performance characteristics for this specific query, but it still results in a spilled sort, and it is hard to recommend the use of a data type which is scheduled for removal. I hope it is clear to you that the fundamental problem here is that SQL Server sorts the whole set arriving at a Sort operator. Clearly, it is not efficient to sort the whole table in memory just to return 150 rows in a random order. The TEXT example was more efficient because it dramatically reduced the size of the set that needed to be sorted. We can do the same thing by selecting 150 unique keys from the table at random (sorting by NEWID() for example) and only then retrieving the large padding column values for just the 150 rows we need. The following script implements that idea for all four tables: SET STATISTICS IO ON ; WITH TestTable AS ( SELECT * FROM dbo.TestCHAR ), TopKeys AS ( SELECT TOP (150) id FROM TestTable ORDER BY NEWID() ) SELECT T1.id, T1.padding FROM TestTable AS T1 WHERE T1.id = ANY (SELECT id FROM TopKeys) OPTION (MAXDOP 1) ; WITH TestTable AS ( SELECT * FROM dbo.TestMAX ), TopKeys AS ( SELECT TOP (150) id FROM TestTable ORDER BY NEWID() ) SELECT T1.id, T1.padding FROM TestTable AS T1 WHERE T1.id IN (SELECT id FROM TopKeys) OPTION (MAXDOP 1) ; WITH TestTable AS ( SELECT * FROM dbo.TestTEXT ), TopKeys AS ( SELECT TOP (150) id FROM TestTable ORDER BY NEWID() ) SELECT T1.id, T1.padding FROM TestTable AS T1 WHERE T1.id IN (SELECT id FROM TopKeys) OPTION (MAXDOP 1) ; WITH TestTable AS ( SELECT * FROM dbo.TestMAXOOR ), TopKeys AS ( SELECT TOP (150) id FROM TestTable ORDER BY NEWID() ) SELECT T1.id, T1.padding FROM TestTable AS T1 WHERE T1.id IN (SELECT id FROM TopKeys) OPTION (MAXDOP 1) ; SET STATISTICS IO OFF ; All four queries now return results in much less than a second, with memory grants between 6 and 12MB, and without spilling to tempdb. The small remaining inefficiency is in reading the id column values from the clustered primary key index. As a clustered index, it contains all the in-row data at its leaf. The CHAR and VARCHAR(MAX) tables store the padding column in-row, so id values are separated by a 3999-character column, plus row overhead. The TEXT and MAXOOR tables store the padding values off-row, so id values in the clustered index leaf are separated by the much-smaller off-row pointer structure. This difference is reflected in the number of logical page reads performed by the four queries: Table 'TestCHAR' logical reads 25511 lob logical reads 000 Table 'TestMAX'. logical reads 25511 lob logical reads 000 Table 'TestTEXT' logical reads 00412 lob logical reads 597 Table 'TestMAXOOR' logical reads 00413 lob logical reads 446 We can increase the density of the id values by creating a separate nonclustered index on the id column only. This is the same key as the clustered index, of course, but the nonclustered index will not include the rest of the in-row column data. CREATE UNIQUE NONCLUSTERED INDEX uq1 ON dbo.TestCHAR (id); CREATE UNIQUE NONCLUSTERED INDEX uq1 ON dbo.TestMAX (id); CREATE UNIQUE NONCLUSTERED INDEX uq1 ON dbo.TestTEXT (id); CREATE UNIQUE NONCLUSTERED INDEX uq1 ON dbo.TestMAXOOR (id); The four queries can now use the very dense nonclustered index to quickly scan the id values, sort them by NEWID(), select the 150 ids we want, and then look up the padding data. The logical reads with the new indexes in place are: Table 'TestCHAR' logical reads 835 lob logical reads 0 Table 'TestMAX' logical reads 835 lob logical reads 0 Table 'TestTEXT' logical reads 686 lob logical reads 597 Table 'TestMAXOOR' logical reads 686 lob logical reads 448 With the new index, all four queries use the same query plan (click to enlarge): Performance Summary: 0.3 seconds elapsed time 6MB memory grant 0MB tempdb usage 1MB sort set 835 logical reads (CHAR, MAX) 686 logical reads (TEXT, MAXOOR) 597 LOB logical reads (TEXT) 448 LOB logical reads (MAXOOR) No sort warning I’ll leave it as an exercise for the reader to work out why trying to eliminate the Key Lookup by adding the padding column to the new nonclustered indexes would be a daft idea Conclusion This post is not about tuning queries that access columns containing big strings. It isn’t about the internal differences between TEXT and MAX data types either. It isn’t even about the cool use of UPDATE .WRITE used in the MAXOOR table load. No, this post is about something else: Many developers might not have tuned our starting example query at all – 5 seconds isn’t that bad, and the original query plan looks reasonable at first glance. Perhaps the NEWID() function would have been blamed for ‘just being slow’ – who knows. 5 seconds isn’t awful – unless your users expect sub-second responses – but using 250MB of memory and writing 200MB to tempdb certainly is! If ten sessions ran that query at the same time in production that’s 2.5GB of memory usage and 2GB hitting tempdb. Of course, not all queries can be rewritten to avoid large memory grants and sort spills using the key-lookup technique in this post, but that’s not the point either. The point of this post is that a basic understanding of execution plans is not enough. Tuning for logical reads and adding covering indexes is not enough. If you want to produce high-quality, scalable TSQL that won’t get you paged as soon as it hits production, you need a deep understanding of execution plans, and as much accurate, deep knowledge about SQL Server as you can lay your hands on. The advanced database developer has a wide range of tools to use in writing queries that perform well in a range of circumstances. By the way, the examples in this post were written for SQL Server 2008. They will run on 2005 and demonstrate the same principles, but you won’t get the same figures I did because 2005 had a rather nasty bug in the Top N Sort operator. Fair warning: if you do decide to run the scripts on a 2005 instance (particularly the parallel query) do it before you head out for lunch… This post is dedicated to the people of Christchurch, New Zealand. © 2011 Paul White email: @[email protected] twitter: @SQL_Kiwi

Read the article
Retrieve Heap memory size and its usage statistics etc...?

- by AKN

Lets say I open some application or process. Did some work with that. Now I closed it. Need to know whether this application caused any memory leak. i.e used up some heap memory and not cleared it properly. Can I get this statistics some how? I'm using Visual Studio (for development) under Windows OS. Even I would be interested in knowing this information for any 3rd party application.

Read the article
Where to check for heap memory size and its usage statistics etc... in windows?

- by AKN

Lets say I open some application or process. Did some work with that. Now I closed it. Need to know whether this application caused any memory leak. i.e used up some heap memory and not cleared it properly. Can I get this statistics some how? I'm using Visual Studio (for development) under Windows OS. Even I would be interested in knowing this information for any 3rd party application.

Read the article
Retrieve Performance Data from SOA Infrastructure Database

- by fip

My earlier blog posting shows how to enable, retrieve and interpret BPEL engine performance statistics to aid performance troubleshooting. The strength of BPEL engine statistics at EM is its break down per request. But there are some limitations with the BPEL performance statistics mentioned in that blog posting: The statistics were stored in memory instead of being persisted. To avoid memory overflow, the data are stored to a buffer with limited size. When the statistic entries exceed the limitation, old data will be flushed out to give ways to new statistics. Therefore it can only keep the last X number of entries of data. The statistics 5 hour ago may not be there anymore. The BPEL engine performance statistics only includes latencies. It does not provide throughputs. Fortunately, Oracle SOA Suite runs with the SOA Infrastructure database and a lot of performance data are naturally persisted there. It is at a more coarse grain than the in-memory BPEL Statistics, but it does have its own strengths as it is persisted. Here I would like offer examples of some basic SQL queries you can run against the infrastructure database of Oracle SOA Suite 11G to acquire the performance statistics for a given period of time. You can run it immediately after you modify the date range to match your actual system. 1. Asynchronous/one-way messages incoming rates The following query will show number of messages sent to one-way/async BPEL processes during a given time period, organized by process names and states select composite_name composite, state, count(*) Count from dlv_message where receive_date >= to_timestamp('2012-10-24 21:00:00','YYYY-MM-DD HH24:MI:SS') and receive_date <= to_timestamp('2012-10-24 21:59:59','YYYY-MM-DD HH24:MI:SS') group by composite_name, state order by Count; 2. Throughput of BPEL process instances The following query shows the number of synchronous and asynchronous process instances created during a given time period. It list instances of all states, including the unfinished and faulted ones. The results will include all composites cross all SOA partitions select state, count(*) Count, composite_name composite, component_name,componenttype from cube_instance where creation_date >= to_timestamp('2012-10-24 21:00:00','YYYY-MM-DD HH24:MI:SS') and creation_date <= to_timestamp('2012-10-24 21:59:59','YYYY-MM-DD HH24:MI:SS') group by composite_name, component_name, componenttype order by count(*) desc; 3. Throughput and latencies of BPEL process instances This query is augmented on the previous one, providing more comprehensive information. It gives not only throughput but also the maximum, minimum and average elapse time BPEL process instances. select composite_name Composite, component_name Process, componenttype, state, count(*) Count, trunc(Max(extract(day from (modify_date-creation_date))*24*60*60 + extract(hour from (modify_date-creation_date))*60*60 + extract(minute from (modify_date-creation_date))*60 + extract(second from (modify_date-creation_date))),4) MaxTime, trunc(Min(extract(day from (modify_date-creation_date))*24*60*60 + extract(hour from (modify_date-creation_date))*60*60 + extract(minute from (modify_date-creation_date))*60 + extract(second from (modify_date-creation_date))),4) MinTime, trunc(AVG(extract(day from (modify_date-creation_date))*24*60*60 + extract(hour from (modify_date-creation_date))*60*60 + extract(minute from (modify_date-creation_date))*60 + extract(second from (modify_date-creation_date))),4) AvgTime from cube_instance where creation_date >= to_timestamp('2012-10-24 21:00:00','YYYY-MM-DD HH24:MI:SS') and creation_date <= to_timestamp('2012-10-24 21:59:59','YYYY-MM-DD HH24:MI:SS') group by composite_name, component_name, componenttype, state order by count(*) desc; 4. Combine all together Now let's combine all of these 3 queries together, and parameterize the start and end time stamps to make the script a bit more robust. The following script will prompt for the start and end time before querying against the database: accept startTime prompt 'Enter start time (YYYY-MM-DD HH24:MI:SS)' accept endTime prompt 'Enter end time (YYYY-MM-DD HH24:MI:SS)' Prompt "==== Rejected Messages ===="; REM 2012-10-24 21:00:00 REM 2012-10-24 21:59:59 select count(*), composite_dn from rejected_message where created_time >= to_timestamp('&&StartTime','YYYY-MM-DD HH24:MI:SS') and created_time <= to_timestamp('&&EndTime','YYYY-MM-DD HH24:MI:SS') group by composite_dn; Prompt " "; Prompt "==== Throughput of one-way/asynchronous messages ===="; select state, count(*) Count, composite_name composite from dlv_message where receive_date >= to_timestamp('&StartTime','YYYY-MM-DD HH24:MI:SS') and receive_date <= to_timestamp('&EndTime','YYYY-MM-DD HH24:MI:SS') group by composite_name, state order by Count; Prompt " "; Prompt "==== Throughput and latency of BPEL process instances ====" select state, count(*) Count, trunc(Max(extract(day from (modify_date-creation_date))*24*60*60 + extract(hour from (modify_date-creation_date))*60*60 + extract(minute from (modify_date-creation_date))*60 + extract(second from (modify_date-creation_date))),4) MaxTime, trunc(Min(extract(day from (modify_date-creation_date))*24*60*60 + extract(hour from (modify_date-creation_date))*60*60 + extract(minute from (modify_date-creation_date))*60 + extract(second from (modify_date-creation_date))),4) MinTime, trunc(AVG(extract(day from (modify_date-creation_date))*24*60*60 + extract(hour from (modify_date-creation_date))*60*60 + extract(minute from (modify_date-creation_date))*60 + extract(second from (modify_date-creation_date))),4) AvgTime, composite_name Composite, component_name Process, componenttype from cube_instance where creation_date >= to_timestamp('&StartTime','YYYY-MM-DD HH24:MI:SS') and creation_date <= to_timestamp('&EndTime','YYYY-MM-DD HH24:MI:SS') group by composite_name, component_name, componenttype, state order by count(*) desc;

Read the article
How to generate graphs and statistics from SQLAlchemy tables [Python]?

- by Az

Hi all, After running a bunch of simulations I'm going to be outputting the results into a table created using SQLAlchemy. I plan to use this data to generatw statistics - mean and variance being key. These, in turn, will be used to generate some graphs - histograms/line graphs, pie-charts and box-and-whisker plots specifically. I'm aware of the Python graphing libraries like matplotlib. The thing is, I'm not sure how to have this integrate with the information contained within the database tables. Any suggestions on how to make these two play with each other? The main problem is that I'm not sure how to supply the information as "data sets" to the graphing library. Thanks in advance.

Read the article
How do I modify gitstats to only utilize a specified file extension for it's statistics?

- by Fake Code Monkey Rashid

Hello good people! The website of the statistics generator in question is: http://gitstats.sourceforge.net/ It's git repo can be cloned from: git clone git://repo.or.cz/gitstats.git What I want to do is something like: ./gitstatus --ext=".py" /input/foo /output/bar Failing being able to easily pass the above option without heavy modification, I'd just hardcore the file extentsion I want to be included. However, I'm unsure of the relevant section of code to modify and even if I did no, I'm unsure of how to start such modifications. It's seems like it'd be rather simple but alas...

Read the article
Large Product catalog with statistics - alternatives to Sql Server?

- by Eric P

I am building UI for a large product catalog (millions of products). I am using Sql Server, FreeText search and ASP.NET MVC. Tables are normalized and indexed. Most queries take less then a second to return. The issue is this. Let's say user does the search by keyword. On search results page I need to display/query for: First 20 matching products (paged, sorted) Total count of matching products for paging List of stores only of matching products List of brands only of matching products List of colors only of matching products Each query takes about .5 to 1 seconds. Altogether it is like 5 seconds. I would like to get the whole page to load under 1 second. There are several approaches: Optimize queries even more. I already spent a lot of time on this one, so not sure it can be pushed further. Load products first, then load the rest of the information using AJAX. More like a workaround. Will need to revise UI. Re-organize data to be more Report friendly. Already aggregated a lot of fields. I checked out several similar sites. For ex. zappos.com. Not only they display the same information as I would like in under 1 second, but they also include statistics (number of results in each category). The following is the search for keyword "white" http://www.zappos.com/white How do sites like zappos, amazon make their results, filters and stats appear almost instantly?

Read the article
Migrating to Amazon AWS etc: What key statistics/questions should be analyzed and asked?

- by cerd

I searched SOverflow pretty extensively for something similar to this set of questions. BACKGROUND: We are a growing 'big(ish)' data chemical data company that are outgrowing our lab and our dedicated production workhorses. Make no mistake, we need to do some serious query optimization. Our data (It comes from a certain govt. agency so the schema and lack of indexing is atrocious). So yes, I know, AWS or EC2 is not a silver bullet in the face of spending time to maybe rework your queries/code entirely 'out of the box'. With that said I would appreciate any input on the following questions: We produce on CentOS and lab on Ubuntu LTS which I prefer especially with their growing cloud / AWS integration. If we are mysql centric, and our biggest problem is these big cartesian products that produce slow queries, should we roll out what we know after more optimization with respect to Ubuntu/mySQL with the added Amazon horsepower? Or is there some merit to the NoSQL and other technologies they offer? What are the key metrics I need to gather from apache and mysql other than like: Disk I/O operations, Data up/down avgs and trends and special high usage periods/scenarios? I've reviewed AWS/EC2 fine print, but want 2nd opinions. What other services aside from the basic web/database have proven valuable to you? I know nothing of Hadoop or many other technologies they offer, echoing my prev. question, do you sometimes find it worth it (Initially having it be a gamble aside from basic homework) to dive/break into a whole new environment and try to/or end up finding a way of more efficiently producing your data/site product? Anything I should watch out for in projecting costs, or any other general advice when working with AWS folks from anyone else where your company is very niche and very very technical (Scientifically - or anybody for that matter)? Thanks very much for your input - I think this thread could be valuable to others as well.

Read the article
Are there any statistics on the number of networks using Ethernet Jumbo frames?

- by Russell Heilling

I am often told by Sales Engineers and Product managers that a Layer 2 Ethernet service is not fit for purpose if the maximum supported frame size is in the 1518-1522 range (enough to support standard Ethernet frames or VLAN tagged frames). Or in other words: an MTU of 1500 is not enough (see this blog post for my definition of MTU and a short rant on how the term is often misused) I am never able to get any details from them on: a) What proportion of customers (Enterprise / SMB) require Jumbo frames b) What are typical expectations on frame size for Jumbos on WAN links? 1600, 2000, 9k, etc... I know that in Telco and Data Centre environments Jumbos are pretty common, but I am after some insight as to how common this is within Enterprise and SMB networks.

Read the article
How to get DNS SERVER Statistics ? Total Queries per day ? (windows Server 2003)

- by Xavier C

Hi Pros, Using a script (powershell for example) How can i get the total Queries on a DNS Server (Windows Server 2003 - Active Directory Environment) ? Thanks

Read the article
Which JMX statistics to watch out for in Catalina/Tomcat?

- by geoaxis

I have configured OpenNMS to collect all kinds of numeric data coming out of tomcat7 jmx. There are a lot of things. I am interested in monitoring this tomcat instance so that I can avoid down time and lockups. What metrics should I be watching out for? I am already monitoring things like CPU, Memory, Network via SNMP. With this JMX connection the things that I find interesting are Catalina:type=GlobalRequestProcessor,name="ajp-bio-/a.b.c.d-XXXX" RequestsCount so far. Catalina:type=Manager,context=/myApp,host=localhost Active sessions and its maximum so far

Read the article

< Previous Page | 13 14 15 16 17 18 19 20 21 22 23 24 | Next Page >