paul clarke - Page 5 - Developer IT

With WebMatrix, How do I Connect to a MySQL Database on a Colleague's Machine?

- by Ash Clarke

I have scoured Google trying to discover how to do this, but essentially I want to connect to a colleague's MySQL database for working together on a Wordpress installation. I am having no luck and keep getting an error about the connection not being possible: Unable to connect to any of the specified MySQL hosts. MySql.Data.MySqlClient.MySqlException (0x80004005): Unable to connect to any of the specified MySQL hosts. at MySql.Data.MySqlClient.NativeDriver.Open() at MySql.Data.MySqlClient.Driver.Open() at MySql.Data.MySqlClient.Driver.Create(MySqlConnectionStringBuilder settings) at MySql.Data.MySqlClient.MySqlPool.GetPooledConnection() at MySql.Data.MySqlClient.MySqlPool.TryToGetDriver() at MySql.Data.MySqlClient.MySqlPool.GetConnection() at MySql.Data.MySqlClient.MySqlConnection.Open() at Microsoft.WebMatrix.DatabaseManager.MySqlDatabase.MySqlDatabaseProvider.TestConnection(String connectionString) at Microsoft.WebMatrix.DatabaseManager.IisDbManagerModuleService.TestConnection(DatabaseConnection databaseConnection, String configPathState) at Microsoft.WebMatrix.DatabaseManager.Client.ClientConnection.Test(ManagementConfigurationPath configPath) at Microsoft.WebMatrix.DatabaseManager.Client.DatabaseHierarchyInfo.EnsureLoaded() The connection details are copied from my colleague's connection string, with the exception of the server being modified to match the IP address of his machine. I'm not sure if there is a firewall port I have to open or a configuration file I have to modify, but I'm not having much luck so far. (There is a strong chance that, by default, web matrix / iis express doesn't set the mysql database it creates to accept remote connections. If anyone knows how to change this, that would be grand!) Anyone have any ideas?

Read the article

APC Smart UPS network shutdown issue

- by Rob Clarke

Here is a bit about our setup: We have 2x Smart-UPS RT 6000 XL units with network management cards We are running Powerchute from a network server Powerchute is connected to the management cards of both UPSs UPSs are set to do a graceful shutdown via Powerchute when the battery duration is under 20 minutes We also have a command file that runs with Powerchute Although our setup is redundant we do not have an equal load on each server due to APC switches for single power devices The problem is that as we do not have an equal load on each server the batteries drain at different rates. This means that the UPSs both get to the specified low battery duration at completely different times. The problem here is that UPS 1 may have run down to 5 minutes and is in desperate need of initiating a Powerchute shutdown - UPS 2 still has 25 minutes of runtime so no shutdown is initiated. Consequently UPS 1 goes down and takes all the servers with and then shuts down UPS 2 as well! What we need to happen are 1 of either 2 things: Powerchute initiates the shutdown as soon as either UPS reaches the 20 minutes low battery duration setting - and doesnt wait for both The UPS with the heavier load expends its entire battery but does not shutdown both UPSs and lets the load be switched across to the UPS that still has runtime remaining. That way when the UPS that still has runtime reaches its low battery duration it can proceed with the graceful shutdown via Powerchute. Hope that makes sense, any help is greatly appreciated!

Read the article

LTO 3 tape drive needing repaired

- by DO it all Paul

We have an IBM LTO 3 tape drive that needs repaired and with the £400 price tag i'm having to shop around for quotes. My question is has anyone actually repaired one before and how was in done? The first error LED was showing a 6, then i cleared the mangled tape only for it to start flashing alternate 'o' on the 7 segment display, simliar to a half 8, flashing top to bottom and it would just flash away like that coupled with a flashing amber light. I tried a reset holding the eject button for it to show an 'r' the go back to flashing again as before. I checked the IBM solutions for the codes but this flashing isn't documented at all. Would be great if anyone had any experience in this area. Thank you, Paul

Read the article

HTTP downloads slow - FTP of same file very fast - Windows 2003

- by Paul Hinett

I am having some issues with download speeds on my site via http, i am averaging around 70kbps downloading a file that is around 70mb. But if i connect to my server via FTP and download the same file on the same computer / connection i am averaging about 300+kbps. I know my server has alot of connections at any one time, probably around 400 connections. My server has a 1gbps connection to the internet so there is plenty of bandwidth available, as proven with the FTP. I have no throttling of any kind enabled in IIS. If interested there is a test file here you can download to check the speed: http://filesd.house-mixes.com/test.zip I am based in the UK and the server is in Washington, USA if that makes any difference. Paul

Read the article

Setting up routing for MS DirectAccess to a VMWare EsXi Host

- by Paul D'Ambra

I'm trying to set up DirectAccess on a virtual machine so I can demonstrate it's value and then if need be add a physical machine to host it. I'm hitting a problem because the Direct Access machine (DA01) needs to have 2 public addresses actually configured on the external adapter but there is a Zyxel Zywall USG300 between the VMware ESXi host and the outside world. I've summarised my setup in this diagram If I ping from the LAN to 212.x.y.89 I get a response but if I ping from the VM I get destination host unreachable. I used "route add 212.x.y.89 192.c.d.1" and get request timed out. At that point I see outbound traffic allowed on the Zyxel firewall but nothing coming back. I'm past my understanding of routing and VMWare so am not sure how to tie down where my problem lies (or even if this setup is possible). So any help massively appreciated. Paul

Read the article

APC Smart UPS network shutdown issue

- by Rob Clarke

Here is a bit about our setup: We have 2x Smart-UPS RT 6000 XL units with network management cards We are running Powerchute from a network server Powerchute is connected to the management cards of both UPSs UPSs are set to do a graceful shutdown via Powerchute when the battery duration is under 20 minutes We also have a command file that runs with Powerchute Although our setup is redundant we do not have an equal load on each server due to APC switches for single power devices The problem is that as we do not have an equal load on each server the batteries drain at different rates. This means that the UPSs both get to the specified low battery duration at completely different times. The problem here is that UPS 1 may have run down to 5 minutes and is in desperate need of initiating a Powerchute shutdown - UPS 2 still has 25 minutes of runtime so no shutdown is initiated. Consequently UPS 1 goes down and takes all the servers with and then shuts down UPS 2 as well! What we need to happen are 1 of either 2 things: Powerchute initiates the shutdown as soon as either UPS reaches the 20 minutes low battery duration setting - and doesnt wait for both The UPS with the heavier load expends its entire battery but does not shutdown both UPSs and lets the load be switched across to the UPS that still has runtime remaining. That way when the UPS that still has runtime reaches its low battery duration it can proceed with the graceful shutdown via Powerchute. Hope that makes sense, any help is greatly appreciated!

Read the article

KVM Guest not reachable from host

- by Paul

Hello, I'm running Ubuntu server 9.10, installed KVM etc. Created the bridge network following instructions on help.ubuntu.com/community/KVM/Networking Created a windows 2008 guest using virt-install command line (using virt-manager GUI from a remote Ubuntu desktop would not let me select the ISO location). I can however use a remote virt-manager to connect to the guest and complete the windows install. Within windows 2008 I changed the IP address but cannot ping from outside world. The bridge network appears fine - I'm not sure what else to look at! Here is the interfaces file: The loopback network interface auto lo iface lo inet loopback The primary network interface auto eth0 iface eth0 inet manual # auto br0 iface br0 inet static address 60.234.64.50 netmask 255.255.255.248 network 60.234.0.0 broadcast 60.234.0.255 gateway 60.234.64.49 bridge_ports eth0 bridge_stp off bridge_fd 0 bridge_maxwait 0 auto eth1 iface eth1 inet static address 192.168.12.2 netmask 255.255.255.0 broadcast 192.168.12.255 The ip of the windows server is 60.234.64.52 What else should I check? Regards Paul.

Read the article

Silverlight Version 4 latest build for Win7 64bit and WinXP 32bit

- by Paul

I have a requirement where a few people need the latest version of Silverlight 4 installed. I know the latest version is 5.xx... but apparently with some new software we're having installed we have to use version 4 After a bit of googling i can see that the latest version is... Build 4.1.10329.0 Released May 8, 2012 We have a mix of Win7 64-bit machines and WinXP 32-bit machines. Q: Is there a different version for each OS or the same one fits all. (This seems strangely hard to decipher by googling) Q: Does anyone know where i can download the latest version 4? Microsoft do not seem to offer it anymore unless i'm just not finding it. Q: Is there a separate browser version of it or will installing it also handle any browser needs (our new software will be browser based) Any pointers much appreciated. Paul

Read the article

How can I convert MP4 video files to DVD?

- by Martin Clarke

I've got some MP4 video files that I'd like to put onto DVD. What is the best software (preferably free) for accomplishing such a task? Thanks in advance!

Read the article

No HDMI audio - Windows 8 - ASUS H81M-PLUS

- by Paul Wright

I have an issue with HDMI audio on Windows 8 using an ASUS H81M-PLUS motherboard (without an external GFX card). There are many forum posts advising you to go into playback devices and setting HDMI to be default - I have done this. To eliminate what works and what doesn't work: I have not been able to get sound from my HDTV using HDMI. I have used this HDMI cable with my PS3, so this cable should be fine. I am able to use the HDMI cable in extended mode, so that I have two monitors (including the TV), just no audio. This HDMI cable goes straight from the motherboard to the TV. Below I have included 'Device manager', and 'Playback Devices' (Sound). Device Manager Playback Devices, showing disabled and disconnected devices I am at a loss. I have uninstalled all drivers, and then rebooted and made windows look for the correct ones, made sure the HDMI device was default. Thanks, Paul

Read the article

CentOS 5.5 x86_64 VPS - A lot of inbound traffic when idle?

- by Matt Clarke

I have a CentOS VPS from UKWSD and I'm getting inbound traffic that I cannot understand. The VPS was setup yesterday and I installed vnstat this morning around 10am, since then the server was basically idle and doing nothing from 12pm but it's showing activity inbound which is way over what it should be and i'd say the outbound is pretty much over to top too. Here is vnstat (snapshot taken at 10:30pm GMT) http://i.imgur.com/XnORb.jpg Here is the iptables http://pastebin.com/uGxX2Ucw The reason I'm concerned is.. 1) I have no idea why this is happening, and I like to know what's going on :D 2) I've calculated (briefly) that this pointless traffic would use around 15-20GB of bandwidth per month, and when your on a 150GB limit - it's quite an issue. I'm struggling to understand this and I thought I'd get some advice before asking my ISP (and risk looking completely stupid) Regards Matt

Read the article

Connecting my iPhone to iTunes causes my Acer laptop to crash

- by Paul Sheldrake

Hello I have an Acer Travelmate 8200 laptop and whenever I connect my Iphone to it, it crashes with the BSOD(Blue Screen Of Death). I have figured out that if I delete all the pictures in my phone I can get it to connect but that is not a ideal long term solution. I also read that it may be a conflict with the built in web-cam I have but I've upgraded the driver and I still get the crashing problem. Any suggestions would be appreciated! Thanks Paul! edit: Here is the BSOD message I get

Read the article

Why is MySQL table_cache full but never used

- by Jeremy Clarke

I have been using the tuning-primer.sh script to tune my my.cnf settings. I have most things working well but the part about TABLE CACHE makes no sense: TABLE CACHE Current table_cache value = 900 tables. You have a total of 0 tables You have 900 open tables. Current table_cache hit rate is 1% , while 100% of your table cache is in use. You should probably increase your table_cache When I do SHOW STATUS; I get the following table-related numbers: Open_tables = 900 Opened_tables = 0 It seems like something is going wrong. I have some extra memory I could use on increasing the table_cache size, but my sense is that the 900 tables already available aren't doing anything, and increasing it will just waste more energy. Why might this be happening? Are there other settings that could cause all my table_cache slots to be used even though there are no hits to them? I have 150 max connections and probably no more than 4 tables per join, FWIW. Here is the tuner script output for temp tables, which I've also been tuning: TEMP TABLES Current max_heap_table_size = 90 M Current tmp_table_size = 90 M Of 11032358 temp tables, 40% were created on disk Perhaps you should increase your tmp_table_size and/or max_heap_table_size to reduce the number of disk-based temporary tables. Note! BLOB and TEXT columns are not allow in memory tables. If you are using these columns raising these values might not impact your ratio of on disk temp tables.

Read the article

What is the easiest way to get XtraDB for MySQL running on CentOS 5

- by Jeremy Clarke

I'm having a lot of issues with a dedicated MySQL server and it seems like upgrading to the XtraDB version of InnoDB will probably have a positive effect, but I'm hesitant to get involved with it since I am not really a sysadmin and prefer to stick with things that start with "yum update". What is the easiest way to get XtraDB installed? Should I use the Percona server? MariaDB? OurDelta? Is there a way to avoid using custom RPMs and sticking to a repo instead? The current yum version of MySQL is 5.0.xx, whereas a lot of the alternate MySQL builds are based on 5.1.xx. How does this factor in? Do I need to figure out 5.1 on CentOS before working on getting XtraDB in? For bonus points: Do I need to seriously test XtraDB with my server before implementing it, or is it relatively safe to have the brief downtime for switching servers followed by putting the site back online with XtraDB?

Read the article

i accidentally deleted the recovery folder on a partition (win vista home)

- by paul

i accidentally deleted the recovery folder on the recovery partition (win vista home) i think it was some sort of scheduled maintenance of some program that i did not configure properly? oops... lol i called toshiba and they said i needed to buy a recovery program, which i didnt bother doing. I bought a legal copy of vista and would like to install the correct files and in a way that when my computer starts looking for files it will eventually find them or i can point to the partition. i'm pretty sure it's not a matter of copy and paste (is it?) thanks Paul

Read the article

In APC+PHP, how much RAM is too much? Is it okay to set apc.shm_size to many GB?

- by Jeremy Clarke

On our server we have a LOT of RAM for our traffic levels (16GB). The HTTP processes regularly eat up all CPU and need to be restarted without even getting close to using swap memory, so I'm looking for ways to spend RAM to ease the load on Apache (and/or help the seperate MySQL server which may be breaking Apache). I have many WordPress installs on the HTTPD instance so APC sometimes uses as much as 900MB of ram (according to the apc.php charts). Just in case I have apc.shm_size set to 1600MB which is more than it needs but not more than I can spare. This means there is usually lots of extra RAM available to APC but also very little turnover and fragmentation is never more than 1%. Is this dangerous? Should I be slimming down APC to less than 1GB just on principle? Should I be expecting some turnover within APC in the name of bringing it's overall footprint down? Having so much memory devoted to APC means that in top/htop every single httpd process shows ~1.9GB in the VIRT memory column. Obviously this is shared memory and not used per-process, but could it be hurting our server? NOTE: The problem with the server remains unclear but the effect is that about 60 times a day all 8 CPU's fill up to 100% and everything stops working until Monit sees that Apache is broken and restarts it (Monin also saves the MySQL server). I'm not sure if APC is even part of the problem but I'm trying to optimize everything just in case.

Read the article

Creating different margins on the first page of a word template

- by Paul

I have a letterhead template and I need the first page left margin to be larger than subsequent pages. I've seen the option of placing a text box or image box in the header to push the text but this ends up throwing off the tabs and bullet list indentation markers. I thought of setting up the first page using two columns and pushing the text to start on the second column but I can't seem to find a way to get the text to switch back to 1 column on the second page when it is created from text overflowing. Does anyone know how something like this is possible? Thanks in advance, Paul

Read the article

Disable OS X Portable Home Directories for specific hosts for all users, not just individuals?

- by Paul Nendick

Would it be possible to block any and all Portable Home Directory services for specific hosts? Something like MCX's "MobileAccountNeverAsk-" but for the whole workstation? We have a network with both portable and stationary machines. I'd like our users to be able to use all machines, going portable on the MacBook but not being bothering with syncing when logged into stationary iMacs or Mac Pros. The Open Directory servers are running Snow Leopard (for now) and all clients are running Lion. Thanks! Paul

Read the article

IIS 7 - allow http for part of site, https for rest?

- by Martin Clarke

In IIS 7, is there a way to set two urls on the same site to allow http and https, and the rest to be https only? - http://mysite/url1 or https://mysite/url1 is accepted and stays on that protocol. - http://mysite/url2 or https://mysite/url2 is accepted and stays on that protocol. - any other item, i.e. http://mysite/whatever redirects to https://mysite/whatever - https://mysite/whatever is accepted. Edited because first question wasn't clear enough.

Read the article

Incremental RPM package version "numbers" for x.y.z > x.y.z-beta (or alpha, rc, etc)

- by Jonathan Clarke

In order to publish RPM packages of several different versions of some software, I'm looking for a way to specify version "numbers" that are considered "upgrades", and include the differentiation of several pre-release versions, such as (in order): "2.4.0 alpha 1", "2.4.0 alpha 2", "2.4.0 alpha 3", "2.4.0 beta 1", "2.4.0 beta 2", "2.4.0 release candidate", "2.4.0 final", "2.4.1", "2.4.2", etc. The main issue I have with this is that RPM considers that "2.4.0" comes earlier than "2.4.0.alpha1", so I can't just add the suffix on the end of the final version number. I could try "2.4.0.alpha1", "2.4.0.beta1", "2.4.0.final", which would work, except for the "release candidate" that would be considered later than "2.4.0.final". An alternative I considered is using the "epoch:" section of the RPM version number (the epoch: prefix is considered before the main version number so that "1:2.4.0" is actually earlier than "2:1.0.0"). By putting a timestamp in the epoch: field, all the versions get ordered as expected by RPM, because their versions appear to increment in time. However, this fails when new releases are made on several major versions at the same time (for example, 2.3.2 is released after 2.4.0, but their version for RPM are "20121003:2.3.2" and "20120928:2.4.0" and systems on 2.3.2 can't get "upgraded" to 2.4.0, because rpm sees it as an older version). In this case, yum/zypper/etc refuse to upgrade to 2.4.0, thus my problem. What version numbers can I use to achieve this, and make sure that RPM always considers the version numbers to be in order. Or if not version numbers, other mechanism in RPM packaging? Note 1: I would like to keep the "Release:" field of the spec file for it's original purpose (several releases of packages, including packaging changes, for the same version of the packaged software). Note 2: This should work on current production versions of major distributions, such as RHEL/CentOS 6 and SLES 11. But I'm interested in solutions that don't, too, so long as they don't involve recompiling rpm! Note 3: On Debian-like systems, dpkg uses a special component in the version number which is the "~" (tilde) character. This causes dpkg to count the suffix as "negative" ordering, so that "2.4.0~anything" will come before "2.4.0". Then, normal ordering applies after the "~", so "2.4.0~alpha1" comes before "2.4.0~beta1" because "alpha" comes before "beta" alphabetically. I'm not necessarily looking to use the same scheme for RPM packages (I'm pretty sure no such equivalent exists), so this is just FYI.

Read the article

Fun with Aggregates

- by Paul White

There are interesting things to be learned from even the simplest queries. For example, imagine you are given the task of writing a query to list AdventureWorks product names where the product has at least one entry in the transaction history table, but fewer than ten. One possible query to meet that specification is: SELECT p.Name FROM Production.Product AS p JOIN Production.TransactionHistory AS th ON p.ProductID = th.ProductID GROUP BY p.ProductID, p.Name HAVING COUNT_BIG(*) < 10; That query correctly returns 23 rows (execution plan and data sample shown below): The execution plan looks a bit different from the written form of the query: the base tables are accessed in reverse order, and the aggregation is performed before the join. The general idea is to read all rows from the history table, compute the count of rows grouped by ProductID, merge join the results to the Product table on ProductID, and finally filter to only return rows where the count is less than ten. This ‘fully-optimized’ plan has an estimated cost of around 0.33 units. The reason for the quote marks there is that this plan is not quite as optimal as it could be – surely it would make sense to push the Filter down past the join too? To answer that, let’s look at some other ways to formulate this query. This being SQL, there are any number of ways to write logically-equivalent query specifications, so we’ll just look at a couple of interesting ones. The first query is an attempt to reverse-engineer T-SQL from the optimized query plan shown above. It joins the result of pre-aggregating the history table to the Product table before filtering: SELECT p.Name FROM ( SELECT th.ProductID, cnt = COUNT_BIG(*) FROM Production.TransactionHistory AS th GROUP BY th.ProductID ) AS q1 JOIN Production.Product AS p ON p.ProductID = q1.ProductID WHERE q1.cnt < 10; Perhaps a little surprisingly, we get a slightly different execution plan: The results are the same (23 rows) but this time the Filter is pushed below the join! The optimizer chooses nested loops for the join, because the cardinality estimate for rows passing the Filter is a bit low (estimate 1 versus 23 actual), though you can force a merge join with a hint and the Filter still appears below the join. In yet another variation, the < 10 predicate can be ‘manually pushed’ by specifying it in a HAVING clause in the “q1” sub-query instead of in the WHERE clause as written above. The reason this predicate can be pushed past the join in this query form, but not in the original formulation is simply an optimizer limitation – it does make efforts (primarily during the simplification phase) to encourage logically-equivalent query specifications to produce the same execution plan, but the implementation is not completely comprehensive. Moving on to a second example, the following query specification results from phrasing the requirement as “list the products where there exists fewer than ten correlated rows in the history table”: SELECT p.Name FROM Production.Product AS p WHERE EXISTS ( SELECT * FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID HAVING COUNT_BIG(*) < 10 ); Unfortunately, this query produces an incorrect result (86 rows): The problem is that it lists products with no history rows, though the reasons are interesting. The COUNT_BIG(*) in the EXISTS clause is a scalar aggregate (meaning there is no GROUP BY clause) and scalar aggregates always produce a value, even when the input is an empty set. In the case of the COUNT aggregate, the result of aggregating the empty set is zero (the other standard aggregates produce a NULL). To make the point really clear, let’s look at product 709, which happens to be one for which no history rows exist: -- Scalar aggregate SELECT COUNT_BIG(*) FROM Production.TransactionHistory AS th WHERE th.ProductID = 709; -- Vector aggregate SELECT COUNT_BIG(*) FROM Production.TransactionHistory AS th WHERE th.ProductID = 709 GROUP BY th.ProductID; The estimated execution plans for these two statements are almost identical: You might expect the Stream Aggregate to have a Group By for the second statement, but this is not the case. The query includes an equality comparison to a constant value (709), so all qualified rows are guaranteed to have the same value for ProductID and the Group By is optimized away. In fact there are some minor differences between the two plans (the first is auto-parameterized and qualifies for trivial plan, whereas the second is not auto-parameterized and requires cost-based optimization), but there is nothing to indicate that one is a scalar aggregate and the other is a vector aggregate. This is something I would like to see exposed in show plan so I suggested it on Connect. Anyway, the results of running the two queries show the difference at runtime: The scalar aggregate (no GROUP BY) returns a result of zero, whereas the vector aggregate (with a GROUP BY clause) returns nothing at all. Returning to our EXISTS query, we could ‘fix’ it by changing the HAVING clause to reject rows where the scalar aggregate returns zero: SELECT p.Name FROM Production.Product AS p WHERE EXISTS ( SELECT * FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID HAVING COUNT_BIG(*) BETWEEN 1 AND 9 ); The query now returns the correct 23 rows: Unfortunately, the execution plan is less efficient now – it has an estimated cost of 0.78 compared to 0.33 for the earlier plans. Let’s try adding a redundant GROUP BY instead of changing the HAVING clause: SELECT p.Name FROM Production.Product AS p WHERE EXISTS ( SELECT * FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY th.ProductID HAVING COUNT_BIG(*) < 10 ); Not only do we now get correct results (23 rows), this is the execution plan: I like to compare that plan to quantum physics: if you don’t find it shocking, you haven’t understood it properly :) The simple addition of a redundant GROUP BY has resulted in the EXISTS form of the query being transformed into exactly the same optimal plan we found earlier. What’s more, in SQL Server 2008 and later, we can replace the odd-looking GROUP BY with an explicit GROUP BY on the empty set: SELECT p.Name FROM Production.Product AS p WHERE EXISTS ( SELECT * FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY () HAVING COUNT_BIG(*) < 10 ); I offer that as an alternative because some people find it more intuitive (and it perhaps has more geek value too). Whichever way you prefer, it’s rather satisfying to note that the result of the sub-query does not exist for a particular correlated value where a vector aggregate is used (the scalar COUNT aggregate always returns a value, even if zero, so it always ‘EXISTS’ regardless which ProductID is logically being evaluated). The following query forms also produce the optimal plan and correct results, so long as a vector aggregate is used (you can probably find more equivalent query forms): WHERE Clause SELECT p.Name FROM Production.Product AS p WHERE ( SELECT COUNT_BIG(*) FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY () ) < 10; APPLY SELECT p.Name FROM Production.Product AS p CROSS APPLY ( SELECT NULL FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY () HAVING COUNT_BIG(*) < 10 ) AS ca (dummy); FROM Clause SELECT q1.Name FROM ( SELECT p.Name, cnt = ( SELECT COUNT_BIG(*) FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY () ) FROM Production.Product AS p ) AS q1 WHERE q1.cnt < 10; This last example uses SUM(1) instead of COUNT and does not require a vector aggregate…you should be able to work out why :) SELECT q.Name FROM ( SELECT p.Name, cnt = ( SELECT SUM(1) FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID ) FROM Production.Product AS p ) AS q WHERE q.cnt < 10; The semantics of SQL aggregates are rather odd in places. It definitely pays to get to know the rules, and to be careful to check whether your queries are using scalar or vector aggregates. As we have seen, query plans do not show in which ‘mode’ an aggregate is running and getting it wrong can cause poor performance, wrong results, or both. © 2012 Paul White Twitter: @SQL_Kiwi email: [email protected]

Read the article

I see no LOBs!

- by Paul White

Is it possible to see LOB (large object) logical reads from STATISTICS IO output on a table with no LOB columns? I was asked this question today by someone who had spent a good fraction of their afternoon trying to work out why this was occurring – even going so far as to re-run DBCC CHECKDB to see if any corruption had taken place. The table in question wasn’t particularly pretty – it had grown somewhat organically over time, with new columns being added every so often as the need arose. Nevertheless, it remained a simple structure with no LOB columns – no TEXT or IMAGE, no XML, no MAX types – nothing aside from ordinary INT, MONEY, VARCHAR, and DATETIME types. To add to the air of mystery, not every query that ran against the table would report LOB logical reads – just sometimes – but when it did, the query often took much longer to execute. Ok, enough of the pre-amble. I can’t reproduce the exact structure here, but the following script creates a table that will serve to demonstrate the effect: IF OBJECT_ID(N'dbo.Test', N'U') IS NOT NULL DROP TABLE dbo.Test GO CREATE TABLE dbo.Test ( row_id NUMERIC IDENTITY NOT NULL, col01 NVARCHAR(450) NOT NULL, col02 NVARCHAR(450) NOT NULL, col03 NVARCHAR(450) NOT NULL, col04 NVARCHAR(450) NOT NULL, col05 NVARCHAR(450) NOT NULL, col06 NVARCHAR(450) NOT NULL, col07 NVARCHAR(450) NOT NULL, col08 NVARCHAR(450) NOT NULL, col09 NVARCHAR(450) NOT NULL, col10 NVARCHAR(450) NOT NULL, CONSTRAINT [PK dbo.Test row_id] PRIMARY KEY CLUSTERED (row_id) ) ; The next script loads the ten variable-length character columns with one-character strings in the first row, two-character strings in the second row, and so on down to the 450th row: WITH Numbers AS ( -- Generates numbers 1 - 450 inclusive SELECT TOP (450) n = ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM master.sys.columns C1, master.sys.columns C2, master.sys.columns C3 ORDER BY n ASC ) INSERT dbo.Test WITH (TABLOCKX) SELECT REPLICATE(N'A', N.n), REPLICATE(N'B', N.n), REPLICATE(N'C', N.n), REPLICATE(N'D', N.n), REPLICATE(N'E', N.n), REPLICATE(N'F', N.n), REPLICATE(N'G', N.n), REPLICATE(N'H', N.n), REPLICATE(N'I', N.n), REPLICATE(N'J', N.n) FROM Numbers AS N ORDER BY N.n ASC ; Once those two scripts have run, the table contains 450 rows and 10 columns of data like this: Most of the time, when we query data from this table, we don’t see any LOB logical reads, for example: -- Find the maximum length of the data in -- column 5 for a range of rows SELECT result = MAX(DATALENGTH(T.col05)) FROM dbo.Test AS T WHERE row_id BETWEEN 50 AND 100 ; But with a different query… -- Read all the data in column 1 SELECT result = MAX(DATALENGTH(T.col01)) FROM dbo.Test AS T ; …suddenly we have 49 LOB logical reads, as well as the ‘normal’ logical reads we would expect. The Explanation If we had tried to create this table in SQL Server 2000, we would have received a warning message to say that future INSERT or UPDATE operations on the table might fail if the resulting row exceeded the in-row storage limit of 8060 bytes. If we needed to store more data than would fit in an 8060 byte row (including internal overhead) we had to use a LOB column – TEXT, NTEXT, or IMAGE. These special data types store the large data values in a separate structure, with just a small pointer left in the original row. Row Overflow SQL Server 2005 introduced a feature called row overflow, which allows one or more variable-length columns in a row to move to off-row storage if the data in a particular row would otherwise exceed 8060 bytes. You no longer receive a warning when creating (or altering) a table that might need more than 8060 bytes of in-row storage; if SQL Server finds that it can no longer fit a variable-length column in a particular row, it will silently move one or more of these columns off the row into a separate allocation unit. Only variable-length columns can be moved in this way (for example the (N)VARCHAR, VARBINARY, and SQL_VARIANT types). Fixed-length columns (like INTEGER and DATETIME for example) never move into ‘row overflow’ storage. The decision to move a column off-row is done on a row-by-row basis – so data in a particular column might be stored in-row for some table records, and off-row for others. In general, if SQL Server finds that it needs to move a column into row-overflow storage, it moves the largest variable-length column record for that row. Note that in the case of an UPDATE statement that results in the 8060 byte limit being exceeded, it might not be the column that grew that is moved! Sneaky LOBs Anyway, that’s all very interesting but I don’t want to get too carried away with the intricacies of row-overflow storage internals. The point is that it is now possible to define a table with non-LOB columns that will silently exceed the old row-size limit and result in ordinary variable-length columns being moved to off-row storage. Adding new columns to a table, expanding an existing column definition, or simply storing more data in a column than you used to – all these things can result in one or more variable-length columns being moved off the row. Note that row-overflow storage is logically quite different from old-style LOB and new-style MAX data type storage – individual variable-length columns are still limited to 8000 bytes each – you can just have more of them now. Having said that, the physical mechanisms involved are very similar to full LOB storage – a column moved to row-overflow leaves a 24-byte pointer record in the row, and the ‘separate storage’ I have been talking about is structured very similarly to both old-style LOBs and new-style MAX types. The disadvantages are also the same: when SQL Server needs a row-overflow column value it needs to follow the in-row pointer a navigate another chain of pages, just like retrieving a traditional LOB. And Finally… In the example script presented above, the rows with row_id values from 402 to 450 inclusive all exceed the total in-row storage limit of 8060 bytes. A SELECT that references a column in one of those rows that has moved to off-row storage will incur one or more lob logical reads as the storage engine locates the data. The results on your system might vary slightly depending on your settings, of course; but in my tests only column 1 in rows 402-450 moved off-row. You might like to play around with the script – updating columns, changing data type lengths, and so on – to see the effect on lob logical reads and which columns get moved when. You might even see row-overflow columns moving back in-row if they are updated to be smaller (hint: reduce the size of a column entry by at least 1000 bytes if you hope to see this). Be aware that SQL Server will not warn you when it moves ‘ordinary’ variable-length columns into overflow storage, and it can have dramatic effects on performance. It makes more sense than ever to choose column data types sensibly. If you make every column a VARCHAR(8000) or NVARCHAR(4000), and someone stores data that results in a row needing more than 8060 bytes, SQL Server might turn some of your column data into pseudo-LOBs – all without saying a word. Finally, some people make a distinction between ordinary LOBs (those that can hold up to 2GB of data) and the LOB-like structures created by row-overflow (where columns are still limited to 8000 bytes) by referring to row-overflow LOBs as SLOBs. I find that quite appealing, but the ‘S’ stands for ‘small’, which makes expanding the whole acronym a little daft-sounding…small large objects anyone? © Paul White 2011 email: [email protected] twitter: @SQL_Kiwi

Read the article

When is a Seek not a Seek?

- by Paul White

The following script creates a single-column clustered table containing the integers from 1 to 1,000 inclusive. IF OBJECT_ID(N'tempdb..#Test', N'U') IS NOT NULL DROP TABLE #Test ; GO CREATE TABLE #Test ( id INTEGER PRIMARY KEY CLUSTERED ); ; INSERT #Test (id) SELECT V.number FROM master.dbo.spt_values AS V WHERE V.[type] = N'P' AND V.number BETWEEN 1 AND 1000 ; Let’s say we need to find the rows with values from 100 to 170, excluding any values that divide exactly by 10. One way to write that query would be: SELECT T.id FROM #Test AS T WHERE T.id IN ( 101,102,103,104,105,106,107,108,109, 111,112,113,114,115,116,117,118,119, 121,122,123,124,125,126,127,128,129, 131,132,133,134,135,136,137,138,139, 141,142,143,144,145,146,147,148,149, 151,152,153,154,155,156,157,158,159, 161,162,163,164,165,166,167,168,169 ) ; That query produces a pretty efficient-looking query plan: Knowing that the source column is defined as an INTEGER, we could also express the query this way: SELECT T.id FROM #Test AS T WHERE T.id >= 101 AND T.id <= 169 AND T.id % 10 > 0 ; We get a similar-looking plan: If you look closely, you might notice that the line connecting the two icons is a little thinner than before. The first query is estimated to produce 61.9167 rows – very close to the 63 rows we know the query will return. The second query presents a tougher challenge for SQL Server because it doesn’t know how to predict the selectivity of the modulo expression (T.id % 10 > 0). Without that last line, the second query is estimated to produce 68.1667 rows – a slight overestimate. Adding the opaque modulo expression results in SQL Server guessing at the selectivity. As you may know, the selectivity guess for a greater-than operation is 30%, so the final estimate is 30% of 68.1667, which comes to 20.45 rows. The second difference is that the Clustered Index Seek is costed at 99% of the estimated total for the statement. For some reason, the final SELECT operator is assigned a small cost of 0.0000484 units; I have absolutely no idea why this is so, or what it models. Nevertheless, we can compare the total cost for both queries: the first one comes in at 0.0033501 units, and the second at 0.0034054. The important point is that the second query is costed very slightly higher than the first, even though it is expected to produce many fewer rows (20.45 versus 61.9167). If you run the two queries, they produce exactly the same results, and both complete so quickly that it is impossible to measure CPU usage for a single execution. We can, however, compare the I/O statistics for a single run by running the queries with STATISTICS IO ON: Table '#Test'. Scan count 63, logical reads 126, physical reads 0. Table '#Test'. Scan count 01, logical reads 002, physical reads 0. The query with the IN list uses 126 logical reads (and has a ‘scan count’ of 63), while the second query form completes with just 2 logical reads (and a ‘scan count’ of 1). It is no coincidence that 126 = 63 * 2, by the way. It is almost as if the first query is doing 63 seeks, compared to one for the second query. In fact, that is exactly what it is doing. There is no indication of this in the graphical plan, or the tool-tip that appears when you hover your mouse over the Clustered Index Seek icon. To see the 63 seek operations, you have click on the Seek icon and look in the Properties window (press F4, or right-click and choose from the menu): The Seek Predicates list shows a total of 63 seek operations – one for each of the values from the IN list contained in the first query. I have expanded the first seek node to show the details; it is seeking down the clustered index to find the entry with the value 101. Each of the other 62 nodes expands similarly, and the same information is contained (even more verbosely) in the XML form of the plan. Each of the 63 seek operations starts at the root of the clustered index B-tree and navigates down to the leaf page that contains the sought key value. Our table is just large enough to need a separate root page, so each seek incurs 2 logical reads (one for the root, and one for the leaf). We can see the index depth using the INDEXPROPERTY function, or by using the a DMV: SELECT S.index_type_desc, S.index_depth FROM sys.dm_db_index_physical_stats ( DB_ID(N'tempdb'), OBJECT_ID(N'tempdb..#Test', N'U'), 1, 1, DEFAULT ) AS S ; Let’s look now at the Properties window when the Clustered Index Seek from the second query is selected: There is just one seek operation, which starts at the root of the index and navigates the B-tree looking for the first key that matches the Start range condition (id >= 101). It then continues to read records at the leaf level of the index (following links between leaf-level pages if necessary) until it finds a row that does not meet the End range condition (id <= 169). Every row that meets the seek range condition is also tested against the Residual Predicate highlighted above (id % 10 > 0), and is only returned if it matches that as well. You will not be surprised that the single seek (with a range scan and residual predicate) is much more efficient than 63 singleton seeks. It is not 63 times more efficient (as the logical reads comparison would suggest), but it is around three times faster. Let’s run both query forms 10,000 times and measure the elapsed time: DECLARE @i INTEGER, @n INTEGER = 10000, @s DATETIME = GETDATE() ; SET NOCOUNT ON; SET STATISTICS XML OFF; ; WHILE @n > 0 BEGIN SELECT @i = T.id FROM #Test AS T WHERE T.id IN ( 101,102,103,104,105,106,107,108,109, 111,112,113,114,115,116,117,118,119, 121,122,123,124,125,126,127,128,129, 131,132,133,134,135,136,137,138,139, 141,142,143,144,145,146,147,148,149, 151,152,153,154,155,156,157,158,159, 161,162,163,164,165,166,167,168,169 ) ; SET @n -= 1; END ; PRINT DATEDIFF(MILLISECOND, @s, GETDATE()) ; GO DECLARE @i INTEGER, @n INTEGER = 10000, @s DATETIME = GETDATE() ; SET NOCOUNT ON ; WHILE @n > 0 BEGIN SELECT @i = T.id FROM #Test AS T WHERE T.id >= 101 AND T.id <= 169 AND T.id % 10 > 0 ; SET @n -= 1; END ; PRINT DATEDIFF(MILLISECOND, @s, GETDATE()) ; On my laptop, running SQL Server 2008 build 4272 (SP2 CU2), the IN form of the query takes around 830ms and the range query about 300ms. The main point of this post is not performance, however – it is meant as an introduction to the next few parts in this mini-series that will continue to explore scans and seeks in detail. When is a seek not a seek? When it is 63 seeks © Paul White 2011 email: [email protected] twitter: @SQL_kiwi

Read the article

Fun with Aggregates

- by Paul White

There are interesting things to be learned from even the simplest queries. For example, imagine you are given the task of writing a query to list AdventureWorks product names where the product has at least one entry in the transaction history table, but fewer than ten. One possible query to meet that specification is: SELECT p.Name FROM Production.Product AS p JOIN Production.TransactionHistory AS th ON p.ProductID = th.ProductID GROUP BY p.ProductID, p.Name HAVING COUNT_BIG(*) < 10; That query correctly returns 23 rows (execution plan and data sample shown below): The execution plan looks a bit different from the written form of the query: the base tables are accessed in reverse order, and the aggregation is performed before the join. The general idea is to read all rows from the history table, compute the count of rows grouped by ProductID, merge join the results to the Product table on ProductID, and finally filter to only return rows where the count is less than ten. This ‘fully-optimized’ plan has an estimated cost of around 0.33 units. The reason for the quote marks there is that this plan is not quite as optimal as it could be – surely it would make sense to push the Filter down past the join too? To answer that, let’s look at some other ways to formulate this query. This being SQL, there are any number of ways to write logically-equivalent query specifications, so we’ll just look at a couple of interesting ones. The first query is an attempt to reverse-engineer T-SQL from the optimized query plan shown above. It joins the result of pre-aggregating the history table to the Product table before filtering: SELECT p.Name FROM ( SELECT th.ProductID, cnt = COUNT_BIG(*) FROM Production.TransactionHistory AS th GROUP BY th.ProductID ) AS q1 JOIN Production.Product AS p ON p.ProductID = q1.ProductID WHERE q1.cnt < 10; Perhaps a little surprisingly, we get a slightly different execution plan: The results are the same (23 rows) but this time the Filter is pushed below the join! The optimizer chooses nested loops for the join, because the cardinality estimate for rows passing the Filter is a bit low (estimate 1 versus 23 actual), though you can force a merge join with a hint and the Filter still appears below the join. In yet another variation, the < 10 predicate can be ‘manually pushed’ by specifying it in a HAVING clause in the “q1” sub-query instead of in the WHERE clause as written above. The reason this predicate can be pushed past the join in this query form, but not in the original formulation is simply an optimizer limitation – it does make efforts (primarily during the simplification phase) to encourage logically-equivalent query specifications to produce the same execution plan, but the implementation is not completely comprehensive. Moving on to a second example, the following query specification results from phrasing the requirement as “list the products where there exists fewer than ten correlated rows in the history table”: SELECT p.Name FROM Production.Product AS p WHERE EXISTS ( SELECT * FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID HAVING COUNT_BIG(*) < 10 ); Unfortunately, this query produces an incorrect result (86 rows): The problem is that it lists products with no history rows, though the reasons are interesting. The COUNT_BIG(*) in the EXISTS clause is a scalar aggregate (meaning there is no GROUP BY clause) and scalar aggregates always produce a value, even when the input is an empty set. In the case of the COUNT aggregate, the result of aggregating the empty set is zero (the other standard aggregates produce a NULL). To make the point really clear, let’s look at product 709, which happens to be one for which no history rows exist: -- Scalar aggregate SELECT COUNT_BIG(*) FROM Production.TransactionHistory AS th WHERE th.ProductID = 709; -- Vector aggregate SELECT COUNT_BIG(*) FROM Production.TransactionHistory AS th WHERE th.ProductID = 709 GROUP BY th.ProductID; The estimated execution plans for these two statements are almost identical: You might expect the Stream Aggregate to have a Group By for the second statement, but this is not the case. The query includes an equality comparison to a constant value (709), so all qualified rows are guaranteed to have the same value for ProductID and the Group By is optimized away. In fact there are some minor differences between the two plans (the first is auto-parameterized and qualifies for trivial plan, whereas the second is not auto-parameterized and requires cost-based optimization), but there is nothing to indicate that one is a scalar aggregate and the other is a vector aggregate. This is something I would like to see exposed in show plan so I suggested it on Connect. Anyway, the results of running the two queries show the difference at runtime: The scalar aggregate (no GROUP BY) returns a result of zero, whereas the vector aggregate (with a GROUP BY clause) returns nothing at all. Returning to our EXISTS query, we could ‘fix’ it by changing the HAVING clause to reject rows where the scalar aggregate returns zero: SELECT p.Name FROM Production.Product AS p WHERE EXISTS ( SELECT * FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID HAVING COUNT_BIG(*) BETWEEN 1 AND 9 ); The query now returns the correct 23 rows: Unfortunately, the execution plan is less efficient now – it has an estimated cost of 0.78 compared to 0.33 for the earlier plans. Let’s try adding a redundant GROUP BY instead of changing the HAVING clause: SELECT p.Name FROM Production.Product AS p WHERE EXISTS ( SELECT * FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY th.ProductID HAVING COUNT_BIG(*) < 10 ); Not only do we now get correct results (23 rows), this is the execution plan: I like to compare that plan to quantum physics: if you don’t find it shocking, you haven’t understood it properly :) The simple addition of a redundant GROUP BY has resulted in the EXISTS form of the query being transformed into exactly the same optimal plan we found earlier. What’s more, in SQL Server 2008 and later, we can replace the odd-looking GROUP BY with an explicit GROUP BY on the empty set: SELECT p.Name FROM Production.Product AS p WHERE EXISTS ( SELECT * FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY () HAVING COUNT_BIG(*) < 10 ); I offer that as an alternative because some people find it more intuitive (and it perhaps has more geek value too). Whichever way you prefer, it’s rather satisfying to note that the result of the sub-query does not exist for a particular correlated value where a vector aggregate is used (the scalar COUNT aggregate always returns a value, even if zero, so it always ‘EXISTS’ regardless which ProductID is logically being evaluated). The following query forms also produce the optimal plan and correct results, so long as a vector aggregate is used (you can probably find more equivalent query forms): WHERE Clause SELECT p.Name FROM Production.Product AS p WHERE ( SELECT COUNT_BIG(*) FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY () ) < 10; APPLY SELECT p.Name FROM Production.Product AS p CROSS APPLY ( SELECT NULL FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY () HAVING COUNT_BIG(*) < 10 ) AS ca (dummy); FROM Clause SELECT q1.Name FROM ( SELECT p.Name, cnt = ( SELECT COUNT_BIG(*) FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY () ) FROM Production.Product AS p ) AS q1 WHERE q1.cnt < 10; This last example uses SUM(1) instead of COUNT and does not require a vector aggregate…you should be able to work out why :) SELECT q.Name FROM ( SELECT p.Name, cnt = ( SELECT SUM(1) FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID ) FROM Production.Product AS p ) AS q WHERE q.cnt < 10; The semantics of SQL aggregates are rather odd in places. It definitely pays to get to know the rules, and to be careful to check whether your queries are using scalar or vector aggregates. As we have seen, query plans do not show in which ‘mode’ an aggregate is running and getting it wrong can cause poor performance, wrong results, or both. © 2012 Paul White Twitter: @SQL_Kiwi email: [email protected]

Read the article

Heaps of Trouble?

- by Paul White NZ

If you’re not already a regular reader of Brad Schulz’s blog, you’re missing out on some great material. In his latest entry, he is tasked with optimizing a query run against tables that have no indexes at all. The problem is, predictably, that performance is not very good. The catch is that we are not allowed to create any indexes (or even new statistics) as part of our optimization efforts. In this post, I’m going to look at the problem from a slightly different angle, and present an alternative solution to the one Brad found. Inevitably, there’s going to be some overlap between our entries, and while you don’t necessarily need to read Brad’s post before this one, I do strongly recommend that you read it at some stage; he covers some important points that I won’t cover again here. The Example We’ll use data from the AdventureWorks database, copied to temporary unindexed tables. A script to create these structures is shown below: CREATE TABLE #Custs ( CustomerID INTEGER NOT NULL, TerritoryID INTEGER NULL, CustomerType NCHAR(1) COLLATE SQL_Latin1_General_CP1_CI_AI NOT NULL, ); GO CREATE TABLE #Prods ( ProductMainID INTEGER NOT NULL, ProductSubID INTEGER NOT NULL, ProductSubSubID INTEGER NOT NULL, Name NVARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AI NOT NULL, ); GO CREATE TABLE #OrdHeader ( SalesOrderID INTEGER NOT NULL, OrderDate DATETIME NOT NULL, SalesOrderNumber NVARCHAR(25) COLLATE SQL_Latin1_General_CP1_CI_AI NOT NULL, CustomerID INTEGER NOT NULL, ); GO CREATE TABLE #OrdDetail ( SalesOrderID INTEGER NOT NULL, OrderQty SMALLINT NOT NULL, LineTotal NUMERIC(38,6) NOT NULL, ProductMainID INTEGER NOT NULL, ProductSubID INTEGER NOT NULL, ProductSubSubID INTEGER NOT NULL, ); GO INSERT #Custs ( CustomerID, TerritoryID, CustomerType ) SELECT C.CustomerID, C.TerritoryID, C.CustomerType FROM AdventureWorks.Sales.Customer C WITH (TABLOCK); GO INSERT #Prods ( ProductMainID, ProductSubID, ProductSubSubID, Name ) SELECT P.ProductID, P.ProductID, P.ProductID, P.Name FROM AdventureWorks.Production.Product P WITH (TABLOCK); GO INSERT #OrdHeader ( SalesOrderID, OrderDate, SalesOrderNumber, CustomerID ) SELECT H.SalesOrderID, H.OrderDate, H.SalesOrderNumber, H.CustomerID FROM AdventureWorks.Sales.SalesOrderHeader H WITH (TABLOCK); GO INSERT #OrdDetail ( SalesOrderID, OrderQty, LineTotal, ProductMainID, ProductSubID, ProductSubSubID ) SELECT D.SalesOrderID, D.OrderQty, D.LineTotal, D.ProductID, D.ProductID, D.ProductID FROM AdventureWorks.Sales.SalesOrderDetail D WITH (TABLOCK); The query itself is a simple join of the four tables: SELECT P.ProductMainID AS PID, P.Name, D.OrderQty, H.SalesOrderNumber, H.OrderDate, C.TerritoryID FROM #Prods P JOIN #OrdDetail D ON P.ProductMainID = D.ProductMainID AND P.ProductSubID = D.ProductSubID AND P.ProductSubSubID = D.ProductSubSubID JOIN #OrdHeader H ON D.SalesOrderID = H.SalesOrderID JOIN #Custs C ON H.CustomerID = C.CustomerID ORDER BY P.ProductMainID ASC OPTION (RECOMPILE, MAXDOP 1); Remember that these tables have no indexes at all, and only the single-column sampled statistics SQL Server automatically creates (assuming default settings). The estimated query plan produced for the test query looks like this (click to enlarge): The Problem The problem here is one of cardinality estimation – the number of rows SQL Server expects to find at each step of the plan. The lack of indexes and useful statistical information means that SQL Server does not have the information it needs to make a good estimate. Every join in the plan shown above estimates that it will produce just a single row as output. Brad covers the factors that lead to the low estimates in his post. In reality, the join between the #Prods and #OrdDetail tables will produce 121,317 rows. It should not surprise you that this has rather dire consequences for the remainder of the query plan. In particular, it makes a nonsense of the optimizer’s decision to use Nested Loops to join to the two remaining tables. Instead of scanning the #OrdHeader and #Custs tables once (as it expected), it has to perform 121,317 full scans of each. The query takes somewhere in the region of twenty minutes to run to completion on my development machine. A Solution At this point, you may be thinking the same thing I was: if we really are stuck with no indexes, the best we can do is to use hash joins everywhere. We can force the exclusive use of hash joins in several ways, the two most common being join and query hints. A join hint means writing the query using the INNER HASH JOIN syntax; using a query hint involves adding OPTION (HASH JOIN) at the bottom of the query. The difference is that using join hints also forces the order of the join, whereas the query hint gives the optimizer freedom to reorder the joins at its discretion. Adding the OPTION (HASH JOIN) hint results in this estimated plan: That produces the correct output in around seven seconds, which is quite an improvement! As a purely practical matter, and given the rigid rules of the environment we find ourselves in, we might leave things there. (We can improve the hashing solution a bit – I’ll come back to that later on). Faster Nested Loops It might surprise you to hear that we can beat the performance of the hash join solution shown above using nested loops joins exclusively, and without breaking the rules we have been set. The key to this part is to realize that a condition like (A = B) can be expressed as (A <= B) AND (A >= B). Armed with this tremendous new insight, we can rewrite the join predicates like so: SELECT P.ProductMainID AS PID, P.Name, D.OrderQty, H.SalesOrderNumber, H.OrderDate, C.TerritoryID FROM #OrdDetail D JOIN #OrdHeader H ON D.SalesOrderID >= H.SalesOrderID AND D.SalesOrderID <= H.SalesOrderID JOIN #Custs C ON H.CustomerID >= C.CustomerID AND H.CustomerID <= C.CustomerID JOIN #Prods P ON P.ProductMainID >= D.ProductMainID AND P.ProductMainID <= D.ProductMainID AND P.ProductSubID = D.ProductSubID AND P.ProductSubSubID = D.ProductSubSubID ORDER BY D.ProductMainID OPTION (RECOMPILE, LOOP JOIN, MAXDOP 1, FORCE ORDER); I’ve also added LOOP JOIN and FORCE ORDER query hints to ensure that only nested loops joins are used, and that the tables are joined in the order they appear. The new estimated execution plan is: This new query runs in under 2 seconds. Why Is It Faster? The main reason for the improvement is the appearance of the eager Index Spools, which are also known as index-on-the-fly spools. If you read my Inside The Optimiser series you might be interested to know that the rule responsible is called JoinToIndexOnTheFly. An eager index spool consumes all rows from the table it sits above, and builds a index suitable for the join to seek on. Taking the index spool above the #Custs table as an example, it reads all the CustomerID and TerritoryID values with a single scan of the table, and builds an index keyed on CustomerID. The term ‘eager’ means that the spool consumes all of its input rows when it starts up. The index is built in a work table in tempdb, has no associated statistics, and only exists until the query finishes executing. The result is that each unindexed table is only scanned once, and just for the columns necessary to build the temporary index. From that point on, every execution of the inner side of the join is answered by a seek on the temporary index – not the base table. A second optimization is that the sort on ProductMainID (required by the ORDER BY clause) is performed early, on just the rows coming from the #OrdDetail table. The optimizer has a good estimate for the number of rows it needs to sort at that stage – it is just the cardinality of the table itself. The accuracy of the estimate there is important because it helps determine the memory grant given to the sort operation. Nested loops join preserves the order of rows on its outer input, so sorting early is safe. (Hash joins do not preserve order in this way, of course). The extra lazy spool on the #Prods branch is a further optimization that avoids executing the seek on the temporary index if the value being joined (the ‘outer reference’) hasn’t changed from the last row received on the outer input. It takes advantage of the fact that rows are still sorted on ProductMainID, so if duplicates exist, they will arrive at the join operator one after the other. The optimizer is quite conservative about introducing index spools into a plan, because creating and dropping a temporary index is a relatively expensive operation. It’s presence in a plan is often an indication that a useful index is missing. I want to stress that I rewrote the query in this way primarily as an educational exercise – I can’t imagine having to do something so horrible to a production system. Improving the Hash Join I promised I would return to the solution that uses hash joins. You might be puzzled that SQL Server can create three new indexes (and perform all those nested loops iterations) faster than it can perform three hash joins. The answer, again, is down to the poor information available to the optimizer. Let’s look at the hash join plan again: Two of the hash joins have single-row estimates on their build inputs. SQL Server fixes the amount of memory available for the hash table based on this cardinality estimate, so at run time the hash join very quickly runs out of memory. This results in the join spilling hash buckets to disk, and any rows from the probe input that hash to the spilled buckets also get written to disk. The join process then continues, and may again run out of memory. This is a recursive process, which may eventually result in SQL Server resorting to a bailout join algorithm, which is guaranteed to complete eventually, but may be very slow. The data sizes in the example tables are not large enough to force a hash bailout, but it does result in multiple levels of hash recursion. You can see this for yourself by tracing the Hash Warning event using the Profiler tool. The final sort in the plan also suffers from a similar problem: it receives very little memory and has to perform multiple sort passes, saving intermediate runs to disk (the Sort Warnings Profiler event can be used to confirm this). Notice also that because hash joins don’t preserve sort order, the sort cannot be pushed down the plan toward the #OrdDetail table, as in the nested loops plan. Ok, so now we understand the problems, what can we do to fix it? We can address the hash spilling by forcing a different order for the joins: SELECT P.ProductMainID AS PID, P.Name, D.OrderQty, H.SalesOrderNumber, H.OrderDate, C.TerritoryID FROM #Prods P JOIN #Custs C JOIN #OrdHeader H ON H.CustomerID = C.CustomerID JOIN #OrdDetail D ON D.SalesOrderID = H.SalesOrderID ON P.ProductMainID = D.ProductMainID AND P.ProductSubID = D.ProductSubID AND P.ProductSubSubID = D.ProductSubSubID ORDER BY D.ProductMainID OPTION (MAXDOP 1, HASH JOIN, FORCE ORDER); With this plan, each of the inputs to the hash joins has a good estimate, and no hash recursion occurs. The final sort still suffers from the one-row estimate problem, and we get a single-pass sort warning as it writes rows to disk. Even so, the query runs to completion in three or four seconds. That’s around half the time of the previous hashing solution, but still not as fast as the nested loops trickery. Final Thoughts SQL Server’s optimizer makes cost-based decisions, so it is vital to provide it with accurate information. We can’t really blame the performance problems highlighted here on anything other than the decision to use completely unindexed tables, and not to allow the creation of additional statistics. I should probably stress that the nested loops solution shown above is not one I would normally contemplate in the real world. It’s there primarily for its educational and entertainment value. I might perhaps use it to demonstrate to the sceptical that SQL Server itself is crying out for an index. Be sure to read Brad’s original post for more details. My grateful thanks to him for granting permission to reuse some of his material. Paul White Email: [email protected] Twitter: @PaulWhiteNZ

Search Results

Search found 1978 results on 80 pages for 'paul clarke'.

Page 5/80 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

- by Ash Clarke

- by Rob Clarke

- by DO it all Paul

- by Paul Hinett

- by Paul D'Ambra

- by Rob Clarke

- by Paul

- by Paul

- by Martin Clarke

- by Paul Wright

- by Matt Clarke

- by Paul Sheldrake

- by Jeremy Clarke

- by Jeremy Clarke

- by paul

- by Jeremy Clarke

- by Paul

- by Paul Nendick

- by Martin Clarke

- by Jonathan Clarke

- by Paul White

- by Paul White

- by Paul White

- by Paul White

- by Paul White NZ

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >