Our SQL 2005 server began timing out and finally stopped responding on Oct 26th. The application logs showed a ton of 17883 events leading up to a reboot. After the reboot everything was fine but we were still scratching our heads. Fast forward 6 days... it happened again. Then again 2 days later. The last night. Today it has happened three times to far. The timeline is fairly predictable when it happens:
Trans log backups.
Login failure for "user2".
Minidump
Another minidump for the scheduler
Repeated 17883 events.
Server fails little by little until it won't accept any requests.
Reboot is all that gets us going again (a band-aid)
Interesting, though, is that the server box itself doesn't seem to have any problems. CPU usage is normal. Network connectivity is fine. We can remote in and look at logs. Management studio does eventually bog down, though.
Today, for the first time, we tried stopping services instead of a reboot. All services stopped on their own except for the SQL Server service. We finally did an "end task" on that one and were able to bring everything back up. It worked fine for about 30 minutes until we started seeing timeouts and 17883's again. This time, probably because we didn't reboot all the way, we saw a bunch of 844 events mixed in with the 17883's.
Our entire tech team here is scratching heads... some ideas we're kicking around:
MS Cumulative Update hit around the same time as when we first had a problem. Since then, we've rolled it back. Maybe it didn't rollback all the way.
The situation looks and feels like an unhandled "stack overflow" (no relation) in that it starts small and compounds over time. Problem with this is that there isn't significant CPU usage. At any rate, we're not ruling SQL 2005 bug out at all.
Maybe we added one too many import processes and have reached our limit on this box. (hard to believe).
Looking at SQLDUMP0151.log at the time of one of the crashes. There are some "login failures" and then there are two stack dumps. 1st a normal stack dump, 2nd for a scheduler dump. Here's a snippet: (sorry for the lack of line breaks)
2009-11-10 11:59:14.95 spid63 Using 'xpsqlbot.dll' version '2005.90.3042' to execute extended stored procedure 'xp_qv'. This is an informational message only; no user action is required.
2009-11-10 11:59:15.09 spid63 Using 'xplog70.dll' version '2005.90.3042' to execute extended stored procedure 'xp_msver'. This is an informational message only; no user action is required.
2009-11-10 12:02:33.24 Logon Error: 18456, Severity: 14, State: 16.
2009-11-10 12:02:33.24 Logon Login failed for user 'standard_user2'. [CLIENT: 50.36.172.101]
2009-11-10 12:08:21.12 Logon Error: 18456, Severity: 14, State: 16.
2009-11-10 12:08:21.12 Logon Login failed for user 'standard_user2'. [CLIENT: 50.36.172.101]
2009-11-10 12:13:49.38 Logon Error: 18456, Severity: 14, State: 16.
2009-11-10 12:13:49.38 Logon Login failed for user 'standard_user2'. [CLIENT: 50.36.172.101]
2009-11-10 12:15:16.88 Logon Error: 18456, Severity: 14, State: 16.
2009-11-10 12:15:16.88 Logon Login failed for user 'standard_user2'. [CLIENT: 50.36.172.101]
2009-11-10 12:18:24.41 Logon Error: 18456, Severity: 14, State: 16.
2009-11-10 12:18:24.41 Logon Login failed for user 'standard_user2'. [CLIENT: 50.36.172.101]
2009-11-10 12:18:38.88 spid111 Using 'dbghelp.dll' version '4.0.5'
2009-11-10 12:18:39.02 spid111 *Stack Dump being sent to C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\SQLDump0149.txt
2009-11-10 12:18:39.02 spid111 SqlDumpExceptionHandler: Process 111 generated fatal exception c0000005 EXCEPTION_ACCESS_VIOLATION. SQL Server is terminating this process.
2009-11-10 12:18:39.02 spid111 * *****************************************************************************
2009-11-10 12:18:39.02 spid111 *
2009-11-10 12:18:39.02 spid111 * BEGIN STACK DUMP:
2009-11-10 12:18:39.02 spid111 * 11/10/09 12:18:39 spid 111
2009-11-10 12:18:39.02 spid111 *
2009-11-10 12:18:39.02 spid111 *
2009-11-10 12:18:39.02 spid111 * Exception Address = 0159D56F Module(sqlservr+0059D56F)
2009-11-10 12:18:39.02 spid111 * Exception Code = c0000005 EXCEPTION_ACCESS_VIOLATION
2009-11-10 12:18:39.02 spid111 * Access Violation occurred writing address 00000000
2009-11-10 12:18:39.02 spid111 * Input Buffer 138 bytes -
2009-11-10 12:18:39.02 spid111 * " N R S C _ P T A 22 00 4e 00 52 00 53 00 43 00 5f 00 50 00 54 00 41 00
2009-11-10 12:18:39.02 spid111 * C _ Q A . d b o . 43 00 5f 00 51 00 41 00 2e 00 64 00 62 00 6f 00 2e 00
2009-11-10 12:18:39.02 spid111 * U s p S e l N e x 55 00 73 00 70 00 53 00 65 00 6c 00 4e 00 65 00 78 00
2009-11-10 12:18:39.02 spid111 * t A c c o u n t 74 00 41 00 63 00 63 00 6f 00 75 00 6e 00 74 00 00 00
2009-11-10 12:18:39.02 spid111 * @ i n t F o r m I 0a 40 00 69 00 6e 00 74 00 46 00 6f 00 72 00 6d 00 49
2009-11-10 12:18:39.02 spid111 * D & 8 @ t x 00 44 00 00 26 04 04 38 00 00 00 09 40 00 74 00 78 00
2009-11-10 12:18:39.02 spid111 * t A l i a s § 74 00 41 00 6c 00 69 00 61 00 73 00 00 a7 0f 00 09 04
2009-11-10 12:18:39.02 spid111 * Ð GQE9732 d0 00 00 07 00 47 51 45 39 37 33 32
2009-11-10 12:18:39.02 spid111 *
2009-11-10 12:18:39.02 spid111 *
2009-11-10 12:18:39.02 spid111 * MODULE BASE END SIZE
2009-11-10 12:18:39.02 spid111 * sqlservr 01000000 02C09FFF 01c0a000
2009-11-10 12:18:39.02 spid111 * ntdll 7C800000 7C8C1FFF 000c2000
2009-11-10 12:18:39.02 spid111 * kernel32 77E40000 77F41FFF 00102000