Friday, March 30, 2012

Question Title: 2000 SQL Cluster Failure (Active/Passive)

(sorry to post this here, but it looks like there is no activity on the
cluster newsgroup)
We are having an issue about every 3 months or so our SQL cluster
(active/passive) will fail & go completely offline. The only recourse is to
power off/on the boxes to restore connectivity.
By the looks of it, the primary SQL node will fail & dump, the second
passive node will sense the failure & come online, bring up the resources &
start up SQL, but shortly after that the SQL Service will fail & the whole
cluster will go down & is unreachable. (via tcp or desktop) We have to
cycle the power.
I appreciate any advise or insight you can give me on the is situation.
Details are below.
Have a good weekend!
James
Details:
SQL Server Enterprise (2000)
Build: 8.00.760 (SP3)
Windows Enterprise Server (2003)
Build: 5.2(3790)
Basic Timeline & Errors:
SQLN01:
No Events written to Application or System Windows Event Logs
No errors in the Cluster Log. Just INFO logged
SQL Server Error Log: Error: 1203, Severity: 20, State 1
SQL Server Error Log: Process ID 58 attempting to unlock unowned resource
RID: 8:1:339:43
Those SQL errors report numerous times & then:
SQL Server Error Log: SQL Server Assertion: File: <lckmgr.cpp>, line-4792,
Failed Assertion = 'lockFound ==TRUE'
SQL Server Error Log: Stack Signature for the dump is 0xFEDF6C17
SQL Server Error Log: Using 'dbghelp.dll' version '4.0.5'*Dump thread...
SQL Server Error Log: Login failed for user 'sa'
That repeats about 30 times & that all for the logs...
SQLN02:
Numerous System Events recorded of lost communication with cluster & bring
SQLN02 into active mode. (Event 1123, 1209 & 1200)
System & Application Events show start of SQL Services.
System Event: 1069 (Failover Mgr) Cluster resource "SQL Server" in Resource
Group XYZ failed
System Event: 7035 The SQLSERVERAGENT service successfully sent a stop control
System Event: 7036 The SQLSERVERAGENT service successfully stopped
System Event: Multiple Events record successfully startup of SQL & Cluster
Service & then nothing in system events until previous shutdown was
unexpected.
Application Event: 17052 [sqsrvres] CheckServiceAlive: Service is dead
Application Event: 17052 [sqsrvres] OnlineThread: service stopped while
waiting for QP.
Application Event: 102 SQLServerAgent service successfully stopped.
Application Event: Multiple Events showing the restart of SQL & then
nothing until the power is cycled.
From the Cluster logs on SQLN02: (only WARN or ERROR) messages
0000090c.00000974::2005/05/19-20:16:02.815 WARN [NM] Communication was lost
with interface 2115bee3-ada3-4b23-94ab-67de328d0969 (node: SQLN01, network:
PUBLIC)
0000090c.00000a90::2005/05/19-20:16:02.815 WARN [NM] Updating local
connectivity info for network ac5dcdf2-4ec6-431c-8d0e-75a9f388f945.
00000bb8.00000d88::2005/05/19-20:16:28.895 WARN Physical Disk <Disk Q:>:
[DiskArb] Assume ownership of the device.
0000090c.00000a90::2005/05/19-20:16:28.911 WARN [NM] Leadership changed.
Cancelling connectivity report for network
ac5dcdf2-4ec6-431c-8d0e-75a9f388f945.
00000bb8.0000152c::2005/05/19-20:16:33.645 WARN Network Name <Cluster Name>:
Unable to read CreatingDC parameter, error=2
00000bb8.000004c0::2005/05/19-20:16:33.677 WARN Network Name <SQL Network
Name(VRSQL)>: Unable to read ResourceData parameter, error=2
00000bb8.000004c0::2005/05/19-20:16:33.770 WARN [ClNet] Tcpip is not bound
to adapter 0470CF36-FDC3-446D-8738-756DE859CB7A. (comment -> disabled network
adapter)
00000bb8.00000940::2005/05/19-20:16:38.286 WARN Physical Disk <Disk L:>:
[DiskArb] Assume ownership of the device.
00000bb8.0000165c::2005/05/19-20:17:19.759 ERR SQL Server <SQL Server>:
[sqsrvres] OnlineThread: service stopped while waiting for QP.
00000bb8.0000165c::2005/05/19-20:17:19.759 ERR SQL Server <SQL Server>:
[sqsrvres] OnlineThread: Error 1 bringing resource online.
00000bb8.00000bd4::2005/05/19-20:17:23.931 ERR SQL Server <SQL Server>:
[sqsrvres] CheckServiceAlive: Service is dead
0000090c.000009c4::2005/05/19-20:17:23.946 WARN [FM]
FmpHandleResourceTransition: Resource Name =
eea79949-1b03-42e8-a690-9dc987e72063 [SQL Server] old state=2 new state=4
Yes we run DBCC every weekend.
We also have AWE enabled @. 12GB RAM.
Looks like we will have to wait for a fix to sp4.
Thanks for your help!
James
"Mike Epprecht (SQL MVP)" wrote:
[vbcol=seagreen]
> Hi
> Have you run DBCC CheckDB on the databases?
> If you do not have more than 2GB RAM, you could install SP4 for SQL Server
> as there was one know issue after SP3a that could have caused this error.
> Regards
> --
> Mike Epprecht, Microsoft SQL Server MVP
> Zurich, Switzerland
> MVP Program: http://www.microsoft.com/mvp
> Blog: http://www.msmvps.com/epprecht/
>
> "Death_n_Gravity" wrote:
|||Hi
Have you run DBCC CheckDB on the databases?
If you do not have more than 2GB RAM, you could install SP4 for SQL Server
as there was one know issue after SP3a that could have caused this error.
Regards
Mike Epprecht, Microsoft SQL Server MVP
Zurich, Switzerland
MVP Program: http://www.microsoft.com/mvp
Blog: http://www.msmvps.com/epprecht/
"Death_n_Gravity" wrote:

> (sorry to post this here, but it looks like there is no activity on the
> cluster newsgroup)
> We are having an issue about every 3 months or so our SQL cluster
> (active/passive) will fail & go completely offline. The only recourse is to
> power off/on the boxes to restore connectivity.
> By the looks of it, the primary SQL node will fail & dump, the second
> passive node will sense the failure & come online, bring up the resources &
> start up SQL, but shortly after that the SQL Service will fail & the whole
> cluster will go down & is unreachable. (via tcp or desktop) We have to
> cycle the power.
> I appreciate any advise or insight you can give me on the is situation.
> Details are below.
> Have a good weekend!
> James
>
> Details:
> SQL Server Enterprise (2000)
> Build: 8.00.760 (SP3)
> Windows Enterprise Server (2003)
> Build: 5.2(3790)
> Basic Timeline & Errors:
> SQLN01:
> No Events written to Application or System Windows Event Logs
> No errors in the Cluster Log. Just INFO logged
> SQL Server Error Log: Error: 1203, Severity: 20, State 1
> SQL Server Error Log: Process ID 58 attempting to unlock unowned resource
> RID: 8:1:339:43
> Those SQL errors report numerous times & then:
> SQL Server Error Log: SQL Server Assertion: File: <lckmgr.cpp>, line-4792,
> Failed Assertion = 'lockFound ==TRUE'
> SQL Server Error Log: Stack Signature for the dump is 0xFEDF6C17
> SQL Server Error Log: Using 'dbghelp.dll' version '4.0.5'*Dump thread...
> SQL Server Error Log: Login failed for user 'sa'
> That repeats about 30 times & that all for the logs...
> SQLN02:
> Numerous System Events recorded of lost communication with cluster & bring
> SQLN02 into active mode. (Event 1123, 1209 & 1200)
> System & Application Events show start of SQL Services.
> System Event: 1069 (Failover Mgr) Cluster resource "SQL Server" in Resource
> Group XYZ failed
> System Event: 7035 The SQLSERVERAGENT service successfully sent a stop control
> System Event: 7036 The SQLSERVERAGENT service successfully stopped
> System Event: Multiple Events record successfully startup of SQL & Cluster
> Service & then nothing in system events until previous shutdown was
> unexpected.
> Application Event: 17052 [sqsrvres] CheckServiceAlive: Service is dead
> Application Event: 17052 [sqsrvres] OnlineThread: service stopped while
> waiting for QP.
> Application Event: 102 SQLServerAgent service successfully stopped.
> Application Event: Multiple Events showing the restart of SQL & then
> nothing until the power is cycled.
> From the Cluster logs on SQLN02: (only WARN or ERROR) messages
> 0000090c.00000974::2005/05/19-20:16:02.815 WARN [NM] Communication was lost
> with interface 2115bee3-ada3-4b23-94ab-67de328d0969 (node: SQLN01, network:
> PUBLIC)
> 0000090c.00000a90::2005/05/19-20:16:02.815 WARN [NM] Updating local
> connectivity info for network ac5dcdf2-4ec6-431c-8d0e-75a9f388f945.
> 00000bb8.00000d88::2005/05/19-20:16:28.895 WARN Physical Disk <Disk Q:>:
> [DiskArb] Assume ownership of the device.
> 0000090c.00000a90::2005/05/19-20:16:28.911 WARN [NM] Leadership changed.
> Cancelling connectivity report for network
> ac5dcdf2-4ec6-431c-8d0e-75a9f388f945.
> 00000bb8.0000152c::2005/05/19-20:16:33.645 WARN Network Name <Cluster Name>:
> Unable to read CreatingDC parameter, error=2
> 00000bb8.000004c0::2005/05/19-20:16:33.677 WARN Network Name <SQL Network
> Name(VRSQL)>: Unable to read ResourceData parameter, error=2
> 00000bb8.000004c0::2005/05/19-20:16:33.770 WARN [ClNet] Tcpip is not bound
> to adapter 0470CF36-FDC3-446D-8738-756DE859CB7A. (comment -> disabled network
> adapter)
> 00000bb8.00000940::2005/05/19-20:16:38.286 WARN Physical Disk <Disk L:>:
> [DiskArb] Assume ownership of the device.
> 00000bb8.0000165c::2005/05/19-20:17:19.759 ERR SQL Server <SQL Server>:
> [sqsrvres] OnlineThread: service stopped while waiting for QP.
> 00000bb8.0000165c::2005/05/19-20:17:19.759 ERR SQL Server <SQL Server>:
> [sqsrvres] OnlineThread: Error 1 bringing resource online.
> 00000bb8.00000bd4::2005/05/19-20:17:23.931 ERR SQL Server <SQL Server>:
> [sqsrvres] CheckServiceAlive: Service is dead
> 0000090c.000009c4::2005/05/19-20:17:23.946 WARN [FM]
> FmpHandleResourceTransition: Resource Name =
> eea79949-1b03-42e8-a690-9dc987e72063 [SQL Server] old state=2 new state=4

No comments:

Post a Comment