Wednesday, March 28, 2012

Named Pipes problem on Clustered SQL Server

We recently have had issues connecting to clustered SQL Server
instances and have been working with PSS. I was unable to find any
useful information to assist in the troubleshooting process, so I have
decided to post this here in case anyone else searches on this like I
did.
Our Configuration:
2 Nodes, Windows 2003 Enterprise SP1, 5 SQL Server Named Instances
(SP4) in active/active configuration.
Our Initial Problem:
We were receiving GNE (General Network Error) when running some jobs or
connecting remotely to any of the named instances via Named Pipes.
When we looked in the SQL Server logs for each instance, we saw that
all instances reported "Listening on TCP, Shared Memory, Named Pipes"
During the troubleshooting we eventually found that we were unable to
connect to any of the instances using named pipes that were owned by of
the cluster nodes unless an alias was created.
Troubleshooting steps that helped us:
Try to connect to each instance using Query Analyzer. In the "Connect
to SQL Server"s "Connect to:" field try the following for each
instance:
Examples use a server called: SQLSERVER1\INSTANCE running on 1433.
After each successful connect, you should verify the connection's net
library using the following query:
select net_library from sysprocesses where spid = @.@.spid
1.) tcp:192.168.1.1,1433
2.) tcp:SQLSERVER1,1433
3.) tcp:SQLSERVER1\INSTANCE
4.) np:\\SQLSERVER1\pipe\MSSQL$INSTANCE\sql\query
5.) np:SQLSERVER1\INSTANCE
Here's the rundown of how each one is testing connectivity:
Number 1 is basic TCP connectivity. If this fails, look at TCP,
cluster VIP for that instance, port blocking on firewalls, etc. This
method does NOT use SSRP (SQL Server Resolution Protocol/Service).
Number 2 is like #1 except now DNS and/or WINS resolution is involved.
If #1 works and #2 doesn't, look at name resolution over TCP/IP. #3
uses the same name resolution as #2, however now you are using SSRP to
get the port number of the instance. (More on SSRP below). #4 uses
TCP name resolution and then connects to the SQL Server Name on the
cluster using the exact pipe name of the instance. If #4 fails, try
number 2 to test name resolution and also ensure that there is a log
entry in the SQL Server log that says "Listening on Named Pipes". #4
Uses Name Resolution and SSRP to query the pipe name.
What is SSRP?
This is the million dollar question, especially in a SQL Server
Clustered Environment. Good luck finding a nice whitepaper by
Microsoft about it, because I was unable to. After piecing together a
bunch of different posts, articles and other resources I was able to
glean the following (please note, this is my interpretation, not
absolute fact)
SSRP listens on each instance of SQL Server installed in the cluster
environment using UDP Port 1434. When a request is made for a named
pipes instance of SQL Server, the client request first queries UDP 1434
using a hex packet of 0x3. The request is handed to the SSRP Active
Server (which is the first SQL Server instance on each node that was
brought online) The SSRP Active Server then enumerates the instances
from the registry key:
HKLM\Software\Microsoft\MicrosoftSQLServer\InstalledInstances
and then SSRP Enumerates the connection information for each instance
using the key(s):
HKLM\Software\Microsoft\MicrosoftSQLServer\(instance
name)\MSSQLServer\SuperSocketNetLib\
SSRP then returns on UDP1434 to the client is a list of listening
instances and their configuration information.
You can actually see this in action using PortQry or a tool called UDP
Port Tool by http://www.simplecomtools.com/
I was unable to get PortQry to work in my environment, so I used the
UDP Port Tool. You will need to bind 1434 on your local machine (right
side of window) then enter the connection information (IP address of
SQL Server IP and port 1434 (left side of window) and enter "03"
without the quotes and set the data type to hex. This will return the
information provided by SSRP.
Now, the meat of the problem....
We found that on one of our nodes NONE of the instances would connect
using method 5 (above). However, method 4 would work. When we queried
SSRP on any of these instances, no data was returned. By logging on to
that node and executing:
netstat -an -b -p UDP
We were able to see that 0.0.0.0,1434 UDP was bound to PID xxxx. This
SHOULD have been the PID of the FIRST SQL Instance to be brought online
to that node. When we looked at both Task Manager Processes (selecting
the column for PID) there was no PID running with the number xxxx.
After looking in the historical logs, we found that there was an
instance at one point that was running on that node with PID xxxx,
however it had crashed and restarted under a new PID, however the
1434UDP bind was never released, therefore it did not thing it needed
to be the Active SSRP Server. A full cluster reboot is the Microsoft
PSS recommendation for this situation.
I hope this helps someone. We banged our heads against the wall for a
while on this, and if this only helps one person out there, hopefully
my newsgroup karma gets better. Please feel free to reply to this
with technical corrections or comments. Like I mentioned above, this
is what I found and is my interpretation.
ThanksOne other interesting note regarding SSRP that I came across while
reserching this did not apply to us, but may apply is registry
permissions while enumerating. There is a technet article on this:
http://support.microsoft.com/default.aspx?scid=kb;en-us;922131&sd=rss&spid=2852|||Wrong article!! This is the article regarding registry permissions:
http://support.microsoft.com/kb/888529
The other article has to do with SQL2005 installed over SQL2000 and
then removed.
wuteva wrote:
> One other interesting note regarding SSRP that I came across while
> reserching this did not apply to us, but may apply is registry
> permissions while enumerating. There is a technet article on this:
> http://support.microsoft.com/default.aspx?scid=kb;en-us;922131&sd=rss&spid=2852

No comments:

Post a Comment