Since a sitewide upgrade to Windows 7 on desktop, I've started having a problem with virus checking.
Specifically - when doing a rename operation on a (filer hosted) CIFS share. The virus checker seems to be triggering a set of messages on the filer:
[filerB: auth.trace.authenticateUser.loginTraceIP:info]: AUTH: Login attempt by user server-wk8-r2$ of domain MYDOMAIN from client machine 10.1.1.20 (server-wk8-r2).
[filerB: auth.dc.trace.DCConnection.statusMsg:info]: AUTH: TraceDC- attempting authentication with domain controller \\MYDC.
[filerB: auth.trace.authenticateUser.loginRejected:info]: AUTH: Login attempt by user rejected by the domain controller with error 0xc0000199: STATUS_NOLOGON_WORKSTATION_TRUST_ACCOUNT.
[filerB: auth.trace.authenticateUser.loginTraceMsg:info]: AUTH: Delaying the response by 5 seconds due to continuous failed login attempts by user server-wk8-r2$ of domain MYDOMAIN from client machine 10.1.1.20.
This seems to specifically trigger on a rename so what we think is going on is the virus checker is seeing a 'new' file, and trying to do an on-access scan. The virus checker - previously running as LocalSystem and thus sending null as it's authentication request is now looking rather like a DOS attack, and causing the filer to temporarily black list.
This 5s lock out each 'access attempt' is a minor nuisance most of the time, and really quite significant for some operations - e.g. large file transfers, where every file takes 5s
Having done some digging, this seems to be related to NLTM authentication:
Symptoms
Error message:
System error 1808 has occurred.
The account used is a computer account. Use your global user account or local user account to access this server.
A packet trace of the failure will show the error as:
STATUS_NOLOGON_WORKSTATION_TRUST_ACCOUNT (0xC0000199)
Cause
Microsoft has changed the functionality of how a Local System account identifies itself
during NTLM authentication. This only impacts NTLM authentication. It does not impact
Kerberos Authentication.
Solution
On the host, please set the following group policy entry and reboot the host.
Network Security: Allow Local System to use computer identity for NTLM: Disabled
Defining this group policy makes Windows Server 2008 R2 and Windows 7 function like Windows Server 2008 SP1.
So we've now got a couple of workaround which aren't particularly nice - one is to change this security option.
One is to disable virus checking, or otherwise exempt part of the infrastructure.
And here's where I come to my request for assistance from ServerFault - what is the best way forwards? I lack Windows experience to be sure of what I'm seeing.
I'm not entirely sure why NTLM is part of this picture in the first place - I thought we were using Kerberos authentication. I'm not sure how to start diagnosing or troubleshooting this. (We are going cross domain - workstation machine accounts are in a separate AD and DNS domain to my filer. Normal user authentication works fine however.)
And failing that, can anyone suggest other lines of enquiry? I'd like to avoid a site wide security option change, or if I do go that way I'll need to be able to supply detailed reasoning. Likewise - disabling virus checking works as a short term workaround, and applying exclusions may help... but I'd rather not, and don't think that solves the underlying problem.
EDIT:
Filers in AD ldap have SPNs for:
nfs/host.fully.qualified.domain
nfs/host
HOST/host.fully.qualified.domain
HOST/host
(Sorry, have to obfuscate those).
Could it be that without a 'cifs/host.fully.qualified.domain' it's not going to work? (or some other SPN? )
Edit: As part of the searching I've been doing I've found:
http://itwanderer.wordpress.com/2011/04/14/tread-lightly-kerberos-encryption-types/
Which suggests that several encryption types were disabled by default in Win7/2008R2. This might be pertinent, as we've definitely had a similar problem with Keberized NFSv4.
There is a hidden option which may help some future Keberos users:
options nfs.rpcsec.trace on
(This hasn't given me anything yet though, so may just be NFS specific).
Edit:
Further digging has me tracking it back to cross domain authentication. It looks like my Windows 7 workstation (in one domain) is not getting Kerberos tickets for the other domain, in which my NetApp filer is CIFS joined.
I've done this separately against a standalone server (Win2003 and Win2008) and didn't get Kerberos tickets for those either.
Which means I think Kerberos might be broken, but I've no idea how to troubleshoot further.
Edit: A further update: It looks like this may be down Kerberos tickets not being issued cross domain. This then triggers NTLM fallback, which then runs into this problem (since Windows 7). First port of call will be to investigate the Kerberos side of things, but in neither case do we have anything pointing at the Filer being the root cause. As such - as the storage engineer - it's out of my hands.
However, if anyone can point me in the direction of troubleshooting Kerberos spanning two Windows AD domains (Kerberos Realms) then that would be appreciated.
Options we're going to be considering for resolution:
Amend policy option on all workstations via GPO (as above).
Talking to AV vendor about the rename triggering scanning.
Talking to AV vendor regarding running AV as service account.
investigating Kerberos authentication (why it's not working, whether it should be).