Resolved
The problem was Hyper-V on that machine. I removed Hyper-V, installed VMware Server, ran the same VM. Time sync issues went away (< 100ms difference after a day).
My setup is like this:
HYV1 - HyperV machine (non domain) - sync irrelevant
AD1 - VM AD server on HYV1, sync'd to time.nist.gov. HyperV time sync off.
S1 - Physical machine, sync'd to domain.
S2 - Physical machine running HyperV, sync'd to domain.
V1 - Linux VM machine on S2, sync'd to AD1. No HyperV integration.
AD1 and S1 have fine sync -- stripchart shows less than 100ms difference.
S2 drifts like crazy. Here's a bit of the stripchart against AD1:
18:33:22 d:+00.0010138s o:+05.4101899s
18:33:24 d:+00.0010138s o:+05.4319765s
18:33:26 d:+00.0000000s o:+05.4788429s
18:33:28 d:+00.0000000s o:+05.6089942s
18:33:30 d:+00.0010138s o:+05.7240269s
18:33:32 d:+00.0000000s o:+06.0421911s
18:33:34 d:+00.0081104s o:+06.5613708s
18:33:37 d:+00.0000000s o:+06.9096594s
18:33:39 d:+00.0000000s o:+06.8867838s
18:33:41 d:+00.0010127s o:+06.8936401s
In 20 seconds, it drifted over a second. If I manually reset it to within 1s, within a few minutes it'll be back drifting about 2 seconds. Overnight it went from ~2s to ~5s. The Linux VM inside S2 has perfect sync with AD1.
Here's the config:
C:\Users\mgg>w32tm /dumpreg /subkey:Parameters
Value Name Value Type Value Data
------------------------------------------------------------
ServiceDll REG_EXPAND_SZ %systemroot%\system32\w32time.dll
ServiceMain REG_SZ SvchostEntry_W32Time
ServiceDllUnloadOnStop REG_DWORD 1
Type REG_SZ NT5DS
NtpServer REG_SZ ad01.mydomain ad02.mydomain
C:\Users\mgg>w32tm /dumpreg /subkey:Config
Value Name Value Type Value Data
-----------------------------------------------------------
FrequencyCorrectRate REG_DWORD 4
PollAdjustFactor REG_DWORD 5
LargePhaseOffset REG_DWORD 50000000
SpikeWatchPeriod REG_DWORD 900
LocalClockDispersion REG_DWORD 9
HoldPeriod REG_DWORD 5
PhaseCorrectRate REG_DWORD 1
UpdateInterval REG_DWORD 30000
EventLogFlags REG_DWORD 2
AnnounceFlags REG_DWORD 5
TimeJumpAuditOffset REG_DWORD 28800
MinPollInterval REG_DWORD 2
MaxPollInterval REG_DWORD 8
MaxNegPhaseCorrection REG_DWORD -1
MaxPosPhaseCorrection REG_DWORD -1
MaxAllowedPhaseOffset REG_DWORD 300
I looked at the event log, and apart from warnings about sync (after it gets way out of sync), there's no other warnings.
How can I go about troubleshooting this? It's the only machine that is having this problem. All the other machines (physical and virtual) are doing fine.
Edit: To clarify: The VM (AD1) has integration turned off and syncs to time.nist.gov. AD1 is fine. It's the physical machine S1 that can't sync to AD1 and drifts all over. All the other physical servers are able to sync to AD1 just fine.
Update
So, it appears to be an issue of running the VM. The clock slips slowly with the VM off. Turned on, it immediately starts losing seconds. I swt the VM to only use half the resources, and that seems to have slightly mitigated it, for now. Thanks!