Diagnosing Random Network Lag
- by uesp
I'm having trouble diagnosing some random lag on a 6 server LAMP cluster serving a MediaWiki site. While we're serving some 100 pages/sec the servers themselves are running fine with less than 0.5 load, no locked processes, no paging, no errors being logged, etc....
Lag is present on all servers and is random: one minute its fine the next it's there.
DNS lookups on the servers are randomly slow. For example time nslookup google.com varies randomly from a few milliseconds to several seconds and sometimes times out entirely. While we use IP addresses internally on the cluster this may be a symptom of the root issue. We are not running our own DNS server.
The Apache server-status pages randomly lag or time out. Benchmarking using ab between servers shows a few loads sometimes take 3000 ms (almost exactly). Benchmarking server-status on the local server itself usually shows no issue (it showed a lag only once among a few hundred tests).
The servers are sitting behind a switch and a firewall which I don't have any access to so I don't know their setup or status. While we are under heavier than normal load a 2 Mbps incoming and 20 Mbps outgoing traffic shouldn't be stressing the switch or firewall should it? My feeling is that it is the switch/firewall or something above them in the ISP like their DNS but can't confirm it.
I need some other tests or methods of diagnosing this lag to try and narrow down the ultimate cause.