Diagnosing Random Network Lag

Posted by uesp on Server Fault See other posts from Server Fault or by uesp
Published on 2011-11-14T15:54:27Z Indexed on 2011/11/18 17:53 UTC
Read the original article Hit count: 232

Filed under:
|
|

I'm having trouble diagnosing some random lag on a 6 server LAMP cluster serving a MediaWiki site. While we're serving some 100 pages/sec the servers themselves are running fine with less than 0.5 load, no locked processes, no paging, no errors being logged, etc....

  • Lag is present on all servers and is random: one minute its fine the next it's there.
  • DNS lookups on the servers are randomly slow. For example time nslookup google.com varies randomly from a few milliseconds to several seconds and sometimes times out entirely. While we use IP addresses internally on the cluster this may be a symptom of the root issue. We are not running our own DNS server.
  • The Apache server-status pages randomly lag or time out. Benchmarking using ab between servers shows a few loads sometimes take 3000 ms (almost exactly). Benchmarking server-status on the local server itself usually shows no issue (it showed a lag only once among a few hundred tests).

The servers are sitting behind a switch and a firewall which I don't have any access to so I don't know their setup or status. While we are under heavier than normal load a 2 Mbps incoming and 20 Mbps outgoing traffic shouldn't be stressing the switch or firewall should it? My feeling is that it is the switch/firewall or something above them in the ISP like their DNS but can't confirm it.

I need some other tests or methods of diagnosing this lag to try and narrow down the ultimate cause.

© Server Fault or respective owner

Related posts about networking

Related posts about lag