Diagnosing Random Network Lag
Posted
by
uesp
on Server Fault
See other posts from Server Fault
or by uesp
Published on 2011-11-14T15:54:27Z
Indexed on
2011/11/18
17:53 UTC
Read the original article
Hit count: 230
I'm having trouble diagnosing some random lag on a 6 server LAMP cluster serving a MediaWiki site. While we're serving some 100 pages/sec the servers themselves are running fine with less than 0.5 load, no locked processes, no paging, no errors being logged, etc....
- Lag is present on all servers and is random: one minute its fine the next it's there.
- DNS lookups on the servers are randomly slow. For example
time nslookup google.com
varies randomly from a few milliseconds to several seconds and sometimes times out entirely. While we use IP addresses internally on the cluster this may be a symptom of the root issue. We are not running our own DNS server. - The Apache
server-status
pages randomly lag or time out. Benchmarking usingab
between servers shows a few loads sometimes take 3000 ms (almost exactly). Benchmarkingserver-status
on the local server itself usually shows no issue (it showed a lag only once among a few hundred tests).
The servers are sitting behind a switch and a firewall which I don't have any access to so I don't know their setup or status. While we are under heavier than normal load a 2 Mbps incoming and 20 Mbps outgoing traffic shouldn't be stressing the switch or firewall should it? My feeling is that it is the switch/firewall or something above them in the ISP like their DNS but can't confirm it.
I need some other tests or methods of diagnosing this lag to try and narrow down the ultimate cause.
© Server Fault or respective owner