I have a DNS server that resolves all queries for an internal group of servers.
It is a bind on CentOS 5.5 (same as RHEL5) and I have set it up to allow recursion and resolve direction without any forwarders.
The problem I am facing is that it takes a freakishly long amount of time to resolve a name for the first time. (in the magnitudes of 20 sec) This causes clients to give timeout.
When I set it to forward all to Google's public DNS, i.e. 8.8.8.8+8.8.4.4, it works very nicely (within a second).
I tried monitoring the traffic on the net to see why it is doing this:
[root@ns1 ~]# tcpdump -nnvvvA -s0 udp
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
23:06:36.137797 IP (tos 0x0, ttl 64, id 35903, offset 0, flags [none], proto: UDP (17), length: 60) 172.17.1.10.36942 > 172.17.1.4.53: [udp sum ok] 19773+ A? www.paypal.
com. (32)
E..<
[email protected]...
.....N.5.(6.M=...........www.paypal.
com.....
23:06:36.140594 IP (tos 0x0, ttl 64, id 56477, offset 0, flags [none], proto: UDP (17), length: 71) 172.17.1.4.6128 > 192.35.51.30.53: [udp sum ok] 10105 [1au] A? www.paypal.
com. ar: . OPT UDPsize=4096 (43)
E..G....@........#3....5.3fR'y...........www.paypal.
com.......)........
23:06:38.149756 IP (tos 0x0, ttl 64, id 13078, offset 0, flags [none], proto: UDP (17), length: 71) 172.17.1.4.52425 > 192.54.112.30.53: [udp sum ok] 54892 [1au] A? www.paypal.
com. ar: . OPT UDPsize=4096 (43)
[email protected]&.....6p....5.3.q.l...........www.paypal.
com.......)........
23:06:40.159725 IP (tos 0x0, ttl 64, id 43016, offset 0, flags [none], proto: UDP (17), length: 71) 172.17.1.4.24059 > 192.42.93.30.53: [udp sum ok] 11205 [1au] A? www.paypal.
com. ar: . OPT UDPsize=4096 (43)
E..G....@..@.....*].]..5.3..+............www.paypal.
com.......)........
23:06:41.141403 IP (tos 0x0, ttl 64, id 35904, offset 0, flags [none], proto: UDP (17), length: 60) 172.17.1.10.36942 > 172.17.1.4.53: [udp sum ok] 19773+ A? www.paypal.
com. (32)
E..<.@..@..@...
.....N.5.(6.M=...........www.paypal.
com.....
23:06:42.169652 IP (tos 0x0, ttl 64, id 44001, offset 0, flags [none], proto: UDP (17), length: 60) 172.17.1.4.9141 > 192.55.83.30.53: [udp sum ok] 1184 A? www.paypal.
com. (32)
E..<
[email protected].#..5.(...............www.paypal.
com.....
23:06:42.207295 IP (tos 0x0, ttl 54, id 38004, offset 0, flags [none], proto: UDP (17), length: 205) 192.55.83.30.53 > 172.17.1.4.9141: [udp sum ok] 1184- q: A? www.paypal.
com. 0/3/3 ns: paypal.
com. NS ns1.isc-sns.net., paypal.
com. NS ns2.isc-sns.
com., paypal.
com. NS ns3.isc-sns.info. ar: ns1.isc-sns.net. AAAA 2001:470:1a::1, ns1.isc-sns.net. A 72.52.71.1, ns2.isc-sns.
com. A 38.103.2.1 (177)
E....t..6./A.7S......5#..................www.paypal.
com..................ns1.isc-sns.net..............ns2.isc-sns...............ns3.isc-sns.info..,.......... ..p.............,..........H4G..I..........&g..
(this goes on for a few more seconds)
If you look carefully, you will see that the first 3-4 root servers did not respond at all.
This wastes 7-8 seconds, until one of them responded.
Do you think I have setup something wrong here? Interestingly, when I dig directly from the root servers that did not respond, the always respond very fast (showing the firewall/nat is not the issue here). E.g.
dig www.paypal.
com @192.35.51.30
works perfectly, consistently, and very fast. What do you think about this mystery?