Bind9 as a caching resolver fails with mismatch ID on localhost but not external IP

Posted by argibbs on Server Fault See other posts from Server Fault or by argibbs
Published on 2013-10-18T17:07:13Z Indexed on 2013/10/19 21:58 UTC
Read the original article Hit count: 564

Filed under:
|
|
|
|

I'm running Ubuntu 12.04 LTS on a machine on my private network.

I have bind9 installed (v9.8.1-P1) via aptitude, so it appears to have put all the bits in the right places and the service starts automatically. I plan on adding some zones later, but first I'm just trying to get it working as a caching resolver. I installed bind, configured it, and starting using it. Initially I thought it was working ok, but then I found some sites weren't being resolved. I've pinned it down to being linked to the size of the result and bind failing-over to TCP mode.

So: I'm trying to find out why bind is failing when I query for domain info and the result is >512 bytes (causing a truncation and retry on TCP). Specifically it fails with ID mismatches if I point dig at localhost, but works when I query the machine's own IP (192.168.0.2). This appears to be backwards to the problem that most people have when using bind (fails on external ip, works on localhost).

If I do dig @localhost google.com (which has a response of <512 bytes) then it works; I get no warnings, and plenty of output.

$ dig @localhost google.com
; <<>> DiG 9.8.1-P1 <<>> @localhost google.com
[snip lots of output]
;; Query time: 39 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Oct 17 23:08:34 2013
;; MSG SIZE  rcvd: 495

If I do dig @localhost play.google.com (which has a larger response) then I get back something like:

$ dig @localhost play.google.com
;; Truncated, retrying in TCP mode.
;; ERROR: ID mismatch: expected ID 3696, got 27130

This seems to be standard, documented behaviour - when the UDP response is large (here 'large' == 512 bytes) it falls back to TCP. The ID mismatch is not expected though.

If I do dig @192.168.0.2 play.google.com then I still get the warning about using TCP mode, but it otherwise works

$ dig @192.168.0.2 play.google.com
;; Truncated, retrying in TCP mode.
; <<>> DiG 9.8.1-P1 <<>> @192.168.0.2 play.google.com
[snip most of the output]
;; Query time: 5 msec
;; SERVER: 192.168.0.2#53(192.168.0.2)
;; WHEN: Thu Oct 17 23:05:55 2013
;; MSG SIZE  rcvd: 521

At the moment I've not set up any zones in my local instance, so it's just acting as a caching resolver. My options config is pretty much unchanged from standard, I've got the following set:

options {
    directory "/var/cache/bind";
    allow-query { 192.168/16; 127.0.0.1; };
    forwarders { 8.8.8.8; 8.8.4.4; };
    dnssec-validation auto;
    edns-udp-size 4096 ;
    allow-transfer { any; };
    auth-nxdomain no;    # conform to RFC1035
    listen-on-v6 { any; };
};

And my /etc/resolv.conf is just

nameserver 127.0.0.1
search .local

The problem definitely seems linked to the failover to TCP mode: if I do dig +bufsize=4096 @localhost play.google.com then it works; no warning about failover to TCP, no ID mismatch, and a standard looking result. To be honest, if there was a way to force bind to use a much larger UDP buffer, that'd probably be good enough for me, but all I've been able to find mention of is max-udp-size 4096 and that doesn't change the behaviour in any way.

I've also tried setting edns-udp-size 512 in case the problem is some weird EDNS issue with my router (which seems unlikely since the +bufsize=4096 flag works fine).

I've also tried dig +trace @localhost play.google.com; this works. No truncation/TCP warning, and a full result.

I've also tried changing the servers used in the forwarder (e.g. to OpenDNS), but that makes no difference.

There's one last data point: if I repetitively do dig @localhost play.google.com I don't always get an ID mismatch, but sometimes a REFUSED error. I'm much more likely to get a REFUSED error if I dig the non-localhost IP (192.168.0.2) first:

$ dig @localhost play.google.com
;; Truncated, retrying in TCP mode.
; <<>> DiG 9.8.1-P1 <<>> @localhost play.google.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 35104
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;play.google.com.       IN  A
;; Query time: 4 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Oct 17 23:20:13 2013
;; MSG SIZE  rcvd: 33

Any insights or things to try would be much appreciated.

© Server Fault or respective owner

Related posts about dns

Related posts about bind