I noticed a problem with postfix logs, that information on some of emails delivery status is missing. The issue affects about 1% of emails.
"Healthy" log:
<server># grep 8EB992EFBB44 postfix_log/mail04.log
Jun 5 03:09:29 mail04 postfix/smtpd[8537]: 8EB992EFBB44: client=xxx.xxx.xxx[xxx.xxx.xxx.xxx]
Jun 5 03:09:29 mail04 postfix/cleanup[34349]: 8EB992EFBB44: message-id=<
[email protected]>
Jun 5 03:12:02 mail04 postfix/qmgr[76377]: 8EB992EFBB44: from=<
[email protected]>, size=48845, nrcpt=1 (queue active)
Jun 5 03:15:12 mail04 postfix/smtp[35058]: 8EB992EFBB44: to=<
[email protected]>, relay=mx.baz.com[xxx.xxx.xxx.xxx]:25, conn_use=70, delay=343, delays=153/190/0/0.24, dsn=2.0.0, status=sent (250 ok)
Jun 5 03:15:12 mail04 postfix/qmgr[76377]: 8EB992EFBB44: removed
"Broken" log:
<server># grep F3C362EF37CA postfix_log/mail04.log
Jun 5 04:03:27 mail04 postfix/smtpd[39666]: F3C362EF37CA: client=xxx.xxx.xxx[xxx.xxx.xxx.xxx]
Jun 5 04:03:27 mail04 postfix/cleanup[41287]: F3C362EF37CA: message-id=<
[email protected]>
Jun 5 04:03:28 mail04 postfix/qmgr[76377]: F3C362EF37CA: from=<
[email protected]>, size=48892, nrcpt=1 (queue active)
** here should be a log line from postfix/smtp but there is none **
Jun 5 04:03:29 mail04 postfix/qmgr[76377]: F3C362EF37CA: removed
Background information:
system: FreeBSD xxx.xxx.xxx 8.2-RELEASE FreeBSD 8.2-RELEASE #0: Thu Feb 17 02:41:51 UTC 2011
[email protected]:/usr/obj/usr/src/sys/GENERIC amd64
Postfix is installed inside jail. Logs are on the same machine, log dir is mounted thru nullfs. The site has spikes of heavy load, causing disks (local) to run at 100%.
Update
The log is rotated daily, current size is ~ 500MB.
I made a test by queuing 99000 messages to same destination (in order to rule out dns/network/mx issues). 5715 messages don't have any DSN record. Failed messages queue time is spread evenly over time, i don't see any time-bound issues.
Some undelivered emails:
envelopeid | processed_time
--------------+----------------------------
8D7652EF3BAE | 2012-06-06 13:19:11.072715
DD53A2EF3C5C | 2012-06-06 13:33:24.374783
8C52F2EF4E3F | 2012-06-06 13:39:15.810616
BBC572EF525C | 2012-06-06 13:44:22.762812
E95822EF54D1 | 2012-06-06 13:52:01.134533
839DD2EF4FBB | 2012-06-06 14:13:48.511236
017EE2EF6234 | 2012-06-06 15:04:48.618963
Those are a few picks, such records of undelivered email occur almost every second.
<server># egrep '(8D7652EF3BAE|BBC572EF525C|017EE2EF6234)' mail04.log
Jun 6 13:19:10 mail04 postfix/smtpd[20350]: 8D7652EF3BAE: client=xxx.xxx.xxx[xxx.xxx.xxx.xxx]
Jun 6 13:19:10 mail04 postfix/cleanup[21024]: 8D7652EF3BAE: message-id=<
[email protected]>
Jun 6 13:19:10 mail04 postfix/qmgr[7939]: 8D7652EF3BAE: from=<
[email protected]>, size=63718, nrcpt=1 (queue active)
Jun 6 13:19:11 mail04 postfix/qmgr[7939]: 8D7652EF3BAE: removed
Jun 6 13:44:22 mail04 postfix/smtpd[20346]: BBC572EF525C: client=xxx.xxx.xxx[xxx.xxx.xxx.xxx]
Jun 6 13:44:22 mail04 postfix/cleanup[24811]: BBC572EF525C: message-id=<
[email protected]>
Jun 6 13:44:22 mail04 postfix/qmgr[7939]: BBC572EF525C: from=<
[email protected]>, size=63758, nrcpt=1 (queue active)
Jun 6 15:04:49 mail04 postfix/smtpd[20344]: 017EE2EF6234: client=xxx.xxx.xxx[xxx.xxx.xxx.xxx]
Jun 6 15:04:49 mail04 postfix/cleanup[35585]: 017EE2EF6234: message-id=<
[email protected]>
Jun 6 15:04:49 mail04 postfix/qmgr[7939]: 017EE2EF6234: from=<
[email protected]>, size=63706, nrcpt=1 (queue active)
<server>#
<server># find /var/spool/postfix/active/ -type f -print | wc -l
1
<server>#
IMPORTANT: As you can see above some of emails doesn't event have the removed line.