ZFS for Database Log Files
- by user12620111
I've been troubled by drop outs in CPU usage in my application server,
characterized by the CPUs suddenly going from close to 90% CPU busy to
almost completely CPU idle for a few seconds. Here is an example of a
drop out as shown by a snippet of vmstat data taken while the
application server is under a heavy workload. # vmstat 1
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s3 s4 s5 s6 in sy cs us sy id
1 0 0 130160176 116381952 0 16 0 0 0 0 0 0 0 0 0 207377 117715 203884 70 21 9
12 0 0 130160160 116381936 0 25 0 0 0 0 0 0 0 0 0 200413 117162 197250 70 20 9
11 0 0 130160176 116381920 0 16 0 0 0 0 0 0 1 0 0 203150 119365 200249 72 21 7
8 0 0 130160176 116377808 0 19 0 0 0 0 0 0 0 0 0 169826 96144 165194 56 17 27 0 0 0 130160176 116377800 0 16 0 0 0 0 0 0 0 0 1 10245 9376 9164 2 1 97
0 0 0 130160176 116377792 0 16 0 0 0 0 0 0 0 0 2 15742 12401 14784 4 1 95
0 0 0 130160176 116377776 2 16 0 0 0 0 0 0 1 0 0 19972 17703 19612 6 2 92
14 0 0 130160176 116377696 0 16 0 0 0 0 0 0 0 0 0 202794 116793 199807 71 21 8
9 0 0 130160160 116373584 0 30 0 0 0 0 0 0 18 0 0 203123 117857 198825 69 20 11
This behavior occurred consistently while the application server was
processing synthetic transactions: HTTP requests from JMeter running on
an external machine.
I explored many theories trying to explain the drop outs, including:
Unexpected JMeter behavior
Network contention
Java Garbage Collection
Application Server thread pool problems
Connection pool problems
Database transaction processing
Database I/O contention
Graphing the CPU %idle led to a breakthrough:
Several of the drop outs were 30 seconds apart. With that insight, I
went digging through the data again and looking for other outliers that
were 30 seconds apart. In the database server statistics, I found spikes
in the iostat "asvc_t" (average response time of disk transactions, in
milliseconds) for the disk drive that was being used for the database
log files.
Here is an example: extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 2053.6 0.0 8234.3 0.0 0.2 0.0 0.1 0 24 c3t60080E5...F4F6d0s0 0.0 2162.2 0.0 8652.8 0.0 0.3 0.0 0.1 0 28 c3t60080E5...F4F6d0s0 0.0 1102.5 0.0 10012.8 0.0 4.5 0.0 4.1 0 69 c3t60080E5...F4F6d0s0 0.0 74.0 0.0 7920.6 0.0 10.0 0.0 135.1 0 100 c3t60080E5...F4F6d0s0 0.0 568.7 0.0 6674.0 0.0 6.4 0.0 11.2 0 90 c3t60080E5...F4F6d0s0 0.0 1358.0 0.0 5456.0 0.0 0.6 0.0 0.4 0 55 c3t60080E5...F4F6d0s0 0.0 1314.3 0.0 5285.2 0.0 0.7 0.0 0.5 0 70 c3t60080E5...F4F6d0s0
Here is a little more information about my database configuration:
The database and application server were running on two different SPARC servers.
Storage for the database was on a storage array connected via 8 gigabit Fibre Channel
Data storage and log file were on different physical disk drives
Reliable low latency I/O is provided by battery backed NVRAM
Highly available:
Two Fibre Channel links accessed via MPxIO
Two Mirrored cache controllers
The log file physical disks were mirrored in the storage device
Database log files on a ZFS Filesystem with cutting-edge technologies, such as copy-on-write and end-to-end checksumming
Why would I be getting service time spikes in my high-end storage?
First, I wanted to verify that the database log disk service time spikes
aligned with the application server CPU drop outs, and they did:
At first, I guessed that the disk service time spikes might be
related to flushing the write through cache on the storage device, but I
was unable to validate that theory.
After searching the WWW for a while, I decided to try using a separate log device:
# zpool add ZFS-db-41 log c3t60080E500017D55C000015C150A9F8A7d0
The ZFS log device is configured in a similar manner as described above: two physical disks mirrored in the storage array.
This change to the database storage configuration eliminated the application server CPU drop outs:
Here is the zpool configuration:
# zpool status ZFS-db-41 pool: ZFS-db-41 state: ONLINE scan: none requested config: NAME STATE ZFS-db-41 ONLINE c3t60080E5...F4F6d0 ONLINE logs c3t60080E5...F8A7d0 ONLINE
Now, the I/O spikes look like this:
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 1053.5 0.0 4234.1 0.0 0.8 0.0 0.7 0 75 c3t60080E5...F8A7d0s0
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 1131.8 0.0 4555.3 0.0 0.8 0.0 0.7 0 76 c3t60080E5...F8A7d0s0
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 1167.6 0.0 4682.2 0.0 0.7 0.0 0.6 0 74 c3t60080E5...F8A7d0s0 0.0 162.2 0.0 19153.9 0.0 0.7 0.0 4.2 0 12 c3t60080E5...F4F6d0s0
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 1247.2 0.0 4992.6 0.0 0.7 0.0 0.6 0 71 c3t60080E5...F8A7d0s0 0.0 41.0 0.0 70.0 0.0 0.1 0.0 1.6 0 2 c3t60080E5...F4F6d0s0
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 1241.3 0.0 4989.3 0.0 0.8 0.0 0.6 0 75 c3t60080E5...F8A7d0s0
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 1193.2 0.0 4772.9 0.0 0.7 0.0 0.6 0 71 c3t60080E5...F8A7d0s0
We can see the steady flow of 4k writes to the ZIL device from O_SYNC database log file writes. The spikes are from flushing the transaction group.
Like almost all problems that I run into, once I thoroughly
understand the problem, I find that other people have documented similar
experiences. Thanks to all of you who have documented alternative approaches.
Saved for another day: now that the problem is obvious, I should try
"zfs:zfs_immediate_write_sz" as recommended in the ZFS Evil Tuning
Guide.
References:
The ZFS Intent Log
Solaris ZFS, Synchronous Writes and the ZIL Explained
ZFS Evil Tuning Guide: Cache Flushes
ZFS Evil Tuning Guide: Tuning ZFS for Database Performance