Using awk to split text file every 10,000 lines

Posted by Sneaky Wombat on Super User See other posts from Super User or by Sneaky Wombat
Published on 2012-10-08T20:59:11Z Indexed on 2012/10/08 21:40 UTC
Read the original article Hit count: 241

Filed under:

I have a large gzip'd text file. I'd like to something like:

zcat BIGFILE.GZ | awk (snag 10,000 lines and redirect to...)|gzip -9 smallerPartFile.gz

the awk part up there, I basically want it to take 10,000 lines and send it to gzip and then repeat until all lines in the original input file are consumed. I found a script that claims to do this, but when I run it on my files and then diff the original to the ones that were split and then merged, lines are missing. So, something is wrong with the awk part and I'm not sure what part is broken.

Here's the code. Can someone tell me why this doesn't yield a file that can be split and merged and then diff'd to the original successfully?

# Generate files part0.dat.gz, part1.dat.gz, etc.
# restore with: zcat foo* | gzip -9 > restoredFoo.sql.gz (or something like that)
prefix="foo"
count=0
suffix=".sql"

lines=10000 # Split every 10000 line.

zcat /home/foo/foo.sql.gz |
while true; do
  partname=${prefix}${count}${suffix}

  # Use awk to read the required number of lines from the input stream.
  awk -v lines=${lines} 'NR <= lines {print} NR == lines {exit}' >${partname}

  if [[ -s ${partname} ]]; then
    # Compress this part file.
    gzip -9 ${partname}
    (( ++count ))
  else
    # Last file generated is empty, delete it.
    rm -f ${partname}
    break
  fi
done

© Super User or respective owner

Related posts about awk