Sunday, June 3, 2012

Making system backup smaller

While taking a snapshot of my system disk, I discovered that the image file produced by dd had grown way too large. This post explains what happened and how to solve it.

dd takes anything

As described in this post, I use dd to take system snapshots, in case I mess up my installation. What dd does is a bit-to-bit copy of the input device, no matter the content. It does not care about file system, format, or even if there is any real information at a given location.

System disc full

I had a little mishappening while backing up my data with rsync.
A mistake in the destination disc resulted in all my data (500 GB) being copied to the system disc (120 GB).
Since there was not enough room for the whole data, rsync copied what it could then stopped when the system disc was full.

Backup size

Next time I took a system snapshot with dd, the resulting image file was 57 GB. Since the original clean image was 14 GB, this size was unexpected.

Deleted files

When files are deleted form a disc, the memory area they occupy is marked as free. Any program that wants to use memory to store data may use this freed area and overwrite its content.

The act of deleting does not discard the data though. The bit content is still in place, although marked as overwritable. This is why deleting the content of large folders is much faster than copying (only about five seconds for 200 GB of data, as I could accidentally experience on my machine), marking as free is much faster than overwriting all the bits.

After filling up my system disc with backup data, all bits in the free memory areas had been written to. Deleting this data did mark these areas as free, but the bit content was left.

As mentioned before, dd takes all bits no matter what they mean. This includes the free memory areas. My dd system backup included all the data backup I had mistakenly put on the disc then erased.

The system backups I make run also through gzip. The areas of my system disc that had never been written to had homogeneous data (all ones or all zeroes), making it easy for gzip to compress very effectively. With real data in these areas, gzip could not compress as much.

In short:
  • The system disc had first all 0 or all 1 in the unused memory areas,
  • the erroneous data backup wrote data to these formerly unused locations,
  • the data backup was deleted, thus freeing the memory areas again, but leaving the bit values as they were,
  • the next dd did backup these freed memory areas into the image file,
  • gzip could not compress as easily as when there were large chunks of homogeneous data.

Clearing the free areas

To fill up the free areas is easy: just create a big file full of zeroes, and this file will be placed wherever there is free memory.

Delete the file full of zeroes, and these areas will be marked as free again.

To do this, dd can be used again, with /dev/zero as input. /dev/zero is an infinite stream that produces zeroes as long as requested.

dd if=/dev/zero of=/big_zero_file
rm /big_zero_file

Since there is no limit given in that command line, dd goes on until the disc is full.

Next system backup

... was only 19 GB instead of 57. No actual data was removed, only the free areas were replaced with zeroes, to give gzip a little push.

This can also serve as a reminder: deleted files are easy to recover until the areas where they were placed is overwritten. Zeroing the free areas as described here is not sufficient either, there are analysis methods that can recreated data even if overwritten. If you really want to make sure that data is destroyed, you should overwrite it with random data using /dev/random, and do that several times. Or bring your disc to a grinder.

No comments:

Post a Comment