Caltech Home > Caltech HPC Homepage > Documentation > FAQ > How do I compress my unused data?
Search Search

How do I compress my unused data?

Why should I compress my data

There are many reasons to compress you data. First and foremost is to shrink the size of your files or directories.  If you are close to your quota and need to make room, compressing older data is a good way to start this process.  That is not to say you shouldn't clean up files you should be getting rid of, just that if you have older files that you can't get rid of, compression is a way to deal with it.  

Another reason is for moving data around.  It is generally easier to send  a single tarred and compressed file to someone you are sharing files with than full directories.

It is a standard distribution format.  If you a writing and distributing software, many developers will create tarballs for users to download rather that individual files or cloning from a repository.  This allows them to easily control what specifically user get.

Running any of these commands to create compressed directories or files does not remove the originals.  If you are trying to save space, you will need to delete the originals once you are confident you have what you need.

Creating a tarball (compressing a directory)

A common way to compress large amounts of data in linux is to create a tar ball. A tarball is a file that has been tarred up and then compressed.  Tar is a command initially written to write files to tape for backups (T[ape]AR[chriver]).  You have probably seen tarballs when downloading software for linux.

The simplest way to crea t tarball with gzip compressions is to do it on the directory level. LIke the following

tar zcvf directory_name.tar.gz directory_name/

To do the same, but compress with bzip, do the following

tar jcvf directory_name.tar.gz directory_name/

Note - these do not remove the original directories

Compress and decompress individual files

If you want to compress an individual file:

Using gzip:
gzip file_to_compress
Using bzip2:
bzip2 file_to_compress
Decompress using gzip: gunzip file_to_decompress.gz
Decompress a file using bzip2: bunzip2

Working with tarballs.

To get the list of files/directories  a gzip-compressed tarball:

tar ztvf filename.tar.gz

To get the list of files/directories in a bzip2-compressed tarball:

tar jtvf filename.tar.bz2
Extracting a tarball:
tar zxvf filename.tar.gz
tar jxvf mycompressedfile.tar.bz2

Creating tarballs with parallel compression tools

Sometimes the compresion can be the longest part of creating a tarball.  In these cases you can use an external parallel compression tool to speed things up. To do this you call an external program to do the compression rather than call it directly

To create a tarball using the pigz (parallel gzip) use the following:

tar -cvf   directory_name.tar.gz -I pigz directory_name

To create a tarball using pbzip2 (parallel bzip) use the following:

tar -cvf directory_name.tar.bz2 -I pbzip2 directory_name

Note: you can use the same method when decompressing tarballs 

Better understanding tar options

To better understand the options you are using in tar, here is a quick rundown

  • x  - extract an archive
  • c  - create an archive
  • z  -  gzip or gunzip the archive
  • j  -  bzip or bunzip the archive
  • v  -  use verbose mode.  this shows what it is doing rather than do it silently
  • f  -  file to use or create
  • t  -  list the files in an archive
  • J  -  use the less used xz compression method
To get even more information, you can use the "man command" for any of these commands