Get uncompressed size of a .gz file in python
Using gzip, tell() returns the offset in the uncompressed file. In order to show a progress bar, I want to know the original (uncompressed) size of the file. Is there an easy way to find out?
The gzip format specifies a field called
This contains the size of the original (uncompressed) input data modulo 2^32.
In gzip.py, which I assume is what you're using for gzip support, there is a method called
_read_eof defined as such:
def _read_eof(self): # We've read to the end of the file, so we have to rewind in order # to reread the 8 bytes containing the CRC and the file size. # We check the that the computed CRC and size of the # uncompressed data matches the stored values. Note that the size # stored is the true file size mod 2**32. self.fileobj.seek(-8, 1) crc32 = read32(self.fileobj) isize = U32(read32(self.fileobj)) # may exceed 2GB if U32(crc32) != U32(self.crc): raise IOError, "CRC check failed" elif isize != LOWU32(self.size): raise IOError, "Incorrect length of data produced"
There you can see that the
ISIZE field is being read, but only to to compare it to
self.size for error detection. This then should mean that
GzipFile.size stores the actual uncompressed size. However, I think it's not exposed publicly, so you might have to hack it in to expose it. Not so sure, sorry.
I just looked all of this up right now, and I haven't tried it so I could be wrong. I hope this is of some use to you. Sorry if I misunderstood your question.
Uncompressed size is stored in the last 4 bytes of the gzip file. We can read the binary data and convert it to an int. (This will only work for files under 4GB) import structdef getuncompressedsize(filename): with open(filename, 'rb') as f: f.seek(-4, 2) return struct.unpack('I', f.read(4))
Uncompressed size is stored in the last 4 bytes of the gzip file. We can read the binary data and convert it to an int. (This will only work for files under 4GB)
import struct def getuncompressedsize(filename): with open(filename, 'rb') as f: f.seek(-4, 2) return struct.unpack('I', f.read(4))
def get_uncompressed_size (file): pipe_in = os. popen ('gzip -l %s' % file) list_1 = pipe_in. readlines () list_2 = list_1 . split () c, u, r, n = list_2 return int (u)
Unix way: use "gunzip -l file.gz" via subprocess.call / os.popen, capture and parse its output.
But in Python 3 you can easily get the size of the uncompressed data by calling the. gz', 'rb')) file_content = f. gz file in dist directory that can be installed with python-m pip install dist/dispy-.
The last 4 bytes of the .gz hold the original size of the file
sudo apt-get install gzip. Once installed, you can execute the tool with the -lcommand line option and name of the archive to see the size related details. gzip -l [archive-name] Following is an example: That's it. As you can see, the figure listed under the 'uncompressed' column is what you want.
I am not sure about performance, but this could be achieved without knowing
gzip magic by using:
with gzip.open(filepath, 'rb') as file_obj: file_size = file_obj.seek(0, io.SEEK_END)
This should also work for other (compressed) stream readers like
bz2 or the plain
as suggested in the comments,
2 in second line was replaced by
io.SEEK_END, which is definitely more readable and probably more future-proof.
EDIT: Works only in Python 3.
15024079 50187117 70.1% compressedfile If you want to determine the uncompressed size of a gzip file from within a program, you can extract to original file size from the gzip file. This size is stored in the last 4 bytes of the file. This will only provide the correct value if the compressed file was smaller than 4 Gb.
Python » 3.8.3 the filename argument is only used to be included in the gzip file header, which may include the original filename of the uncompressed file.
The gzip module provides the GzipFile class which is modeled after Python’s File Object. The GzipFile class reads and writes gzip -format files, automatically compressing or decompressing the data so that it looks like an ordinary file object.
Python gzip module provides a very simple way to compress and decompress files and work in a similar manner to GNU programs gzip and gunzip.. In this lesson, we will study what classes are present in this module which allows us to perform the mentioned operations along with the additional functions it provides.
- I guess this is good enough. In case of a file larger than 4G, it is easy to add some heuristics to the progress bar to set the file-size to 4G + ISIZE, if tell() indicates we we are too close to ISIZE.
- I need to do the same thing and I am trying to extend the GzipFile class to give the file size, but I am unsuccessful, How did you get it working?
- Update: This functions works for me: code.activestate.com/lists/python-list/245777
- Note this isn't completely foolproof insofar as a gzip file that was appended to will only have the size of the last appended portion... See: pastebin.com/82zyV3k9 - the second '1000' here should actually be 2000, but it's just the size of the last block that was appended...
- Open the file
error: unpack requires a string argument of length 4.
- This is exactly what is shown in old Jorge Israel Peña's answer, so while your answer provides a handy function, it does not add much to the topic. Moreover, as comments say in the old answer, depending only on the last 4 bytes is actually NOT 100% foolproof, as the GZ allows you to append new blocks at end of file
- Never touch operating systems that are older than me... Seriously speaking: I'm looking for a python solution, as the code is for all platforms.
- Windows is at least 24 or 25 years old. Version 1 came out around 1985 or so. How old are you?
- 44.5 (and last used Unix at 18)
- The last 4 bytes is the "size of the original (uncompressed) input data modulo 2^32." (gzip.org/zlib/rfc-gzip.html)
file_size = file_obj.seek(0, io.SEEK_END)
- Python 3! No Python 2 tho!
ValueError: Seek from end not supported. However: struct.unpack works on 2.7!
- Mark, I'm recently working on some programatic manipulation of gzip files and often see many of your answers at the bottom of the stack with one or two votes. I guess people don't recognize you. Thanks for your tremendous contributions and for contributing your answers to compression questions despite the lack of recognition.
- .tell() works great. What I'm looking for is the original file size.
- What if the file is enormous?