keroncommunication.blogg.se - Python decompress lzip

#Python decompress lzip zip file#

The problem with using a pool of processors is that it requires that the original.

as_completed ( futures ): total += future. filename, dest, ) ) total = 0 for future in concurrent. ProcessPoolExecutor () as executor : for member in zf. join ( dest, filename ) return _count_file ( fn ) def f3 ( fn, dest ): with open ( fn, 'rb' ) as f : zf = zipfile. Is there perhaps a way to optimize that? Baseline functionįirst it's these common functions that simulate actually doing something with the files in the zip file:ĭef unzip_member_f3 ( zip_filepath, filename, dest ): with open ( zip_filepath, 'rb' ) as f : zf = zipfile. This worked much better but I still noticed the whole unzipping was taking up a huge amount of time.

#Python decompress lzip zip file#

So, the solution, after much testing, was to dump the zip file to disk (in a temporary directory in /tmp) and then iterate over the files. First you have the 1GB file in RAM, then you unzip each file and now you have possibly 2-3GB all in memory. That failed spectacularly with various memory explosions and EC2 running out of memory. It's not unusual that each zip file contains 100 files and 1-3 of those make up 95% of the zip file size.Īt first I tried unzipping the file, in memory, and deal with one file at a time. Within them, there are mostly plain text files but there are some binary files in there too that are huge. The average is 560MB but some are as much as 1GB. The challenge is that these zip files that come in are huuuge. In this particular application what it does is that it looks at the file's individual name and size, compares that to what has already been uploaded in AWS S3 and if the file is believed to be different or new, it gets uploaded to AWS S3. So the context is this a zip file is uploaded into a web service and Python then needs extract that and analyze and deal with each file within.