Bug report
Bug description:
I'm using the zipfile module to create a ZIP file with thousands of members. Each file is created by creating a corresponding zipfile.Path object first and then calling .open() on it.
The implementation of open() contains a check whether the file already exists when opening a file in read mode:
if not self.exists() and zip_mode == 'r':
raise FileNotFoundError(self)
However, self.exists() is called even in write mode, because it is the and operator's first argument.
The call to self.exists() is quite slow however, because this requires the ZIP file's .namelist() to be computed, which in turn requires to compute all the implied directories. After all, this is the reason why the FastLookup optimization exists for ZIP files in read mode.
I found this issue after profiling my application and was surprised that more than half of its execution time was spent in computing implied ZIP directories.
I would propose to swap the and arguments so the check looks like this:
if zip_mode == "r" and not self.exists():
raise FileNotFoundError(self)
This will cause the self.exists() check to only be run in read mode, where the check is fast because of FastLookup anyway.
I'm happy to provide a pull request with this change.
CPython versions tested on:
3.11, 3.12
Operating systems tested on:
Linux, macOS
Linked PRs
Bug report
Bug description:
I'm using the
zipfilemodule to create a ZIP file with thousands of members. Each file is created by creating a correspondingzipfile.Pathobject first and then calling.open()on it.The implementation of
open()contains a check whether the file already exists when opening a file in read mode:However,
self.exists()is called even in write mode, because it is theandoperator's first argument.The call to
self.exists()is quite slow however, because this requires the ZIP file's.namelist()to be computed, which in turn requires to compute all the implied directories. After all, this is the reason why theFastLookupoptimization exists for ZIP files in read mode.I found this issue after profiling my application and was surprised that more than half of its execution time was spent in computing implied ZIP directories.
I would propose to swap the
andarguments so the check looks like this:This will cause the
self.exists()check to only be run in read mode, where the check is fast because ofFastLookupanyway.I'm happy to provide a pull request with this change.
CPython versions tested on:
3.11, 3.12
Operating systems tested on:
Linux, macOS
Linked PRs
zipfile.Path.existscheck in write mode #126576zipfile.Path.existscheck in write mode (GH-126576) #126642zipfile.Path.existscheck in write mode (GH-126576) #126643