Ben Hoyt
2014-07-14 00:33:16 UTC
Hi folks,
Thanks Victor, Nick, Ethan, and others for continued discussion on the
scandir PEP 471 (most recent thread starts at
https://mail.python.org/pipermail/python-dev/2014-July/135377.html).
Just an aside ... I was reminded again recently why scandir() matters:
a scandir user emailed me the other day, saying "I used scandir to
dump the contents of a network dir in under 15 seconds. 13 root dirs,
60,000 files in the structure. This will replace some old VBA code
embedded in a spreadsheet that was taking 15-20 minutes to do the
exact same thing." I asked if he could run scandir's benchmark.py on
his directory tree, and here's what it printed out:
C:\Python34\scandir-master>benchmark.py "\\my\network\directory"
Using fast C version of scandir
Priming the system's cache...
Benchmarking walks on \\my\network\directory, repeat 1/3...
Benchmarking walks on \\my\network\directory, repeat 2/3...
Benchmarking walks on \\my\network\directory, repeat 3/3...
os.walk took 8739.851s, scandir.walk took 129.500s -- 67.5x as fast
That's right -- os.walk() with scandir was almost 70x as fast as the
current version! Admittedly this is a network file system, but that's
still a real and important use case. It really pays not to throw away
information the OS gives you for free. :-)
On the recent python-dev thread, Victor especially made some well
thought out suggestions. It seems to me there's general agreement that
the basic API in PEP 471 is good (with Ethan not a fan at first, but
it seems he's on board after further discussion :-).
That said, I think there's basically one thing remaining to decide:
whether or not to have DirEntry.is_dir() and .is_file() follow
symlinks by default. I think Victor made a pretty good case that:
(a) following links is usually what you want
(b) that's the precedent set by the similar functions os.path.isdir()
and pathlib.Path.is_dir(), so to do otherwise would be confusing
(c) with the non-link-following version, if you wanted to follow links
you'd have to say something like "if (entry.is_symlink() and
os.path.isdir(entry.full_name)) or entry.is_dir()" instead of just "if
entry.is_dir()"
(d) it's error prone to have to do (c), as I found out recently when I
had a bug in my implementation of os.walk() with scandir -- I had a
bug due to getting this exact test wrong
If we go with Victor's link-following .is_dir() and .is_file(), then
we probably need to add his suggestion of a follow_symlinks=False
parameter (defaults to True). Either that or you have to say
"stat.S_ISDIR(entry.lstat().st_mode)" instead, which is a little bit
less nice.
As a KISS enthusiast, I admit I'm still somewhat partial to the
DirEntry methods just returning (non-link following) info about the
*directory entry* itself. However, I can definitely see the
error-proneness of that, and the advantages given the points above. So
I guess I'm on the fence.
Given the above arguments for symlink-following is_dir()/is_file()
methods (have I missed any, Victor?), what do others think?
I'd be very keen to come to a consensus on this, so that I can make
some final updates to the PEP and see about getting it accepted and/or
implemented. :-)
-Ben
Thanks Victor, Nick, Ethan, and others for continued discussion on the
scandir PEP 471 (most recent thread starts at
https://mail.python.org/pipermail/python-dev/2014-July/135377.html).
Just an aside ... I was reminded again recently why scandir() matters:
a scandir user emailed me the other day, saying "I used scandir to
dump the contents of a network dir in under 15 seconds. 13 root dirs,
60,000 files in the structure. This will replace some old VBA code
embedded in a spreadsheet that was taking 15-20 minutes to do the
exact same thing." I asked if he could run scandir's benchmark.py on
his directory tree, and here's what it printed out:
C:\Python34\scandir-master>benchmark.py "\\my\network\directory"
Using fast C version of scandir
Priming the system's cache...
Benchmarking walks on \\my\network\directory, repeat 1/3...
Benchmarking walks on \\my\network\directory, repeat 2/3...
Benchmarking walks on \\my\network\directory, repeat 3/3...
os.walk took 8739.851s, scandir.walk took 129.500s -- 67.5x as fast
That's right -- os.walk() with scandir was almost 70x as fast as the
current version! Admittedly this is a network file system, but that's
still a real and important use case. It really pays not to throw away
information the OS gives you for free. :-)
On the recent python-dev thread, Victor especially made some well
thought out suggestions. It seems to me there's general agreement that
the basic API in PEP 471 is good (with Ethan not a fan at first, but
it seems he's on board after further discussion :-).
That said, I think there's basically one thing remaining to decide:
whether or not to have DirEntry.is_dir() and .is_file() follow
symlinks by default. I think Victor made a pretty good case that:
(a) following links is usually what you want
(b) that's the precedent set by the similar functions os.path.isdir()
and pathlib.Path.is_dir(), so to do otherwise would be confusing
(c) with the non-link-following version, if you wanted to follow links
you'd have to say something like "if (entry.is_symlink() and
os.path.isdir(entry.full_name)) or entry.is_dir()" instead of just "if
entry.is_dir()"
(d) it's error prone to have to do (c), as I found out recently when I
had a bug in my implementation of os.walk() with scandir -- I had a
bug due to getting this exact test wrong
If we go with Victor's link-following .is_dir() and .is_file(), then
we probably need to add his suggestion of a follow_symlinks=False
parameter (defaults to True). Either that or you have to say
"stat.S_ISDIR(entry.lstat().st_mode)" instead, which is a little bit
less nice.
As a KISS enthusiast, I admit I'm still somewhat partial to the
DirEntry methods just returning (non-link following) info about the
*directory entry* itself. However, I can definitely see the
error-proneness of that, and the advantages given the points above. So
I guess I'm on the fence.
Given the above arguments for symlink-following is_dir()/is_file()
methods (have I missed any, Victor?), what do others think?
I'd be very keen to come to a consensus on this, so that I can make
some final updates to the PEP and see about getting it accepted and/or
implemented. :-)
-Ben