tl;dr - I agree with your points and think that the original PEP 471
proposal is fine. The details here are just clarification of why my
proposal wasn't just "use PEP 471 as written" in the first place...
Post by Ben Hoyt1) It's a nasty API to actually write code with. If you try to use it,
it gives off a "made only for low-level library authors" rather than
"designed for developers" smell. For example, here's a get_tree_size()
function I use written in both versions (original is the PEP 471
"""Return total size of all files in directory tree at path."""
total = 0
total += get_tree_size_original(entry.full_name)
total += entry.lstat().st_size
return total
"""Return total size of all files in directory tree at path."""
total = 0
is_dir = entry.is_dir
size = entry.st_size
st = os.lstat(entry.full_name)
is_dir = stat.S_ISDIR(st.st_mode)
size = st.st_size
total += get_tree_size_new(entry.full_name)
total += size
return total
I know which version I'd rather write and maintain!
Fair point. But *only* because is_dir isn't guaranteed to be
available. I could debate other aspects of your translation to use my
API, but it's not relevant as my proposal was flawed in terms of
is_XXX.
Post by Ben HoytIt seems to me new
users and folks new to Python could easily write the top version, but
the bottom is longer, more complicated, and harder to get right.
Given the is_dir point, agreed.
Post by Ben HoytIt
would also be very easy to write code in a way that works on Windows
but bombs hard on POSIX.
You may have a point here - my Windows bias may be showing. It's
already awfully easy to write code that works on POSIX but bombs hard
on Windows (deleting open files, for example) so I find it tempting to
think "give them a taste of their own medicine" :-)
More seriously, it seems to me that the scandir API is specifically
designed to write efficient code on platforms where the OS gives
information that allows you to do so. Warping the API too much to
cater for platforms where that isn't possible seems to have the
priorities backwards. Making the API not be an accident waiting to
happen is fine, though.
And let's be careful, too. My position is that it's not too hard to
write code that works on Windows, Linux and OS X but you're right you
could miss the problem with platforms that don't even support a free
is_dir(). It's *easier* to write Windows-only code by mistake, but the
fix to cover the "big three" is pretty simple (if not hasattr, lstat).
Post by Ben Hoyt2) It seems like your assumption is that is_dir/is_file/is_symlink are
always available on POSIX via readdir. This isn't actually the case
(this was discussed in the original threads) -- if readdir() returns
dirent.d_type as DT_UNKNOWN, then you actually have to call os.stat()
anyway to get it. So, as the above definition of get_tree_size_new()
is_dir/is_file/is_symlink as well as the st_* attributes.
Ah, the wording in the PEP says "Linux, Windows, OS X". Superficially,
that said "everywhere" to me. It might be worth calling out
specifically some examples where it's not available without an extra
system call, just to make the point explicit.
You're right, though, that blows away the simplicity of my proposal.
The original PEP 471 seems precisely right to me, in that case.
I was only really arguing for attributes because they seem more
obviously static than a method call. And personally I don't care about
that aspect.
Post by Ben Hoyt3) It's not much different in concept to the PEP 471 version, except
that PEP 471 has a built-in .lstat() method, making the user's life
much easier. This is the sense in which it's the worst of both worlds
-- it's a far less nice API to use, but it still has the same issues
with race conditions the original does.
Agreed. My intent was never to remove the race conditions, I see them
as the responsibility of the application to consider (many
applications simply won't care, and those that do will likely want a
specific solution, not a library-level compromise).
Post by Ben HoytFirst, based on the +1's to Paul's new solution, I don't think people
are too concerned about the race condition issue (attributes being
different between the original readdir and the os.stat calls). I think
this is probably fair -- if folks care, they can handle it in an
application-specific way. So that means Paul's new solution and the
original PEP 471 approach are both okay on that score.
+1. That was my main point, in actual fact
Post by Ben HoytSecond, comparing PEP 471 to Nick's solution: error handling is much
more straight-forward and simple to document with the original PEP 471
approach (just try/catch around the function calls) than with Nick's
get_lstat=True approach of doing the stat() if needed inside the
iterator. To catch errors with that approach, you'd either have to do
a "while True" loop and try/catch around next(it) manually (which is
very yucky code), or we'd have to add an onerror callback, which is
somewhat less nice to use and harder to document (signature of the
callback, exception object, etc).
Agreed. If my solution had worked, it would have been by isolating a
few extra cases where you could guarantee errors won't happen. But
actually, errors *can* happen in those cases, on certain systems. So
PEP 471 wins on all counts here too.
Post by Ben HoytSo given all of the above, I'm fairly strongly in favour of the
approach in the original PEP 471 due to it's easy-to-use API and
straight-forward try/catch approach to error handling. (My second
option would be Nick's get_lstat=True with the onerror callback. My
third option would be Paul's attribute-only solution, as it's just
very hard to use.)
Agreed. The solution I proposed isn't just "very hard to use", it's
actually wrong. If is_XXX are optional attributes, that's not my
solution, and I agree it's *awful*.
Paul.
PS I'd suggest adding a "Rejected proposals" section to the PEP which
mentions the race condition issue and points to this discussion as an
indication that people didn't seem to see it as a problem.
Post by Ben HoytThanks for the effort in your response, Paul.
I'm all for KISS, but let's just slow down a bit here.
Post by Paul MooreI think that thin wrapper is needed - even
if the various bells and whistles are useful, they can be built on top
of a low-level version (whereas the converse is not the case).
Yes, but API design is important. For example, urllib2 has a kind of
the "thin wrapper approach", but millions of people use the 3rd-party
"requests" library because it's just so much nicer to use.
There are low-level functions in the "os" module, but there are also a
lot of higher-level functions (os.walk) and functions that smooth over
cross-platform issues (os.stat).
Detailed comments below.
Post by Paul MooreThe return value is an object whose attributes correspond to the data
* name - the object's name
* full_name - the object's full name (including path)
* is_dir - whether the object is a directory
* is file - whether the object is a plain file
* is_symlink - whether the object is a symbolic link
On Windows, the following attributes are also available
* st_size - the size, in bytes, of the object (only meaningful for files)
* st_atime - time of last access
* st_mtime - time of last write
* st_ctime - time of creation
* st_file_attributes - Windows file attribute bits (see the
FILE_ATTRIBUTE_* constants in the stat module)
Again, this seems like a nice simple idea, but I think it's actually a
1) It's a nasty API to actually write code with. If you try to use it,
it gives off a "made only for low-level library authors" rather than
"designed for developers" smell. For example, here's a get_tree_size()
function I use written in both versions (original is the PEP 471
"""Return total size of all files in directory tree at path."""
total = 0
total += get_tree_size_original(entry.full_name)
total += entry.lstat().st_size
return total
"""Return total size of all files in directory tree at path."""
total = 0
is_dir = entry.is_dir
size = entry.st_size
st = os.lstat(entry.full_name)
is_dir = stat.S_ISDIR(st.st_mode)
size = st.st_size
total += get_tree_size_new(entry.full_name)
total += size
return total
I know which version I'd rather write and maintain! It seems to me new
users and folks new to Python could easily write the top version, but
the bottom is longer, more complicated, and harder to get right. It
would also be very easy to write code in a way that works on Windows
but bombs hard on POSIX.
2) It seems like your assumption is that is_dir/is_file/is_symlink are
always available on POSIX via readdir. This isn't actually the case
(this was discussed in the original threads) -- if readdir() returns
dirent.d_type as DT_UNKNOWN, then you actually have to call os.stat()
anyway to get it. So, as the above definition of get_tree_size_new()
is_dir/is_file/is_symlink as well as the st_* attributes.
3) It's not much different in concept to the PEP 471 version, except
that PEP 471 has a built-in .lstat() method, making the user's life
much easier. This is the sense in which it's the worst of both worlds
-- it's a far less nice API to use, but it still has the same issues
with race conditions the original does.
First, based on the +1's to Paul's new solution, I don't think people
are too concerned about the race condition issue (attributes being
different between the original readdir and the os.stat calls). I think
this is probably fair -- if folks care, they can handle it in an
application-specific way. So that means Paul's new solution and the
original PEP 471 approach are both okay on that score.
Second, comparing PEP 471 to Nick's solution: error handling is much
more straight-forward and simple to document with the original PEP 471
approach (just try/catch around the function calls) than with Nick's
get_lstat=True approach of doing the stat() if needed inside the
iterator. To catch errors with that approach, you'd either have to do
a "while True" loop and try/catch around next(it) manually (which is
very yucky code), or we'd have to add an onerror callback, which is
somewhat less nice to use and harder to document (signature of the
callback, exception object, etc).
So given all of the above, I'm fairly strongly in favour of the
approach in the original PEP 471 due to it's easy-to-use API and
straight-forward try/catch approach to error handling. (My second
option would be Nick's get_lstat=True with the onerror callback. My
third option would be Paul's attribute-only solution, as it's just
very hard to use.)
Thoughts?
-Ben