Discussion:
My summary of the scandir (PEP 471)
Victor Stinner
2014-07-01 07:48:49 UTC
Permalink
Hi,

@Ben: it's time to update your PEP to complete it with this
discussion! IMO DirEntry must be as simple as possible and portable:

- os.scandir(str)
- DirEntry.lstat_result object only available on Windows, same result
than os.lstat()
- DirEntry.fullname(): os.path.join(directory, DirEntry.name), where
directory would be an hidden attribute of DirEntry


Notes:

- DirEntry.lstat_result is better than DirEntry.lstat() because it
makes explicitly that lstat_result is only computed once. When I call
DirEntry.lstat(), I expect to get the current status of the file, not
the cached one. It's also hard to explain (document) that
DirEntry.lstat() may or may call a system call. Don't do that, use
DirEntry.lstat_result.

- I don't think that we should support scandir(bytes). If you really
want to support os.scandir(bytes), it must raise an error on Windows
since bytes filename are already deprecated. It wouldn't make sense to
add new function with a deprecated feature. Since we have the PEP 383
(surrogateescape), it's better to advice to use Unicode on all
platforms. Almost all Python functions are able to encode back Unicode
filename automatically. Use os.fsencode() to encode manually if needd.

- We may not define a DirEntry.fullname() method: the directory name
is usually well known. Ok, but every time that I use os.listdir(), I
write os.path.join(directory, name) because in some cases I want the
full path. Example:

interesting = []
for name in os.listdir(path):
fullpath = os.path.join(path, name)
if os.path.isdir(fullpath):
continue
if ... test on the file ...:
# i need the full path here, not the relative path
# (ex: my own recursive "scandir"/"walk" function)
interesting.append(fullpath)

- It must not be possible to "refresh" a DirEntry object. Call
os.stat(entry.fullname()) or pathlib.Path(entry.fullname()) to get
fresh data. DirEntry is only computed once, that's all. It's well
defined.

- No Windows wildcard, you wrote that the feature has many corner
cases, and it's only available on Windows. It's easy to combine
scandir with fnmatch.

Victor
Ben Hoyt
2014-07-01 13:00:32 UTC
Permalink
Thanks for spinning this off to (hopefully) finished the discussion. I
agree it's nearly time to update the PEP.
Post by Victor Stinner
@Ben: it's time to update your PEP to complete it with this
- os.scandir(str)
- DirEntry.lstat_result object only available on Windows, same result
than os.lstat()
- DirEntry.fullname(): os.path.join(directory, DirEntry.name), where
directory would be an hidden attribute of DirEntry
I'm quite strongly against this, and I think it's actually the worst
of both worlds. It is not as good an API because:

(a) it doesn't call stat for you (on POSIX), so you have to check an
attribute and call scandir manually if you need it, turning what
should be one line of code into four. Your proposal above was kind of
how I had it originally, where you had to do extra tests and call
scandir manually if you needed it (see
https://mail.python.org/pipermail/python-dev/2013-May/126119.html)
(b) the .lstat_result attribute is available on Windows but not on
POSIX, meaning it's very easy for Windows developers to write code
that will run and work fine on Windows, but then break horribly on
POSIX; I think it'd be better if it broke hard on Windows to make
writing cross-platform code easy

The two alternates are:

1) the original proposal in the current version of PEP 471, where
DirEntry has an .lstat() method which calls stat() on POSIX but is
free on Windows
2) Nick Coghlan's proposal on the previous thread
(https://mail.python.org/pipermail/python-dev/2014-June/135261.html)
suggesting an ensure_lstat keyword param to scandir if you need the
lstat_result value

I would make one small tweak to Nick Coghlan's proposal to make
writing cross-platform code easier. Instead of .lstat_result being
None sometimes (on POSIX), have it None always unless you specify
ensure_lstat=True. (Actually, call it get_lstat=True to kind of make
this more obvious.) Per (b) above, this means Windows developers
wouldn't accidentally write code which failed on POSIX systems -- it'd
fail fast on Windows too if you accessed .lstat_result without
specifying get_lstat=True.

I'm still unsure which of these I like better. I think #1's API is
slightly nicer without the ensure_lstat parameter, and error handling
of the stat() is more explicit. But #2 always fetches the stat info at
the same time as the dir entry info, so eliminates the problem of
having the file info change between scandir iteration and the .lstat()
call.

I'm leaning towards preferring #2 (Nick's proposal) because it solves
or gets around the caching issue. My one concern is error handling. Is
it an issue if scandir's __next__ can raise an OSError either from the
readdir() call or the call to stat()? My thinking is probably not. In
practice, would it ever really happen that readdir() would succeed but
an os.stat() immediately after would fail? I guess it could if the
file is deleted, but then if it were deleted a microsecond earlier the
readdir() would fail anyway, or not? Or does readdir give you a
consistent, "snap-shotted" view on things?

The one other thing I'm not quite sure about with Nick's proposal is
the name .lstat_result, as it's long. I can see why he suggested that,
as .lstat sounds like a verb, but maybe that's okay? If we can have
.is_dir and .is_file as attributes, my thinking is an .lstat attribute
is fine too. I don't feel too strongly though.
Post by Victor Stinner
- I don't think that we should support scandir(bytes). If you really
want to support os.scandir(bytes), it must raise an error on Windows
since bytes filename are already deprecated. It wouldn't make sense to
add new function with a deprecated feature. Since we have the PEP 383
(surrogateescape), it's better to advice to use Unicode on all
platforms. Almost all Python functions are able to encode back Unicode
filename automatically. Use os.fsencode() to encode manually if needd.
Really, are bytes filenames deprecated? I think maybe they should be,
as they don't work on Windows :-), but the latest Python "os" docs
(https://docs.python.org/3.5/library/os.html) still say that all
functions that accept path names accept either str or bytes, and
return a value of the same type where necessary. So I think scandir()
should do the same thing.
Post by Victor Stinner
- We may not define a DirEntry.fullname() method: the directory name
is usually well known. Ok, but every time that I use os.listdir(), I
write os.path.join(directory, name) because in some cases I want the
full path.
Agreed. I use this a lot too. However, I'd prefer a .fullname
attribute rather than a method, as it's free/cheap to compute and
doesn't require OS calls.

Out of interest, why do we have .is_dir and .stat_result but .fullname
rather than .full_name? .fullname seems reasonable to me, but maybe
consistency is a good thing here?
Post by Victor Stinner
- It must not be possible to "refresh" a DirEntry object. Call
os.stat(entry.fullname()) or pathlib.Path(entry.fullname()) to get
fresh data. DirEntry is only computed once, that's all. It's well
defined.
I agree refresh() is not needed -- just use os.stat() or pathlib.
Post by Victor Stinner
- No Windows wildcard, you wrote that the feature has many corner
cases, and it's only available on Windows. It's easy to combine
scandir with fnmatch.
Agreed.

-Ben
Victor Stinner
2014-07-01 14:28:10 UTC
Permalink
Received: from localhost (HELO mail.python.org) (127.0.0.1)
by albatross.python.org with SMTP; 01 Jul 2014 16:28:32 +0200
Received: from mail-oa0-x236.google.com (unknown
[IPv6:2607:f8b0:4003:c02::236])
(using TLSv1 with cipher ECDHE-RSA-AES128-SHA (128/128 bits))
(No client certificate requested)
by mail.python.org (Postfix) with ESMTPS
for <Python-***@python.org>; Tue, 1 Jul 2014 16:28:32 +0200 (CEST)
Received: by mail-oa0-f54.google.com with SMTP id eb12so10548439oac.41
for <Python-***@python.org>; Tue, 01 Jul 2014 07:28:30 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
h=mime-version:in-reply-to:references:from:date:message-id:subject:to
:cc:content-type;
bh=Ak/AEFc3u2NA4KrijEOd6RmS3uaXlrjXZp8Tt+N3UT4=;
b=OtcHJnBacYWgHsMMDBuc91jXSZbVmKSzz42J1q6aoCARP6YYyEYQIEGy/ALdIa6Gga
GbdRZf0F2spXV9Liqj2Y4/Ciiu2zjRoI406ldAUzgHjuvi9QPJ4pNUr4sNzhO/DDM60J
xKSM23Dwrjg8OB/mDvUaIOhw224SJIfT2UilJTSKWCZ9ejoCzb25vZEDC0XDmWOumpwv
RQR9UMe4w8PZUGk71D0utt0jWSaA0lt6AlFajLwqR5flPPSiAAvxNEpg+QKeEtHVZZ5O
+ZJsMvEdqklW1GJVlny2gCgSgvxJgJ5wla0tXN8XVNRtEnTr3zuVFINDveOH9W4WmzG2
4g6A==
X-Received: by 10.182.66.130 with SMTP id f2mr3452668obt.84.1404224910121;
Tue, 01 Jul 2014 07:28:30 -0700 (PDT)
Received: by 10.182.51.98 with HTTP; Tue, 1 Jul 2014 07:28:10 -0700 (PDT)
In-Reply-To: <CAL9jXCHW+0mHU2a8Xk5z7MjHeQ4O99=***@mail.gmail.com>
X-BeenThere: python-***@python.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Python core developers <python-dev.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-dev>,
<mailto:python-dev-***@python.org?subject=unsubscribe>
List-Archive: <http://mail.python.org/pipermail/python-dev/>
List-Post: <mailto:python-***@python.org>
List-Help: <mailto:python-dev-***@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-dev>,
<mailto:python-dev-***@python.org?subject=subscribe>
Errors-To: python-dev-bounces+python-python-dev=***@python.org
Sender: "Python-Dev"
<python-dev-bounces+python-python-dev=***@python.org>
Archived-At: <http://permalink.gmane.org/gmane.comp.python.devel/148463>
Post by Ben Hoyt
(a) it doesn't call stat for you (on POSIX), so you have to check an
attribute and call scandir manually if you need it,
Yes, and that's something common when you use the os module. For
example, don't try to call os.fork(), os.getgid() or os.fchmod() on
Windows :-) Closer to your PEP, the following OS attributes are only
available on UNIX: st_blocks, st_blksize, st_rdev, st_flags; and
st_file_attributes is only available on Windows.

I don't think that using lstat_result is a common need when browsing a
directoy. In most cases, you only need is_dir() and the name
attribute.
Post by Ben Hoyt
1) the original proposal in the current version of PEP 471, where
DirEntry has an .lstat() method which calls stat() on POSIX but is
free on Windows
On UNIX, does it mean that .lstat() calls os.lstat() at the first
call, and then always return the same result? It would be different
than os.lstat() and pathlib.Path.stat() :-( I would prefer to have the
same behaviour than pathlib and os (you know, the well known
consistency of Python stdlib). As I wrote, I expect a function call to
always retrieve the new status.
Post by Ben Hoyt
2) Nick Coghlan's proposal on the previous thread
(https://mail.python.org/pipermail/python-dev/2014-June/135261.html)
suggesting an ensure_lstat keyword param to scandir if you need the
lstat_result value
I don't like this idea because it makes error handling more complex.
The syntax to catch exceptions on an iterator is verbose (while: try:
next() except ...).

Whereas calling os.lstat(entry.fullname()) is explicit and it's easy
to surround it with try/except.
Post by Ben Hoyt
.lstat_result being None sometimes (on POSIX),
Don't do that, it's not how Python handles portability. We use hasattr().
Post by Ben Hoyt
would it ever really happen that readdir() would succeed but an os.stat() immediately after would fail?
Yes, it can happen. The filesystem is system-wide and shared by all
users. The file can be deleted.
Post by Ben Hoyt
Really, are bytes filenames deprecated?
Yes, in all functions of the os module since Python 3.3. I'm sure
because I implemented the deprecation :-)

Try open(b'test.txt', w') on Windows with python -Werror.
Post by Ben Hoyt
I think maybe they should be, as they don't work on Windows :-)
Windows has an API dedicated to bytes filenames, the ANSI API. But
this API has annoying bugs: it replaces unencodable characters by
question marks, and there is no option to be noticed on the encoding
error.

Different users complained about that. It was decided to not change
Python since Python is a light wrapper over the kernel system calls.
But bytes filenames are now deprecated to advice users to use the
native type for filenames on Windows: Unicode!
Post by Ben Hoyt
but the latest Python "os" docs
(https://docs.python.org/3.5/library/os.html) still say that all
functions that accept path names accept either str or bytes,
Maybe I forgot to update the documentation :-(
Post by Ben Hoyt
So I think scandir() should do the same thing.
You may support scandir(bytes) on Windows but you will need to emit a
deprecation warning too. (which are silent by default.)

Victor
Jonas Wielicki
2014-07-01 14:59:13 UTC
Permalink
Post by Ben Hoyt
I'm leaning towards preferring #2 (Nick's proposal) because it solves
or gets around the caching issue. My one concern is error handling. Is
it an issue if scandir's __next__ can raise an OSError either from the
readdir() call or the call to stat()? My thinking is probably not. In
practice, would it ever really happen that readdir() would succeed but
an os.stat() immediately after would fail? I guess it could if the
file is deleted, but then if it were deleted a microsecond earlier the
readdir() would fail anyway, or not? Or does readdir give you a
consistent, "snap-shotted" view on things?
No need for a microsecond-timed deletion -- a directory with +r but
without +x will allow you to list the entries, but stat calls on the
files will fail with EPERM:

$ ls -l
drwxr--r--. 2 root root 60 1. Jul 16:52 test

$ sudo ls -l test
total 0
-rw-r--r--. 1 root root 0 1. Jul 16:52 foo

$ ls test
ls: cannot access test/foo: Permission denied
total 0
-????????? ? ? ? ? ? foo

$ stat test/foo
stat: cannot stat ‘test/foo’: Permission denied

I had the idea to treat a failing lstat() inside scandir() as if the
entry wasn’t found at all, but in this context, this seems wrong too.

regards,
jwi
Ben Hoyt
2014-07-01 15:30:37 UTC
Permalink
This post might be inappropriate. Click to display it.
Jonas Wielicki
2014-07-01 18:45:22 UTC
Permalink
Post by Ben Hoyt
Post by Jonas Wielicki
No need for a microsecond-timed deletion -- a directory with +r but
without +x will allow you to list the entries, but stat calls on the
Ah -- very good to know, thanks. This definitely points me in the
direction of wanting better control over error handling.
Speaking of errors, and thinking of handling errors during iteration
-- in what cases (if any) would an individual readdir fail if the
opendir succeeded?
readdir(3) manpage suggests that readdir can only fail if an invalid
directory fd was passed.

regards,
jwi
Post by Ben Hoyt
-Ben
Nick Coghlan
2014-07-01 15:33:06 UTC
Permalink
Post by Victor Stinner
Post by Ben Hoyt
2) Nick Coghlan's proposal on the previous thread
(https://mail.python.org/pipermail/python-dev/2014-June/135261.html)
suggesting an ensure_lstat keyword param to scandir if you need the
lstat_result value
I don't like this idea because it makes error handling more complex.
next() except ...).
Actually, we may need to copy the os.walk API and accept an "onerror"
callback as a scandir argument. Regardless of whether or not we have
"ensure_lstat", the iteration step could fail, so I don't believe we can
just transfer the existing approach of catching exceptions from the listdir
call.
Post by Victor Stinner
Whereas calling os.lstat(entry.fullname()) is explicit and it's easy
to surround it with try/except.
Post by Ben Hoyt
.lstat_result being None sometimes (on POSIX),
Don't do that, it's not how Python handles portability. We use hasattr().
That's not true in general - we do either, depending on context.

With the addition of an os.walk style onerror callback, I'm still in favour
of a "get_lstat" flag (tweaked as Ben suggests to always be None unless
requested, so Windows code is less likely to be inadvertently non-portable)
Post by Victor Stinner
Post by Ben Hoyt
would it ever really happen that readdir() would succeed but an
os.stat() immediately after would fail?
Post by Victor Stinner
Yes, it can happen. The filesystem is system-wide and shared by all
users. The file can be deleted.
We need per-iteration error handling for the readdir call anyway, so I
think an onerror callback is a better option than dropping the ability to
easily obtain full stat information as part of the iteration.

Cheers,
Nick.
Ben Hoyt
2014-07-01 15:42:25 UTC
Permalink
We need per-iteration error handling for the readdir call anyway, so I think
an onerror callback is a better option than dropping the ability to easily
obtain full stat information as part of the iteration.
I don't mind the idea of an "onerror" callback, but it's adding
complexity. Putting aside the question of caching/timing for a second
and assuming .lstat() as per the current PEP 471, do we really need
per-iteration error handling for readdir()? When would that actually
fail in practice?

-Ben
Nick Coghlan
2014-07-01 16:50:48 UTC
Permalink
Post by Ben Hoyt
We need per-iteration error handling for the readdir call anyway, so I think
an onerror callback is a better option than dropping the ability to easily
obtain full stat information as part of the iteration.
I don't mind the idea of an "onerror" callback, but it's adding
complexity. Putting aside the question of caching/timing for a second
and assuming .lstat() as per the current PEP 471, do we really need
per-iteration error handling for readdir()? When would that actually
fail in practice?
An NFS mount dropping the connection or a USB key being removed are
the first that come to mind, but I expect there are others. I find
it's generally better to just assume that any system call may fail for
obscure reasons and put the infrastructure in place to deal with it
rather than getting ugly, hard to track down bugs later.

Cheers,
Nick.
--
Nick Coghlan | ***@gmail.com | Brisbane, Australia
Ethan Furman
2014-07-01 15:34:20 UTC
Permalink
Post by Jonas Wielicki
I had the idea to treat a failing lstat() inside scandir() as if the
entry wasn’t found at all, but in this context, this seems wrong too.
Well, os.walk supports passing in an error handler -- perhaps scandir should as well.

--
~Ethan~
Paul Moore
2014-07-01 21:20:17 UTC
Permalink
Post by Ben Hoyt
2) Nick Coghlan's proposal on the previous thread
(https://mail.python.org/pipermail/python-dev/2014-June/135261.html)
suggesting an ensure_lstat keyword param to scandir if you need the
lstat_result value
I would make one small tweak to Nick Coghlan's proposal to make
writing cross-platform code easier. Instead of .lstat_result being
None sometimes (on POSIX), have it None always unless you specify
ensure_lstat=True. (Actually, call it get_lstat=True to kind of make
this more obvious.) Per (b) above, this means Windows developers
wouldn't accidentally write code which failed on POSIX systems -- it'd
fail fast on Windows too if you accessed .lstat_result without
specifying get_lstat=True.
This is getting very complicated (at least to me, as a Windows user,
where the basic idea seems straightforward).

It seems to me that the right model is the standard "thin wrapper
round the OS feature" that acts as a building block - it's typical of
the rest of the os module. I think that thin wrapper is needed - even
if the various bells and whistles are useful, they can be built on top
of a low-level version (whereas the converse is not the case).
Typically, such thin wrappers expose POSIX semantics by default, and
Windows behaviour follows as closely as possible (see for example
stat, where st_ino makes no sense on Windows, but is present). In this
case, we're exposing Windows semantics, and POSIX is the one needing
to fit the model, but the principle is the same.

On that basis, optional attributes (as used in stat results) seem
entirely sensible.

The documentation for DirEntry could easily be written to parallel
that of a stat result:

"""
The return value is an object whose attributes correspond to the data
the OS returns about a directory entry:

* name - the object's name
* full_name - the object's full name (including path)
* is_dir - whether the object is a directory
* is file - whether the object is a plain file
* is_symlink - whether the object is a symbolic link

On Windows, the following attributes are also available

* st_size - the size, in bytes, of the object (only meaningful for files)
* st_atime - time of last access
* st_mtime - time of last write
* st_ctime - time of creation
* st_file_attributes - Windows file attribute bits (see the
FILE_ATTRIBUTE_* constants in the stat module)
"""

That's no harder to understand (or to work with) than the equivalent
stat result. The only difference is that the unavailable attributes
can be queried on POSIX, there's just a separate system call involved
(with implications in terms of performance, error handling and
potential race conditions).

The version of scandir with the ensure_lstat argument is easy to write
based on one with optional arguments (I'm playing fast and loose with
adding attributes to DirEntry values here, just for the sake of an
example - the details are left as an exercise)

def scandir_ensure(path='.', ensure_lstat=False):
for entry in os.scandir(path):
if ensure_lstat and not hasattr(entry, 'st_size'):
stat_data = os.lstat(entry.full_name)
entry.st_size = stat_data.st_size
entry.st_atime = stat_data.st_atime
entry.st_mtime = stat_data.st_mtime
entry.st_ctime = stat_data.st_ctime
# Ignore file_attributes, as we'll never get here on Windows
yield entry

Variations on how you handle errors in the lstat call, etc, can be
added to taste.

Please, let's stick to a low-level wrapper round the OS API for the
first iteration of this feature. Enhancements can be added later, when
real-world usage has proved their value.

Paul
Glenn Linderman
2014-07-01 21:39:51 UTC
Permalink
Post by Paul Moore
Please, let's stick to a low-level wrapper round the OS API for the
first iteration of this feature. Enhancements can be added later, when
real-world usage has proved their value.
I almost wrote this whole message this morning, but didn't have time.
Thanks, Paul, for digging through the details.

+1
Ethan Furman
2014-07-01 21:30:48 UTC
Permalink
Post by Paul Moore
Please, let's stick to a low-level wrapper round the OS API for the
first iteration of this feature. Enhancements can be added later, when
real-world usage has proved their value.
+1
Chris Angelico
2014-07-02 01:13:56 UTC
Permalink
Post by Paul Moore
I think that thin wrapper is needed - even
if the various bells and whistles are useful, they can be built on top
of a low-level version (whereas the converse is not the case).
+1. Make everything as simple as possible (but no simpler).

ChrisA
Nick Coghlan
2014-07-02 06:35:48 UTC
Permalink
Received: from localhost (HELO mail.python.org) (127.0.0.1)
by albatross.python.org with SMTP; 02 Jul 2014 08:41:07 +0200
Received: from mail-qc0-x22b.google.com (unknown
[IPv6:2607:f8b0:400d:c01::22b])
(using TLSv1 with cipher ECDHE-RSA-AES128-SHA (128/128 bits))
(No client certificate requested)
by mail.python.org (Postfix) with ESMTPS
for <Python-***@python.org>; Wed, 2 Jul 2014 08:41:07 +0200 (CEST)
Received: by mail-qc0-f171.google.com with SMTP id w7so9356848qcr.2
for <Python-***@python.org>; Tue, 01 Jul 2014 23:41:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
h=mime-version:in-reply-to:references:date:message-id:subject:from:to
:cc:content-type;
bh=l27Ym6+Jp8mbcQzSsvDcZup9g5XK64U2y3c96CUOWBo=;
b=mQXuq3AHVeG6JQdy3bwXaDOMx5PD43i9DSOeLVGoCUupGRuvk9ldEiL8LpLEzIyllG
xozNsQ6vPpiwzW0OP4Z+A/XrOzqX4LItUgfB5mweU7DquZPhj9m+77r+JPS23fs6jQyo
xJXgF7EqiwNaLiwYxCUt68jlAu/8/EtI5Iqwz7VSeOOfyX8dRkJo6Vpg8LLKDGCIs7mV
qp+CTVx0V8k2ON1SN91yyrzlsWzADUQQ74Hx/Pv0dUCC2lvwivPCE6mEbAjoHIP46Sml
62SBKYzheHknLDInoK8Tubjv2CuQeqW+uqlsiD7fN0sdEK8K/I/oxSTxrE9G/IenTVFs
j2pQ==
X-Received: by 10.224.115.3 with SMTP id g3mr69618830qaq.9.1404282948297; Tue,
01 Jul 2014 23:35:48 -0700 (PDT)
Received: by 10.224.70.71 with HTTP; Tue, 1 Jul 2014 23:35:48 -0700 (PDT)
In-Reply-To: <CACac1F_03K5WRmCVx5VwQC6Zy=pUS8UBv6-***@mail.gmail.com>
X-BeenThere: python-***@python.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Python core developers <python-dev.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-dev>,
<mailto:python-dev-***@python.org?subject=unsubscribe>
List-Archive: <http://mail.python.org/pipermail/python-dev/>
List-Post: <mailto:python-***@python.org>
List-Help: <mailto:python-dev-***@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-dev>,
<mailto:python-dev-***@python.org?subject=subscribe>
Errors-To: python-dev-bounces+python-python-dev=***@python.org
Sender: "Python-Dev"
<python-dev-bounces+python-python-dev=***@python.org>
Archived-At: <http://permalink.gmane.org/gmane.comp.python.devel/148483>
Post by Paul Moore
Post by Ben Hoyt
2) Nick Coghlan's proposal on the previous thread
(https://mail.python.org/pipermail/python-dev/2014-June/135261.html)
suggesting an ensure_lstat keyword param to scandir if you need the
lstat_result value
I would make one small tweak to Nick Coghlan's proposal to make
writing cross-platform code easier. Instead of .lstat_result being
None sometimes (on POSIX), have it None always unless you specify
ensure_lstat=True. (Actually, call it get_lstat=True to kind of make
this more obvious.) Per (b) above, this means Windows developers
wouldn't accidentally write code which failed on POSIX systems -- it'd
fail fast on Windows too if you accessed .lstat_result without
specifying get_lstat=True.
This is getting very complicated (at least to me, as a Windows user,
where the basic idea seems straightforward).
It seems to me that the right model is the standard "thin wrapper
round the OS feature" that acts as a building block - it's typical of
the rest of the os module. I think that thin wrapper is needed - even
if the various bells and whistles are useful, they can be built on top
of a low-level version (whereas the converse is not the case).
Typically, such thin wrappers expose POSIX semantics by default, and
Windows behaviour follows as closely as possible (see for example
stat, where st_ino makes no sense on Windows, but is present). In this
case, we're exposing Windows semantics, and POSIX is the one needing
to fit the model, but the principle is the same.
On that basis, optional attributes (as used in stat results) seem
entirely sensible.
The documentation for DirEntry could easily be written to parallel
"""
The return value is an object whose attributes correspond to the data
* name - the object's name
* full_name - the object's full name (including path)
* is_dir - whether the object is a directory
* is file - whether the object is a plain file
* is_symlink - whether the object is a symbolic link
On Windows, the following attributes are also available
* st_size - the size, in bytes, of the object (only meaningful for files)
* st_atime - time of last access
* st_mtime - time of last write
* st_ctime - time of creation
* st_file_attributes - Windows file attribute bits (see the
FILE_ATTRIBUTE_* constants in the stat module)
"""
That's no harder to understand (or to work with) than the equivalent
stat result. The only difference is that the unavailable attributes
can be queried on POSIX, there's just a separate system call involved
(with implications in terms of performance, error handling and
potential race conditions).
The version of scandir with the ensure_lstat argument is easy to write
based on one with optional arguments (I'm playing fast and loose with
adding attributes to DirEntry values here, just for the sake of an
example - the details are left as an exercise)
stat_data = os.lstat(entry.full_name)
entry.st_size = stat_data.st_size
entry.st_atime = stat_data.st_atime
entry.st_mtime = stat_data.st_mtime
entry.st_ctime = stat_data.st_ctime
# Ignore file_attributes, as we'll never get here on Windows
yield entry
Variations on how you handle errors in the lstat call, etc, can be
added to taste.
Please, let's stick to a low-level wrapper round the OS API for the
first iteration of this feature. Enhancements can be added later, when
real-world usage has proved their value.
+1 from me - especially if this recipe goes in at least the PEP, and
potentially even the docs.

I'm also OK with postponing onerror support for the time being - that
should be straightforward to add later if we decide we need it.

Cheers,
Nick.
--
Nick Coghlan | ***@gmail.com | Brisbane, Australia
Jonas Wielicki
2014-07-02 10:25:47 UTC
Permalink
[snip]
Please, let's stick to a low-level wrapper round the OS API for the
first iteration of this feature. Enhancements can be added later, when
real-world usage has proved their value.
Paul
+1 to the whole thing. That’s an ingeniously simple solution to the
issues we’re having here.

regards,
jwi
Ben Hoyt
2014-07-02 12:41:28 UTC
Permalink
This post might be inappropriate. Click to display it.
Paul Moore
2014-07-02 13:48:12 UTC
Permalink
tl;dr - I agree with your points and think that the original PEP 471
proposal is fine. The details here are just clarification of why my
proposal wasn't just "use PEP 471 as written" in the first place...
Post by Ben Hoyt
1) It's a nasty API to actually write code with. If you try to use it,
it gives off a "made only for low-level library authors" rather than
"designed for developers" smell. For example, here's a get_tree_size()
function I use written in both versions (original is the PEP 471
"""Return total size of all files in directory tree at path."""
total = 0
total += get_tree_size_original(entry.full_name)
total += entry.lstat().st_size
return total
"""Return total size of all files in directory tree at path."""
total = 0
is_dir = entry.is_dir
size = entry.st_size
st = os.lstat(entry.full_name)
is_dir = stat.S_ISDIR(st.st_mode)
size = st.st_size
total += get_tree_size_new(entry.full_name)
total += size
return total
I know which version I'd rather write and maintain!
Fair point. But *only* because is_dir isn't guaranteed to be
available. I could debate other aspects of your translation to use my
API, but it's not relevant as my proposal was flawed in terms of
is_XXX.
Post by Ben Hoyt
It seems to me new
users and folks new to Python could easily write the top version, but
the bottom is longer, more complicated, and harder to get right.
Given the is_dir point, agreed.
Post by Ben Hoyt
It
would also be very easy to write code in a way that works on Windows
but bombs hard on POSIX.
You may have a point here - my Windows bias may be showing. It's
already awfully easy to write code that works on POSIX but bombs hard
on Windows (deleting open files, for example) so I find it tempting to
think "give them a taste of their own medicine" :-)

More seriously, it seems to me that the scandir API is specifically
designed to write efficient code on platforms where the OS gives
information that allows you to do so. Warping the API too much to
cater for platforms where that isn't possible seems to have the
priorities backwards. Making the API not be an accident waiting to
happen is fine, though.

And let's be careful, too. My position is that it's not too hard to
write code that works on Windows, Linux and OS X but you're right you
could miss the problem with platforms that don't even support a free
is_dir(). It's *easier* to write Windows-only code by mistake, but the
fix to cover the "big three" is pretty simple (if not hasattr, lstat).
Post by Ben Hoyt
2) It seems like your assumption is that is_dir/is_file/is_symlink are
always available on POSIX via readdir. This isn't actually the case
(this was discussed in the original threads) -- if readdir() returns
dirent.d_type as DT_UNKNOWN, then you actually have to call os.stat()
anyway to get it. So, as the above definition of get_tree_size_new()
is_dir/is_file/is_symlink as well as the st_* attributes.
Ah, the wording in the PEP says "Linux, Windows, OS X". Superficially,
that said "everywhere" to me. It might be worth calling out
specifically some examples where it's not available without an extra
system call, just to make the point explicit.

You're right, though, that blows away the simplicity of my proposal.
The original PEP 471 seems precisely right to me, in that case.

I was only really arguing for attributes because they seem more
obviously static than a method call. And personally I don't care about
that aspect.
Post by Ben Hoyt
3) It's not much different in concept to the PEP 471 version, except
that PEP 471 has a built-in .lstat() method, making the user's life
much easier. This is the sense in which it's the worst of both worlds
-- it's a far less nice API to use, but it still has the same issues
with race conditions the original does.
Agreed. My intent was never to remove the race conditions, I see them
as the responsibility of the application to consider (many
applications simply won't care, and those that do will likely want a
specific solution, not a library-level compromise).
Post by Ben Hoyt
First, based on the +1's to Paul's new solution, I don't think people
are too concerned about the race condition issue (attributes being
different between the original readdir and the os.stat calls). I think
this is probably fair -- if folks care, they can handle it in an
application-specific way. So that means Paul's new solution and the
original PEP 471 approach are both okay on that score.
+1. That was my main point, in actual fact
Post by Ben Hoyt
Second, comparing PEP 471 to Nick's solution: error handling is much
more straight-forward and simple to document with the original PEP 471
approach (just try/catch around the function calls) than with Nick's
get_lstat=True approach of doing the stat() if needed inside the
iterator. To catch errors with that approach, you'd either have to do
a "while True" loop and try/catch around next(it) manually (which is
very yucky code), or we'd have to add an onerror callback, which is
somewhat less nice to use and harder to document (signature of the
callback, exception object, etc).
Agreed. If my solution had worked, it would have been by isolating a
few extra cases where you could guarantee errors won't happen. But
actually, errors *can* happen in those cases, on certain systems. So
PEP 471 wins on all counts here too.
Post by Ben Hoyt
So given all of the above, I'm fairly strongly in favour of the
approach in the original PEP 471 due to it's easy-to-use API and
straight-forward try/catch approach to error handling. (My second
option would be Nick's get_lstat=True with the onerror callback. My
third option would be Paul's attribute-only solution, as it's just
very hard to use.)
Agreed. The solution I proposed isn't just "very hard to use", it's
actually wrong. If is_XXX are optional attributes, that's not my
solution, and I agree it's *awful*.

Paul.

PS I'd suggest adding a "Rejected proposals" section to the PEP which
mentions the race condition issue and points to this discussion as an
indication that people didn't seem to see it as a problem.
Post by Ben Hoyt
Thanks for the effort in your response, Paul.
I'm all for KISS, but let's just slow down a bit here.
Post by Paul Moore
I think that thin wrapper is needed - even
if the various bells and whistles are useful, they can be built on top
of a low-level version (whereas the converse is not the case).
Yes, but API design is important. For example, urllib2 has a kind of
the "thin wrapper approach", but millions of people use the 3rd-party
"requests" library because it's just so much nicer to use.
There are low-level functions in the "os" module, but there are also a
lot of higher-level functions (os.walk) and functions that smooth over
cross-platform issues (os.stat).
Detailed comments below.
Post by Paul Moore
The return value is an object whose attributes correspond to the data
* name - the object's name
* full_name - the object's full name (including path)
* is_dir - whether the object is a directory
* is file - whether the object is a plain file
* is_symlink - whether the object is a symbolic link
On Windows, the following attributes are also available
* st_size - the size, in bytes, of the object (only meaningful for files)
* st_atime - time of last access
* st_mtime - time of last write
* st_ctime - time of creation
* st_file_attributes - Windows file attribute bits (see the
FILE_ATTRIBUTE_* constants in the stat module)
Again, this seems like a nice simple idea, but I think it's actually a
1) It's a nasty API to actually write code with. If you try to use it,
it gives off a "made only for low-level library authors" rather than
"designed for developers" smell. For example, here's a get_tree_size()
function I use written in both versions (original is the PEP 471
"""Return total size of all files in directory tree at path."""
total = 0
total += get_tree_size_original(entry.full_name)
total += entry.lstat().st_size
return total
"""Return total size of all files in directory tree at path."""
total = 0
is_dir = entry.is_dir
size = entry.st_size
st = os.lstat(entry.full_name)
is_dir = stat.S_ISDIR(st.st_mode)
size = st.st_size
total += get_tree_size_new(entry.full_name)
total += size
return total
I know which version I'd rather write and maintain! It seems to me new
users and folks new to Python could easily write the top version, but
the bottom is longer, more complicated, and harder to get right. It
would also be very easy to write code in a way that works on Windows
but bombs hard on POSIX.
2) It seems like your assumption is that is_dir/is_file/is_symlink are
always available on POSIX via readdir. This isn't actually the case
(this was discussed in the original threads) -- if readdir() returns
dirent.d_type as DT_UNKNOWN, then you actually have to call os.stat()
anyway to get it. So, as the above definition of get_tree_size_new()
is_dir/is_file/is_symlink as well as the st_* attributes.
3) It's not much different in concept to the PEP 471 version, except
that PEP 471 has a built-in .lstat() method, making the user's life
much easier. This is the sense in which it's the worst of both worlds
-- it's a far less nice API to use, but it still has the same issues
with race conditions the original does.
First, based on the +1's to Paul's new solution, I don't think people
are too concerned about the race condition issue (attributes being
different between the original readdir and the os.stat calls). I think
this is probably fair -- if folks care, they can handle it in an
application-specific way. So that means Paul's new solution and the
original PEP 471 approach are both okay on that score.
Second, comparing PEP 471 to Nick's solution: error handling is much
more straight-forward and simple to document with the original PEP 471
approach (just try/catch around the function calls) than with Nick's
get_lstat=True approach of doing the stat() if needed inside the
iterator. To catch errors with that approach, you'd either have to do
a "while True" loop and try/catch around next(it) manually (which is
very yucky code), or we'd have to add an onerror callback, which is
somewhat less nice to use and harder to document (signature of the
callback, exception object, etc).
So given all of the above, I'm fairly strongly in favour of the
approach in the original PEP 471 due to it's easy-to-use API and
straight-forward try/catch approach to error handling. (My second
option would be Nick's get_lstat=True with the onerror callback. My
third option would be Paul's attribute-only solution, as it's just
very hard to use.)
Thoughts?
-Ben
Ben Hoyt
2014-07-02 14:48:50 UTC
Permalink
Thanks for the clarifications and support.
Post by Paul Moore
Ah, the wording in the PEP says "Linux, Windows, OS X". Superficially,
that said "everywhere" to me. It might be worth calling out
specifically some examples where it's not available without an extra
system call, just to make the point explicit.
Good call. I'll update the wording in the PEP here and try to call out
specific examples of where is_dir() could call os.stat().

Hard-core POSIX people, do you know when readdir() d_type will be
DT_UNKNOWN on (for example) Linux or OS X? I suspect this can happen
on certain network filesystems, but I'm not sure.
Post by Paul Moore
PS I'd suggest adding a "Rejected proposals" section to the PEP which
mentions the race condition issue and points to this discussion as an
indication that people didn't seem to see it as a problem.
Definitely agreed. I'll add this, and clarify various other issues in
the PEP, and then repost.

-Ben
Nikolaus Rath
2014-07-02 21:59:01 UTC
Permalink
Post by Ben Hoyt
Thanks for the clarifications and support.
Post by Paul Moore
Ah, the wording in the PEP says "Linux, Windows, OS X". Superficially,
that said "everywhere" to me. It might be worth calling out
specifically some examples where it's not available without an extra
system call, just to make the point explicit.
Good call. I'll update the wording in the PEP here and try to call out
specific examples of where is_dir() could call os.stat().
Hard-core POSIX people, do you know when readdir() d_type will be
DT_UNKNOWN on (for example) Linux or OS X? I suspect this can happen
on certain network filesystems, but I'm not sure.
Any fuse file system mounted by some other user and without -o
allow_other. For these entries, stat() will fail as well.


Best,
-Nikolaus
--
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

»Time flies like an arrow, fruit flies like a Banana.«
Loading...