Discussion:
PEP 471 "scandir" accepted
Victor Stinner
2014-07-21 22:26:02 UTC
Permalink
Hi,

I asked privately Guido van Rossum if I can be the BDFL-delegate for
the PEP 471 and he agreed. I accept the latest version of the PEP:

http://legacy.python.org/dev/peps/pep-0471/

I consider that the PEP 471 "scandir" was discussed enough to collect
all possible options (variations of the API) and that main flaws have
been detected. Ben Hoyt modified his PEP to list all these options,
and for each option gives advantages and drawbacks. Great job Ben :-)
Thanks all developers who contributed to the threads on the python-dev
mailing list!

The new version of the PEP has an optional "follow_symlinks" parameter
which is True by default. IMO this API fits better the common case,
list the content of a single directory, and it's now simple to not
follow symlinks to implement a recursive function like os.walk().

The PEP also explicitly mentions that os.walk() will be modified to
benefit of the new os.scandir() function.

I'm happy because the final API is very close to os.path functions and
pathlib.Path methods. Python stays consistent, which is a great power
of this language!

The PEP is accepted. It's time to review the implementation ;-) The
current code can be found at:

https://github.com/benhoyt/scandir

(I don't think that Ben already updated his implementation for the
latest version of the PEP.)

Victor
Ben Hoyt
2014-07-22 02:27:09 UTC
Permalink
Post by Victor Stinner
I asked privately Guido van Rossum if I can be the BDFL-delegate for
http://legacy.python.org/dev/peps/pep-0471/
Thank you!
Post by Victor Stinner
The PEP also explicitly mentions that os.walk() will be modified to
benefit of the new os.scandir() function.
Yes, this was a good suggestion to include that explicitly -- in
actual fact, speeding up os.walk() was my main goal initially.
Post by Victor Stinner
The PEP is accepted.
Superb. Could you please update the PEP with the Resolution and
BDFL-Delegate fields?
Post by Victor Stinner
https://github.com/benhoyt/scandir
(I don't think that Ben already updated his implementation for the
latest version of the PEP.)
I have actually updated my GitHub repo for the current PEP (did this
last Saturday). However, there are still a few open issues, the main
one is that my scandir.py module doesn't handle the bytes/str thing
properly.

I intend to work on the CPython implementation over the next few
weeks. However, a couple of thoughts up-front:

I think if I were doing this from scratch I'd reimplement listdir() in
Python as "return [e.name for e in scandir(path)]". However, I'm not
sure this is a good idea, as I don't really want listdir() to suddenly
use more memory and perform slightly *worse* due to the extra DirEntry
object allocations.

So my basic plan is to have an internal helper function in
posixmodule.c that either yields DirEntry objects or strings. And then
listdir() would simply be defined something like "return
list(_scandir(path, yield_strings=True))" in C or in Python.

My reasoning is that then there'll be much less (if any) code
duplication between scandir() and listdir().

Does this sound like a reasonable approach?

-Ben
Victor Stinner
2014-07-22 07:39:17 UTC
Permalink
Modify os.listdir() to use os.scandir() is not part of the PEP, you should
not do that. If you worry about performances, try to implement my free list
idea.

You may modify the C code of listdir() to share as much code as possible. I
mean you can implement your idea in C.

Victor
Akira Li
2014-07-22 07:33:41 UTC
Permalink
Post by Ben Hoyt
I think if I were doing this from scratch I'd reimplement listdir() in
Python as "return [e.name for e in scandir(path)]".
...
Post by Ben Hoyt
So my basic plan is to have an internal helper function in
posixmodule.c that either yields DirEntry objects or strings. And then
listdir() would simply be defined something like "return
list(_scandir(path, yield_strings=True))" in C or in Python.
My reasoning is that then there'll be much less (if any) code
duplication between scandir() and listdir().
Does this sound like a reasonable approach?
Note: listdir() accepts an integer path (an open file descriptor that
refers to a directory) that is passed to fdopendir() on POSIX [4] i.e.,
*you can't use scandir() to replace listdir() in this case* (as I've
already mentioned in [1]). See the corresponding tests from [2].

[1] https://mail.python.org/pipermail/python-dev/2014-July/135296.html
[2] https://mail.python.org/pipermail/python-dev/2014-June/135265.html
Post by Ben Hoyt
This function can also support specifying a file descriptor; the file
descriptor must refer to a directory.
[3] https://docs.python.org/3.4/library/os.html#os.listdir
[4] http://hg.python.org/cpython/file/3.4/Modules/posixmodule.c#l3736


--
Akira
Ben Hoyt
2014-07-22 15:52:45 UTC
Permalink
Post by Akira Li
Note: listdir() accepts an integer path (an open file descriptor that
refers to a directory) that is passed to fdopendir() on POSIX [4] i.e.,
*you can't use scandir() to replace listdir() in this case* (as I've
already mentioned in [1]). See the corresponding tests from [2].
[1] https://mail.python.org/pipermail/python-dev/2014-July/135296.html
[2] https://mail.python.org/pipermail/python-dev/2014-June/135265.html
Post by Ben Hoyt
This function can also support specifying a file descriptor; the file
descriptor must refer to a directory.
[3] https://docs.python.org/3.4/library/os.html#os.listdir
[4] http://hg.python.org/cpython/file/3.4/Modules/posixmodule.c#l3736
Fair point.

Yes, I hadn't realized listdir supported dir_fd (must have been
looking at 2.x docs), though you've pointed it out at [1] above. and I
guess I wasn't thinking about implementation at the time.

It would be easy enough (I think) to have the helper function support
both, but raise an error in the scandir() function if the type of path
is an integer.

However, given that we have to support this for listdir() anyway, I
think it's worth reconsidering whether scandir()'s directory argument
can be an integer FD. Given that listdir() already supports it, it
will almost certainly be asked for later anyway for someone who's
porting some listdir code that uses an FD. Thoughts, Victor?

-Ben
Victor Stinner
2014-07-22 16:16:14 UTC
Permalink
Post by Ben Hoyt
However, given that we have to support this for listdir() anyway, I
think it's worth reconsidering whether scandir()'s directory argument
can be an integer FD. Given that listdir() already supports it, it
will almost certainly be asked for later anyway for someone who's
porting some listdir code that uses an FD. Thoughts, Victor?
Please focus on what was accepted in the PEP. We should first test
os.scandir(). In a few months, with better feedbacks, we can consider
extending os.scandir() to support a file descriptor. There are
different issues which should be discussed and decided to implement it
(ex: handle the lifetime of the directory file descriptor).

Victor
Nick Coghlan
2014-07-22 20:57:18 UTC
Permalink
Post by Victor Stinner
Post by Ben Hoyt
However, given that we have to support this for listdir() anyway, I
think it's worth reconsidering whether scandir()'s directory argument
can be an integer FD. Given that listdir() already supports it, it
will almost certainly be asked for later anyway for someone who's
porting some listdir code that uses an FD. Thoughts, Victor?
Please focus on what was accepted in the PEP. We should first test
os.scandir(). In a few months, with better feedbacks, we can consider
extending os.scandir() to support a file descriptor. There are
different issues which should be discussed and decided to implement it
(ex: handle the lifetime of the directory file descriptor).
As Victor suggests, getting the core version working and incorporated first
is a good way to go. Future enhancements (like accepting a file descriptor)
and refactorings (like eliminating the code duplication with listdir) don't
need to (and hence shouldn't) go into the initial patch.

Cheers,
Nick.
Post by Victor Stinner
Victor
_______________________________________________
Python-Dev mailing list
https://mail.python.org/mailman/listinfo/python-dev
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
Ben Hoyt
2014-07-22 21:07:37 UTC
Permalink
Makes sense, thanks. -Ben
Post by Nick Coghlan
Post by Victor Stinner
Post by Ben Hoyt
However, given that we have to support this for listdir() anyway, I
think it's worth reconsidering whether scandir()'s directory argument
can be an integer FD. Given that listdir() already supports it, it
will almost certainly be asked for later anyway for someone who's
porting some listdir code that uses an FD. Thoughts, Victor?
Please focus on what was accepted in the PEP. We should first test
os.scandir(). In a few months, with better feedbacks, we can consider
extending os.scandir() to support a file descriptor. There are
different issues which should be discussed and decided to implement it
(ex: handle the lifetime of the directory file descriptor).
As Victor suggests, getting the core version working and incorporated first
is a good way to go. Future enhancements (like accepting a file descriptor)
and refactorings (like eliminating the code duplication with listdir) don't
need to (and hence shouldn't) go into the initial patch.
Cheers,
Nick.
Post by Victor Stinner
Victor
_______________________________________________
Python-Dev mailing list
https://mail.python.org/mailman/listinfo/python-dev
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
Akira Li
2014-07-22 23:21:14 UTC
Permalink
Post by Ben Hoyt
Post by Akira Li
Note: listdir() accepts an integer path (an open file descriptor that
refers to a directory) that is passed to fdopendir() on POSIX [4] i.e.,
*you can't use scandir() to replace listdir() in this case* (as I've
already mentioned in [1]). See the corresponding tests from [2].
[1] https://mail.python.org/pipermail/python-dev/2014-July/135296.html
[2] https://mail.python.org/pipermail/python-dev/2014-June/135265.html
Post by Ben Hoyt
This function can also support specifying a file descriptor; the file
descriptor must refer to a directory.
[3] https://docs.python.org/3.4/library/os.html#os.listdir
[4] http://hg.python.org/cpython/file/3.4/Modules/posixmodule.c#l3736
Fair point.
Yes, I hadn't realized listdir supported dir_fd (must have been
looking at 2.x docs), though you've pointed it out at [1] above. and I
guess I wasn't thinking about implementation at the time.
FYI, dir_fd is related but *different*: compare "specifying a file
descriptor" [1] vs. "paths relative to directory descriptors" [2].
Post by Ben Hoyt
Post by Akira Li
Post by Ben Hoyt
import os
os.listdir in os.supports_fd
True
Post by Ben Hoyt
Post by Akira Li
Post by Ben Hoyt
os.listdir in os.supports_dir_fd
False


[1] https://docs.python.org/3/library/os.html#path-fd
[2] https://docs.python.org/3/library/os.html#dir-fd
[3] https://mail.python.org/pipermail/python-dev/2014-July/135296.html

To be clear: *listdir() does not support dir_fd* though it can be
emulated using os.open(dir_fd=..).

You can safely ignore the rest of the e-mail until you want to implement
path-fd [1] support for os.scandir() in several months.

Here's code example that demonstrates both path-fd [1] and dir-fd [2]:

import contextlib
import os

with contextlib.ExitStack() as stack:
dir_fd = os.open('/etc', os.O_RDONLY)
stack.callback(os.close, dir_fd)
fd = os.open('init.d', os.O_RDONLY, dir_fd=dir_fd) # dir-fd [2]
stack.callback(os.close, fd)
print("\n".join(os.listdir(fd))) # path-fd [1]

It is the same as os.listdir('/etc/init.d') unless '/etc' is symlinked
to refer to another directory after the first os.open('/etc',..)
call. See also, os.fwalk(dir_fd=..) [4]

[4] https://docs.python.org/3/library/os.html#os.fwalk
Post by Ben Hoyt
However, given that we have to support this for listdir() anyway, I
think it's worth reconsidering whether scandir()'s directory argument
can be an integer FD.
What is entry.path in this case? If input directory is a file descriptor
(an integer) then os.path.join(directory, entry.name) won't work.

"PEP 471 should explicitly reject the support for specifying a file
descriptor so that a code that uses os.scandir may assume that
entry.path attribute is always present (no exceptions due
to a failure to read /proc/self/fd/NNN or an error while calling
fcntl(F_GETPATH) or GetFileInformationByHandleEx() -- see
http://stackoverflow.com/q/1188757 )." [5]

[5] https://mail.python.org/pipermail/python-dev/2014-July/135441.html

On the other hand os.fwalk() [4] that supports both path-fd [1] and
dir-fd [2] could be implemented without entry.path property if
os.scandir() supports just path-fd [1]. os.fwalk() provides a safe way
to traverse a directory tree without symlink races e.g., [6]:

def get_tree_size(directory):
"""Return total size of files in directory and subdirs."""
return sum(entry.lstat().st_size
for root, dirs, files, rootfd in fwalk(directory)
for entry in files)

[6] http://legacy.python.org/dev/peps/pep-0471/#examples

where fwalk() is the exact copy of os.fwalk() except that it uses
_fwalk() which is defined in terms of scandir():

import os

# adapt os._fwalk() to use scandir() instead of os.listdir()
def _fwalk(topfd, toppath, topdown, onerror, follow_symlinks):
# Note: This uses O(depth of the directory tree) file descriptors:
# if necessary, it can be adapted to only require O(1) FDs, see
# http://bugs.python.org/issue13734

entries = scandir(topfd)
dirs, nondirs = [], []
for entry in entries: #XXX call onerror on OSError on next() and return?
# report symlinks to directories as directories (like os.walk)
# but no recursion into symlinked subdirectories unless
# follow_symlinks is true

# add dangling symlinks as nondirs (DirEntry.is_dir() doesn't
# raise on broken links)
try:
(dirs if entry.is_dir() else nondirs).append(entry)
except FileNotFoundError:
continue # ignore disappeared files

if topdown:
yield toppath, dirs, nondirs, topfd

for entry in dirs:
try:
orig_st = entry.stat(follow_symlinks=follow_symlinks)
#XXX O_DIRECTORY, O_CLOEXEC, [? O_NOCTTY, O_SEARCH ?]
dirfd = os.open(entry.name, os.O_RDONLY, dir_fd=topfd)
except OSError as err:
if onerror is not None:
onerror(err)
return
try:
if follow_symlinks or os.path.samestat(orig_st, os.stat(dirfd)):
dirpath = os.path.join(toppath, entry.name) # entry.path
yield from _fwalk(dirfd, dirpath, topdown, onerror,
follow_symlinks)
finally:
close(dirfd) # or use with entry.opendir() as dirfd: ...

if not topdown:
yield toppath, dirs, nondirs, topfd


i.e., if os.scandir() supports specifying file descriptors [1] then it
is relatively straightforward to define os.fwalk() in terms of it. Would
scandir() provide the same performance benefits as for os.walk()?

entry.stat() can be implemented without entry.path when entry._directory
(or whatever other DirEntry's attribute that stores the first parameter
to os.scandir(fd)) is an open file descriptor that refers to a directory:

def stat(self, *, follow_symlinks=True):
return os.stat(self.name, #NOTE: ignore caching
follow_symlinks=follow_symlinks, dir_fd=self._directory)
lstat = lambda self: self.stat(follow_symlinks=False)


--
Akira
Victor Stinner
2014-07-22 21:57:53 UTC
Permalink
Post by Ben Hoyt
Post by Victor Stinner
The PEP is accepted.
Superb. Could you please update the PEP with the Resolution and
BDFL-Delegate fields?
Done.

Victor
Antoine Pitrou
2014-07-23 01:23:16 UTC
Permalink
Post by Victor Stinner
I'm happy because the final API is very close to os.path functions and
pathlib.Path methods. Python stays consistent, which is a great power
of this language!
By the way, http://bugs.python.org/issue19767 could benefit too.

Regards

Antoine.
Ethan Furman
2014-07-24 00:34:13 UTC
Permalink
Post by Victor Stinner
The PEP is accepted.
Thanks, Victor!

Congratulations, Ben!

--
~Ethan~

Loading...