Discussion:
Updates to PEP 471, the os.scandir() proposal
Ben Hoyt
2014-07-08 13:52:18 UTC
Permalink
Hi folks,

After some very good python-dev feedback on my first version of PEP
471, I've updated the PEP to clarify a few things and added various
"Rejected ideas" subsections. Here's a link to the new version (I've
also copied the full text below):

http://legacy.python.org/dev/peps/pep-0471/ -- new PEP as HTML
http://hg.python.org/peps/rev/0da4736c27e8 -- changes

Specifically, I've made these changes (not an exhaustive list):

* Clarified wording in several places, for example "Linux and OS X" ->
"POSIX-based systems"
* Added a new "Notes on exception handling" section
* Added a thorough "Rejected ideas" section with the various ideas
that have been discussed previously and rejected for various reasons
* Added a description of the .full_name attribute, which folks seemed
to generally agree is a good idea
* Removed the "open issues" section, as the three open issues have
either been included (full_name) or rejected (windows_wildcard)

One known error in the PEP is that the "Notes" sections should be
top-level sections, not be subheadings of "Examples". If someone would
like to give me ("benhoyt") commit access to the peps repo, I can fix
this and any other issues that come up.

I'd love to see this finalized! If you're going to comment with
suggestions to change the API, please ensure you've first read the
"rejected ideas" sections in the PEP as well as the relevant
python-dev discussion (linked to in the PEP).

Thanks,
Ben


PEP: 471
Title: os.scandir() function -- a better and faster directory iterator
Version: $Revision$
Last-Modified: $Date$
Author: Ben Hoyt <***@gmail.com>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 30-May-2014
Python-Version: 3.5
Post-History: 27-Jun-2014, 8-Jul-2014


Abstract
========

This PEP proposes including a new directory iteration function,
``os.scandir()``, in the standard library. This new function adds
useful functionality and increases the speed of ``os.walk()`` by 2-10
times (depending on the platform and file system) by significantly
reducing the number of times ``stat()`` needs to be called.


Rationale
=========

Python's built-in ``os.walk()`` is significantly slower than it needs
to be, because -- in addition to calling ``os.listdir()`` on each
directory -- it executes the ``stat()`` system call or
``GetFileAttributes()`` on each file to determine whether the entry is
a directory or not.

But the underlying system calls -- ``FindFirstFile`` /
``FindNextFile`` on Windows and ``readdir`` on POSIX systems --
already tell you whether the files returned are directories or not, so
no further system calls are needed. Further, the Windows system calls
return all the information for a ``stat_result`` object, such as file
size and last modification time.

In short, you can reduce the number of system calls required for a
tree function like ``os.walk()`` from approximately 2N to N, where N
is the total number of files and directories in the tree. (And because
directory trees are usually wider than they are deep, it's often much
better than this.)

In practice, removing all those extra system calls makes ``os.walk()``
about **8-9 times as fast on Windows**, and about **2-3 times as fast
on POSIX systems**. So we're not talking about micro-
optimizations. See more `benchmarks here`_.

.. _`benchmarks here`: https://github.com/benhoyt/scandir#benchmarks

Somewhat relatedly, many people (see Python `Issue 11406`_) are also
keen on a version of ``os.listdir()`` that yields filenames as it
iterates instead of returning them as one big list. This improves
memory efficiency for iterating very large directories.

So, as well as providing a ``scandir()`` iterator function for calling
directly, Python's existing ``os.walk()`` function could be sped up a
huge amount.

.. _`Issue 11406`: http://bugs.python.org/issue11406


Implementation
==============

The implementation of this proposal was written by Ben Hoyt (initial
version) and Tim Golden (who helped a lot with the C extension
module). It lives on GitHub at `benhoyt/scandir`_.

.. _`benhoyt/scandir`: https://github.com/benhoyt/scandir

Note that this module has been used and tested (see "Use in the wild"
section in this PEP), so it's more than a proof-of-concept. However,
it is marked as beta software and is not extensively battle-tested.
It will need some cleanup and more thorough testing before going into
the standard library, as well as integration into ``posixmodule.c``.



Specifics of proposal
=====================

Specifically, this PEP proposes adding a single function to the ``os``
module in the standard library, ``scandir``, that takes a single,
optional string as its argument::

scandir(path='.') -> generator of DirEntry objects

Like ``listdir``, ``scandir`` calls the operating system's directory
iteration system calls to get the names of the files in the ``path``
directory, but it's different from ``listdir`` in two ways:

* Instead of returning bare filename strings, it returns lightweight
``DirEntry`` objects that hold the filename string and provide
simple methods that allow access to the additional data the
operating system returned.

* It returns a generator instead of a list, so that ``scandir`` acts
as a true iterator instead of returning the full list immediately.

``scandir()`` yields a ``DirEntry`` object for each file and directory
in ``path``. Just like ``listdir``, the ``'.'`` and ``'..'``
pseudo-directories are skipped, and the entries are yielded in
system-dependent order. Each ``DirEntry`` object has the following
attributes and methods:

* ``name``: the entry's filename, relative to the ``path`` argument
(corresponds to the return values of ``os.listdir``)

* ``full_name``: the entry's full path name -- the equivalent of
``os.path.join(path, entry.name)``

* ``is_dir()``: like ``os.path.isdir()``, but much cheaper -- it never
requires a system call on Windows, and usually doesn't on POSIX
systems

* ``is_file()``: like ``os.path.isfile()``, but much cheaper -- it
never requires a system call on Windows, and usually doesn't on
POSIX systems

* ``is_symlink()``: like ``os.path.islink()``, but much cheaper -- it
never requires a system call on Windows, and usually doesn't on
POSIX systems

* ``lstat()``: like ``os.lstat()``, but much cheaper on some systems
-- it only requires a system call on POSIX systems

The ``is_X`` methods may perform a ``stat()`` call under certain
conditions (for example, on certain file systems on POSIX systems),
and therefore possibly raise ``OSError``. The ``lstat()`` method will
call ``stat()`` on POSIX systems and therefore also possibly raise
``OSError``. See the "Notes on exception handling" section for more
details.

The ``DirEntry`` attribute and method names were chosen to be the same
as those in the new ``pathlib`` module for consistency.

Like the other functions in the ``os`` module, ``scandir()`` accepts
either a bytes or str object for the ``path`` parameter, and returns
the ``DirEntry.name`` and ``DirEntry.full_name`` attributes with the
same type as ``path``. However, it is *strongly recommended* to use
the str type, as this ensures cross-platform support for Unicode
filenames.


Examples
========

Below is a good usage pattern for ``scandir``. This is in fact almost
exactly how the scandir module's faster ``os.walk()`` implementation
uses it::

dirs = []
non_dirs = []
for entry in os.scandir(path):
if entry.is_dir():
dirs.append(entry)
else:
non_dirs.append(entry)

The above ``os.walk()``-like code will be significantly faster with
scandir than ``os.listdir()`` and ``os.path.isdir()`` on both Windows
and POSIX systems.

Or, for getting the total size of files in a directory tree, showing
use of the ``DirEntry.lstat()`` method and ``DirEntry.full_name``
attribute::

def get_tree_size(path):
"""Return total size of files in path and subdirs."""
total = 0
for entry in os.scandir(path):
if entry.is_dir():
total += get_tree_size(entry.full_name)
else:
total += entry.lstat().st_size
return total

Note that ``get_tree_size()`` will get a huge speed boost on Windows,
because no extra stat call are needed, but on POSIX systems the size
information is not returned by the directory iteration functions, so
this function won't gain anything there.


Notes on caching
----------------

The ``DirEntry`` objects are relatively dumb -- the ``name`` and
``full_name`` attributes are obviously always cached, and the ``is_X``
and ``lstat`` methods cache their values (immediately on Windows via
``FindNextFile``, and on first use on POSIX systems via a ``stat``
call) and never refetch from the system.

For this reason, ``DirEntry`` objects are intended to be used and
thrown away after iteration, not stored in long-lived data structured
and the methods called again and again.

If developers want "refresh" behaviour (for example, for watching a
file's size change), they can simply use ``pathlib.Path`` objects,
or call the regular ``os.lstat()`` or ``os.path.getsize()`` functions
which get fresh data from the operating system every call.


Notes on exception handling
---------------------------

``DirEntry.is_X()`` and ``DirEntry.lstat()`` are explicitly methods
rather than attributes or properties, to make it clear that they may
not be cheap operations, and they may do a system call. As a result,
these methods may raise ``OSError``.

For example, ``DirEntry.lstat()`` will always make a system call on
POSIX-based systems, and the ``DirEntry.is_X()`` methods will make a
``stat()`` system call on such systems if ``readdir()`` returns a
``d_type`` with a value of ``DT_UNKNOWN``, which can occur under
certain conditions or on certain file systems.

For this reason, when a user requires fine-grained error handling,
it's good to catch ``OSError`` around these method calls and then
handle as appropriate.

For example, below is a version of the ``get_tree_size()`` example
shown above, but with basic error handling added::

def get_tree_size(path):
"""Return total size of files in path and subdirs. If
is_dir() or lstat() fails, print an error message to stderr
and assume zero size (for example, file has been deleted).
"""
total = 0
for entry in os.scandir(path):
try:
is_dir = entry.is_dir()
except OSError as error:
print('Error calling is_dir():', error, file=sys.stderr)
continue
if is_dir:
total += get_tree_size(entry.full_name)
else:
try:
total += entry.lstat().st_size
except OSError as error:
print('Error calling lstat():', error, file=sys.stderr)
return total


Support
=======

The scandir module on GitHub has been forked and used quite a bit (see
"Use in the wild" in this PEP), but there's also been a fair bit of
direct support for a scandir-like function from core developers and
others on the python-dev and python-ideas mailing lists. A sampling:

* **python-dev**: a good number of +1's and very few negatives for
scandir and PEP 471 on `this June 2014 python-dev thread
<https://mail.python.org/pipermail/python-dev/2014-June/135217.html>`_

* **Nick Coghlan**, a core Python developer: "I've had the local Red
Hat release engineering team express their displeasure at having to
stat every file in a network mounted directory tree for info that is
present in the dirent structure, so a definite +1 to os.scandir from
me, so long as it makes that info available."
[`source1 <http://bugs.python.org/issue11406>`_]

* **Tim Golden**, a core Python developer, supports scandir enough to
have spent time refactoring and significantly improving scandir's C
extension module.
[`source2 <https://github.com/tjguk/scandir>`_]

* **Christian Heimes**, a core Python developer: "+1 for something
like yielddir()"
[`source3 <https://mail.python.org/pipermail/python-ideas/2012-November/017772.html>`_]
and "Indeed! I'd like to see the feature in 3.4 so I can remove my
own hack from our code base."
[`source4 <http://bugs.python.org/issue11406>`_]

* **Gregory P. Smith**, a core Python developer: "As 3.4beta1 happens
tonight, this isn't going to make 3.4 so i'm bumping this to 3.5.
I really like the proposed design outlined above."
[`source5 <http://bugs.python.org/issue11406>`_]

* **Guido van Rossum** on the possibility of adding scandir to Python
3.5 (as it was too late for 3.4): "The ship has likewise sailed for
adding scandir() (whether to os or pathlib). By all means experiment
and get it ready for consideration for 3.5, but I don't want to add
it to 3.4."
[`source6 <https://mail.python.org/pipermail/python-dev/2013-November/130583.html>`_]

Support for this PEP itself (meta-support?) was given by Nick Coghlan
on python-dev: "A PEP reviewing all this for 3.5 and proposing a
specific os.scandir API would be a good thing."
[`source7 <https://mail.python.org/pipermail/python-dev/2013-November/130588.html>`_]


Use in the wild
===============

To date, the ``scandir`` implementation is definitely useful, but has
been clearly marked "beta", so it's uncertain how much use of it there
is in the wild. Ben Hoyt has had several reports from people using it.
For example:

* Chris F: "I am processing some pretty large directories and was half
expecting to have to modify getdents. So thanks for saving me the
effort." [via personal email]

* bschollnick: "I wanted to let you know about this, since I am using
Scandir as a building block for this code. Here's a good example of
scandir making a radical performance improvement over os.listdir."
[`source8 <https://github.com/benhoyt/scandir/issues/19>`_]

* Avram L: "I'm testing our scandir for a project I'm working on.
Seems pretty solid, so first thing, just want to say nice work!"
[via personal email]

Others have `requested a PyPI package`_ for it, which has been
created. See `PyPI package`_.

.. _`requested a PyPI package`: https://github.com/benhoyt/scandir/issues/12
.. _`PyPI package`: https://pypi.python.org/pypi/scandir

GitHub stats don't mean too much, but scandir does have several
watchers, issues, forks, etc. Here's the run-down as of the stats as
of July 7, 2014:

* Watchers: 17
* Stars: 57
* Forks: 20
* Issues: 4 open, 26 closed

**However, the much larger point is this:**, if this PEP is accepted,
``os.walk()`` can easily be reimplemented using ``scandir`` rather
than ``listdir`` and ``stat``, increasing the speed of ``os.walk()``
very significantly. There are thousands of developers, scripts, and
production code that would benefit from this large speedup of
``os.walk()``. For example, on GitHub, there are almost as many uses
of ``os.walk`` (194,000) as there are of ``os.mkdir`` (230,000).


Rejected ideas
==============


Naming
------

The only other real contender for this function's name was
``iterdir()``. However, ``iterX()`` functions in Python (mostly found
in Python 2) tend to be simple iterator equivalents of their
non-iterator counterparts. For example, ``dict.iterkeys()`` is just an
iterator version of ``dict.keys()``, but the objects returned are
identical. In ``scandir()``'s case, however, the return values are
quite different objects (``DirEntry`` objects vs filename strings), so
this should probably be reflected by a difference in name -- hence
``scandir()``.

See some `relevant discussion on python-dev
<https://mail.python.org/pipermail/python-dev/2014-June/135228.html>`_.


Wildcard support
----------------

``FindFirstFile``/``FindNextFile`` on Windows support passing a
"wildcard" like ``*.jpg``, so at first folks (this PEP's author
included) felt it would be a good idea to include a
``windows_wildcard`` keyword argument to the ``scandir`` function so
users could pass this in.

However, on further thought and discussion it was decided that this
would be bad idea, *unless it could be made cross-platform* (a
``pattern`` keyword argument or similar). This seems easy enough at
first -- just use the OS wildcard support on Windows, and something
like ``fnmatch`` or ``re`` afterwards on POSIX-based systems.

Unfortunately the exact Windows wildcard matching rules aren't really
documented anywhere by Microsoft, and they're quite quirky (see this
`blog post
<http://blogs.msdn.com/b/oldnewthing/archive/2007/12/17/6785519.aspx>`_),
meaning it's very problematic to emulate using ``fnmatch`` or regexes.

So the consensus was that Windows wildcard support was a bad idea.
It would be possible to add at a later date if there's a
cross-platform way to achieve it, but not for the initial version.

Read more on the `this Nov 2012 python-ideas thread
<https://mail.python.org/pipermail/python-ideas/2012-November/017770.html>`_
and this `June 2014 python-dev thread on PEP 471
<https://mail.python.org/pipermail/python-dev/2014-June/135217.html>`_.


DirEntry attributes being properties
------------------------------------

In some ways it would be nicer for the ``DirEntry`` ``is_X()`` and
``lstat()`` to be properties instead of methods, to indicate they're
very cheap or free. However, this isn't quite the case, as ``lstat()``
will require an OS call on POSIX-based systems but not on Windows.
Even ``is_dir()`` and friends may perform an OS call on POSIX-based
systems if the ``dirent.d_type`` value is ``DT_UNKNOWN`` (on certain
file systems).

Also, people would expect the attribute access ``entry.is_dir`` to
only ever raise ``AttributeError``, not ``OSError`` in the case it
makes a system call under the covers. Calling code would have to have
a ``try``/``except`` around what looks like a simple attribute access,
and so it's much better to make them *methods*.

See `this May 2013 python-dev thread
<https://mail.python.org/pipermail/python-dev/2013-May/126184.html>`_
where this PEP author makes this case and there's agreement from a
core developers.


DirEntry fields being "static" attribute-only objects
-----------------------------------------------------

In `this July 2014 python-dev message
<https://mail.python.org/pipermail/python-dev/2014-July/135303.html>`_,
Paul Moore suggested a solution that was a "thin wrapper round the OS
feature", where the ``DirEntry`` object had only static attributes:
``name``, ``full_name``, and ``is_X``, with the ``st_X`` attributes
only present on Windows. The idea was to use this simpler, lower-level
function as a building block for higher-level functions.

At first there was general agreement that simplifying in this way was
a good thing. However, there were two problems with this approach.
First, the assumption is the ``is_dir`` and similar attributes are
always present on POSIX, which isn't the case (if ``d_type`` is not
present or is ``DT_UNKNOWN``). Second, it's a much harder-to-use API
in practice, as even the ``is_dir`` attributes aren't always present
on POSIX, and would need to be tested with ``hasattr()`` and then
``os.stat()`` called if they weren't present.

See `this July 2014 python-dev response
<https://mail.python.org/pipermail/python-dev/2014-July/135312.html>`_
from this PEP's author detailing why this option is a non-ideal
solution, and the subsequent reply from Paul Moore voicing agreement.


DirEntry fields being static with an ensure_lstat option
--------------------------------------------------------

Another seemingly simpler and attractive option was suggested by
Nick Coghlan in this `June 2014 python-dev message
<https://mail.python.org/pipermail/python-dev/2014-June/135261.html>`_:
make ``DirEntry.is_X`` and ``DirEntry.lstat_result`` properties, and
populate ``DirEntry.lstat_result`` at iteration time, but only if
the new argument ``ensure_lstat=True`` was specified on the
``scandir()`` call.

This does have the advantage over the above in that you can easily get
the stat result from ``scandir()`` if you need it. However, it has the
serious disadvantage that fine-grained error handling is messy,
because ``stat()`` will be called (and hence potentially raise
``OSError``) during iteration, leading to a rather ugly, hand-made
iteration loop::

it = os.scandir(path)
while True:
try:
entry = next(it)
except OSError as error:
handle_error(path, error)
except StopIteration:
break

Or it means that ``scandir()`` would have to accept an ``onerror``
argument -- a function to call when ``stat()`` errors occur during
iteration. This seems to this PEP's author neither as direct nor as
Pythonic as ``try``/``except`` around a ``DirEntry.lstat()`` call.

See `Ben Hoyt's July 2014 reply
<https://mail.python.org/pipermail/python-dev/2014-July/135312.html>`_
to the discussion summarizing this and detailing why he thinks the
original PEP 471 proposal is "the right one" after all.


Return values being (name, stat_result) two-tuples
--------------------------------------------------

Initially this PEP's author proposed this concept as a function called
``iterdir_stat()`` which yielded two-tuples of (name, stat_result).
This does have the advantage that there are no new types introduced.
However, the ``stat_result`` is only partially filled on POSIX-based
systems (most fields set to ``None`` and other quirks), so they're not
really ``stat_result`` objects at all, and this would have to be
thoroughly documented as different from ``os.stat()``.

Also, Python has good support for proper objects with attributes and
methods, which makes for a saner and simpler API than two-tuples. It
also makes the ``DirEntry`` objects more extensible and future-proof
as operating systems add functionality and we want to include this in
``DirEntry``.

See also some previous discussion:

* `May 2013 python-dev thread
<https://mail.python.org/pipermail/python-dev/2013-May/126148.html>`_
where Nick Coghlan makes the original case for a ``DirEntry``-style
object.

* `June 2014 python-dev thread
<https://mail.python.org/pipermail/python-dev/2014-June/135244.html>`_
where Nick Coghlan makes (another) good case against the two-tuple
approach.


Return values being overloaded stat_result objects
--------------------------------------------------

Another alternative discussed was making the return values to be
overloaded ``stat_result`` objects with ``name`` and ``full_name``
attributes. However, apart from this being a strange (and strained!)
kind of overloading, this has the same problems mentioned above --
most of the ``stat_result`` information is not fetched by
``readdir()`` on POSIX systems, only (part of) the ``st_mode`` value.


Return values being pathlib.Path objects
----------------------------------------

With Antoine Pitrou's new standard library ``pathlib`` module, it
at first seems like a great idea for ``scandir()`` to return instances
of ``pathlib.Path``. However, ``pathlib.Path``'s ``is_X()`` and
``lstat()`` functions are explicitly not cached, whereas ``scandir``
has to cache them by design, because it's (often) returning values
from the original directory iteration system call.

And if the ``pathlib.Path`` instances returned by ``scandir`` cached
lstat values, but the ordinary ``pathlib.Path`` objects explicitly
don't, that would be more than a little confusing.

Guido van Rossum explicitly rejected ``pathlib.Path`` caching lstat in
the context of scandir `here
<https://mail.python.org/pipermail/python-dev/2013-November/130583.html>`_,
making ``pathlib.Path`` objects a bad choice for scandir return
values.


Possible improvements
=====================

There are many possible improvements one could make to scandir, but
here is a short list of some this PEP's author has in mind:

* scandir could potentially be further sped up by calling ``readdir``
/ ``FindNextFile`` say 50 times per ``Py_BEGIN_ALLOW_THREADS`` block
so that it stays in the C extension module for longer, and may be
somewhat faster as a result. This approach hasn't been tested, but
was suggested by on Issue 11406 by Antoine Pitrou.
[`source9 <http://bugs.python.org/msg130125>`_]

* scandir could use a free list to avoid the cost of memory allocation
for each iteration -- a short free list of 10 or maybe even 1 may help.
Suggested by Victor Stinner on a `python-dev thread on June 27`_.

.. _`python-dev thread on June 27`:
https://mail.python.org/pipermail/python-dev/2014-June/135232.html


Previous discussion
===================

* `Original thread Ben Hoyt started on python-ideas`_ about speeding
up ``os.walk()``

* Python `Issue 11406`_, which includes the original proposal for a
scandir-like function

* `Further thread Ben Hoyt started on python-dev`_ that refined the
``scandir()`` API, including Nick Coghlan's suggestion of scandir
yielding ``DirEntry``-like objects

* `Another thread Ben Hoyt started on python-dev`_ to discuss the
interaction between scandir and the new ``pathlib`` module

* `Final thread Ben Hoyt started on python-dev`_ to discuss the first
version of this PEP, with extensive discussion about the API.

* `Question on StackOverflow`_ about why ``os.walk()`` is slow and
pointers on how to fix it (this inspired the author of this PEP
early on)

* `BetterWalk`_, this PEP's author's previous attempt at this, on
which the scandir code is based

.. _`Original thread Ben Hoyt started on python-ideas`:
https://mail.python.org/pipermail/python-ideas/2012-November/017770.html
.. _`Further thread Ben Hoyt started on python-dev`:
https://mail.python.org/pipermail/python-dev/2013-May/126119.html
.. _`Another thread Ben Hoyt started on python-dev`:
https://mail.python.org/pipermail/python-dev/2013-November/130572.html
.. _`Final thread Ben Hoyt started on python-dev`:
https://mail.python.org/pipermail/python-dev/2014-June/135215.html
.. _`Question on StackOverflow`:
http://stackoverflow.com/questions/2485719/very-quickly-getting-total-size-of-folder
.. _`BetterWalk`: https://github.com/benhoyt/betterwalk


Copyright
=========

This document has been placed in the public domain.


..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End:
Victor Stinner
2014-07-08 15:13:08 UTC
Permalink
Hi,

2014-07-08 15:52 GMT+02:00 Ben Hoyt <***@gmail.com>:
> After some very good python-dev feedback on my first version of PEP
> 471, I've updated the PEP to clarify a few things and added various
> "Rejected ideas" subsections. Here's a link to the new version (I've
> also copied the full text below):

Thanks, the new PEP looks better.

> * Removed the "open issues" section, as the three open issues have
> either been included (full_name) or rejected (windows_wildcard)

I remember a pending question on python-dev:

- Martin von Loewis asked if the scandir generator would have send()
and close() methods as any Python generator. I didn't see a reply on
the mailing (nor in the PEP).

> One known error in the PEP is that the "Notes" sections should be
> top-level sections, not be subheadings of "Examples". If someone would
> like to give me ("benhoyt") commit access to the peps repo, I can fix
> this and any other issues that come up.

Or just send me your new PEP ;-)

> Notes on caching
> ----------------
>
> The ``DirEntry`` objects are relatively dumb -- the ``name`` and
> ``full_name`` attributes are obviously always cached, and the ``is_X``
> and ``lstat`` methods cache their values (immediately on Windows via
> ``FindNextFile``, and on first use on POSIX systems via a ``stat``
> call) and never refetch from the system.

It is not clear to me which methods share the cache.

On UNIX, is_dir() and is_file() call os.stat(); whereas lstat() and
is_symlink() call os.lstat().

If os.stat() says that the file is not a symlink, I guess that you can
use os.stat() result for lstat() and is_symlink() methods?

In the worst case, if the path is a symlink, would it be possible that
os.stat() and os.lstat() become "inconsistent" if the symlink is
modified between the two calls? If yes, I don't think that it's an
issue, it's just good to know it.

For symlinks, readdir() returns the status of the linked file or of the symlink?

Victor
Ben Hoyt
2014-07-08 18:03:00 UTC
Permalink
> I remember a pending question on python-dev:
>
> - Martin von Loewis asked if the scandir generator would have send()
> and close() methods as any Python generator. I didn't see a reply on
> the mailing (nor in the PEP).

Good call. Looks like you're referring to this message:
https://mail.python.org/pipermail/python-dev/2014-July/135324.html

I'm not actually familiar with the purpose of .close() and
.send()/.throw() on generators. Do you typically call these functions
manually, or are they called automatically by the generator protocol?

> It is not clear to me which methods share the cache.
>
> On UNIX, is_dir() and is_file() call os.stat(); whereas lstat() and
> is_symlink() call os.lstat().
>
> If os.stat() says that the file is not a symlink, I guess that you can
> use os.stat() result for lstat() and is_symlink() methods?
>
> In the worst case, if the path is a symlink, would it be possible that
> os.stat() and os.lstat() become "inconsistent" if the symlink is
> modified between the two calls? If yes, I don't think that it's an
> issue, it's just good to know it.
>
> For symlinks, readdir() returns the status of the linked file or of the symlink?

I think you're misunderstanding is_dir() and is_file(), as these don't
actually call os.stat(). All DirEntry methods either call nothing or
os.lstat() to get the stat info on the entry itself (not the
destination of the symlink). In light of this, I don't think what
you're describing above is an issue.

-Ben
Victor Stinner
2014-07-08 19:55:59 UTC
Permalink
Le mardi 8 juillet 2014, Ben Hoyt <***@gmail.com> a écrit :

>
> > It is not clear to me which methods share the cache.
> >
> > On UNIX, is_dir() and is_file() call os.stat(); whereas lstat() and
> > is_symlink() call os.lstat().
> >
> > If os.stat() says that the file is not a symlink, I guess that you can
> > use os.stat() result for lstat() and is_symlink() methods?
> >
> > In the worst case, if the path is a symlink, would it be possible that
> > os.stat() and os.lstat() become "inconsistent" if the symlink is
> > modified between the two calls? If yes, I don't think that it's an
> > issue, it's just good to know it.
> >
> > For symlinks, readdir() returns the status of the linked file or of the
> symlink?
>
> I think you're misunderstanding is_dir() and is_file(), as these don't
> actually call os.stat(). All DirEntry methods either call nothing or
> os.lstat() to get the stat info on the entry itself (not the
> destination of the symlink).


Oh. Extract of your PEP: "is_dir(): like os.path.isdir(), but much cheaper".

genericpath.isdir() and genericpath.isfile() use os.stat(), whereas
posixpath.islink() uses os.lstat().

Is it a mistake in the PEP?

> In light of this, I don't think what you're describing above is an issue.

I'm not saying that there is an issue, I'm just trying to understand.

Victor
Ben Hoyt
2014-07-08 20:09:36 UTC
Permalink
>> I think you're misunderstanding is_dir() and is_file(), as these don't
>> actually call os.stat(). All DirEntry methods either call nothing or
>> os.lstat() to get the stat info on the entry itself (not the
>> destination of the symlink).
>
>
> Oh. Extract of your PEP: "is_dir(): like os.path.isdir(), but much cheaper".
>
> genericpath.isdir() and genericpath.isfile() use os.stat(), whereas
> posixpath.islink() uses os.lstat().
>
> Is it a mistake in the PEP?

Ah, you're dead right -- this is basically a bug in the PEP, as
DirEntry.is_dir() is not like os.path.isdir() in that it is based on
the entry itself (like lstat), not following the link.

I'll improve the wording here and update the PEP.

-Ben
Victor Stinner
2014-07-09 13:05:05 UTC
Permalink
2014-07-08 22:09 GMT+02:00 Ben Hoyt <***@gmail.com>:
>>> I think you're misunderstanding is_dir() and is_file(), as these don't
>>> actually call os.stat(). All DirEntry methods either call nothing or
>>> os.lstat() to get the stat info on the entry itself (not the
>>> destination of the symlink).
>>
>>
>> Oh. Extract of your PEP: "is_dir(): like os.path.isdir(), but much cheaper".
>>
>> genericpath.isdir() and genericpath.isfile() use os.stat(), whereas
>> posixpath.islink() uses os.lstat().
>>
>> Is it a mistake in the PEP?
>
> Ah, you're dead right -- this is basically a bug in the PEP, as
> DirEntry.is_dir() is not like os.path.isdir() in that it is based on
> the entry itself (like lstat), not following the link.
>
> I'll improve the wording here and update the PEP.

Ok, so it means that your example grouping files per type, files and
directories, is also wrong. Or at least, it behaves differently than
os.walk(). You should put symbolic links to directories in the "dirs"
list too.

if entry.is_dir(): # is_dir() checks os.lstat()
dirs.append(entry)
elif entry.is_symlink() and os.path.isdir(entry): # isdir() checks os.stat()
dirs.append(entry)
else:
non_dirs.append(entry)

Victor
Ben Hoyt
2014-07-09 13:12:24 UTC
Permalink
> Ok, so it means that your example grouping files per type, files and
> directories, is also wrong. Or at least, it behaves differently than
> os.walk(). You should put symbolic links to directories in the "dirs"
> list too.
>
> if entry.is_dir(): # is_dir() checks os.lstat()
> dirs.append(entry)
> elif entry.is_symlink() and os.path.isdir(entry): # isdir() checks os.stat()
> dirs.append(entry)
> else:
> non_dirs.append(entry)

Yes, good call. I believe I'm doing this wrong in the scandir.py
os.walk() implementation too -- hence this open issue:
https://github.com/benhoyt/scandir/issues/4

-Ben
Victor Stinner
2014-07-09 15:05:33 UTC
Permalink
2014-07-09 15:12 GMT+02:00 Ben Hoyt <***@gmail.com>:
>> Ok, so it means that your example grouping files per type, files and
>> directories, is also wrong. Or at least, it behaves differently than
>> os.walk(). You should put symbolic links to directories in the "dirs"
>> list too.
>>
>> if entry.is_dir(): # is_dir() checks os.lstat()
>> dirs.append(entry)
>> elif entry.is_symlink() and os.path.isdir(entry): # isdir() checks os.stat()
>> dirs.append(entry)
>> else:
>> non_dirs.append(entry)
>
> Yes, good call. I believe I'm doing this wrong in the scandir.py
> os.walk() implementation too -- hence this open issue:
> https://github.com/benhoyt/scandir/issues/4

The PEP says that DirEntry should mimic pathlib.Path, so I think that
DirEntry.is_dir() should work as os.path.isir(): if the entry is a
symbolic link, you should follow the symlink to get the status of the
linked file with os.stat().

"entry.is_dir() or (entry.is_symlink() and os.path.isdir(entry))"
looks wrong: why would you have to check is_dir() and isdir()?
Duplicating this check is error prone and not convinient.

Pseudo-code:
---
class DirEntry:
def __init__(self, lstat=None, d_type=None):
self._stat = None
self._lstat = lstat
self._d_type = d_type

def stat(self):
if self._stat is None:
self._stat = os.stat(self.full_name)
return self._stat

def lstat(self):
if self._lstat is None:
self._lstat = os.lstat(self.full_name)
return self._lstat

def is_dir(self):
if self._d_type is not None:
if self._d_type == DT_DIR:
return True
if self._d_type != DT_LNK:
return False
else:
lstat = self.lstat()
if stat.S_ISDIR(lstat.st_mode):
return True
if not stat.S_ISLNK(lstat.st_mode):
return False
stat = self.stat()
return stat.S_ISDIR(stat.st_mode)
---

DirEntry would be created with lstat (Windows) or d_type (Linux)
prefilled. is_dir() would only need to call os.stat() once for
symbolic links.

With this code, it becomes even more obvious why is_dir() is a method
and not a property ;-)

Victor
Paul Moore
2014-07-09 15:26:48 UTC
Permalink
On 9 July 2014 16:05, Victor Stinner <***@gmail.com> wrote:
> The PEP says that DirEntry should mimic pathlib.Path, so I think that
> DirEntry.is_dir() should work as os.path.isir(): if the entry is a
> symbolic link, you should follow the symlink to get the status of the
> linked file with os.stat().

Would this not "break" the tree size script being discussed in the
other thread, as it would follow links and include linked directories
in the "size" of the tree?

As a Windows user with only a superficial understanding of how
symlinks should behave, I don't have any intuition as to what the
"right" answer should be. But I would say that the tree size code
we've been debating over there (which recurses if is_dir is true and
adds in st_size otherwise) should do whatever people would expect of a
function with that name, as it's a perfect example of something a
Windows user might write and expect it to work cross-platform. If it
doesn't much of the worrying over making sure scandir's API is
"cross-platform by default" is probably being wasted :-)

(Obviously the walk_tree function could be modified if needed, but
that's missing the point I'm trying to make :-))

Paul
Ben Hoyt
2014-07-09 15:29:21 UTC
Permalink
>> The PEP says that DirEntry should mimic pathlib.Path, so I think that
>> DirEntry.is_dir() should work as os.path.isir(): if the entry is a
>> symbolic link, you should follow the symlink to get the status of the
>> linked file with os.stat().
>
> Would this not "break" the tree size script being discussed in the
> other thread, as it would follow links and include linked directories
> in the "size" of the tree?

Yeah, I agree. Victor -- I don't think the DirEntry is_X() methods (or
attributes) should mimic the link-following os.path.isdir() at all.
You want the type of the entry, not the type of the source.

Otherwise, as Paul says, you are essentially forced to follow links,
and os.walk(followlinks=False), which is the default, can't do the
right thing.

-Ben
Victor Stinner
2014-07-10 00:15:58 UTC
Permalink
2014-07-09 17:29 GMT+02:00 Ben Hoyt <***@gmail.com>:
>> Would this not "break" the tree size script being discussed in the
>> other thread, as it would follow links and include linked directories
>> in the "size" of the tree?

The get_tree_size() function in the PEP would use: "if not
entry.is_symlink() and entry.is_dir():".

Note: First I wrote "if entry.is_dir() and not entry.is_symlink():",
but this syntax is slower on Linux because is_dir() has to call
lstat().

Adding an optional keyword to DirEntry.is_dir() would allow to write
"if entry.is_dir(follow_symlink=False)", but it looks like a micro
optimization and as I said, I prefer to stick to pathlib.Path API
(which was already heavily discussed in its PEP). Anyway, this case is
rare (I explain that below), we should not worry too much about it.

> Yeah, I agree. Victor -- I don't think the DirEntry is_X() methods (or
> attributes) should mimic the link-following os.path.isdir() at all.
> You want the type of the entry, not the type of the source.

On UNIX, a symlink to a directory is expected to behave like a
directory. For example, in a file browser, you should enter in the
linked directory when you click on a symlink to a directory.

There are only a few cases where you want to handle symlinks
differently: archive (ex: tar), compute the size of a directory (ex:
du does not follow symlinks by default, du -L follows them), remove a
directory.

You should do a short poll in the Python stdlib and on the Internet to
check what is the most common check.

Examples of the Python stdlib:

- zipfile: listdir + os.path.isdir
- pkgutil: listdir + os.path.isdir
- unittest.loader: listdir + os.path.isdir and os.path.isfile
- http.server: listdir + os.path.isdir, it also uses os.path.islink: "
Append / for directories or @ for symbolic links "
- idlelib.GrepDialog: listdir + os.path.isdir
- compileall: listdir + os.path.isdir and "os.path.isdir(fullname) and
not os.path.islink(fullname)" <= don't follow symlinks to directories
- shutil (copytree): listdir + os.path.isdir + os.path.islink
- shutil (rmtree): listdir + os.lstat() + stat.S_ISDIR(mode) <= don't
follow symlinks to directories
- mailbox: listdir + os.path.isdir
- tabnanny: listdir + os.path.isdir
- os.walk: listdir + os.path.isdir + os.path.islink <= don't follow
symlinks to directories by default, but the behaviour is configurable
... but symlinks to directories are added to the "dirs" list (not all
symlinks, only symlinks to directories)
- setup.py: listdir + os.path.isfile

In this list of 12 examples, only compileall, shutil.rmtree and
os.walk check if entries are symlinks. compileall starts by checking
"if not os.path.isdir(fullname):" which follows symlinks. os.walk()
starts by checking "if os.path.isdir(name):" which follows symlinks. I
consider that only one case on 12 (8.3%) doesn't follow symlinks.

If entry.is_dir() doesn't follow symlinks, the other 91.7% will need
to be modified to use "if entry.is_dir() or (entry.is_link() and
os.path.is_dir(entry.full_name)):" to keep the same behaviour :-(

> Otherwise, as Paul says, you are essentially forced to follow links,
> and os.walk(followlinks=False), which is the default, can't do the
> right thing.

os.walk() and get_tree_size() are good users of scandir(), but they
are recursive functions. It means that you may handle symlinks
differently, os.walk() gives the choice to follow or not symlinks for
example.

Recursive functions are rare. The most common case is to list files of
a single directory and then filter files depending on various filters
(is a file? is a directory? match the file name? ...). In such use
case, you don't "care" of symlinks (you want to follow them).

Victor
Ethan Furman
2014-07-10 00:38:11 UTC
Permalink
On 07/09/2014 05:15 PM, Victor Stinner wrote:
> 2014-07-09 17:29 GMT+02:00 Ben Hoyt <***@gmail.com>:
>>> Would this not "break" the tree size script being discussed in the
>>> other thread, as it would follow links and include linked directories
>>> in the "size" of the tree?
>
> The get_tree_size() function in the PEP would use: "if not
> entry.is_symlink() and entry.is_dir():".
>
> Note: First I wrote "if entry.is_dir() and not entry.is_symlink():",
> but this syntax is slower on Linux because is_dir() has to call
> lstat().

Wouldn't it only have to call lstat if the entry was, in fact, a link?


> There are only a few cases where you want to handle symlinks
> differently: archive (ex: tar), compute the size of a directory (ex:
> du does not follow symlinks by default, du -L follows them), remove a
> directory.

I agree with Victor here. If the entry is a link I would want to know if it was a link to a directory or a link to a
file. If I care about not following sym links I can check is_symlink() (or whatever it's called).

--
~Ethan~
Victor Stinner
2014-07-10 00:23:17 UTC
Permalink
2014-07-09 17:26 GMT+02:00 Paul Moore <***@gmail.com>:
> On 9 July 2014 16:05, Victor Stinner <***@gmail.com> wrote:
>> The PEP says that DirEntry should mimic pathlib.Path, so I think that
>> DirEntry.is_dir() should work as os.path.isir(): if the entry is a
>> symbolic link, you should follow the symlink to get the status of the
>> linked file with os.stat().
>
> (...)
> As a Windows user with only a superficial understanding of how
> symlinks should behave, (...)

FYI Windows also supports symbolic links since Windows Vista. The
feature is unknown because it is restricted to the administrator
account. Try the "mklink" command in a terminal (cmd.exe) ;-)
http://en.wikipedia.org/wiki/NTFS_symbolic_link

... To be honest, I never created a symlink on Windows. But since it
is supported, you need to know it to write correctly your Windows
code.

(It's unrelated to "LNK" files.)

Victor
Tim Delaney
2014-07-10 00:58:57 UTC
Permalink
On 10 July 2014 10:23, Victor Stinner <***@gmail.com> wrote:

> 2014-07-09 17:26 GMT+02:00 Paul Moore <***@gmail.com>:
> > On 9 July 2014 16:05, Victor Stinner <***@gmail.com> wrote:
> >> The PEP says that DirEntry should mimic pathlib.Path, so I think that
> >> DirEntry.is_dir() should work as os.path.isir(): if the entry is a
> >> symbolic link, you should follow the symlink to get the status of the
> >> linked file with os.stat().
> >
> > (...)
> > As a Windows user with only a superficial understanding of how
> > symlinks should behave, (...)
>
> FYI Windows also supports symbolic links since Windows Vista. The
> feature is unknown because it is restricted to the administrator
> account. Try the "mklink" command in a terminal (cmd.exe) ;-)
> http://en.wikipedia.org/wiki/NTFS_symbolic_link
>
> ... To be honest, I never created a symlink on Windows. But since it
> is supported, you need to know it to write correctly your Windows
> code.
>

Personally, I create them all the time on Windows - mainly via the Link
Shell Extension <
http://schinagl.priv.at/nt/hardlinkshellext/linkshellextension.html>. It's
the easiest way to ensure that my directory structures are as I want them
whilst not worrying about where the files really are e.g. code on SSD,
GB+-sized data files on rusty metal, symlinks makes it look like it's the
same directory structure. Same thing can be done with junctions if you're
only dealing with directories, but symlinks work with files as well.

I work cross-platform, and have a mild preference for option #2 with
similar semantics on all platforms.

Tim Delaney
Paul Moore
2014-07-10 07:04:53 UTC
Permalink
On 10 July 2014 01:23, Victor Stinner <***@gmail.com> wrote:
>> As a Windows user with only a superficial understanding of how
>> symlinks should behave, (...)
>
> FYI Windows also supports symbolic links since Windows Vista. The
> feature is unknown because it is restricted to the administrator
> account. Try the "mklink" command in a terminal (cmd.exe) ;-)
> http://en.wikipedia.org/wiki/NTFS_symbolic_link
>
> ... To be honest, I never created a symlink on Windows. But since it
> is supported, you need to know it to write correctly your Windows
> code.

I know how symlinks *do* behave, and I know how Windows supports them.
What I meant was that, because Windows typically makes little use of
symlinks, I have little or no intuition of what feels natural to
people using an OS where symlinks are common.

As someone (Tim?) pointed out later in the thread,
FindFirstFile/FindNextFile doesn't follow symlinks by default (and nor
do the dirent entries on Unix). So whether or not it's "natural", the
"free" functionality provided by the OS is that of lstat, not that of
stat. Presumably because it's possible to build symlink-following code
on top of non-following code, but not the other way around.

Paul
Tim Delaney
2014-07-10 07:35:19 UTC
Permalink
On 10 July 2014 17:04, Paul Moore <***@gmail.com> wrote:

> On 10 July 2014 01:23, Victor Stinner <***@gmail.com> wrote:
> >> As a Windows user with only a superficial understanding of how
> >> symlinks should behave, (...)
> >
> > FYI Windows also supports symbolic links since Windows Vista. The
> > feature is unknown because it is restricted to the administrator
> > account. Try the "mklink" command in a terminal (cmd.exe) ;-)
> > http://en.wikipedia.org/wiki/NTFS_symbolic_link
> >
> > ... To be honest, I never created a symlink on Windows. But since it
> > is supported, you need to know it to write correctly your Windows
> > code.
>
> I know how symlinks *do* behave, and I know how Windows supports them.
> What I meant was that, because Windows typically makes little use of
> symlinks, I have little or no intuition of what feels natural to
> people using an OS where symlinks are common.
>
> As someone (Tim?) pointed out later in the thread,
> FindFirstFile/FindNextFile doesn't follow symlinks by default (and nor
> do the dirent entries on Unix).


It wasn't me (I didn't even see it - lost in the noise).


> So whether or not it's "natural", the
> "free" functionality provided by the OS is that of lstat, not that of
> stat. Presumably because it's possible to build symlink-following code
> on top of non-following code, but not the other way around.
>

For most uses the most natural thing is to follow symlinks (e.g. opening a
symlink in a text editor should open the target). However, I think not
following symlinks by default is better approach for exactly the reason
Paul has noted above.

Tim Delaney
Victor Stinner
2014-07-10 08:37:19 UTC
Permalink
2014-07-10 9:04 GMT+02:00 Paul Moore <***@gmail.com>:
> As someone (Tim?) pointed out later in the thread,
> FindFirstFile/FindNextFile doesn't follow symlinks by default (and nor
> do the dirent entries on Unix). So whether or not it's "natural", the
> "free" functionality provided by the OS is that of lstat, not that of
> stat. Presumably because it's possible to build symlink-following code
> on top of non-following code, but not the other way around.

DirEntry methods will remain free (no syscall) for directories and
regular files. One extra syscall will be needed only for symlinks,
which are more rare than other file types (for example, you wrote "
Windows typically makes little use of symlinks").

See my pseudo-code:
https://mail.python.org/pipermail/python-dev/2014-July/135439.html

On Windows, _lstat and _stat attributes will be filled directly in the
constructor on Windows for regular files and directories.

Victor
Nick Coghlan
2014-07-10 13:58:57 UTC
Permalink
On 10 Jul 2014 03:39, "Victor Stinner" <***@gmail.com> wrote:
>
> 2014-07-10 9:04 GMT+02:00 Paul Moore <***@gmail.com>:
> > As someone (Tim?) pointed out later in the thread,
> > FindFirstFile/FindNextFile doesn't follow symlinks by default (and nor
> > do the dirent entries on Unix). So whether or not it's "natural", the
> > "free" functionality provided by the OS is that of lstat, not that of
> > stat. Presumably because it's possible to build symlink-following code
> > on top of non-following code, but not the other way around.
>
> DirEntry methods will remain free (no syscall) for directories and
> regular files. One extra syscall will be needed only for symlinks,
> which are more rare than other file types (for example, you wrote "
> Windows typically makes little use of symlinks").

The info we want for scandir is that of the *link itself*. That makes it
easy to implement things like the "followlinks" flag of os.walk. The *far
end* of the link isn't relevant at this level.

The docs just need to be clear that DirEntry objects always match lstat(),
never stat().

Cheers,
Nick.

>
> See my pseudo-code:
> https://mail.python.org/pipermail/python-dev/2014-July/135439.html
>
> On Windows, _lstat and _stat attributes will be filled directly in the
> constructor on Windows for regular files and directories.
>
> Victor
> _______________________________________________
> Python-Dev mailing list
> Python-***@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
Ben Hoyt
2014-07-10 14:19:28 UTC
Permalink
>> DirEntry methods will remain free (no syscall) for directories and
>> regular files. One extra syscall will be needed only for symlinks,
>> which are more rare than other file types (for example, you wrote "
>> Windows typically makes little use of symlinks").
>
> The info we want for scandir is that of the *link itself*. That makes it
> easy to implement things like the "followlinks" flag of os.walk. The *far
> end* of the link isn't relevant at this level.
>
> The docs just need to be clear that DirEntry objects always match lstat(),
> never stat().

Yeah, I agree with this. It makes the function (and documentation and
implementation) quite a lot simpler to understand.

scandir() is a lowish-level function which deals with the directory
entries themselves, and mirrors both Windows FindNextFile and POSIX
readdir() in that. If the user wants follow-links behaviour, they can
easily call os.stat() themselves. If this is clearly documented that
seems much simpler to me (and it also seems implicit to me in the fact
that you're calling is_dir() on the *entry*).

Otherwise we might as well go down the route of -- the objects
returned are just like pathlib.Path(), but with stat() and lstat()
cached on first use.

-Ben
Ethan Furman
2014-07-10 17:53:45 UTC
Permalink
On 07/10/2014 06:58 AM, Nick Coghlan wrote:
>
> The info we want for scandir is that of the *link itself*. That makes it
> easy to implement things like the "followlinks" flag of os.walk. The
> *far end* of the link isn't relevant at this level.

This also mirrors listdir, correct? scandir is simply* returning something smarter than a string.

> The docs just need to be clear that DirEntry objects always match lstat(), never stat().

Agreed.

--
~Ethan~

* As well as being a less resource-intensive generator. :)
Victor Stinner
2014-07-10 00:57:00 UTC
Permalink
Oh, since I'm proposing to add a new stat() method to DirEntry, we can
optimize it. stat() can reuse lstat() result if the file is not a
symlink. It simplifies is_dir().

New pseudo-code:
---
class DirEntry:
def __init__(self, path, name, lstat=None, d_type=None):
self.name = name
self.full_name = os.path.join(path, name)
# lstat is known on Windows
self._lstat = lstat
if lstat is not None and not stat.S_ISLNK(lstat.st_mode):
# On Windows, stat() only calls os.stat() for symlinks
self._stat = lstat
else:
self._stat = None
# d_type is known on UNIX
if d_type != DT_UNKNOWN:
self._d_type = d_type
else:
# DT_UNKNOWN is not a very useful information :-p
self._d_type = None

def stat(self):
if self._stat is None:
self._stat = os.stat(self.full_name)
return self._stat

def lstat(self):
if self._lstat is None:
self._lstat = os.lstat(self.full_name)
if self._stat is None and not stat.S_ISLNK(self._lstat.st_mode):
self._stat = lstat
return self._lstat

def is_dir(self):
if self._d_type is not None:
if self._d_type == DT_DIR:
return True
if self._d_type != DT_LNK:
return False
else:
lstat = self.lstat()
if stat.S_ISDIR(lstat.st_mode):
return True
stat = self.stat() # if lstat() was already called, stat()
will only call os.stat() for symlink
return stat.S_ISDIR(stat.st_mode)
---

The extra caching rules are complex, that's why I suggest to not document them.

In short: is_dir() only needs an extra syscall for symlinks, for other
file types it does not need any syscall.

Victor
Janzert
2014-07-08 15:44:54 UTC
Permalink
On 7/8/2014 9:52 AM, Ben Hoyt wrote:
> DirEntry fields being "static" attribute-only objects
> -----------------------------------------------------
>
> In `this July 2014 python-dev message
> <https://mail.python.org/pipermail/python-dev/2014-July/135303.html>`_,
> Paul Moore suggested a solution that was a "thin wrapper round the OS
> feature", where the ``DirEntry`` object had only static attributes:
> ``name``, ``full_name``, and ``is_X``, with the ``st_X`` attributes
> only present on Windows. The idea was to use this simpler, lower-level
> function as a building block for higher-level functions.
>
> At first there was general agreement that simplifying in this way was
> a good thing. However, there were two problems with this approach.
> First, the assumption is the ``is_dir`` and similar attributes are
> always present on POSIX, which isn't the case (if ``d_type`` is not
> present or is ``DT_UNKNOWN``). Second, it's a much harder-to-use API
> in practice, as even the ``is_dir`` attributes aren't always present
> on POSIX, and would need to be tested with ``hasattr()`` and then
> ``os.stat()`` called if they weren't present.
>

Only exposing what the OS provides for free will make the API too
difficult to use in the common case. But is there a nice way to expand
the API that will allow the user who is trying to avoid extra expense
know what information is already available?

Even if the initial version doesn't have a way to check what information
is there for free, ensuring there is a clean way to add this in the
future would be really nice.

Janzert
Ben Hoyt
2014-07-08 18:05:53 UTC
Permalink
> Only exposing what the OS provides for free will make the API too difficult
> to use in the common case. But is there a nice way to expand the API that
> will allow the user who is trying to avoid extra expense know what
> information is already available?
>
> Even if the initial version doesn't have a way to check what information is
> there for free, ensuring there is a clean way to add this in the future
> would be really nice.

We could easily add ".had_type" and ".had_lstat" properties (not sure
on the names), that would be true if the is_X information and lstat
information was fetched, respectively. Basically both would always be
True on Windows, but on POSIX only had_type would be True d_type is
present and != DT_UNKNOWN.

I don't feel this is actually necessary, but it's not hard to add.

Thoughts?

-Ben
Ethan Furman
2014-07-08 19:02:56 UTC
Permalink
On 07/08/2014 11:05 AM, Ben Hoyt wrote:
>> Only exposing what the OS provides for free will make the API too difficult
>> to use in the common case. But is there a nice way to expand the API that
>> will allow the user who is trying to avoid extra expense know what
>> information is already available?
>>
>> Even if the initial version doesn't have a way to check what information is
>> there for free, ensuring there is a clean way to add this in the future
>> would be really nice.
>
> We could easily add ".had_type" and ".had_lstat" properties (not sure
> on the names), that would be true if the is_X information and lstat
> information was fetched, respectively. Basically both would always be
> True on Windows, but on POSIX only had_type would be True d_type is
> present and != DT_UNKNOWN.
>
> I don't feel this is actually necessary, but it's not hard to add.
>
> Thoughts?

Better to just have the attributes be None if they were not fetched. None is better than hasattr anyway, at least in
the respect of not having to catch exceptions to function properly.

--
~Ethan~
Ben Hoyt
2014-07-08 19:34:26 UTC
Permalink
> Better to just have the attributes be None if they were not fetched. None
> is better than hasattr anyway, at least in the respect of not having to
> catch exceptions to function properly.

The thing is, is_dir() and lstat() are not attributes (for a good
reason). Please read the relevant "Rejected ideas" sections and let us
know what you think. :-)

-Ben
Ethan Furman
2014-07-08 20:22:33 UTC
Permalink
On 07/08/2014 12:34 PM, Ben Hoyt wrote:
>>
>> Better to just have the attributes be None if they were not fetched. None
>> is better than hasattr anyway, at least in the respect of not having to
>> catch exceptions to function properly.
>
> The thing is, is_dir() and lstat() are not attributes (for a good
> reason). Please read the relevant "Rejected ideas" sections and let us
> know what you think. :-)

I did better than that -- I read the whole thing! ;)

-1 on the PEP's implementation.

Just like an attribute does not imply a system call, having a method named 'is_dir' /does/ imply a system call, and not
having one can be just as misleading.

If we have this:

size = 0
for entry in scandir('/some/path'):
size += entry.st_size

- on Windows, this should Just Work (if I have the names correct ;)
- on Posix, etc., this should fail noisily with either an AttributeError
('entry' has no 'st_size') or a TypeError (cannot add None)

and the solution is equally simple:

for entry in scandir('/some/path', stat=True):

- if not Windows, perform a stat call at the same time

Now, of course, we might get errors. I am not a big fan of wrapping everything in try/except, particularly when we
already have a model to follow -- os.walk:

for entry in scandir('/some/path', stat=True, onerror=record_and_skip):

If we don't care if an error crashes the script, leave off onerror.

If we don't need st_size and friends, leave off stat=True.

If we get better performance on Windows instead of Linux, that's okay.

scandir is going into os because it may not behave the same on every platform. Heck, even some non-os modules
(multiprocessing comes to mind) do not behave the same on every platform.

I think caching the attributes for DirEntry is fine, but let's do it as a snapshot of that moment in time, not name now,
and attributes in 30 minutes when we finally get to you because we had a lot of processing/files ahead of you (you being
a DirEntry ;) .

--
~Ethan~
Ethan Furman
2014-07-08 21:05:22 UTC
Permalink
On 07/08/2014 01:22 PM, Ethan Furman wrote:
>
> I think caching the attributes for DirEntry is fine, but let's do it as a snapshot of that moment in time, not name now,
> and attributes in 30 minutes when we finally get to you because we had a lot of processing/files ahead of you (you being
> a DirEntry ;) .

This bit is wrong, I think, since scandir is a generator -- there wouldn't be much time passing between the direntry
call and the stat call in any case. Hopefully my other points still hold.

--
~Ethan~
Ben Hoyt
2014-07-09 01:08:03 UTC
Permalink
> I did better than that -- I read the whole thing! ;)

Thanks. :-)

> -1 on the PEP's implementation.
>
> Just like an attribute does not imply a system call, having a
> method named 'is_dir' /does/ imply a system call, and not
> having one can be just as misleading.

Why does a method imply a system call? os.path.join() and str.lower()
don't make system calls. Isn't it just a matter of clear
documentation? Anyway -- less philosophical discussion below.

> If we have this:
>
> size = 0
> for entry in scandir('/some/path'):
> size += entry.st_size
>
> - on Windows, this should Just Work (if I have the names correct ;)
> - on Posix, etc., this should fail noisily with either an AttributeError
> ('entry' has no 'st_size') or a TypeError (cannot add None)
>
> and the solution is equally simple:
>
> for entry in scandir('/some/path', stat=True):
>
> - if not Windows, perform a stat call at the same time

I'm not totally opposed to this, which is basically a combination of
Nick Coghlan's and Paul Moore's recent proposals mentioned in the PEP.
However, as discussed on python-dev, there are some edge cases it
doesn't handle very well, and it's messier to handle errors (requires
onerror as you mention below).

I presume you're suggesting that is_dir/is_file/is_symlink should be
regular attributes, and accessing them should never do a system call.
But what if the system doesn't support d_type (eg: Solaris) or the
d_type value is DT_UNKNOWN (can happen on Linux, OS X, BSD)? The
options are:

1) scandir() would always call lstat() in the case of missing/unknown
d_type. If so, scandir() is actually more expensive than listdir(),
and as a result it's no longer safe to implement listdir in terms of
scandir:

def listdir(path='.'):
return [e.name for e in scandir(path)]

2) Or would it be better to have another flag like scandir(path,
type=True) to ensure the is_X type info is fetched? This is explicit,
but also getting kind of unwieldly.

3) A third option is for the is_X attributes to be absent in this case
(hasattr tests required, and the user would do the lstat manually).
But as I noted on python-dev recently, you basically always want is_X,
so this leads to unwieldly and code that's twice as long as it needs
to be. See here:
https://mail.python.org/pipermail/python-dev/2014-July/135312.html

4) I gather in your proposal above, scandir will call lstat() if
stat=True? Except where does it put the values? Surely it should
return an existing stat_result object, rather than stuffing everything
onto the DirEntry, or throwing away some values on Linux? In this
case, I'd prefer Nick Coghlan's approach of ensure_lstat and a
.stat_result attribute. However, this still has the "what if d_type is
missing or DT_UNKNOWN" issue.

It seems to me that making is_X() methods handles this exact scenario
-- methods are so you don't have to do the dirty work.

So yes, the real world is messy due to missing is_X values, but I
think it's worth getting this right, and is_X() methods can do this
while keeping the API simple and cross-platform.

> Now, of course, we might get errors. I am not a big fan of wrapping everything in try/except, particularly when we already have a model to follow -- os.walk:

I don't mind the onerror too much if we went with this kind of
approach. It's not quite as nice as a standard try/except around the
method call, but it's definitely workable and has a precedent with
os.walk().

It seems a bit like we're going around in circles here, and I think we
have all the information and options available to us, so I'm going to
SUMMARIZE.

We have a choice before us, a fork in the road. :-) We can choose one
of these options for the scandir API:

1) The current PEP 471 approach. This solves the issue with d_type
being missing or DT_UNKNOWN, it doesn't require onerror, and it's a
really tidy API that doesn't explode with AttributeErrors if you write
code on Windows (without thinking too hard) and then move to Linux. I
think all of these points are important -- the cross-platform one not
the least, because we want to make it easy, even *trivial*, for people
to write cross-platform code.

For reference, here's what get_tree_size() looks like with this
approach, not including error handling with try/except:

def get_tree_size(path):
total = 0
for entry in os.scandir(path):
if entry.is_dir():
total += get_tree_size(entry.full_name)
else:
total += entry.lstat().st_size
return total

2) Nick Coghlan's model of only fetching the lstat value if
ensure_lstat=True, and including an onerror callback for error
handling when scandir calls lstat internally. However, as described,
we'd also need an ensure_type=True option, so that scandir() isn't way
slower than listdir() if you actually don't want the is_X values and
d_type is missing/unknown.

For reference, here's what get_tree_size() looks like with this
approach, not including error handling with onerror:

def get_tree_size(path):
total = 0
for entry in os.scandir(path, ensure_type=True, ensure_lstat=True):
if entry.is_dir:
total += get_tree_size(entry.full_name)
else:
total += entry.lstat_result.st_size
return total

I'm fairly strongly in favour of approach #1, but I wouldn't die if
everyone else thinks the benefits of #2 outweigh the somewhat less
nice API.

Comments and votes, please!

-Ben
Ethan Furman
2014-07-09 01:31:55 UTC
Permalink
On 07/08/2014 06:08 PM, Ben Hoyt wrote:
>>
>> Just like an attribute does not imply a system call, having a
>> method named 'is_dir' /does/ imply a system call, and not
>> having one can be just as misleading.
>
> Why does a method imply a system call? os.path.join() and str.lower()
> don't make system calls. Isn't it just a matter of clear
> documentation? Anyway -- less philosophical discussion below.

In this case because the names are exactly the same as the os versions which /do/ make a system call.


> I presume you're suggesting that is_dir/is_file/is_symlink should be
> regular attributes, and accessing them should never do a system call.
> But what if the system doesn't support d_type (eg: Solaris) or the
> d_type value is DT_UNKNOWN (can happen on Linux, OS X, BSD)? The
> options are:

So if I'm finally understanding the root problem here:

- listdir returns a list of strings, one for each filename and one for
each directory, and keeps no other O/S supplied info.

- os.walk, which uses listdir, then needs to go back to the O/S and
refetch the thrown-away information

- so it's slow.

The solution:

- have scandir /not/ throw away the O/S supplied info

and the new problem:

- not all O/Ses provide the same (or any) extra info about the
directory entries

Have I got that right?

If so, I still like the attribute idea better (surprise!), we just need to revisit the 'ensure_lstat' (or whatever it's
called) parameter: instead of a true/false value, it could have a scale:

- 0 = whatever the O/S gives us

- 1 = at least the is_dir/is_file (whatever the other normal one is),
and if the O/S doesn't give it to us for free than call lstat

- 2 = we want it all -- call lstat if necessary on this platform

After all, the programmer should know up front how much of the extra info will be needed for the work that is trying to
be done.


> We have a choice before us, a fork in the road. :-) We can choose one
> of these options for the scandir API:
>
> 1) The current PEP 471 approach. This solves the issue with d_type
> being missing or DT_UNKNOWN, it doesn't require onerror, and it's a
> really tidy API that doesn't explode with AttributeErrors if you write
> code on Windows (without thinking too hard) and then move to Linux. I
> think all of these points are important -- the cross-platform one not
> the least, because we want to make it easy, even *trivial*, for people
> to write cross-platform code.

Yes, but we don't want a function that sucks equally on all platforms. ;)


> 2) Nick Coghlan's model of only fetching the lstat value if
> ensure_lstat=True, and including an onerror callback for error
> handling when scandir calls lstat internally. However, as described,
> we'd also need an ensure_type=True option, so that scandir() isn't way
> slower than listdir() if you actually don't want the is_X values and
> d_type is missing/unknown.

With the multi-level version of 'ensure_lstat' we do not need an extra 'ensure_type'.

For reference, here's what get_tree_size() looks like with this approach, not including error handling with onerror:

def get_tree_size(path):
total = 0
for entry in os.scandir(path, ensure_lstat=1):
if entry.is_dir:
total += get_tree_size(entry.full_name)
else:
total += entry.lstat_result.st_size
return total

And if we added the onerror here it would be a line fragment, as opposed to the extra four lines (at least) for the
try/except in the first example (which I cut).


Finally:

Thank you for writing scandir, and this PEP. Excellent work.

Oh, and +1 for option 2, slightly modified. :)

--
~Ethan~
Ben Hoyt
2014-07-09 12:48:04 UTC
Permalink
> In this case because the names are exactly the same as the os versions which
> /do/ make a system call.

Fair enough.

> So if I'm finally understanding the root problem here:
>
> - listdir returns a list of strings, one for each filename and one for
> each directory, and keeps no other O/S supplied info.
>
> - os.walk, which uses listdir, then needs to go back to the O/S and
> refetch the thrown-away information
>
> - so it's slow.
> ...
> and the new problem:
>
> - not all O/Ses provide the same (or any) extra info about the
> directory entries
>
> Have I got that right?

Yes, that's exactly right.

> If so, I still like the attribute idea better (surprise!), we just need to
> revisit the 'ensure_lstat' (or whatever it's called) parameter: instead of
> a true/false value, it could have a scale:
>
> - 0 = whatever the O/S gives us
>
> - 1 = at least the is_dir/is_file (whatever the other normal one is),
> and if the O/S doesn't give it to us for free than call lstat
>
> - 2 = we want it all -- call lstat if necessary on this platform
>
> After all, the programmer should know up front how much of the extra info
> will be needed for the work that is trying to be done.

Yeah, I think this is a good idea to make option #2 a bit nicer. I
don't like the magic constants, and using constants like
os.SCANDIR_LSTAT is annoying, so how about using strings? I also
suggest calling the parameter "info" (because it determines what info
is returned), so you'd do scandir(path, info='type') if you need just
the is_X type information.

I also think it's nice to have a way for power users to "just return
what the OS gives us". However, I think making this the default is a
bad idea, as it's just asking for cross-platform bugs (and it's easy
to prevent).

Paul Moore basically agrees with this in his reply yesterday, though I
disagree with him it would be unfriendly to fail hard unless you asked
for the info -- quite the opposite, Linux users would think it very
unfriendly when your code broke because you didn't ask for the info.
:-)

So how about tweaking option #2 a tiny bit more to this:

def scandir(path='.', info=None, onerror=None): ...

* if info is None (the default), only the .name and .full_name
attributes are present
* if info is 'type', scandir ensures the is_dir/is_file/is_symlink
attributes are present and either True or False
* if info is 'lstat', scandir additionally ensures a .lstat is present
and is a full stat_result object
* if info is 'os', scandir returns the attributes the OS provides
(everything on Windows, only is_X -- most of the time -- on POSIX)

* if onerror is not None and errors occur during any internal lstat()
call, onerror(exc) is called with the OSError exception object

Further point -- because the is_dir/is_file/is_symlink attributes are
booleans, it would be very bad for them to be present but None if you
didn't ask for (or the OS didn't return) the type information. Because
then "if entry.is_dir:" would be None and your code would think it
wasn't a directory, when actually you don't know. For this reason, all
attributes should fail with AttributeError if not fetched.

> Thank you for writing scandir, and this PEP. Excellent work.

Thanks!

> Oh, and +1 for option 2, slightly modified. :)

With the above tweaks, I'm getting closer to being 50/50. It's
probably 60% #1 and 40% #2 for me now. :-)

Okay folks -- please respond: option #1 as per the current PEP 471, or
option #2 with Ethan's multi-level thing tweaks as per the above?

-Ben
Paul Moore
2014-07-09 13:12:34 UTC
Permalink
Received: from localhost (HELO mail.python.org) (127.0.0.1)
by albatross.python.org with SMTP; 09 Jul 2014 15:12:36 +0200
Received: from mail-qa0-x232.google.com (unknown
[IPv6:2607:f8b0:400d:c00::232])
(using TLSv1 with cipher ECDHE-RSA-AES128-SHA (128/128 bits))
(No client certificate requested)
by mail.python.org (Postfix) with ESMTPS
for <python-***@python.org>; Wed, 9 Jul 2014 15:12:36 +0200 (CEST)
Received: by mail-qa0-f50.google.com with SMTP id m5so5904074qaj.37
for <python-***@python.org>; Wed, 09 Jul 2014 06:12:34 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
h=mime-version:in-reply-to:references:date:message-id:subject:from:to
:cc:content-type;
bh=L/EQSax8cwwNpDBry4TUPfXtLtknQpwY0raOQ5i5THQ=;
b=U5pxInUcMxC92H6UvL3Q2bnjl82U8OMh4A9sJ0HHttjv6dgPJmLgBK9pd2RunYAqPN
NDtDnApKVrhI5/eN2qyoBnI8J+EejaOZjRAdbyMvVMDEXrITHx98EHIAiCquqPoOHki5
cBPedbEcrstfD+3z0LSRk6G5ccW2aj2xhW+scumUx84+0IKqDeaXp+3ekHL7H//kOiRU
bn5eX6bNQVTO1TUh3aagwJjlVEZsWtjIubq00h+WRd7qKzsUZ9O8EC/67RpKj4rVeSIX
ZS6de4q9r93w489Qtu8w2339LanIG5c82cn03ommmg9zCTnGTHY8zXBhay5u2+dhU0UZ
YsuQ==
X-Received: by 10.224.38.71 with SMTP id a7mr38056938qae.44.1404911554722;
Wed, 09 Jul 2014 06:12:34 -0700 (PDT)
Received: by 10.140.22.229 with HTTP; Wed, 9 Jul 2014 06:12:34 -0700 (PDT)
In-Reply-To: <CAL9jXCHCZNDU6nMuJYddB8PN3k3rQL3NR7PPyWZox32X6+***@mail.gmail.com>
X-BeenThere: python-***@python.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Python core developers <python-dev.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-dev>,
<mailto:python-dev-***@python.org?subject=unsubscribe>
List-Archive: <http://mail.python.org/pipermail/python-dev/>
List-Post: <mailto:python-***@python.org>
List-Help: <mailto:python-dev-***@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-dev>,
<mailto:python-dev-***@python.org?subject=subscribe>
Errors-To: python-dev-bounces+python-python-dev=***@python.org
Sender: "Python-Dev"
<python-dev-bounces+python-python-dev=***@python.org>
Archived-At: <http://permalink.gmane.org/gmane.comp.python.devel/148578>

On 9 July 2014 13:48, Ben Hoyt <***@gmail.com> wrote:
> Okay folks -- please respond: option #1 as per the current PEP 471, or
> option #2 with Ethan's multi-level thing tweaks as per the above?

I'm probably about 50/50 at the moment. What will swing it for me is
likely error handling, so let's try both approaches with some error
handling:

Rules are that we calculate the total size of all files in a tree (as
returned from lstat), with files that fail to stat being logged and
their size assumed to be 0.

Option 1:

def get_tree_size(path):
total = 0
for entry in os.scandir(path):
try:
isdir = entry.is_dir()
except OSError:
logger.warn("Cannot stat {}".format(entry.full_name))
continue
if entry.is_dir():
total += get_tree_size(entry.full_name)
else:
try:
total += entry.lstat().st_size
except OSError:
logger.warn("Cannot stat {}".format(entry.full_name))
return total

Option 2:
def log_err(exc):
logger.warn("Cannot stat {}".format(exc.filename))

def get_tree_size(path):
total = 0
for entry in os.scandir(path, info='lstat', onerror=log_err):
if entry.is_dir:
total += get_tree_size(entry.full_name)
else:
total += entry.lstat.st_size
return total

On this basis, #2 wins. However, I'm slightly uncomfortable using the
filename attribute of the exception in the logging, as there is
nothing in the docs saying that this will give a full pathname. I'd
hate to see "Unable to stat __init__.py"!!!

So maybe the onerror function should also receive the DirEntry object
- which will only have the name and full_name attributes, but that's
all that is needed.

OK, looks like option #2 is now my preferred option. My gut instinct
still rebels over an API that deliberately throws information away in
the default case, even though there is now an option to ask it to keep
that information, but I see the logic and can learn to live with it.

Paul
Ben Hoyt
2014-07-09 13:22:41 UTC
Permalink
> Option 2:
> def log_err(exc):
> logger.warn("Cannot stat {}".format(exc.filename))
>
> def get_tree_size(path):
> total = 0
> for entry in os.scandir(path, info='lstat', onerror=log_err):
> if entry.is_dir:
> total += get_tree_size(entry.full_name)
> else:
> total += entry.lstat.st_size
> return total
>
> On this basis, #2 wins.

That's a pretty nice comparison, and you're right, onerror handling is
nicer here.

> However, I'm slightly uncomfortable using the
> filename attribute of the exception in the logging, as there is
> nothing in the docs saying that this will give a full pathname. I'd
> hate to see "Unable to stat __init__.py"!!!

Huh, you're right. I think this should be documented in os.walk() too.
I think it should be the full filename (is it currently?).

> So maybe the onerror function should also receive the DirEntry object
> - which will only have the name and full_name attributes, but that's
> all that is needed.

That's an interesting idea -- though enough of a deviation from
os.walk()'s onerror that I'm uncomfortable with it -- I'd rather just
document that the onerror exception .filename is the full path name.

One issue with option #2 that I just realized -- does scandir yield
the entry at all if there's a stat error? It can't really, because the
caller will except the .lstat attribute to be set (assuming he asked
for type='lstat') but it won't be. Is effectively removing these
entries just because the stat failed a problem? I kind of think it is.
If so, is there a way to solve it with option #2?

> OK, looks like option #2 is now my preferred option. My gut instinct
> still rebels over an API that deliberately throws information away in
> the default case, even though there is now an option to ask it to keep
> that information, but I see the logic and can learn to live with it.

In terms of throwing away info "in the default case" -- it's simply a
case of getting what you ask for. :-) Worst case, you'll write your
code and test it, it'll fail hard on any system, you'll fix it
immediately, and then it'll work on any system.

-Ben
Paul Moore
2014-07-09 13:30:32 UTC
Permalink
Received: from localhost (HELO mail.python.org) (127.0.0.1)
by albatross.python.org with SMTP; 09 Jul 2014 15:30:34 +0200
Received: from mail-qg0-x22c.google.com (unknown
[IPv6:2607:f8b0:400d:c04::22c])
(using TLSv1 with cipher ECDHE-RSA-AES128-SHA (128/128 bits))
(No client certificate requested)
by mail.python.org (Postfix) with ESMTPS
for <python-***@python.org>; Wed, 9 Jul 2014 15:30:34 +0200 (CEST)
Received: by mail-qg0-f44.google.com with SMTP id j107so6462906qga.3
for <python-***@python.org>; Wed, 09 Jul 2014 06:30:33 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
h=mime-version:in-reply-to:references:date:message-id:subject:from:to
:cc:content-type;
bh=pBxKjAkwDRAwOUejA3BiiPxHZhrRAyvpm8LVPMk6InE=;
b=UHZh5zzRDXd+BbzJMwDPa5xalCz7xuAjWcCDAYre2BsalrI5X5qnlVNboimlqOs0PF
7XHHgCcmLyAV72Ecq8ag1RTs3z8jTKmZiCt7SxthXAWtN1BKVvJ+XNDxOFcEwREm0x53
HeTUf9XJ4LGNUjx65/c2hIU+eb9GEyT+2MCROzHKD+3Paov8BugJpGdRwjCDL89DBotD
KkVgD5ONn+2r2dmdnjSlLRhPWz6o4xX+lxnz21CFwzqC94KQF9/U43XxMwjMxgWBphGg
Ge/b/p+BU4nPxlRrgR0b7wuDb/SUKXONcReivOe7y3mzZS9aC+PUUx4hF/76skBuEVHP
R1NA==
X-Received: by 10.224.30.137 with SMTP id u9mr42762842qac.57.1404912632961;
Wed, 09 Jul 2014 06:30:32 -0700 (PDT)
Received: by 10.140.22.229 with HTTP; Wed, 9 Jul 2014 06:30:32 -0700 (PDT)
In-Reply-To: <CAL9jXCGFr4sxcyep+Nkjy=iPc5O1EOQ5RG7+Ya_q+***@mail.gmail.com>
X-BeenThere: python-***@python.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Python core developers <python-dev.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-dev>,
<mailto:python-dev-***@python.org?subject=unsubscribe>
List-Archive: <http://mail.python.org/pipermail/python-dev/>
List-Post: <mailto:python-***@python.org>
List-Help: <mailto:python-dev-***@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-dev>,
<mailto:python-dev-***@python.org?subject=subscribe>
Errors-To: python-dev-bounces+python-python-dev=***@python.org
Sender: "Python-Dev"
<python-dev-bounces+python-python-dev=***@python.org>
Archived-At: <http://permalink.gmane.org/gmane.comp.python.devel/148581>

On 9 July 2014 14:22, Ben Hoyt <***@gmail.com> wrote:
>> So maybe the onerror function should also receive the DirEntry object
>> - which will only have the name and full_name attributes, but that's
>> all that is needed.
>
> That's an interesting idea -- though enough of a deviation from
> os.walk()'s onerror that I'm uncomfortable with it -- I'd rather just
> document that the onerror exception .filename is the full path name.

But the onerror exception will come from the lstat call, so it'll be a
raw OSError (unless scandir modifies it, which may be what you're
thinking of). And if so, aren't we at the mercy of what the OS module
gives us? That's why I said we can't guarantee it. I looked at the
documentation of OSError (in "Built In Exceptions"), and all it says
is "the filename" (unqualified). I'd expect that to be "whatever got
passed to the underlying OS API" - which may well be an absolute
pathname if we're lucky, but who knows? (I'd actually prefer it if
OSError guaranteed a full pathname, but that's a bigger issue...)

Paul
Ethan Furman
2014-07-09 13:41:04 UTC
Permalink
On 07/09/2014 06:22 AM, Ben Hoyt wrote:
>
> One issue with option #2 that I just realized -- does scandir yield the entry at all if there's a stat error? It
> can't really, because the caller will expect the .lstat attribute to be set (assuming he asked for type='lstat') but
> it won't be. Is effectively removing these entries just because the stat failed a problem? I kind of think it is. If
> so, is there a way to solve it with option #2?

Leave it up to the onerror handler. If it returns None, skip yielding the entry, otherwise yield whatever it returned
-- which also means the error handler should be able to set fields on the DirEntry:

def log_err(exc, entry):
logger.warn("Cannot stat {}".format(exc.filename))
entry.lstat.st_size = 0
return True

def get_tree_size(path):
total = 0
for entry in os.scandir(path, info='lstat', onerror=log_err):
if entry.is_dir:
total += get_tree_size(entry.full_name)
else:
total += entry.lstat.st_size
return total

This particular example doesn't benefit much from the addition, but this way we don't have to guess what the programmer
wants or needs to do in the case of failure.

--
~Ethan~
Ethan Furman
2014-07-09 14:41:11 UTC
Permalink
On 07/09/2014 06:41 AM, Ethan Furman wrote:
>
> Leave it up to the onerror handler. If it returns None, skip yielding the entry, otherwise yield whatever it returned
> -- which also means the error handler should be able to set fields on the DirEntry:
>
> def log_err(exc, entry):
> logger.warn("Cannot stat {}".format(exc.filename))
> entry.lstat.st_size = 0
> return True

Blah. Okay, either return the DirEntry (possibly modified), or have the log_err return entry instead of True. (Now
where is that caffeine??)

--
~Ethan~
Ben Hoyt
2014-07-09 15:35:26 UTC
Permalink
>> One issue with option #2 that I just realized -- does scandir yield the
>> entry at all if there's a stat error? It
>> can't really, because the caller will expect the .lstat attribute to be
>> set (assuming he asked for type='lstat') but
>>
>> it won't be. Is effectively removing these entries just because the stat
>> failed a problem? I kind of think it is. If
>> so, is there a way to solve it with option #2?
>
>
> Leave it up to the onerror handler. If it returns None, skip yielding the
> entry, otherwise yield whatever it returned
> -- which also means the error handler should be able to set fields on the
> DirEntry:
>
> def log_err(exc, entry):
> logger.warn("Cannot stat {}".format(exc.filename))
> entry.lstat.st_size = 0
> return True

This is an interesting idea, but it's just getting more and more
complex, and I'm guessing that being able to change the attributes of
DirEntry will make the C implementation more complex.

Also, I'm not sure it's very workable. For log_err above, you'd
actually have to do something like this, right?

def log_err(exc, entry):
logger.warn("Cannot stat {}".format(exc.filename))
entry.lstat = os.stat_result((0, 0, 0, 0, 0, 0, 0, 0, 0, 0))
return entry

Unless there's another simple way around this issue, I'm back to
loving the simplicity of option #1, which avoids this whole question.

-Ben
Ethan Furman
2014-07-09 16:35:04 UTC
Permalink
On 07/09/2014 08:35 AM, Ben Hoyt wrote:
>>> One issue with option #2 that I just realized -- does scandir yield the
>>> entry at all if there's a stat error? It
>>> can't really, because the caller will expect the .lstat attribute to be
>>> set (assuming he asked for type='lstat') but
>>>
>>> it won't be. Is effectively removing these entries just because the stat
>>> failed a problem? I kind of think it is. If
>>> so, is there a way to solve it with option #2?
>>
>>
>> Leave it up to the onerror handler. If it returns None, skip yielding the
>> entry, otherwise yield whatever it returned
>> -- which also means the error handler should be able to set fields on the
>> DirEntry:
>>
>> def log_err(exc, entry):
>> logger.warn("Cannot stat {}".format(exc.filename))
>> entry.lstat.st_size = 0
>> return True
>
> This is an interesting idea, but it's just getting more and more
> complex, and I'm guessing that being able to change the attributes of
> DirEntry will make the C implementation more complex.
>
> Also, I'm not sure it's very workable. For log_err above, you'd
> actually have to do something like this, right?
>
> def log_err(exc, entry):
> logger.warn("Cannot stat {}".format(exc.filename))
> entry.lstat = os.stat_result((0, 0, 0, 0, 0, 0, 0, 0, 0, 0))
> return entry

I would imagine we would provide a helper function:

def stat_result(st_size=0, st_atime=0, st_mtime=0, ...):
return os.stat_result((st_size, st_atime, st_mtime, ...))

and then in onerror

entry.lstat = stat_result()


> Unless there's another simple way around this issue, I'm back to
> loving the simplicity of option #1, which avoids this whole question.

Too simple is just as bad as too complex, and properly handling errors is rarely a simple task. Either we provide a
clean way to deal with errors in the API, or we force every user everywhere to come up with their own system.

Also, just because we provide it doesn't force people to use it, but if we don't provide it then people cannot use it.

To summarize the choice I think we are looking at:

1) We provide a very basic tool that many will have to write wrappers
around to get the desired behavior (choice 1)

2) We provide a more advanced tool that, in many cases, can be used
as-is, and is also fairly easy to extend to handle odd situations
(choice 2)

More specifically, if we go with choice 1 (no built-in error handling, no mutable DirEntry), how would I implement
choice 2? Would I have to write my own CustomDirEntry object?

--
~Ethan~
Paul Moore
2014-07-09 18:04:09 UTC
Permalink
On 9 July 2014 17:35, Ethan Furman <***@stoneleaf.us> wrote:
> More specifically, if we go with choice 1 (no built-in error handling, no
> mutable DirEntry), how would I implement choice 2? Would I have to write my
> own CustomDirEntry object?

Having built-in error handling is, I think, a key point. That's where
#1 really falls down.

But a mutable DirEntry and/or letting onerror manipulate the result is
a lot more than just having a hook for being notified of errors. That
seems to me to be a step too far, in the current context.
Specifically, the tree size example doesn't need it.

Do you have a compelling use case that needs a mutable DirEntry? It
feels like YAGNI to me.

Paul
Ethan Furman
2014-07-09 18:29:50 UTC
Permalink
On 07/09/2014 11:04 AM, Paul Moore wrote:
> On 9 July 2014 17:35, Ethan Furman <***@stoneleaf.us> wrote:
>> More specifically, if we go with choice 1 (no built-in error handling, no
>> mutable DirEntry), how would I implement choice 2? Would I have to write my
>> own CustomDirEntry object?
>
> Having built-in error handling is, I think, a key point. That's where
> #1 really falls down.
>
> But a mutable DirEntry and/or letting onerror manipulate the result is
> a lot more than just having a hook for being notified of errors. That
> seems to me to be a step too far, in the current context.
> Specifically, the tree size example doesn't need it.
>
> Do you have a compelling use case that needs a mutable DirEntry? It
> feels like YAGNI to me.

Not at this point. As I indicated in my reply to your response, as long as we have the onerror machinery now we can
tweak it later if real-world use shows it would be beneficial.

--
~Ethan~
Ben Hoyt
2014-07-09 19:03:20 UTC
Permalink
This is just getting way too complex ... further thoughts below.

>> This is an interesting idea, but it's just getting more and more
>> complex, and I'm guessing that being able to change the attributes of
>> DirEntry will make the C implementation more complex.
>>
>> Also, I'm not sure it's very workable. For log_err above, you'd
>> actually have to do something like this, right?
>>
>> def log_err(exc, entry):
>> logger.warn("Cannot stat {}".format(exc.filename))
>> entry.lstat = os.stat_result((0, 0, 0, 0, 0, 0, 0, 0, 0, 0))
>> return entry
>
>
> I would imagine we would provide a helper function:
>
> def stat_result(st_size=0, st_atime=0, st_mtime=0, ...):
> return os.stat_result((st_size, st_atime, st_mtime, ...))
>
> and then in onerror
>
> entry.lstat = stat_result()
>
>> Unless there's another simple way around this issue, I'm back to
>> loving the simplicity of option #1, which avoids this whole question.
>
>
> Too simple is just as bad as too complex, and properly handling errors is
> rarely a simple task. Either we provide a clean way to deal with errors in
> the API, or we force every user everywhere to come up with their own system.
>
> Also, just because we provide it doesn't force people to use it, but if we
> don't provide it then people cannot use it.

So here's the ways in which option #2 is now more complicated than option #1:

1) it has an additional "info" argument, the values of which have to
be documented ('os', 'type', 'lstat', and what each one means)
2) it has an additional "onerror" argument, the signature of which and
fairly complicated return value is non-obvious and has to be
documented
3) it requires user modification of the DirEntry object, which needs
documentation, and is potentially hard to implement
4) because the DirEntry object now allows modification, you need a
stat_result() helper function to help you build your own stat values

I'm afraid points 3 and 4 here add way too much complexity.

Remind me why all this is better than the PEP 471 approach again? It
handles all of these problems, is very direct, and uses built-in
Python constructs (method calls and try/except error handling).

And it's also simple to document -- much simpler than the above 4
things, which could be a couple of pages in the docs. Here's the doc
required for the PEP 471 approach:

"Note about caching and error handling: The is_X() and lstat()
functions may perform an lstat() on first call if the OS didn't
already fetch this data when reading the directory. So if you need
fine-grained error handling, catch OSError exceptions around these
method calls. After the first call, the is_X() and lstat() functions
cache the value on the DirEntry."

-Ben
Ethan Furman
2014-07-09 19:17:43 UTC
Permalink
On 07/09/2014 12:03 PM, Ben Hoyt wrote:
>
> So here's the ways in which option #2 is now more complicated than option #1:
>
> 1) it has an additional "info" argument, the values of which have to
> be documented ('os', 'type', 'lstat', and what each one means)
> 2) it has an additional "onerror" argument, the signature of which and
> fairly complicated return value is non-obvious and has to be
> documented
> 3) it requires user modification of the DirEntry object, which needs
> documentation, and is potentially hard to implement
> 4) because the DirEntry object now allows modification, you need a
> stat_result() helper function to help you build your own stat values
>
> I'm afraid points 3 and 4 here add way too much complexity.

I'm okay with dropping 3 and 4, and making the return from onerror being simply True to yield the entry, and False/None
to skip it. That should make implementation much easier, and documentation not too strenuous either.

--
~Ethan~
Ben Hoyt
2014-07-09 19:59:39 UTC
Permalink
>> 1) it has an additional "info" argument, the values of which have to
>> be documented ('os', 'type', 'lstat', and what each one means)
>> 2) it has an additional "onerror" argument, the signature of which and
>> fairly complicated return value is non-obvious and has to be
>> documented
>> 3) it requires user modification of the DirEntry object, which needs
>> documentation, and is potentially hard to implement
>> 4) because the DirEntry object now allows modification, you need a
>> stat_result() helper function to help you build your own stat values
>>
>> I'm afraid points 3 and 4 here add way too much complexity.
>
>
> I'm okay with dropping 3 and 4, and making the return from onerror being
> simply True to yield the entry, and False/None to skip it. That should make
> implementation much easier, and documentation not too strenuous either.

That's definitely better in terms of complexity.

Other python-devers, please chime in with your thoughts or votes.

-Ben
Victor Stinner
2014-07-09 20:24:19 UTC
Permalink
2014-07-09 21:59 GMT+02:00 Ben Hoyt <***@gmail.com>:
> Other python-devers, please chime in with your thoughts or votes.

Sorry, I didn't follow the whole discussion. IMO DirEntry must use
methods and you should not expose nor document which infos are already
provided by the OS or not. DirEntry should be a best-effort black-box
object providing an API similar to pathlib.Path. is_dir() may be fast?
fine, but don't say it in the documentation because Python must remain
portable and you should not write code specific to one specific
platform.

is_dir(), is_file(), is_symlink() and lstat() can fail as any other
Python function, no need to specialize them with custom error handler.
If you care, just use a very standard try/except.

I'm also against ensure_lstat=True or ideas like that fetching all
datas transparently in the generator. The behaviour would be too
different depending on the OS, and usually you don't need all
informations. And it raises errors in the generator, which is
something unusual, and difficult to handle (I don't like the onerror
callback).

Example where you may sometimes need is_dir(), but not always
---
for entry in os.scandir(path):
if ignore_entry(entry.name):
# this entry is not interesting, lstat_result is useless here
continue
if entry.is_dir(): # fetch required data if needed
continue
...
---

I don't understand why you are all focused on handling os.stat() and
os.lstat() errors. See for example the os.walk() function which is an
old function (python 2.6!): it doesn't catch erros on isdir(), even if
it has an onerror parameter... It only handles errors on listdir().
IMO errors on os.stat() and os.lstat() are very rare under very
specific conditions. The most common case is that you can get the
status if you can list files.

Victor
Paul Moore
2014-07-09 20:57:57 UTC
Permalink
On 9 July 2014 21:24, Victor Stinner <***@gmail.com> wrote:
> Example where you may sometimes need is_dir(), but not always
> ---
> for entry in os.scandir(path):
> if ignore_entry(entry.name):
> # this entry is not interesting, lstat_result is useless here
> continue
> if entry.is_dir(): # fetch required data if needed
> continue
> ...
> ---

That is an extremely good point, and articulates why I've always been
a bit uncomfortable with the whole ensure_stat idea.

> I don't understand why you are all focused on handling os.stat() and
> os.lstat() errors. See for example the os.walk() function which is an
> old function (python 2.6!): it doesn't catch erros on isdir(), even if
> it has an onerror parameter... It only handles errors on listdir().
> IMO errors on os.stat() and os.lstat() are very rare under very
> specific conditions. The most common case is that you can get the
> status if you can list files.

Personally, I'm only focused on it as a response to others feeling
it's important. I'm on Windows, where there are no extra stat calls,
so all *I* care about is having an API that deals with the use cases
others are concerned about without making it too hard for me to use it
on Windows where I don't have to worry about all this.

If POSIX users come to a consensus that error handling doesn't need
special treatment, I'm more than happy to go back to the PEP version.
(Much as previously happened with the race condition debate).

Paul
Ethan Furman
2014-07-09 21:28:07 UTC
Permalink
On 07/09/2014 01:57 PM, Paul Moore wrote:
> On 9 July 2014 21:24, Victor Stinner wrote:
>>
>> Example where you may sometimes need is_dir(), but not always
>> ---
>> for entry in os.scandir(path):
>> if ignore_entry(entry.name):
>> # this entry is not interesting, lstat_result is useless here
>> continue
>> if entry.is_dir(): # fetch required data if needed
>> continue
>> ...
>
> That is an extremely good point, and articulates why I've always been
> a bit uncomfortable with the whole ensure_stat idea.

On a system which did not supply is_dir automatically I would write that as:

for entry in os.scandir(path): # info defaults to 'os', which is basically None in this case
if ignore_entry(entry.name):
continue
if os.path.isdir(entry.full_name):
# do something interesting

Not hard to read or understand, no time wasted in unnecessary lstat calls.

--
~Ethan~
Ben Hoyt
2014-07-09 21:33:12 UTC
Permalink
> On a system which did not supply is_dir automatically I would write that as:
>
> for entry in os.scandir(path): # info defaults to 'os', which is
> basically None in this case
> if ignore_entry(entry.name):
> continue
> if os.path.isdir(entry.full_name):
> # do something interesting
>
> Not hard to read or understand, no time wasted in unnecessary lstat calls.

No, but how do you know whether you're on "a system which did not
supply is_dir automatically"? The above is not cross-platform, or at
least, not efficient cross-platform, which defeats the whole point of
scandir -- the above is no better than listdir().

-Ben
Ethan Furman
2014-07-09 22:50:28 UTC
Permalink
On 07/09/2014 02:33 PM, Ben Hoyt wrote:
>>
>> On a system which did not supply is_dir automatically I would write that as:
>>
>> for entry in os.scandir(path):
>> if ignore_entry(entry.name):
>> continue
>> if os.path.isdir(entry.full_name):
>> # do something interesting
>>
>> Not hard to read or understand, no time wasted in unnecessary lstat calls.
>
> No, but how do you know whether you're on "a system which did not
> supply is_dir automatically"? The above is not cross-platform, or at
> least, not efficient cross-platform, which defeats the whole point of
> scandir -- the above is no better than listdir().

Hit a directory with 100,000 entries and you'll change your mind. ;)

Okay, so the issue is you /want/ to write an efficient, cross-platform routine...

hrmmm.....

thinking........

Okay, marry the two ideas together:

scandir(path, info=None, onerror=None)
"""
Return a generator that returns one directory entry at a time in a DirEntry object
info: None --> DirEntries will have whatever attributes the O/S provides
'type' --> DirEntries will already have at least the file/dir distinction
'stat' --> DirEntries will also already have stat information
"""

DirEntry.is_dir()
Return True if this is a directory-type entry; may call os.lstat if the cache is empty.

DirEntry.is_file()
Return True if this is a file-type entry; may call os.lstat if the cache is empty.

DirEntry.is_symlink()
Return True if this is a symbolic link; may call os.lstat if the cache is empty.

DirEntry.stat
Return the stat info for this link; may call os.lstat if the cache is empty.


This way both paradigms are supported.

--
~Ethan~
MRAB
2014-07-09 23:22:21 UTC
Permalink
On 2014-07-09 23:50, Ethan Furman wrote:
> On 07/09/2014 02:33 PM, Ben Hoyt wrote:
>>>
>>> On a system which did not supply is_dir automatically I would write that as:
>>>
>>> for entry in os.scandir(path):
>>> if ignore_entry(entry.name):
>>> continue
>>> if os.path.isdir(entry.full_name):
>>> # do something interesting
>>>
>>> Not hard to read or understand, no time wasted in unnecessary lstat calls.
>>
>> No, but how do you know whether you're on "a system which did not
>> supply is_dir automatically"? The above is not cross-platform, or at
>> least, not efficient cross-platform, which defeats the whole point of
>> scandir -- the above is no better than listdir().
>
> Hit a directory with 100,000 entries and you'll change your mind. ;)
>
> Okay, so the issue is you /want/ to write an efficient, cross-platform routine...
>
> hrmmm.....
>
> thinking........
>
> Okay, marry the two ideas together:
>
> scandir(path, info=None, onerror=None)
> """
> Return a generator that returns one directory entry at a time in a DirEntry object

Should that be "that yields one directory entry at a time"?

> info: None --> DirEntries will have whatever attributes the O/S provides
> 'type' --> DirEntries will already have at least the file/dir distinction
> 'stat' --> DirEntries will also already have stat information
> """
>
> DirEntry.is_dir()
> Return True if this is a directory-type entry; may call os.lstat if the cache is empty.
>
> DirEntry.is_file()
> Return True if this is a file-type entry; may call os.lstat if the cache is empty.
>
> DirEntry.is_symlink()
> Return True if this is a symbolic link; may call os.lstat if the cache is empty.
>
> DirEntry.stat
> Return the stat info for this link; may call os.lstat if the cache is empty.
>
Why is "is_dir", et al, functions, but "stat" not a function?

>
> This way both paradigms are supported.
>
Ethan Furman
2014-07-09 23:26:01 UTC
Permalink
On 07/09/2014 04:22 PM, MRAB wrote:
> On 2014-07-09 23:50, Ethan Furman wrote:
>>
>> Okay, marry the two ideas together:
>>
>> scandir(path, info=None, onerror=None)
>> """
>> Return a generator that returns one directory entry at a time in a DirEntry object
>
> Should that be "that yields one directory entry at a time"?

Yes, thanks.

>> info: None --> DirEntries will have whatever attributes the O/S provides
>> 'type' --> DirEntries will already have at least the file/dir distinction
>> 'stat' --> DirEntries will also already have stat information
>> """
>>
>> DirEntry.is_dir()
>> Return True if this is a directory-type entry; may call os.lstat if the cache is empty.
>>
>> DirEntry.is_file()
>> Return True if this is a file-type entry; may call os.lstat if the cache is empty.
>>
>> DirEntry.is_symlink()
>> Return True if this is a symbolic link; may call os.lstat if the cache is empty.
>>
>> DirEntry.stat
>> Return the stat info for this link; may call os.lstat if the cache is empty.
>
> Why is "is_dir", et al, functions, but "stat" not a function?

Good point. Make stat a function as well.

--
~Ethan~
Ethan Furman
2014-07-09 20:44:12 UTC
Permalink
On 07/09/2014 01:24 PM, Victor Stinner wrote:
>
> Sorry, I didn't follow the whole discussion. IMO DirEntry must use
> methods and you should not expose nor document which infos are already
> provided by the OS or not. DirEntry should be a best-effort black-box
> object providing an API similar to pathlib.Path. is_dir() may be fast?
> fine, but don't say it in the documentation because Python must remain
> portable and you should not write code specific to one specific
> platform.

Okay, so using that logic we should head over to the os module and remove:

ctermid, getenv, getegid, geteuid, getgid, getgrouplist, getgroups, getpgid, getpgrp, getpriority, PRIO_PROCESS,
PRIO_PGRP, PRIO_USER, getresuid, getresgid, getuid, initgroups, putenv, setegid, seteuid, setgid, setgroups,
setpriority, setregid, setrusgid, setresuid, setreuid, getsid, setsid, setuid, unsetenv, fchmod, fchown, fdatasync,
fpathconf, fstatvfs, ftruncate, lockf, F_LOCK, F_TLOCK, F_ULOCK, F_TEST, O_DSYNC, O_RSYNC, O_SYNC, O_NDELAY, O_NONBLOCK,
O_NOCTTY, O_SHLOCK, O_EXLOCK, O_CLOEXEC, O_BINARY, O_NOINHERIT, O_SHORT_LIVED, O_TEMPORARY, O_RANDOM, O_SEQUENTIAL,
O_TEXT, ...

Okay, I'm tired of typing, but that list is not even half-way through the os page, and those are all methods or
attributes that are not available on either Windows or Unix or some flavors of Unix.

Oh, and all those upper-case attributes? Yup, documented. And when we don't document it ourselves we often refer
readers to their system documentation because Python does not, in fact, return exactly the same results on all platforms
-- particularly when calling into the OS.

--
~Ethan~
Ben Hoyt
2014-07-09 21:42:07 UTC
Permalink
I really don't understand why you *want* a worse, much less cross-platform API?

> Okay, so using that logic we should head over to the os module and remove:
>
> ctermid, getenv, getegid...
>
> Okay, I'm tired of typing, but that list is not even half-way through the os
> page, and those are all methods or attributes that are not available on
> either Windows or Unix or some flavors of Unix.

True, is this really the precedent we want to *aim for*. listdir() is
cross-platform, and it's relatively easy to make scandir()
cross-platform, so why not?

> Oh, and all those upper-case attributes? Yup, documented. And when we
> don't document it ourselves we often refer readers to their system
> documentation because Python does not, in fact, return exactly the same
> results on all platforms -- particularly when calling into the OS.

But again, why a worse, less cross-platform API when a simple,
cross-platform one is a method call away?

-Ben
Ethan Furman
2014-07-09 22:12:18 UTC
Permalink
On 07/09/2014 02:42 PM, Ben Hoyt wrote:
>>
>> Okay, so using that [no platform specific] logic we should head over to the os module and remove:
>>
>> ctermid, getenv, getegid...
>>
>> Okay, I'm tired of typing, but that list is not even half-way through the os
>> page, and those are all methods or attributes that are not available on
>> either Windows or Unix or some flavors of Unix.
>
> True, is this really the precedent we want to *aim for*. listdir() is
> cross-platform,

and listdir has serious performance issues, which is why you developed scandir.

>> Oh, and all those [snipped] upper-case attributes? Yup, documented. And when we
>> don't document it ourselves we often refer readers to their system
>> documentation because Python does not, in fact, return exactly the same
>> results on all platforms -- particularly when calling into the OS.
>
> But again, why a worse, less cross-platform API when a simple,
> cross-platform one is a method call away?

For the same reason we don't use code that makes threaded behavior better, but kills the single thread application.

If the programmer would rather have consistency on all platforms rather than performance on the one being used,
`info='lstat'` is the option to use.

I like the 'onerror' API better primarily because it gives a single point to deal with the errors. This has at least a
couple advantages:

- less duplication of code: in the tree_size example, the error
handling is duplicated twice

- readablity: with the error handling in a separate routine, one
does not have to jump around the try/except blocks looking for
what happens if there are no errors

--
~Ethan~
Nick Coghlan
2014-07-10 04:02:01 UTC
Permalink
On 9 Jul 2014 17:14, "Ethan Furman" <***@stoneleaf.us> wrote:
>
> On 07/09/2014 02:42 PM, Ben Hoyt wrote:
>>>
>>>
>>> Okay, so using that [no platform specific] logic we should head over to
the os module and remove:
>>>
>>>
>>> ctermid, getenv, getegid...
>>>
>>> Okay, I'm tired of typing, but that list is not even half-way through
the os
>>> page, and those are all methods or attributes that are not available on
>>> either Windows or Unix or some flavors of Unix.
>>
>>
>> True, is this really the precedent we want to *aim for*. listdir() is
>> cross-platform,
>
>
> and listdir has serious performance issues, which is why you developed
scandir.
>
>>> Oh, and all those [snipped] upper-case attributes? Yup, documented.
And when we
>>>
>>> don't document it ourselves we often refer readers to their system
>>> documentation because Python does not, in fact, return exactly the same
>>> results on all platforms -- particularly when calling into the OS.
>>
>>
>> But again, why a worse, less cross-platform API when a simple,
>> cross-platform one is a method call away?
>
>
> For the same reason we don't use code that makes threaded behavior
better, but kills the single thread application.
>
> If the programmer would rather have consistency on all platforms rather
than performance on the one being used, `info='lstat'` is the option to use.
>
> I like the 'onerror' API better primarily because it gives a single point
to deal with the errors. This has at least a couple advantages:
>
> - less duplication of code: in the tree_size example, the error
> handling is duplicated twice
>
> - readablity: with the error handling in a separate routine, one
> does not have to jump around the try/except blocks looking for
> what happens if there are no errors

The "onerror" approach can also deal with readdir failing, which the PEP
currently glosses over.

I'm somewhat inclined towards the current approach in the PEP, but I'd like
to see an explanation of two aspects:

1. How a scandir variant with an 'onerror' option could be implemented
given the version in the PEP

2. How the existing scandir module handles the 'onerror' parameter to its
directory walking function

Regards,
Nick.

>
> --
> ~Ethan~
>
> _______________________________________________
> Python-Dev mailing list
> Python-***@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
Ethan Furman
2014-07-11 03:26:05 UTC
Permalink
On 07/09/2014 09:02 PM, Nick Coghlan wrote:
> On 9 Jul 2014 17:14, "Ethan Furman" wrote:
>>
>> I like the 'onerror' API better primarily because it gives a single
>> point to deal with the errors. [...]
>
> The "onerror" approach can also deal with readdir failing, which the
> PEP currently glosses over.

Do we want this, though? I can see an error handler for individual entries, but if one of the *dir commands fails that
would seem to be fairly catastrophic.

> I'm somewhat inclined towards the current approach in the PEP, but I'd like to see an explanation of two aspects:
>
> 1. How a scandir variant with an 'onerror' option could be implemented given the version in the PEP

Here's a stab at it:

def scandir_error(path, info=None, onerror=None):
for entry in scandir(path):
if info == 'type':
try:
entry.is_dir()
except OSError as exc:
if onerror is None:
raise
if not onerror(exc, entry):
continue
elif info == 'lstat':
try:
entry.lstat()
except OSError as exc:
if onerror is None:
raise
if not onerror(exc, entry):
continue
yield entry

Here it is again with an attempt to deal with opendir/readdir/closedir exceptions:

def scandir_error(path, info=None, onerror=None):
entries = scandir(path)
try:
entry = next(entries)
except StopIteration:
# pass it through
raise
except Exception as exc:
if onerror is None:
raise
if not onerror(exc, 'what else here?'):
# what do we do on False?
# what do we do on True?
else:
for entry in scandir(path):
if info == 'type':
try:
entry.is_dir()
except OSError as exc:
if onerror is None:
raise
if not onerror(exc, entry):
continue
elif info == 'lstat':
try:
entry.lstat()
except OSError as exc:
if onerror is None:
raise
if not onerror(exc, entry):
continue
yield entry


> 2. How the existing scandir module handles the 'onerror' parameter to its directory walking function

Here's the first third of it from the repo:

def walk(top, topdown=True, onerror=None, followlinks=False):
"""Like os.walk(), but faster, as it uses scandir() internally."""
# Determine which are files and which are directories
dirs = []
nondirs = []
try:
for entry in scandir(top):
if entry.is_dir():
dirs.append(entry)
else:
nondirs.append(entry)
except OSError as error:
if onerror is not None:
onerror(error)
return
...

--
~Ethan~
Ben Hoyt
2014-07-11 11:12:59 UTC
Permalink
[replying to python-dev this time]

>> The "onerror" approach can also deal with readdir failing, which the
>> PEP currently glosses over.
>
>
> Do we want this, though? I can see an error handler for individual entries,
> but if one of the *dir commands fails that would seem to be fairly
> catastrophic.

Very much agreed that this isn't necessary for just readdir/FindNext
errors. We've never had this level of detail before -- if listdir()
fails half way through (very unlikely) it just bombs with OSError and
you get no entries at all.

If you really really want this (again very unlikely), you can always
use call next() directly and catch OSError around that call.

-Ben
Nick Coghlan
2014-07-12 15:07:03 UTC
Permalink
On 11 Jul 2014 12:46, "Ben Hoyt" <***@gmail.com> wrote:
>
> [replying to python-dev this time]
>
> >> The "onerror" approach can also deal with readdir failing, which the
> >> PEP currently glosses over.
> >
> >
> > Do we want this, though? I can see an error handler for individual
entries,
> > but if one of the *dir commands fails that would seem to be fairly
> > catastrophic.
>
> Very much agreed that this isn't necessary for just readdir/FindNext
> errors. We've never had this level of detail before -- if listdir()
> fails half way through (very unlikely) it just bombs with OSError and
> you get no entries at all.
>
> If you really really want this (again very unlikely), you can always
> use call next() directly and catch OSError around that call.

Agreed - I think the PEP should point this out explicitly, and show that
the approach it takes offers a lot of flexibility in error handling from
"just let it fail", to a single try/catch around the whole loop, to
try/catch just around the operations that might call lstat(), to try/catch
around the individual iteration steps.

os.walk remains the higher level API that most code should be using, and
that has to retain the current listdir based behaviour (any error = ignore
all entries in that directory) for backwards compatibility reasons.

Cheers,
Nick.

>
> -Ben
> _______________________________________________
> Python-Dev mailing list
> Python-***@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
Ben Hoyt
2014-07-14 00:12:16 UTC
Permalink
>> Very much agreed that this isn't necessary for just readdir/FindNext
>> errors. We've never had this level of detail before -- if listdir()
>> fails half way through (very unlikely) it just bombs with OSError and
>> you get no entries at all.
>>
>> If you really really want this (again very unlikely), you can always
>> use call next() directly and catch OSError around that call.
>
> Agreed - I think the PEP should point this out explicitly, and show that the
> approach it takes offers a lot of flexibility in error handling from "just
> let it fail", to a single try/catch around the whole loop, to try/catch just
> around the operations that might call lstat(), to try/catch around the
> individual iteration steps.

Good point. It'd be good to mention this explicitly in the PEP and
have another example or two of the different levels of errors
handling.

> os.walk remains the higher level API that most code should be using, and
> that has to retain the current listdir based behaviour (any error = ignore
> all entries in that directory) for backwards compatibility reasons.

Yes, definitely.

-Ben
Victor Stinner
2014-07-09 21:38:26 UTC
Permalink
2014-07-09 22:44 GMT+02:00 Ethan Furman <***@stoneleaf.us>:
> On 07/09/2014 01:24 PM, Victor Stinner wrote:
>> Sorry, I didn't follow the whole discussion. IMO DirEntry must use
>> methods and you should not expose nor document which infos are already
>> provided by the OS or not. DirEntry should be a best-effort black-box
>> object providing an API similar to pathlib.Path. is_dir() may be fast?
>> fine, but don't say it in the documentation because Python must remain
>> portable and you should not write code specific to one specific
>> platform.
>
>
> Okay, so using that logic we should head over to the os module and remove: (...)

My comment was specific to the PEP 471, design of the DirEntry class.

Victor
Ethan Furman
2014-07-09 22:15:49 UTC
Permalink
On 07/09/2014 02:38 PM, Victor Stinner wrote:
> 2014-07-09 22:44 GMT+02:00 Ethan Furman:
>> On 07/09/2014 01:24 PM, Victor Stinner wrote:
>>>
>>> [...] Python must remain
>>> portable and you should not write code specific to one specific
>>> platform.
>>
>>
>> Okay, so using that logic we should head over to the os module and remove: (...)
>
> My comment was specific to the PEP 471, design of the DirEntry class.

And my comment was to the point of there being methods/attributes/return values that /do/ vary by platform, and /are/
documented as such. Even stat itself is not the same on Windows as posix.

--
~Ethan~
Nikolaus Rath
2014-07-10 00:25:54 UTC
Permalink
Ben Hoyt <***@gmail.com> writes:
> So here's the ways in which option #2 is now more complicated than option #1:
>
> 1) it has an additional "info" argument, the values of which have to
> be documented ('os', 'type', 'lstat', and what each one means)
> 2) it has an additional "onerror" argument, the signature of which and
> fairly complicated return value is non-obvious and has to be
> documented
> 3) it requires user modification of the DirEntry object, which needs
> documentation, and is potentially hard to implement
> 4) because the DirEntry object now allows modification, you need a
> stat_result() helper function to help you build your own stat values
>
> I'm afraid points 3 and 4 here add way too much complexity.

Points 3 and 4 are not required to go with option #2, option #2 merely
allows to implement points 3 and 4 at some point in the future if it
turns out to be desirable.


Best,
-Nikolaus

--
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

»Time flies like an arrow, fruit flies like a Banana.«
Paul Moore
2014-07-09 17:10:29 UTC
Permalink
On 9 July 2014 14:22, Ben Hoyt <***@gmail.com> wrote:
> One issue with option #2 that I just realized -- does scandir yield
> the entry at all if there's a stat error? It can't really, because the
> caller will except the .lstat attribute to be set (assuming he asked
> for type='lstat') but it won't be. Is effectively removing these
> entries just because the stat failed a problem? I kind of think it is.
> If so, is there a way to solve it with option #2?

So the issue is that you need to do a stat but it failed. You have
"whatever the OS gave you", but can't get anything more. This is only
an issue on POSIX, where the original OS call doesn't give you
everything, so it's fine, those POSIX people can just learn to cope
with their broken OS, right? :-)

More seriously, why not just return a DirEntry that says it's a file
with a stat entry that's all zeroes? That seems pretty harmless. And
the onerror function will be called, so if it is inappropriate the
application can do something. Maybe it's worth letting onerror return
a boolean that says whether to skip the entry, but that's as far as
I'd bother going.

It's a close call, but I think option #2 still wins (just) for me.

Paul
Ethan Furman
2014-07-09 17:38:38 UTC
Permalink
On 07/09/2014 10:10 AM, Paul Moore wrote:
> On 9 July 2014 14:22, Ben Hoyt <***@gmail.com> wrote:
>> One issue with option #2 that I just realized -- does scandir yield
>> the entry at all if there's a stat error? It can't really, because the
>> caller will except the .lstat attribute to be set (assuming he asked
>> for type='lstat') but it won't be. Is effectively removing these
>> entries just because the stat failed a problem? I kind of think it is.
>> If so, is there a way to solve it with option #2?
>
> So the issue is that you need to do a stat but it failed. You have
> "whatever the OS gave you", but can't get anything more. This is only
> an issue on POSIX, where the original OS call doesn't give you
> everything, so it's fine, those POSIX people can just learn to cope
> with their broken OS, right? :-)

LOL


> More seriously, why not just return a DirEntry that says it's a file
> with a stat entry that's all zeroes? That seems pretty harmless. And
> the onerror function will be called, so if it is inappropriate the
> application can do something. Maybe it's worth letting onerror return
> a boolean that says whether to skip the entry, but that's as far as
> I'd bother going.

I could live with this -- we could enhance it the future fairly easily if we needed to.

--
~Ethan~
Ethan Furman
2014-07-09 13:17:40 UTC
Permalink
On 07/09/2014 05:48 AM, Ben Hoyt wrote:
>
> So how about tweaking option #2 a tiny bit more to this:
>
> def scandir(path='.', info=None, onerror=None): ...
>
> * if info is None (the default), only the .name and .full_name
> attributes are present
> * if info is 'type', scandir ensures the is_dir/is_file/is_symlink
> attributes are present and either True or False
> * if info is 'lstat', scandir additionally ensures a .lstat is present
> and is a full stat_result object
> * if info is 'os', scandir returns the attributes the OS provides
> (everything on Windows, only is_X -- most of the time -- on POSIX)

I would rather have the default for info be 'os': cross-platform is good, but there is no reason to force it on some
poor script that is meant to run on a local machine and will never leave it.


> * if onerror is not None and errors occur during any internal lstat()
> call, onerror(exc) is called with the OSError exception object

As Paul mentioned, 'onerror(exc, DirEntry)' would be better.


> Further point -- because the is_dir/is_file/is_symlink attributes are
> booleans, it would be very bad for them to be present but None if you
> didn't ask for (or the OS didn't return) the type information. Because
> then "if entry.is_dir:" would be None and your code would think it
> wasn't a directory, when actually you don't know. For this reason, all
> attributes should fail with AttributeError if not fetched.

Fair point, and agreed.

--
~Ethan~
Paul Moore
2014-07-09 07:13:10 UTC
Permalink
On 9 July 2014 02:08, Ben Hoyt <***@gmail.com> wrote:
> Comments and votes, please!

+1 on option 1 (current PEP approach) at the moment, but I would like
to see how the error handling would look (suppose the function logs
files that can't be statted, and assumes a size of 0 for them). The
idea of a multi-level ensure_lstat isn't unreasonable, either, and
that helps option 2.

The biggest issue *I* see with option 2 is that people won't remember
to add the ensure_XXX argument, and that will result in more code that
seems to work but fails cross-platform. Unless scandir deliberately
fails if you use an attribute that you haven't "ensured", but that
would be really unfriendly...

Paul
Walter Dörwald
2014-07-09 14:41:44 UTC
Permalink
On 8 Jul 2014, at 15:52, Ben Hoyt wrote:

> Hi folks,
>
> After some very good python-dev feedback on my first version of PEP
> 471, I've updated the PEP to clarify a few things and added various
> "Rejected ideas" subsections. Here's a link to the new version (I've
> also copied the full text below):
>
> http://legacy.python.org/dev/peps/pep-0471/ -- new PEP as HTML
> http://hg.python.org/peps/rev/0da4736c27e8 -- changes
>
> [...]
> Rejected ideas
> ==============
>
> [...]
> Return values being pathlib.Path objects
> ----------------------------------------
>
> With Antoine Pitrou's new standard library ``pathlib`` module, it
> at first seems like a great idea for ``scandir()`` to return instances
> of ``pathlib.Path``. However, ``pathlib.Path``'s ``is_X()`` and
> ``lstat()`` functions are explicitly not cached, whereas ``scandir``
> has to cache them by design, because it's (often) returning values
> from the original directory iteration system call.
>
> And if the ``pathlib.Path`` instances returned by ``scandir`` cached
> lstat values, but the ordinary ``pathlib.Path`` objects explicitly
> don't, that would be more than a little confusing.
>
> Guido van Rossum explicitly rejected ``pathlib.Path`` caching lstat in
> the context of scandir `here
> <https://mail.python.org/pipermail/python-dev/2013-November/130583.html>`_,
> making ``pathlib.Path`` objects a bad choice for scandir return
> values.

Can we at least make sure that attributes of DirEntry that have the same
meaning as attributes of pathlib.Path have the same name?

> [...]

Servus,
Walter
Akira Li
2014-07-10 02:28:09 UTC
Permalink
Ben Hoyt <***@gmail.com> writes:
...
> ``scandir()`` yields a ``DirEntry`` object for each file and directory
> in ``path``. Just like ``listdir``, the ``'.'`` and ``'..'``
> pseudo-directories are skipped, and the entries are yielded in
> system-dependent order. Each ``DirEntry`` object has the following
> attributes and methods:
>
> * ``name``: the entry's filename, relative to the ``path`` argument
> (corresponds to the return values of ``os.listdir``)
>
> * ``full_name``: the entry's full path name -- the equivalent of
> ``os.path.join(path, entry.name)``

I suggest renaming .full_name -> .path

.full_name might be misleading e.g., it implies that .full_name ==
abspath(.full_name) that might be false. The .path name has no such
associations.

The semantics of the the .path attribute is defined by these assertions::

for entry in os.scandir(topdir):
#NOTE: assume os.path.normpath(topdir) is not called to create .path
assert entry.path == os.path.join(topdir, entry.name)
assert entry.name == os.path.basename(entry.path)
assert entry.name == os.path.relpath(entry.path, start=topdir)
assert os.path.dirname(entry.path) == topdir
assert (entry.path != os.path.abspath(entry.path) or
os.path.isabs(topdir)) # it is absolute only if topdir is
assert (entry.path != os.path.realpath(entry.path) or
topdir == os.path.realpath(topdir)) # symlinks are not resolved
assert (entry.path != os.path.normcase(entry.path) or
topdir == os.path.normcase(topdir)) # no case-folding,
# unlike PureWindowsPath


...
> * ``is_dir()``: like ``os.path.isdir()``, but much cheaper -- it never
> requires a system call on Windows, and usually doesn't on POSIX
> systems

I suggest documenting the implicit follow_symlinks parameter for .is_X methods.

Note: lstat == partial(stat, follow_symlinks=False).

In particular, .is_dir() should probably use follow_symlinks=True by
default as suggested by Victor Stinner *if .is_dir() does it on Windows*

MSDN says: GetFileAttributes() does not follow symlinks.

os.path.isdir docs imply follow_symlinks=True: "both islink() and
isdir() can be true for the same path."


...
> Like the other functions in the ``os`` module, ``scandir()`` accepts
> either a bytes or str object for the ``path`` parameter, and returns
> the ``DirEntry.name`` and ``DirEntry.full_name`` attributes with the
> same type as ``path``. However, it is *strongly recommended* to use
> the str type, as this ensures cross-platform support for Unicode
> filenames.

Document when {e.name for e in os.scandir(path)} != set(os.listdir(path))
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

e.g., path can be an open file descriptor in os.listdir(path) since
Python 3.3 but the PEP doesn't mention it explicitly.

It has been discussed already e.g.,
https://mail.python.org/pipermail/python-dev/2014-July/135296.html

PEP 471 should explicitly reject the support for specifying a file
descriptor so that a code that uses os.scandir may assume that
entry.path (.full_name) attribute is always present (no exceptions due
to a failure to read /proc/self/fd/NNN or an error while calling
fcntl(F_GETPATH) or GetFileInformationByHandleEx() -- see
http://stackoverflow.com/q/1188757 ).

Reject explicitly in PEP 471 the support for dir_fd parameter
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

aka the support for paths relative to directory descriptors.

Note: it is a *different* (but related) issue.


...
> Notes on exception handling
> ---------------------------
>
> ``DirEntry.is_X()`` and ``DirEntry.lstat()`` are explicitly methods
> rather than attributes or properties, to make it clear that they may
> not be cheap operations, and they may do a system call. As a result,
> these methods may raise ``OSError``.
>
> For example, ``DirEntry.lstat()`` will always make a system call on
> POSIX-based systems, and the ``DirEntry.is_X()`` methods will make a
> ``stat()`` system call on such systems if ``readdir()`` returns a
> ``d_type`` with a value of ``DT_UNKNOWN``, which can occur under
> certain conditions or on certain file systems.
>
> For this reason, when a user requires fine-grained error handling,
> it's good to catch ``OSError`` around these method calls and then
> handle as appropriate.
>

I suggest documenting that next(os.scandir()) may raise OSError

e.g., on POSIX it may happen due to an OS error in opendir/readdir/closedir

Also, document whether os.scandir() itself may raise OSError (whether
opendir or other OS functions may be called before the first yield).


...
os.scandir() should allow the explicit cleanup
++++++++++++++++++++++++++++++++++++++++++++++

::
with closing(os.scandir()) as entries:
for _ in entries:
break

entries.close() is called that frees the resources if necessary, to
*avoid relying on garbage-collection for managing file descriptors*
(check whether it is consistent with the .close() method from the
generator protocol e.g., it might be already called on the exit from the
loop whether an exception happens or not without requiring the
with-statement (I don't know)). *It should be possible to limit the
resource life-time on non-refcounting Python implementations.*

os.scandir() object may support the context manager protocol explicitly::

with os.scandir() as entries:
for _ in entries:
break

``.__exit__`` method may just call ``.close`` method.


...
> Rejected ideas
> ==============
>
>
> Naming
> ------
>
> The only other real contender for this function's name was
> ``iterdir()``. However, ``iterX()`` functions in Python (mostly found
> in Python 2) tend to be simple iterator equivalents of their
> non-iterator counterparts. For example, ``dict.iterkeys()`` is just an
> iterator version of ``dict.keys()``, but the objects returned are
> identical. In ``scandir()``'s case, however, the return values are
> quite different objects (``DirEntry`` objects vs filename strings), so
> this should probably be reflected by a difference in name -- hence
> ``scandir()``.
>
> See some `relevant discussion on python-dev
> <https://mail.python.org/pipermail/python-dev/2014-June/135228.html>`_.
>

- os.scandir() name is inconsistent with the pathlib module.
pathlib.Path has `.iterdir() method
<https://docs.python.org/3/library/pathlib.html#pathlib.Path.iterdir>`_
that generates Path instances i.e., the argument that iterdir()
should return strings is not valid

- os.scandir() name conflicts with POSIX. POSIX already has `scandir()
function
<http://pubs.opengroup.org/onlinepubs/9699919799/functions/scandir.html>`_
Most functions in the os module are thin-wrappers of their
corresponding POSIX analogs

In principle, POSIX scandir(path, &entries, sel, compar) is emulated
using::

entries = sorted(filter(sel, os.scandir(path)),
key=cmp_to_key(compar))

so that the above code snippet could be provided in the docs. We may
say that os.scandir is a pythonic analog of the POSIX function and
therefore there is no conflict even if os.scandir doesn't use POSIX
scandir function in its implementation. If we can't say it then a
*different name/module should be used to allow adding POSIX-compatible
os.scandir() in the future*.


--
Akira
Loading...