Discussion:
PEP 471 -- os.scandir() function -- a better and faster directory iterator
Ben Hoyt
2014-06-26 22:59:45 UTC
Permalink
Hi Python dev folks,

I've written a PEP proposing a specific os.scandir() API for a
directory iterator that returns the stat-like info from the OS, the
main advantage of which is to speed up os.walk() and similar
operations between 4-20x, depending on your OS and file system. Full
details, background info, and context links are in the PEP, which
Victor Stinner has uploaded at the following URL, and I've also copied
inline below.

http://legacy.python.org/dev/peps/pep-0471/

Would love feedback on the PEP, but also of course on the proposal itself.

-Ben


PEP: 471
Title: os.scandir() function -- a better and faster directory iterator
Version: $Revision$
Last-Modified: $Date$
Author: Ben Hoyt <***@gmail.com>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 30-May-2014
Python-Version: 3.5


Abstract
========

This PEP proposes including a new directory iteration function,
``os.scandir()``, in the standard library. This new function adds
useful functionality and increases the speed of ``os.walk()`` by 2-10
times (depending on the platform and file system) by significantly
reducing the number of times ``stat()`` needs to be called.


Rationale
=========

Python's built-in ``os.walk()`` is significantly slower than it needs
to be, because -- in addition to calling ``os.listdir()`` on each
directory -- it executes the system call ``os.stat()`` or
``GetFileAttributes()`` on each file to determine whether the entry is
a directory or not.

But the underlying system calls -- ``FindFirstFile`` /
``FindNextFile`` on Windows and ``readdir`` on Linux and OS X --
already tell you whether the files returned are directories or not, so
no further system calls are needed. In short, you can reduce the
number of system calls from approximately 2N to N, where N is the
total number of files and directories in the tree. (And because
directory trees are usually much wider than they are deep, it's often
much better than this.)

In practice, removing all those extra system calls makes ``os.walk()``
about **8-9 times as fast on Windows**, and about **2-3 times as fast
on Linux and Mac OS X**. So we're not talking about micro-
optimizations. See more `benchmarks`_.

.. _`benchmarks`: https://github.com/benhoyt/scandir#benchmarks

Somewhat relatedly, many people (see Python `Issue 11406`_) are also
keen on a version of ``os.listdir()`` that yields filenames as it
iterates instead of returning them as one big list. This improves
memory efficiency for iterating very large directories.

So as well as providing a ``scandir()`` iterator function for calling
directly, Python's existing ``os.walk()`` function could be sped up a
huge amount.

.. _`Issue 11406`: http://bugs.python.org/issue11406


Implementation
==============

The implementation of this proposal was written by Ben Hoyt (initial
version) and Tim Golden (who helped a lot with the C extension
module). It lives on GitHub at `benhoyt/scandir`_.

.. _`benhoyt/scandir`: https://github.com/benhoyt/scandir

Note that this module has been used and tested (see "Use in the wild"
section in this PEP), so it's more than a proof-of-concept. However,
it is marked as beta software and is not extensively battle-tested.
It will need some cleanup and more thorough testing before going into
the standard library, as well as integration into `posixmodule.c`.



Specifics of proposal
=====================

Specifically, this PEP proposes adding a single function to the ``os``
module in the standard library, ``scandir``, that takes a single,
optional string as its argument::

scandir(path='.') -> generator of DirEntry objects

Like ``listdir``, ``scandir`` calls the operating system's directory
iteration system calls to get the names of the files in the ``path``
directory, but it's different from ``listdir`` in two ways:

* Instead of bare filename strings, it returns lightweight
``DirEntry`` objects that hold the filename string and provide
simple methods that allow access to the stat-like data the operating
system returned.

* It returns a generator instead of a list, so that ``scandir`` acts
as a true iterator instead of returning the full list immediately.

``scandir()`` yields a ``DirEntry`` object for each file and directory
in ``path``. Just like ``listdir``, the ``'.'`` and ``'..'``
pseudo-directories are skipped, and the entries are yielded in
system-dependent order. Each ``DirEntry`` object has the following
attributes and methods:

* ``name``: the entry's filename, relative to ``path`` (corresponds to
the return values of ``os.listdir``)

* ``is_dir()``: like ``os.path.isdir()``, but requires no system calls
on most systems (Linux, Windows, OS X)

* ``is_file()``: like ``os.path.isfile()``, but requires no system
calls on most systems (Linux, Windows, OS X)

* ``is_symlink()``: like ``os.path.islink()``, but requires no system
calls on most systems (Linux, Windows, OS X)

* ``lstat()``: like ``os.lstat()``, but requires no system calls on
Windows

The ``DirEntry`` attribute and method names were chosen to be the same
as those in the new ``pathlib`` module for consistency.


Notes on caching
----------------

The ``DirEntry`` objects are relatively dumb -- the ``name`` attribute
is obviously always cached, and the ``is_X`` and ``lstat`` methods
cache their values (immediately on Windows via ``FindNextFile``, and
on first use on Linux / OS X via a ``stat`` call) and never refetch
from the system.

For this reason, ``DirEntry`` objects are intended to be used and
thrown away after iteration, not stored in long-lived data structured
and the methods called again and again.

If a user wants to do that (for example, for watching a file's size
change), they'll need to call the regular ``os.lstat()`` or
``os.path.getsize()`` functions which force a new system call each
time.


Examples
========

Here's a good usage pattern for ``scandir``. This is in fact almost
exactly how the scandir module's faster ``os.walk()`` implementation
uses it::

dirs = []
non_dirs = []
for entry in scandir(path):
if entry.is_dir():
dirs.append(entry)
else:
non_dirs.append(entry)

The above ``os.walk()``-like code will be significantly using scandir
on both Windows and Linux or OS X.

Or, for getting the total size of files in a directory tree -- showing
use of the ``DirEntry.lstat()`` method::

def get_tree_size(path):
"""Return total size of files in path and subdirs."""
size = 0
for entry in scandir(path):
if entry.is_dir():
sub_path = os.path.join(path, entry.name)
size += get_tree_size(sub_path)
else:
size += entry.lstat().st_size
return size

Note that ``get_tree_size()`` will get a huge speed boost on Windows,
because no extra stat call are needed, but on Linux and OS X the size
information is not returned by the directory iteration functions, so
this function won't gain anything there.


Support
=======

The scandir module on GitHub has been forked and used quite a bit (see
"Use in the wild" in this PEP), but there's also been a fair bit of
direct support for a scandir-like function from core developers and
others on the python-dev and python-ideas mailing lists. A sampling:

* **Nick Coghlan**, a core Python developer: "I've had the local Red
Hat release engineering team express their displeasure at having to
stat every file in a network mounted directory tree for info that is
present in the dirent structure, so a definite +1 to os.scandir from
me, so long as it makes that info available."
[`source1 <http://bugs.python.org/issue11406>`_]

* **Tim Golden**, a core Python developer, supports scandir enough to
have spent time refactoring and significantly improving scandir's C
extension module.
[`source2 <https://github.com/tjguk/scandir>`_]

* **Christian Heimes**, a core Python developer: "+1 for something
like yielddir()"
[`source3 <https://mail.python.org/pipermail/python-ideas/2012-November/017772.html>`_]
and "Indeed! I'd like to see the feature in 3.4 so I can remove my
own hack from our code base."
[`source4 <http://bugs.python.org/issue11406>`_]

* **Gregory P. Smith**, a core Python developer: "As 3.4beta1 happens
tonight, this isn't going to make 3.4 so i'm bumping this to 3.5.
I really like the proposed design outlined above."
[`source5 <http://bugs.python.org/issue11406>`_]

* **Guido van Rossum** on the possibility of adding scandir to Python
3.5 (as it was too late for 3.4): "The ship has likewise sailed for
adding scandir() (whether to os or pathlib). By all means experiment
and get it ready for consideration for 3.5, but I don't want to add
it to 3.4."
[`source6 <https://mail.python.org/pipermail/python-dev/2013-November/130583.html>`_]

Support for this PEP itself (meta-support?) was given by Nick Coghlan
on python-dev: "A PEP reviewing all this for 3.5 and proposing a
specific os.scandir API would be a good thing."
[`source7 <https://mail.python.org/pipermail/python-dev/2013-November/130588.html>`_]


Use in the wild
===============

To date, ``scandir`` is definitely useful, but has been clearly marked
"beta", so it's uncertain how much use of it there is in the wild. Ben
Hoyt has had several reports from people using it. For example:

* Chris F: "I am processing some pretty large directories and was half
expecting to have to modify getdents. So thanks for saving me the
effort." [via personal email]

* bschollnick: "I wanted to let you know about this, since I am using
Scandir as a building block for this code. Here's a good example of
scandir making a radical performance improvement over os.listdir."
[`source8 <https://github.com/benhoyt/scandir/issues/19>`_]

* Avram L: "I'm testing our scandir for a project I'm working on.
Seems pretty solid, so first thing, just want to say nice work!"
[via personal email]

Others have `requested a PyPI package`_ for it, which has been
created. See `PyPI package`_.

.. _`requested a PyPI package`: https://github.com/benhoyt/scandir/issues/12
.. _`PyPI package`: https://pypi.python.org/pypi/scandir

GitHub stats don't mean too much, but scandir does have several
watchers, issues, forks, etc. Here's the run-down as of the stats as
of June 5, 2014:

* Watchers: 17
* Stars: 48
* Forks: 15
* Issues: 2 open, 19 closed

**However, the much larger point is this:**, if this PEP is accepted,
``os.walk()`` can easily be reimplemented using ``scandir`` rather
than ``listdir`` and ``stat``, increasing the speed of ``os.walk()``
very significantly. There are thousands of developers, scripts, and
production code that would benefit from this large speedup of
``os.walk()``. For example, on GitHub, there are almost as many uses
of ``os.walk`` (194,000) as there are of ``os.mkdir`` (230,000).


Open issues and optional things
===============================

There are a few open issues or optional additions:


Should scandir be in its own module?
------------------------------------

Should the function be included in the standard library in a new
module, ``scandir.scandir()``, or just as ``os.scandir()`` as
discussed? The preference of this PEP's author (Ben Hoyt) would be
``os.scandir()``, as it's just a single function.


Should there be a way to access the full path?
----------------------------------------------

Should ``DirEntry``'s have a way to get the full path without using
``os.path.join(path, entry.name)``? This is a pretty common pattern,
and it may be useful to add pathlib-like ``str(entry)`` functionality.
This functionality has also been requested in `issue 13`_ on GitHub.

.. _`issue 13`: https://github.com/benhoyt/scandir/issues/13


Should it expose Windows wildcard functionality?
------------------------------------------------

Should ``scandir()`` have a way of exposing the wildcard functionality
in the Windows ``FindFirstFile`` / ``FindNextFile`` functions? The
scandir module on GitHub exposes this as a ``windows_wildcard``
keyword argument, allowing Windows power users the option to pass a
custom wildcard to ``FindFirstFile``, which may avoid the need to use
``fnmatch`` or similar on the resulting names. It is named the
unwieldly ``windows_wildcard`` to remind you you're writing power-
user, Windows-only code if you use it.

This boils down to whether ``scandir`` should be about exposing all of
the system's directory iteration features, or simply providing a fast,
simple, cross-platform directory iteration API.

This PEP's author votes for not including ``windows_wildcard`` in the
standard library version, because even though it could be useful in
rare cases (say the Windows Dropbox client?), it'd be too easy to use
it just because you're a Windows developer, and create code that is
not cross-platform.


Possible improvements
=====================

There are many possible improvements one could make to scandir, but
here is a short list of some this PEP's author has in mind:

* scandir could potentially be further sped up by calling ``readdir``
/ ``FindNextFile`` say 50 times per ``Py_BEGIN_ALLOW_THREADS`` block
so that it stays in the C extension module for longer, and may be
somewhat faster as a result. This approach hasn't been tested, but
was suggested by on Issue 11406 by Antoine Pitrou.
[`source9 <http://bugs.python.org/msg130125>`_]


Previous discussion
===================

* `Original thread Ben Hoyt started on python-ideas`_ about speeding
up ``os.walk()``

* Python `Issue 11406`_, which includes the original proposal for a
scandir-like function

* `Further thread Ben Hoyt started on python-dev`_ that refined the
``scandir()`` API, including Nick Coghlan's suggestion of scandir
yielding ``DirEntry``-like objects

* `Final thread Ben Hoyt started on python-dev`_ to discuss the
interaction between scandir and the new ``pathlib`` module

* `Question on StackOverflow`_ about why ``os.walk()`` is slow and
pointers on how to fix it (this inspired the author of this PEP
early on)

* `BetterWalk`_, this PEP's author's previous attempt at this, on
which the scandir code is based

.. _`Original thread Ben Hoyt started on python-ideas`:
https://mail.python.org/pipermail/python-ideas/2012-November/017770.html
.. _`Further thread Ben Hoyt started on python-dev`:
https://mail.python.org/pipermail/python-dev/2013-May/126119.html
.. _`Final thread Ben Hoyt started on python-dev`:
https://mail.python.org/pipermail/python-dev/2013-November/130572.html
.. _`Question on StackOverflow`:
http://stackoverflow.com/questions/2485719/very-quickly-getting-total-size-of-folder
.. _`BetterWalk`: https://github.com/benhoyt/betterwalk


Copyright
=========

This document has been placed in the public domain.


..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End:
MRAB
2014-06-26 23:28:20 UTC
Permalink
Post by Ben Hoyt
Hi Python dev folks,
I've written a PEP proposing a specific os.scandir() API for a
directory iterator that returns the stat-like info from the OS, the
main advantage of which is to speed up os.walk() and similar
operations between 4-20x, depending on your OS and file system. Full
details, background info, and context links are in the PEP, which
Victor Stinner has uploaded at the following URL, and I've also
copied inline below.
http://legacy.python.org/dev/peps/pep-0471/
Would love feedback on the PEP, but also of course on the proposal itself.
[snip]
Personally, I'd prefer the name 'iterdir' because it emphasises that
it's an iterator.
Tim Delaney
2014-06-26 23:36:28 UTC
Permalink
Post by MRAB
Personally, I'd prefer the name 'iterdir' because it emphasises that
it's an iterator.
Exactly what I was going to post (with the added note that thee's an
obvious symmetry with listdir).

+1 for iterdir rather than scandir

Other than that:

+1 for adding scandir to the stdlib
-1 for windows_wildcard (it would be an attractive nuisance to write
windows-only code)

Tim Delaney
Ethan Furman
2014-06-26 23:43:43 UTC
Permalink
Post by MRAB
Personally, I'd prefer the name 'iterdir' because it emphasises that
it's an iterator.
Exactly what I was going to post (with the added note that thee's an obvious symmetry with listdir).
+1 for iterdir rather than scandir
+1 for adding [it] to the stdlib
+1 for all of above

--
~Ethan~
Ben Hoyt
2014-06-27 01:37:50 UTC
Permalink
Received: from localhost (HELO mail.python.org) (127.0.0.1)
by albatross.python.org with SMTP; 27 Jun 2014 03:37:58 +0200
Received: from mail-qg0-x22c.google.com (unknown
[IPv6:2607:f8b0:400d:c04::22c])
(using TLSv1 with cipher ECDHE-RSA-AES128-SHA (128/128 bits))
(No client certificate requested)
by mail.python.org (Postfix) with ESMTPS
for <python-***@python.org>; Fri, 27 Jun 2014 03:37:58 +0200 (CEST)
Received: by mail-qg0-f44.google.com with SMTP id j107so3850012qga.31
for <python-***@python.org>; Thu, 26 Jun 2014 18:37:51 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
h=mime-version:in-reply-to:references:date:message-id:subject:from:to
:cc:content-type;
bh=n79hQiQ/QHInr6RZpLKCgUofhf/VvhYBjP+tICSLhX8=;
b=gdVqTDBkPxSLD2IT3zyPujNn6hMh8DDOeGbzH/qzVgH8g9gJif81XJKqIkDPH+RFWm
k4M7mkKCERdBsqhqo6EzZ6ddEbkc1nybQvI7T63m55srkiWzuDVE+8wvf5hVS5jUI0rl
njTvZr+GLg/tKC+9K1eUS465HH8QUM9kMm7L4zIpOtITPdVo7pTpqIQDxzhH9PsDDtnx
kiKxwhQ8YbmI81U33eDguods7+5OxZe8AazKLCd+HB9KtY5nhIMCY0o4/UIYEX7LdTAq
ApnMzAfspSt+ddCb3I+NkQvwZQygOWNmhStN9rdQO5wm0lh89x3aIOhJKNWUljgztTbM
HsuA==
X-Received: by 10.224.151.135 with SMTP id c7mr25407968qaw.95.1403833071066;
Thu, 26 Jun 2014 18:37:51 -0700 (PDT)
Received: by 10.229.174.8 with HTTP; Thu, 26 Jun 2014 18:37:50 -0700 (PDT)
In-Reply-To: <***@stoneleaf.us>
X-BeenThere: python-***@python.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Python core developers <python-dev.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-dev>,
<mailto:python-dev-***@python.org?subject=unsubscribe>
List-Archive: <http://mail.python.org/pipermail/python-dev/>
List-Post: <mailto:python-***@python.org>
List-Help: <mailto:python-dev-***@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-dev>,
<mailto:python-dev-***@python.org?subject=subscribe>
Errors-To: python-dev-bounces+python-python-dev=***@python.org
Sender: "Python-Dev"
<python-dev-bounces+python-python-dev=***@python.org>
Archived-At: <http://permalink.gmane.org/gmane.comp.python.devel/148398>

I don't mind iterdir() and would take it :-), but I'll just say why I
chose the name scandir() -- though it wasn't my suggestion originally:

iterdir() sounds like just an iterator version of listdir(), kinda
like keys() and iterkeys() in Python 2. Whereas in actual fact the
return values are quite different (DirEntry objects vs strings), and
so the name change reflects that difference a little.

I'm also -1 on windows_wildcard. I think it's asking for trouble, and
wouldn't gain much on Windows in most cases anyway.

-Ben
Post by Ethan Furman
Post by Tim Delaney
Post by MRAB
Personally, I'd prefer the name 'iterdir' because it emphasises that
it's an iterator.
Exactly what I was going to post (with the added note that thee's an
obvious symmetry with listdir).
+1 for iterdir rather than scandir
+1 for adding [it] to the stdlib
+1 for all of above
--
~Ethan~
_______________________________________________
Python-Dev mailing list
https://mail.python.org/mailman/listinfo/python-dev
https://mail.python.org/mailman/options/python-dev/benhoyt%40gmail.com
MRAB
2014-06-27 01:50:38 UTC
Permalink
Post by Ben Hoyt
I don't mind iterdir() and would take it :-), but I'll just say why I
iterdir() sounds like just an iterator version of listdir(), kinda
like keys() and iterkeys() in Python 2. Whereas in actual fact the
return values are quite different (DirEntry objects vs strings), and
so the name change reflects that difference a little.
[snip]

The re module has 'findall', which returns a list of strings, and
'finditer', which returns an iterator that yields match objects, so
there's a precedent. :-)
Jonas Wielicki
2014-06-27 10:28:27 UTC
Permalink
Post by MRAB
Post by Ben Hoyt
I don't mind iterdir() and would take it :-), but I'll just say why I
iterdir() sounds like just an iterator version of listdir(), kinda
like keys() and iterkeys() in Python 2. Whereas in actual fact the
return values are quite different (DirEntry objects vs strings), and
so the name change reflects that difference a little.
[snip]
The re module has 'findall', which returns a list of strings, and
'finditer', which returns an iterator that yields match objects, so
there's a precedent. :-)
A bad precedent in my opinion though -- I was just recently bitten by
that, and I find it very untypical for python.

regards,
Jonas
Gregory P. Smith
2014-06-27 02:04:16 UTC
Permalink
+1 on getting this in for 3.5.

If the only objection people are having is the stupid paint color of the
name I don't care what it's called! scandir matches the libc API of the
same name. iterdir also makes sense to anyone reading it. Whoever checks
this in can pick one and be done with it. We have other Python APIs with
iter in the name and tend not to be trying to mirror C so much these days
so the iterdir folks do have a valid point.

I'm not a huge fan of the DirEntry object and the method calls on it
instead of simply yielding tuples of (filename,
partially_filled_in_stat_result) but I don't *really* care which is used as
they both work fine and it is trivial to wrap with another generator
expression to turn it into exactly what you want anyways.

Python not having the ability to operate on large directories means Python
simply cannot be used for common system maintenance tasks. Python being
slow to walk a file system due to unnecessary stat calls (often each an
entire io op. requiring a disk seek!) due to the existing information that
it throws away not being used via listdir is similarly a problem. This
addresses both.

IMNSHO, it is a single function, it belongs in the os module right next to
listdir.

-gps
Post by Ben Hoyt
I don't mind iterdir() and would take it :-), but I'll just say why I
iterdir() sounds like just an iterator version of listdir(), kinda
like keys() and iterkeys() in Python 2. Whereas in actual fact the
return values are quite different (DirEntry objects vs strings), and
so the name change reflects that difference a little.
I'm also -1 on windows_wildcard. I think it's asking for trouble, and
wouldn't gain much on Windows in most cases anyway.
-Ben
Post by Ethan Furman
Post by Tim Delaney
Post by MRAB
Personally, I'd prefer the name 'iterdir' because it emphasises that
it's an iterator.
Exactly what I was going to post (with the added note that thee's an
obvious symmetry with listdir).
+1 for iterdir rather than scandir
+1 for adding [it] to the stdlib
+1 for all of above
--
~Ethan~
_______________________________________________
Python-Dev mailing list
https://mail.python.org/mailman/listinfo/python-dev
https://mail.python.org/mailman/options/python-dev/benhoyt%40gmail.com
_______________________________________________
Python-Dev mailing list
https://mail.python.org/mailman/listinfo/python-dev
https://mail.python.org/mailman/options/python-dev/greg%40krypto.org
Steven D'Aprano
2014-06-27 02:21:15 UTC
Permalink
Post by Ben Hoyt
I don't mind iterdir() and would take it :-), but I'll just say why I
iterdir() sounds like just an iterator version of listdir(), kinda
like keys() and iterkeys() in Python 2. Whereas in actual fact the
return values are quite different (DirEntry objects vs strings), and
so the name change reflects that difference a little.
+1

I think that's a good objective reason to prefer scandir, which suits
me, because my subjective opinion is that "iterdir" is an inelegant
and less than attractive name.
--
Steven
Ryan
2014-06-27 01:01:18 UTC
Permalink
+1 for scandir.
-1 for iterdir(scandir sounds fancier).
-99999999 for windows_wildcard.
Post by Tim Delaney
Post by MRAB
Personally, I'd prefer the name 'iterdir' because it emphasises that
it's an iterator.
Exactly what I was going to post (with the added note that thee's an
obvious symmetry with listdir).
+1 for iterdir rather than scandir
+1 for adding scandir to the stdlib
-1 for windows_wildcard (it would be an attractive nuisance to write
windows-only code)
Tim Delaney
------------------------------------------------------------------------
_______________________________________________
Python-Dev mailing list
https://mail.python.org/mailman/listinfo/python-dev
https://mail.python.org/mailman/options/python-dev/rymg19%40gmail.com
--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.
Chris Barker - NOAA Federal
2014-06-27 00:09:01 UTC
Permalink
Post by Tim Delaney
-1 for windows_wildcard (it would be an attractive nuisance to write
windows-only code)


Could you emulate it on other platforms?

+1 on the rest of it.

-Chris
Paul Sokolovsky
2014-06-27 00:07:46 UTC
Permalink
Hello,

On Thu, 26 Jun 2014 18:59:45 -0400
Post by Ben Hoyt
Hi Python dev folks,
I've written a PEP proposing a specific os.scandir() API for a
directory iterator that returns the stat-like info from the OS, the
main advantage of which is to speed up os.walk() and similar
operations between 4-20x, depending on your OS and file system. Full
details, background info, and context links are in the PEP, which
Victor Stinner has uploaded at the following URL, and I've also copied
inline below.
I noticed obvious inefficiency of os.walk() implemented in terms of
os.listdir() when I worked on "os" module for MicroPython. I essentially
did what your PEP suggests - introduced internal generator function
(ilistdir_ex() in
https://github.com/micropython/micropython-lib/blob/master/os/os/__init__.py#L85
), in terms of which both os.listdir() and os.walk() are implemented.


With my MicroPython hat on, os.scandir() would make things only worse.
With current interface, one can either have inefficient implementation
(like CPython chose) or efficient implementation (like MicroPython
chose) - all transparently. os.scandir() supposedly opens up efficient
implementation for everyone, but at the price of bloating API and
introducing heavy-weight objects to wrap info. PEP calls it
"lightweight DirEntry objects", but that cannot be true, because all
Python objects are heavy-weight, especially those which have methods.

It would be better if os.scandir() was specified to return a struct
(named tuple) compatible with return value of os.stat() (with only
fields relevant to underlying readdir()-like system call). The grounds
for that are obvious: it's already existing data interface in module
"os", which is also based on open standard for operating systems -
POSIX, so if one is to expect something about file attributes, it's
what one can reasonably base expectations on.


But reusing os.stat struct is glaringly not what's proposed. And
it's clear where that comes from - "[DirEntry.]lstat(): like os.lstat(),
but requires no system calls on Windows". Nice, but OS "FooBar" can do
much more than Windows - it has a system call to send a file by email,
right when scanning a directory containing it. So, why not to have
DirEntry.send_by_email(recipient) method? I hear the answer - it's
because CPython strives to support Windows well, while doesn't care
about "FooBar" OS.

And then it again leads to the question I posed several times - where's
line between "CPython" and "Python"? Is it grounded for CPython to add
(or remove) to Python stdlib something which is useful for its users,
but useless or complicating for other Python implementations?
Especially taking into account that there's "win32api" module allowing
Windows users to use all wonders of its API? Especially that os.stat
struct is itself pretty extensible
(https://docs.python.org/3.4/library/os.html#os.stat : "On other Unix
systems (such as FreeBSD), the following attributes may be
available ...", "On Mac OS systems...", - so extra fields can be added
for Windows just the same, if really needed).
Post by Ben Hoyt
http://legacy.python.org/dev/peps/pep-0471/
Would love feedback on the PEP, but also of course on the proposal itself.
-Ben
[]
--
Best regards,
Paul mailto:***@gmail.com
Benjamin Peterson
2014-06-27 00:35:21 UTC
Permalink
Post by Paul Sokolovsky
With my MicroPython hat on, os.scandir() would make things only worse.
With current interface, one can either have inefficient implementation
(like CPython chose) or efficient implementation (like MicroPython
chose) - all transparently. os.scandir() supposedly opens up efficient
implementation for everyone, but at the price of bloating API and
introducing heavy-weight objects to wrap info. PEP calls it
"lightweight DirEntry objects", but that cannot be true, because all
Python objects are heavy-weight, especially those which have methods.
Why do you think methods make an object more heavyweight? namedtuples
have methods.
Paul Sokolovsky
2014-06-27 00:47:08 UTC
Permalink
Hello,

On Thu, 26 Jun 2014 17:35:21 -0700
Post by Benjamin Peterson
Post by Paul Sokolovsky
With my MicroPython hat on, os.scandir() would make things only
worse. With current interface, one can either have inefficient
implementation (like CPython chose) or efficient implementation
(like MicroPython chose) - all transparently. os.scandir()
supposedly opens up efficient implementation for everyone, but at
the price of bloating API and introducing heavy-weight objects to
wrap info. PEP calls it "lightweight DirEntry objects", but that
cannot be true, because all Python objects are heavy-weight,
especially those which have methods.
Why do you think methods make an object more heavyweight?
Because you need to call them. And if the only thing they do is return
object field, call overhead is rather noticeable.
Post by Benjamin Peterson
namedtuples have methods.
Yes, unfortunately. But fortunately, named tuple is a subclass of
tuple, so user caring for efficiency can just use numeric indexing
which existed for os.stat values all the time, blissfully ignoring
cruft which have been accumulating there since 1.5 times.
--
Best regards,
Paul mailto:***@gmail.com
Ben Hoyt
2014-06-27 01:52:43 UTC
Permalink
Post by Paul Sokolovsky
os.listdir() when I worked on "os" module for MicroPython. I essentially
did what your PEP suggests - introduced internal generator function
(ilistdir_ex() in
https://github.com/micropython/micropython-lib/blob/master/os/os/__init__.py#L85
), in terms of which both os.listdir() and os.walk() are implemented.
Nice (though I see the implementation is very *nix specific).
Post by Paul Sokolovsky
With my MicroPython hat on, os.scandir() would make things only worse.
With current interface, one can either have inefficient implementation
(like CPython chose) or efficient implementation (like MicroPython
chose) - all transparently. os.scandir() supposedly opens up efficient
implementation for everyone, but at the price of bloating API and
introducing heavy-weight objects to wrap info. PEP calls it
"lightweight DirEntry objects", but that cannot be true, because all
Python objects are heavy-weight, especially those which have methods.
It's a fair point that os.walk() can be implemented efficiently
without adding a new function and API. However, often you'll want more
info, like the file size, which scandir() can give you via
DirEntry.lstat(), which is free on Windows. So opening up this
efficient API is beneficial.

In CPython, I think the DirEntry objects are as lightweight as
stat_result objects.

I'm an embedded developer by background, so I know the constraints
here, but I really don't think Python's development should be tailored
to fit MicroPython. If os.scandir() is not very efficient on
MicroPython, so be it -- 99% of all desktop/server users will gain
from it.
Post by Paul Sokolovsky
It would be better if os.scandir() was specified to return a struct
(named tuple) compatible with return value of os.stat() (with only
fields relevant to underlying readdir()-like system call). The grounds
for that are obvious: it's already existing data interface in module
"os", which is also based on open standard for operating systems -
POSIX, so if one is to expect something about file attributes, it's
what one can reasonably base expectations on.
Yes, we considered this early on (see the python-ideas and python-dev
threads referenced in the PEP), but decided it wasn't a great API to
overload stat_result further, and have most of the attributes None or
not present on Linux.
Post by Paul Sokolovsky
Especially that os.stat struct is itself pretty extensible
(https://docs.python.org/3.4/library/os.html#os.stat : "On other Unix
systems (such as FreeBSD), the following attributes may be
available ...", "On Mac OS systems...", - so extra fields can be added
for Windows just the same, if really needed).
Yes. Incidentally, I just submitted an (accepted) patch for Python 3.5
that adds the full Win32 file attribute data to stat_result objects on
Windows (see https://docs.python.org/3.5/whatsnew/3.5.html#os).

However, for scandir() to be useful, you also need the name. My
original version of this directory iterator returned two-tuples of
(name, stat_result). But most people didn't like the API, and I don't
really either. You could overload stat_result with a .name attribute
in this case, but it still isn't a nice API to have most of the
attributes None, and then you have to test for that, etc.

So basically we tweaked the API to do what was best, and ended up with
it returning DirEntry objects with is_file() and similar methods.

Hope that helps give a bit more context. If you haven't read the
relevant python-ideas and python-dev threads, those are interesting
too.

-Ben
Paul Sokolovsky
2014-06-27 11:48:17 UTC
Permalink
Hello,

On Thu, 26 Jun 2014 21:52:43 -0400
Ben Hoyt <***@gmail.com> wrote:

[]
Post by Ben Hoyt
It's a fair point that os.walk() can be implemented efficiently
without adding a new function and API. However, often you'll want more
info, like the file size, which scandir() can give you via
DirEntry.lstat(), which is free on Windows. So opening up this
efficient API is beneficial.
In CPython, I think the DirEntry objects are as lightweight as
stat_result objects.
I'm an embedded developer by background, so I know the constraints
here, but I really don't think Python's development should be tailored
to fit MicroPython. If os.scandir() is not very efficient on
MicroPython, so be it -- 99% of all desktop/server users will gain
from it.
Surely, tailoring Python to MicroPython's needs is completely not what
I suggest. It was an example of alternative implementation which
optimized os.walk() without need for any additional public module APIs.
Vice-versa, high-level nature of API call like os.walk() and
underspecification of low-level details (like which function
implemented in terms of which others) allow MicroPython provide
optimized implementation even with its resource constraints. So, power
of high-level interfaces and underspecification should not be
underestimated ;-).

But I don't want to argue that os.scandir() is "not needed", because
that's hardly productive. Something I'd like to prototype in uPy and
ideally lead further up to PEP status is to add iterator-based string
methods, and I pretty much can expect "we lived without it" response,
so don't want to go the same way regarding addition of other
iterator-based APIs - it's clear that more iterator/generator based APIs
is a good direction for Python to evolve.
Post by Ben Hoyt
Post by Paul Sokolovsky
It would be better if os.scandir() was specified to return a struct
(named tuple) compatible with return value of os.stat() (with only
fields relevant to underlying readdir()-like system call). The
grounds for that are obvious: it's already existing data interface
in module "os", which is also based on open standard for operating
systems - POSIX, so if one is to expect something about file
attributes, it's what one can reasonably base expectations on.
Yes, we considered this early on (see the python-ideas and python-dev
threads referenced in the PEP), but decided it wasn't a great API to
overload stat_result further, and have most of the attributes None or
not present on Linux.
[]
Post by Ben Hoyt
However, for scandir() to be useful, you also need the name. My
original version of this directory iterator returned two-tuples of
(name, stat_result). But most people didn't like the API, and I don't
really either. You could overload stat_result with a .name attribute
in this case, but it still isn't a nice API to have most of the
attributes None, and then you have to test for that, etc.
Yes, returning (name, stat_result) would be my first motion too, I
don't see why someone wouldn't like pair of 2 values, with each value
of obvious type and semantics within "os" module. Regarding stat
result, os.stat() provides full information about a file,
and intuitively, one may expect that os.scandir() would provide subset
of that info, asymptotically reaching volume of what os.stat() may
provide, depending on OS capabilities. So, if truly OS-independent
interface is wanted to salvage more data from a dir scanning, using
os.stat struct as data interface is hard to ignore.


But well, if it was rejected already, what can be said? Perhaps, at
least the PEP could be extended to explicitly mention other approached
which were discussed and rejected, not just link to a discussion
archive (from experience with reading other PEPs, they oftentimes
contained such subsections, so hope this suggestion is not ungrounded).
Post by Ben Hoyt
So basically we tweaked the API to do what was best, and ended up with
it returning DirEntry objects with is_file() and similar methods.
Hope that helps give a bit more context. If you haven't read the
relevant python-ideas and python-dev threads, those are interesting
too.
-Ben
--
Best regards,
Paul mailto:***@gmail.com
Steven D'Aprano
2014-06-27 02:08:41 UTC
Permalink
Post by Paul Sokolovsky
With my MicroPython hat on, os.scandir() would make things only worse.
With current interface, one can either have inefficient implementation
(like CPython chose) or efficient implementation (like MicroPython
chose) - all transparently. os.scandir() supposedly opens up efficient
implementation for everyone, but at the price of bloating API and
introducing heavy-weight objects to wrap info.
os.scandir is not part of the Python API, it is not a built-in function.
It is part of the CPython standard library. That means (in my opinion)
that there is an expectation that other Pythons should provide it, but
not an absolute requirement. Especially for the os module, which by
definition is platform-specific. In my opinion that means you have four
options:

1. provide os.scandir, with exactly the same semantics as on CPython;

2. provide os.scandir, but change its semantics to be more lightweight
(e.g. return an ordinary tuple, as you already suggest);

3. don't provide os.scandir at all; or

4. do something different depending on whether the platform is Linux
or an embedded system.

I would consider any of those acceptable for a library feature, but not
for a language feature.


[...]
Post by Paul Sokolovsky
But reusing os.stat struct is glaringly not what's proposed. And
it's clear where that comes from - "[DirEntry.]lstat(): like os.lstat(),
but requires no system calls on Windows". Nice, but OS "FooBar" can do
much more than Windows - it has a system call to send a file by email,
right when scanning a directory containing it. So, why not to have
DirEntry.send_by_email(recipient) method? I hear the answer - it's
because CPython strives to support Windows well, while doesn't care
about "FooBar" OS.
Correct. If there is sufficient demand for FooBar, then CPython may
support it. Until then, FooBarPython can support it, and offer whatever
platform-specific features are needed within its standard library.
Post by Paul Sokolovsky
And then it again leads to the question I posed several times - where's
line between "CPython" and "Python"? Is it grounded for CPython to add
(or remove) to Python stdlib something which is useful for its users,
but useless or complicating for other Python implementations?
I think so. And other implementations are free to do the same thing.

Of course there is an expectation that the standard library of most
implementations will be broadly similar, but not that they will be
identical.

I am surprised that both Jython and IronPython offer an non-functioning
dis module: you can import it successfully, but if there's a way to
actually use it, I haven't found it:


***@orac:~$ jython
Jython 2.5.1+ (Release_2_5_1, Aug 4 2010, 07:18:19)
[OpenJDK Server VM (Sun Microsystems Inc.)] on java1.6.0_27
Type "help", "copyright", "credits" or "license" for more information.
Post by Paul Sokolovsky
import dis
dis.dis(lambda x: x+1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/share/jython/Lib/dis.py", line 42, in dis
disassemble(x)
File "/usr/share/jython/Lib/dis.py", line 64, in disassemble
linestarts = dict(findlinestarts(co))
File "/usr/share/jython/Lib/dis.py", line 183, in findlinestarts
byte_increments = [ord(c) for c in code.co_lnotab[0::2]]
AttributeError: 'tablecode' object has no attribute 'co_lnotab'


IronPython gives a different exception:

***@orac:~$ ipy
IronPython 2.6 Beta 2 DEBUG (2.6.0.20) on .NET 2.0.50727.1433
Type "help", "copyright", "credits" or "license" for more information.
Post by Paul Sokolovsky
import dis
dis.dis(lambda x: x+1)
Traceback (most recent call last):
TypeError: don't know how to disassemble code objects


It's quite annoying, I would have rather that they just removed the
module altogether. Better still would have been to disassemble code
objects to whatever byte code the Java and .Net platforms use. But
there's surely no requirement to disassemble to CPython byte code!
--
Steven
Paul Sokolovsky
2014-06-27 12:13:13 UTC
Permalink
Hello,

On Fri, 27 Jun 2014 12:08:41 +1000
Post by Steven D'Aprano
Post by Paul Sokolovsky
With my MicroPython hat on, os.scandir() would make things only
worse. With current interface, one can either have inefficient
implementation (like CPython chose) or efficient implementation
(like MicroPython chose) - all transparently. os.scandir()
supposedly opens up efficient implementation for everyone, but at
the price of bloating API and introducing heavy-weight objects to
wrap info.
os.scandir is not part of the Python API, it is not a built-in
function. It is part of the CPython standard library.
Ok, so standard library also has API, and that's the API being
discussed.
Post by Steven D'Aprano
That means (in
my opinion) that there is an expectation that other Pythons should
provide it, but not an absolute requirement. Especially for the os
module, which by definition is platform-specific.
Yes, that's intuitive, but not strict and formal, so is subject to
interpretations. As a developer working on alternative Python
implementation, I'd like to have better understanding of what needs to
be done to be a compliant implementation (in particular, because I need
to pass that info down to the users). So, I was told that
https://docs.python.org/3/reference/index.html describes Python, not
CPython. Next step is figuring out whether
https://docs.python.org/3/library/index.html describes Python or
CPython, and if the latter, how to separate Python's stdlib essence from
extended library CPython provides?
Post by Steven D'Aprano
In my opinion that
1. provide os.scandir, with exactly the same semantics as on CPython;
2. provide os.scandir, but change its semantics to be more
lightweight (e.g. return an ordinary tuple, as you already suggest);
3. don't provide os.scandir at all; or
4. do something different depending on whether the platform is Linux
or an embedded system.
I would consider any of those acceptable for a library feature, but
not for a language feature.
Good, thanks. If that represents shared opinion of (C)Python developers
(so, there won't be claims like "MicroPython is not Python because it
doesn't provide os.scandir()" (or hundred of other missing stdlib
functions ;-) )) that's good enough already.

With that in mind, I wished that any Python implementation was as
complete and as efficient as possible, and one way to achieve that is
to not add stdlib entities without real need (be it more API calls or
more data types). So, I'm glad to know that os.scandir() passed thru
Occam's Razor in this respect and specified the way it is really for
common good.


[]
--
Best regards,
Paul mailto:***@gmail.com
Glenn Linderman
2014-06-27 02:43:34 UTC
Permalink
I'm generally +1, with opinions noted below on these two topics.
Post by Ben Hoyt
Should there be a way to access the full path?
----------------------------------------------
Should ``DirEntry``'s have a way to get the full path without using
``os.path.join(path, entry.name)``? This is a pretty common pattern,
and it may be useful to add pathlib-like ``str(entry)`` functionality.
This functionality has also been requested in `issue 13`_ on GitHub.
.. _`issue 13`:https://github.com/benhoyt/scandir/issues/13
+1
Post by Ben Hoyt
Should it expose Windows wildcard functionality?
------------------------------------------------
Should ``scandir()`` have a way of exposing the wildcard functionality
in the Windows ``FindFirstFile`` / ``FindNextFile`` functions? The
scandir module on GitHub exposes this as a ``windows_wildcard``
keyword argument, allowing Windows power users the option to pass a
custom wildcard to ``FindFirstFile``, which may avoid the need to use
``fnmatch`` or similar on the resulting names. It is named the
unwieldly ``windows_wildcard`` to remind you you're writing power-
user, Windows-only code if you use it.
This boils down to whether ``scandir`` should be about exposing all of
the system's directory iteration features, or simply providing a fast,
simple, cross-platform directory iteration API.
This PEP's author votes for not including ``windows_wildcard`` in the
standard library version, because even though it could be useful in
rare cases (say the Windows Dropbox client?), it'd be too easy to use
it just because you're a Windows developer, and create code that is
not cross-platform.
Because another common pattern is to check for name matches pattern, I
think it would be good to have a feature that provides such. I do that
in my own private directory listing extensions, and also some command
lines expose it to the user. Where exposed to the user, I use -p
windows-pattern and -P regexp. My implementation converts the
windows-pattern to a regexp, and then uses common code, but for this
particular API, because the windows_wildcard can be optimized by the
window API call used, it would make more sense to pass windows_wildcard
directly to FindFirst on Windows, but on *nix convert it to a regexp.
Both Windows and *nix would call re to process pattern matches except
for the case on Windows of having a Windows pattern passed in. The
alternate parameter could simply be called wildcard, and would be a
regexp. If desired, other flavors of wildcard bsd_wildcard? could also
be implemented, but I'm not sure there are any benefits to them, as
there are, as far as I am aware, no optimizations for those patterns in
those systems.
Paul Moore
2014-06-27 06:47:21 UTC
Permalink
Post by Ben Hoyt
Would love feedback on the PEP, but also of course on the proposal itself.
A solid +1 from me.

Some specific points:

- I'm in favour of it being in the os module. It's more discoverable
there, as well as the other reasons mentioned.
- I prefer scandir as the name, for the reason you gave (the output
isn't the same as an iterator version of listdir)
- I'm mildly against windows_wildcard (even though I'm a windows user)
- You mention the caching behaviour of DirEntry objects. The
limitations should be clearly covered in the final docs, as it's the
sort of thing people will get wrong otherwise.

Paul
Victor Stinner
2014-06-27 07:44:17 UTC
Permalink
Hi,

You wrote a great PEP Ben, thanks :-) But it's now time for comments!
Post by Ben Hoyt
But the underlying system calls -- ``FindFirstFile`` /
``FindNextFile`` on Windows and ``readdir`` on Linux and OS X --
What about FreeBSD, OpenBSD, NetBSD, Solaris, etc. They don't provide readdir?

You should add a link to FindFirstFile doc:
http://msdn.microsoft.com/en-us/library/windows/desktop/aa364418%28v=vs.85%29.aspx

It looks like the WIN32_FIND_DATA has a dwFileAttributes field. So we
should mimic stat_result recent addition: the new
stat_result.file_attributes field. Add DirEntry.file_attributes which
would only be available on Windows.

The Windows structure also contains

FILETIME ftCreationTime;
FILETIME ftLastAccessTime;
FILETIME ftLastWriteTime;
DWORD nFileSizeHigh;
DWORD nFileSizeLow;

It would be nice to expose them as well. I'm no more surprised that
the exact API is different depending on the OS for functions of the os
module.
Post by Ben Hoyt
* Instead of bare filename strings, it returns lightweight
``DirEntry`` objects that hold the filename string and provide
simple methods that allow access to the stat-like data the operating
system returned.
Does your implementation uses a free list to avoid the cost of memory
allocation? A short free list of 10 or maybe just 1 may help. The free
list may be stored directly in the generator object.
Post by Ben Hoyt
``scandir()`` yields a ``DirEntry`` object for each file and directory
in ``path``. Just like ``listdir``, the ``'.'`` and ``'..'``
pseudo-directories are skipped, and the entries are yielded in
system-dependent order. Each ``DirEntry`` object has the following
Does it support also bytes filenames on UNIX?

Python now supports undecodable filenames thanks to the PEP 383
(surrogateescape). I prefer to use the same type for filenames on
Linux and Windows, so Unicode is better. But some users might prefer
bytes for other reasons.
Post by Ben Hoyt
The ``DirEntry`` attribute and method names were chosen to be the same
as those in the new ``pathlib`` module for consistency.
Great! That's exactly what I expected :-) Consistency with other modules.
Post by Ben Hoyt
Notes on caching
----------------
The ``DirEntry`` objects are relatively dumb -- the ``name`` attribute
is obviously always cached, and the ``is_X`` and ``lstat`` methods
cache their values (immediately on Windows via ``FindNextFile``, and
on first use on Linux / OS X via a ``stat`` call) and never refetch
from the system.
For this reason, ``DirEntry`` objects are intended to be used and
thrown away after iteration, not stored in long-lived data structured
and the methods called again and again.
If a user wants to do that (for example, for watching a file's size
change), they'll need to call the regular ``os.lstat()`` or
``os.path.getsize()`` functions which force a new system call each
time.
Crazy idea: would it be possible to "convert" a DirEntry object to a
pathlib.Path object without losing the cache? I guess that
pathlib.Path expects a full stat_result object.
Post by Ben Hoyt
Or, for getting the total size of files in a directory tree -- showing
"""Return total size of files in path and subdirs."""
size = 0
sub_path = os.path.join(path, entry.name)
size += get_tree_size(sub_path)
size += entry.lstat().st_size
return size
Note that ``get_tree_size()`` will get a huge speed boost on Windows,
because no extra stat call are needed, but on Linux and OS X the size
information is not returned by the directory iteration functions, so
this function won't gain anything there.
I don't understand how you can build a full lstat() result without
really calling stat. I see that WIN32_FIND_DATA contains the size, but
here you call lstat(). If you know that it's not a symlink, you
already know the size, but you still have to call stat() to retrieve
all fields required to build a stat_result no?
Post by Ben Hoyt
Support
=======
The scandir module on GitHub has been forked and used quite a bit (see
"Use in the wild" in this PEP),
Do you plan to continue to maintain your module for Python < 3.5, but
upgrade your module for the final PEP?
Post by Ben Hoyt
Should scandir be in its own module?
------------------------------------
Should the function be included in the standard library in a new
module, ``scandir.scandir()``, or just as ``os.scandir()`` as
discussed? The preference of this PEP's author (Ben Hoyt) would be
``os.scandir()``, as it's just a single function.
Yes, put it in the os module which is already bloated :-)
Post by Ben Hoyt
Should there be a way to access the full path?
----------------------------------------------
Should ``DirEntry``'s have a way to get the full path without using
``os.path.join(path, entry.name)``? This is a pretty common pattern,
and it may be useful to add pathlib-like ``str(entry)`` functionality.
This functionality has also been requested in `issue 13`_ on GitHub.
.. _`issue 13`: https://github.com/benhoyt/scandir/issues/13
I think that it would be very convinient to store the directory name
in the DirEntry. It should be light, it's just a reference.

And provide a fullname() name which would just return
os.path.join(path, entry.name) without trying to resolve path to get
an absolute path.
Post by Ben Hoyt
Should it expose Windows wildcard functionality?
------------------------------------------------
Should ``scandir()`` have a way of exposing the wildcard functionality
in the Windows ``FindFirstFile`` / ``FindNextFile`` functions? The
scandir module on GitHub exposes this as a ``windows_wildcard``
keyword argument, allowing Windows power users the option to pass a
custom wildcard to ``FindFirstFile``, which may avoid the need to use
``fnmatch`` or similar on the resulting names. It is named the
unwieldly ``windows_wildcard`` to remind you you're writing power-
user, Windows-only code if you use it.
This boils down to whether ``scandir`` should be about exposing all of
the system's directory iteration features, or simply providing a fast,
simple, cross-platform directory iteration API.
Would it be hard to implement the wildcard feature on UNIX to compare
performances of scandir('*.jpg') with and without the wildcard built
in os.scandir?

I implemented it in C for the tracemalloc module (Filter object):
http://hg.python.org/features/tracemalloc

Get the revision 69fd2d766005 and search match_filename_joker() in
Modules/_tracemalloc.c. The function matchs the filename backward
because it most cases, the last latter is enough to reject a filename
(ex: "*.jpg" => reject filenames not ending with "g").

The filename is normalized before matching the pattern: converted to
lowercase and / is replaced with \ on Windows.

It was decided to drop the Filter object to keep the tracemalloc
module as simple as possible. Charles-François was not convinced by
the speedup.

But tracemalloc case is different because the OS didn't provide an API for that.

Victor
Ben Hoyt
2014-06-28 19:48:03 UTC
Permalink
Post by Victor Stinner
Post by Ben Hoyt
But the underlying system calls -- ``FindFirstFile`` /
``FindNextFile`` on Windows and ``readdir`` on Linux and OS X --
What about FreeBSD, OpenBSD, NetBSD, Solaris, etc. They don't provide readdir?
I guess it'd be better to say "Windows" and "Unix-based OSs"
throughout the PEP? Because all of these (including Mac OS X) are
Unix-based.
Post by Victor Stinner
It looks like the WIN32_FIND_DATA has a dwFileAttributes field. So we
should mimic stat_result recent addition: the new
stat_result.file_attributes field. Add DirEntry.file_attributes which
would only be available on Windows.
The Windows structure also contains
FILETIME ftCreationTime;
FILETIME ftLastAccessTime;
FILETIME ftLastWriteTime;
DWORD nFileSizeHigh;
DWORD nFileSizeLow;
It would be nice to expose them as well. I'm no more surprised that
the exact API is different depending on the OS for functions of the os
module.
I think you've misunderstood how DirEntry.lstat() works on Windows --
it's basically a no-op, as Windows returns the full stat information
with the original FindFirst/FindNext OS calls. This is fairly explict
in the PEP, but I'm sure I could make it clearer:

DirEntry.lstat(): "like os.lstat(), but requires no system calls on Windows

So you can already get the dwFileAttributes for free by saying
entry.lstat().st_file_attributes. You can also get all the other
fields you mentioned for free via .lstat() with no additional OS calls
on Windows, for example: entry.lstat().st_size.

Feel free to suggest changes to the PEP or scandir docs if this isn't
clear. Note that is_dir()/is_file()/is_symlink() are free on all
systems, but .lstat() is only free on Windows.
Post by Victor Stinner
Does your implementation uses a free list to avoid the cost of memory
allocation? A short free list of 10 or maybe just 1 may help. The free
list may be stored directly in the generator object.
No, it doesn't. I might add this to the PEP under "possible
improvements". However, I think the speed increase by removing the
extra OS call and/or disk seek is going to be way more than memory
allocation improvements, so I'm not sure this would be worth it.
Post by Victor Stinner
Does it support also bytes filenames on UNIX?
Python now supports undecodable filenames thanks to the PEP 383
(surrogateescape). I prefer to use the same type for filenames on
Linux and Windows, so Unicode is better. But some users might prefer
bytes for other reasons.
I forget exactly now what my scandir module does, but for os.scandir()
I think this should behave exactly like os.listdir() does for
Unicode/bytes filenames.
Post by Victor Stinner
Crazy idea: would it be possible to "convert" a DirEntry object to a
pathlib.Path object without losing the cache? I guess that
pathlib.Path expects a full stat_result object.
The main problem is that pathlib.Path objects explicitly don't cache
stat info (and Guido doesn't want them to, for good reason I think).
There's a thread on python-dev about this earlier. I'll add it to a
"Rejected ideas" section.
Post by Victor Stinner
I don't understand how you can build a full lstat() result without
really calling stat. I see that WIN32_FIND_DATA contains the size, but
here you call lstat().
See above.
Post by Victor Stinner
Do you plan to continue to maintain your module for Python < 3.5, but
upgrade your module for the final PEP?
Yes, I intend to maintain the standalone scandir module for 2.6 <=
Python < 3.5, at least for a good while. For integration into the
Python 3.5 stdlib, the implementation will be integrated into
posixmodule.c, of course.
Post by Victor Stinner
Post by Ben Hoyt
Should there be a way to access the full path?
----------------------------------------------
Should ``DirEntry``'s have a way to get the full path without using
``os.path.join(path, entry.name)``? This is a pretty common pattern,
and it may be useful to add pathlib-like ``str(entry)`` functionality.
This functionality has also been requested in `issue 13`_ on GitHub.
.. _`issue 13`: https://github.com/benhoyt/scandir/issues/13
I think that it would be very convinient to store the directory name
in the DirEntry. It should be light, it's just a reference.
And provide a fullname() name which would just return
os.path.join(path, entry.name) without trying to resolve path to get
an absolute path.
Yeah, fair suggestion. I'm still slightly on the fence about this, but
I think an explicit fullname() is a good suggestion. Ideally I think
it'd be better to mimic pathlib.Path.__str__() which is kind of the
equivalent of fullname(). But how does pathlib deal with unicode/bytes
issues if it's the str function which has to return a str object? Or
at least, it'd be very weird if __str__() returned bytes. But I think
it'd need to if you passed bytes into scandir(). Do others have
thoughts?
Post by Victor Stinner
Would it be hard to implement the wildcard feature on UNIX to compare
performances of scandir('*.jpg') with and without the wildcard built
in os.scandir?
It's a good idea, the problem with this is that the Windows wildcard
implementation has a bunch of crazy edge cases where *.ext will catch
more things than just a simple regex/glob. This was discussed on
python-dev or python-ideas previously, so I'll dig it up and add to a
Rejected Ideas section. In any case, this could be added later if
there's a way to iron out the Windows quirks.

-Ben
Nick Coghlan
2014-06-29 04:59:19 UTC
Permalink
Post by Ben Hoyt
Post by Victor Stinner
Post by Ben Hoyt
But the underlying system calls -- ``FindFirstFile`` /
``FindNextFile`` on Windows and ``readdir`` on Linux and OS X --
What about FreeBSD, OpenBSD, NetBSD, Solaris, etc. They don't provide readdir?
I guess it'd be better to say "Windows" and "Unix-based OSs"
throughout the PEP? Because all of these (including Mac OS X) are
Unix-based.
*nix and POSIX-based are the two conventions I use.
Post by Ben Hoyt
Post by Victor Stinner
Crazy idea: would it be possible to "convert" a DirEntry object to a
pathlib.Path object without losing the cache? I guess that
pathlib.Path expects a full stat_result object.
The main problem is that pathlib.Path objects explicitly don't cache
stat info (and Guido doesn't want them to, for good reason I think).
There's a thread on python-dev about this earlier. I'll add it to a
"Rejected ideas" section.
The key problem with caches on pathlib.Path objects is that you could
end up with two separate path objects that referred to the same
filesystem location but returned different answers about the
filesystem state because their caches might be stale. DirEntry is
different, as the content is generally *assumed* to be stale
(referring to when the directory was scanned, rather than the current
filesystem state). DirEntry.lstat() on POSIX systems will be an
exception to that general rule (referring to the time of first lookup,
rather than when the directory was scanned, so the answer rom lstat()
may be inconsistent with other data stored directly on the DirEntry
object), but one we can probably live with.

More generally, as part of the pathlib PEP review, we figured out that
a *per-object* cache of filesystem state would be an inherently bad
idea, but a string based *process global* cache might make sense for
modules like walkdir (not part of the stdlib - it's an iterator
pipeline based approach to file tree scanning I wrote a while back,
that currently suffers badly from the performance impact of repeated
stat calls at different stages of the pipeline). We realised this was
getting into a space where application and library specific concerns
are likely to start affecting the caching design, though, so the
current status of standard library level stat caching is "it's not
clear if there's an available approach that would be sufficiently
general purpose to be appropriate for inclusion in the standard
library".

Cheers,
Nick.
--
Nick Coghlan | ***@gmail.com | Brisbane, Australia
Gregory P. Smith
2014-06-29 06:26:24 UTC
Permalink
Post by Ben Hoyt
Post by Victor Stinner
Post by Ben Hoyt
But the underlying system calls -- ``FindFirstFile`` /
``FindNextFile`` on Windows and ``readdir`` on Linux and OS X --
What about FreeBSD, OpenBSD, NetBSD, Solaris, etc. They don't provide readdir?
I guess it'd be better to say "Windows" and "Unix-based OSs"
throughout the PEP? Because all of these (including Mac OS X) are
Unix-based.
No, Just say POSIX.
Post by Ben Hoyt
Post by Victor Stinner
It looks like the WIN32_FIND_DATA has a dwFileAttributes field. So we
should mimic stat_result recent addition: the new
stat_result.file_attributes field. Add DirEntry.file_attributes which
would only be available on Windows.
The Windows structure also contains
FILETIME ftCreationTime;
FILETIME ftLastAccessTime;
FILETIME ftLastWriteTime;
DWORD nFileSizeHigh;
DWORD nFileSizeLow;
It would be nice to expose them as well. I'm no more surprised that
the exact API is different depending on the OS for functions of the os
module.
I think you've misunderstood how DirEntry.lstat() works on Windows --
it's basically a no-op, as Windows returns the full stat information
with the original FindFirst/FindNext OS calls. This is fairly explict
DirEntry.lstat(): "like os.lstat(), but requires no system calls on Windows
So you can already get the dwFileAttributes for free by saying
entry.lstat().st_file_attributes. You can also get all the other
fields you mentioned for free via .lstat() with no additional OS calls
on Windows, for example: entry.lstat().st_size.
Feel free to suggest changes to the PEP or scandir docs if this isn't
clear. Note that is_dir()/is_file()/is_symlink() are free on all
systems, but .lstat() is only free on Windows.
Post by Victor Stinner
Does your implementation uses a free list to avoid the cost of memory
allocation? A short free list of 10 or maybe just 1 may help. The free
list may be stored directly in the generator object.
No, it doesn't. I might add this to the PEP under "possible
improvements". However, I think the speed increase by removing the
extra OS call and/or disk seek is going to be way more than memory
allocation improvements, so I'm not sure this would be worth it.
Post by Victor Stinner
Does it support also bytes filenames on UNIX?
Python now supports undecodable filenames thanks to the PEP 383
(surrogateescape). I prefer to use the same type for filenames on
Linux and Windows, so Unicode is better. But some users might prefer
bytes for other reasons.
I forget exactly now what my scandir module does, but for os.scandir()
I think this should behave exactly like os.listdir() does for
Unicode/bytes filenames.
Post by Victor Stinner
Crazy idea: would it be possible to "convert" a DirEntry object to a
pathlib.Path object without losing the cache? I guess that
pathlib.Path expects a full stat_result object.
The main problem is that pathlib.Path objects explicitly don't cache
stat info (and Guido doesn't want them to, for good reason I think).
There's a thread on python-dev about this earlier. I'll add it to a
"Rejected ideas" section.
Post by Victor Stinner
I don't understand how you can build a full lstat() result without
really calling stat. I see that WIN32_FIND_DATA contains the size, but
here you call lstat().
See above.
Post by Victor Stinner
Do you plan to continue to maintain your module for Python < 3.5, but
upgrade your module for the final PEP?
Yes, I intend to maintain the standalone scandir module for 2.6 <=
Python < 3.5, at least for a good while. For integration into the
Python 3.5 stdlib, the implementation will be integrated into
posixmodule.c, of course.
Post by Victor Stinner
Post by Ben Hoyt
Should there be a way to access the full path?
----------------------------------------------
Should ``DirEntry``'s have a way to get the full path without using
``os.path.join(path, entry.name)``? This is a pretty common pattern,
and it may be useful to add pathlib-like ``str(entry)`` functionality.
This functionality has also been requested in `issue 13`_ on GitHub.
.. _`issue 13`: https://github.com/benhoyt/scandir/issues/13
I think that it would be very convinient to store the directory name
in the DirEntry. It should be light, it's just a reference.
And provide a fullname() name which would just return
os.path.join(path, entry.name) without trying to resolve path to get
an absolute path.
Yeah, fair suggestion. I'm still slightly on the fence about this, but
I think an explicit fullname() is a good suggestion. Ideally I think
it'd be better to mimic pathlib.Path.__str__() which is kind of the
equivalent of fullname(). But how does pathlib deal with unicode/bytes
issues if it's the str function which has to return a str object? Or
at least, it'd be very weird if __str__() returned bytes. But I think
it'd need to if you passed bytes into scandir(). Do others have
thoughts?
Post by Victor Stinner
Would it be hard to implement the wildcard feature on UNIX to compare
performances of scandir('*.jpg') with and without the wildcard built
in os.scandir?
It's a good idea, the problem with this is that the Windows wildcard
implementation has a bunch of crazy edge cases where *.ext will catch
more things than just a simple regex/glob. This was discussed on
python-dev or python-ideas previously, so I'll dig it up and add to a
Rejected Ideas section. In any case, this could be added later if
there's a way to iron out the Windows quirks.
-Ben
_______________________________________________
Python-Dev mailing list
https://mail.python.org/mailman/listinfo/python-dev
https://mail.python.org/mailman/options/python-dev/greg%40krypto.org
Walter Dörwald
2014-06-29 08:23:42 UTC
Permalink
Post by Ben Hoyt
[...]
Post by Victor Stinner
Crazy idea: would it be possible to "convert" a DirEntry object to a
pathlib.Path object without losing the cache? I guess that
pathlib.Path expects a full stat_result object.
The main problem is that pathlib.Path objects explicitly don't cache
stat info (and Guido doesn't want them to, for good reason I think).
There's a thread on python-dev about this earlier. I'll add it to a
"Rejected ideas" section.
However, it would be bad to have two implementations of the concept of
"filename" with different attribute and method names.

The best way to ensure compatible APIs would be if one class was derived
from the other.
Post by Ben Hoyt
[...]
Servus,
Walter
Jonas Wielicki
2014-06-27 10:44:35 UTC
Permalink
Post by Ben Hoyt
Specifics of proposal
=====================
[snip] Each ``DirEntry`` object has the following
[snip]
Notes on caching
----------------
The ``DirEntry`` objects are relatively dumb -- the ``name`` attribute
is obviously always cached, and the ``is_X`` and ``lstat`` methods
cache their values (immediately on Windows via ``FindNextFile``, and
on first use on Linux / OS X via a ``stat`` call) and never refetch
from the system.
I find this behaviour a bit misleading: using methods and have them
return cached results. How much (implementation and/or performance
and/or memory) overhead would incur by using property-like access here?
I think this would underline the static nature of the data.

This would break the semantics with respect to pathlib, but they’re only
marginally equal anyways -- and as far as I understand it, pathlib won’t
cache, so I think this has a fair point here.

regards,
jwi
Nick Coghlan
2014-06-27 21:58:50 UTC
Permalink
Post by Jonas Wielicki
Post by Ben Hoyt
Specifics of proposal
=====================
[snip] Each ``DirEntry`` object has the following
[snip]
Notes on caching
----------------
The ``DirEntry`` objects are relatively dumb -- the ``name`` attribute
is obviously always cached, and the ``is_X`` and ``lstat`` methods
cache their values (immediately on Windows via ``FindNextFile``, and
on first use on Linux / OS X via a ``stat`` call) and never refetch
from the system.
I find this behaviour a bit misleading: using methods and have them
return cached results. How much (implementation and/or performance
and/or memory) overhead would incur by using property-like access here?
I think this would underline the static nature of the data.
This would break the semantics with respect to pathlib, but they’re only
marginally equal anyways -- and as far as I understand it, pathlib won’t
cache, so I think this has a fair point here.
Indeed - using properties rather than methods may help emphasise the
deliberate *difference* from pathlib in this case (i.e. value when the
result was retrieved from the OS, rather than the value right now). The
main benefit is that switching from using the DirEntry object to a pathlib
Path will require touching all the places where the performance
characteristics switch from "memory access" to "system call". This benefit
is also the main downside, so I'd actually be OK with either decision on
this one.

Other comments:

* +1 on the general idea
* +1 on scandir() over iterdir, since it *isn't* just an iterator version
of listdir
* -1 on including Windows specific globbing support in the API
* -0 on including cross platform globbing support in the initial iteration
of the API (that could be done later as a separate RFE instead)
* +1 on a new section in the PEP covering rejected design options (calling
it iterdir, returning a 2-tuple instead of a dedicated DirEntry type)
* regarding "why not a 2-tuple", we know from experience that operating
systems evolve and we end up wanting to add additional info to this kind of
API. A dedicated DirEntry type lets us adjust the information returned over
time, without breaking backwards compatibility and without resorting to
ugly hacks like those in some of the time and stat APIs (or even our own
codec info APIs)
* it would be nice to see some relative performance numbers for NFS and
CIFS network shares - the additional network round trips can make excessive
stat calls absolutely brutal from a speed perspective when using a network
drive (that's why the stat caching added to the import system in 3.3
dramatically sped up the case of having network drives on sys.path, and why
I thought AJ had a point when he was complaining about the fact we didn't
expose the dirent data from os.listdir)

Regards,
Nick.
Post by Jonas Wielicki
regards,
jwi
_______________________________________________
Python-Dev mailing list
https://mail.python.org/mailman/listinfo/python-dev
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
Gregory P. Smith
2014-06-28 06:17:55 UTC
Permalink
Post by Nick Coghlan
* -1 on including Windows specific globbing support in the API
* -0 on including cross platform globbing support in the initial iteration
of the API (that could be done later as a separate RFE instead)
Agreed. Globbing or filtering support should not hold this up. If that
part isn't settled, just don't include it and work out what it should be as
a future enhancement.
Post by Nick Coghlan
* +1 on a new section in the PEP covering rejected design options (calling
it iterdir, returning a 2-tuple instead of a dedicated DirEntry type)
+1. IMNSHO, one of the most important part of PEPs: capturing the entire
decision process to document the "why nots".
Post by Nick Coghlan
* regarding "why not a 2-tuple", we know from experience that operating
systems evolve and we end up wanting to add additional info to this kind of
API. A dedicated DirEntry type lets us adjust the information returned over
time, without breaking backwards compatibility and without resorting to
ugly hacks like those in some of the time and stat APIs (or even our own
codec info APIs)
* it would be nice to see some relative performance numbers for NFS and
CIFS network shares - the additional network round trips can make excessive
stat calls absolutely brutal from a speed perspective when using a network
drive (that's why the stat caching added to the import system in 3.3
dramatically sped up the case of having network drives on sys.path, and why
I thought AJ had a point when he was complaining about the fact we didn't
expose the dirent data from os.listdir)
fwiw, I wouldn't wait for benchmark numbers.

A needless stat call when you've got the information from an earlier API
call is already brutal. It is easy to compute from existing ballparks
remote file server / cloud access: ~100ms, local spinning disk seek+read:
~10ms. fetch of stat info cached in memory on file server on the local
network: ~500us. You can go down further to local system call overhead
which can vary wildly but should likely be assumed to be at least 10us.

You don't need a benchmark to tell you that adding needless >= 500us-100ms
blocking operations to your program is bad. :)

-gps
Nick Coghlan
2014-06-28 09:19:23 UTC
Permalink
Agreed, but walking even a moderately large tree over the network can
really hammer home the point that this offers a significant
performance enhancement as the latency of access increases. I've found
that kind of comparison can be eye-opening for folks that are used to
only operating on local disks (even spinning disks, let alone SSDs)
and/or relatively small trees (distro build trees aren't *that* big,
but they're big enough for this kind of difference in access overhead
to start getting annoying).
Oops, forgot to add - I agree this isn't a blocking issue for the PEP,
it's definitely only in "nice to have" territory.

Cheers,
Nick.
--
Nick Coghlan | ***@gmail.com | Brisbane, Australia
Nick Coghlan
2014-06-28 09:17:12 UTC
Permalink
Post by Gregory P. Smith
Post by Nick Coghlan
* it would be nice to see some relative performance numbers for NFS and
CIFS network shares - the additional network round trips can make excessive
stat calls absolutely brutal from a speed perspective when using a network
drive (that's why the stat caching added to the import system in 3.3
dramatically sped up the case of having network drives on sys.path, and why
I thought AJ had a point when he was complaining about the fact we didn't
expose the dirent data from os.listdir)
fwiw, I wouldn't wait for benchmark numbers.
A needless stat call when you've got the information from an earlier API
call is already brutal. It is easy to compute from existing ballparks remote
file server / cloud access: ~100ms, local spinning disk seek+read: ~10ms.
~500us. You can go down further to local system call overhead which can
vary wildly but should likely be assumed to be at least 10us.
You don't need a benchmark to tell you that adding needless >= 500us-100ms
blocking operations to your program is bad. :)
Agreed, but walking even a moderately large tree over the network can
really hammer home the point that this offers a significant
performance enhancement as the latency of access increases. I've found
that kind of comparison can be eye-opening for folks that are used to
only operating on local disks (even spinning disks, let alone SSDs)
and/or relatively small trees (distro build trees aren't *that* big,
but they're big enough for this kind of difference in access overhead
to start getting annoying).

Cheers,
Nick.
--
Nick Coghlan | ***@gmail.com | Brisbane, Australia
Ben Hoyt
2014-06-28 19:55:00 UTC
Permalink
Post by Nick Coghlan
Post by Jonas Wielicki
I find this behaviour a bit misleading: using methods and have them
return cached results. How much (implementation and/or performance
and/or memory) overhead would incur by using property-like access here?
I think this would underline the static nature of the data.
This would break the semantics with respect to pathlib, but they're only
marginally equal anyways -- and as far as I understand it, pathlib won't
cache, so I think this has a fair point here.
Indeed - using properties rather than methods may help emphasise the
deliberate *difference* from pathlib in this case (i.e. value when the
result was retrieved from the OS, rather than the value right now). The main
benefit is that switching from using the DirEntry object to a pathlib Path
will require touching all the places where the performance characteristics
switch from "memory access" to "system call". This benefit is also the main
downside, so I'd actually be OK with either decision on this one.
The problem with this is that properties "look free", they look just
like attribute access, so you wouldn't normally handle exceptions when
accessing them. But .lstat() and .is_dir() etc may do an OS call, so
if you're needing to be careful with error handling, you may want to
handle errors on them. Hence I think it's best practice to make them
functions().

Some of us discussed this on python-dev or python-ideas a while back,
and I think there was general agreement with what I've stated above
and therefore they should be methods. But I'll dig up the links and
add to a Rejected ideas section.
Post by Nick Coghlan
* +1 on a new section in the PEP covering rejected design options (calling
it iterdir, returning a 2-tuple instead of a dedicated DirEntry type)
Great idea. I'll add a bunch of stuff, including the above, to a new
section, Rejected Design Options.
Post by Nick Coghlan
* regarding "why not a 2-tuple", we know from experience that operating
systems evolve and we end up wanting to add additional info to this kind of
API. A dedicated DirEntry type lets us adjust the information returned over
time, without breaking backwards compatibility and without resorting to ugly
hacks like those in some of the time and stat APIs (or even our own codec
info APIs)
Fully agreed.
Post by Nick Coghlan
* it would be nice to see some relative performance numbers for NFS and CIFS
network shares - the additional network round trips can make excessive stat
calls absolutely brutal from a speed perspective when using a network drive
(that's why the stat caching added to the import system in 3.3 dramatically
sped up the case of having network drives on sys.path, and why I thought AJ
had a point when he was complaining about the fact we didn't expose the
dirent data from os.listdir)
Don't know if you saw, but there are actually some benchmarks,
including one over NFS, on the scandir GitHub page:

https://github.com/benhoyt/scandir#benchmarks

os.walk() was 23 times faster with scandir() than the current
listdir() + stat() implementation on the Windows NFS file system I
tried. Pretty good speedup!

-Ben
Nick Coghlan
2014-06-29 05:03:27 UTC
Permalink
Post by Ben Hoyt
Post by Nick Coghlan
Post by Jonas Wielicki
I find this behaviour a bit misleading: using methods and have them
return cached results. How much (implementation and/or performance
and/or memory) overhead would incur by using property-like access here?
I think this would underline the static nature of the data.
This would break the semantics with respect to pathlib, but they're only
marginally equal anyways -- and as far as I understand it, pathlib won't
cache, so I think this has a fair point here.
Indeed - using properties rather than methods may help emphasise the
deliberate *difference* from pathlib in this case (i.e. value when the
result was retrieved from the OS, rather than the value right now). The main
benefit is that switching from using the DirEntry object to a pathlib Path
will require touching all the places where the performance characteristics
switch from "memory access" to "system call". This benefit is also the main
downside, so I'd actually be OK with either decision on this one.
The problem with this is that properties "look free", they look just
like attribute access, so you wouldn't normally handle exceptions when
accessing them. But .lstat() and .is_dir() etc may do an OS call, so
if you're needing to be careful with error handling, you may want to
handle errors on them. Hence I think it's best practice to make them
functions().
Some of us discussed this on python-dev or python-ideas a while back,
and I think there was general agreement with what I've stated above
and therefore they should be methods. But I'll dig up the links and
add to a Rejected ideas section.
Yes, only the stuff that *never* needs a system call (regardless of
OS) would be a candidate for handling as a property rather than a
method call. Consistency of access would likely trump that idea
anyway, but it would still be worth ensuring that the PEP is clear on
which values are guaranteed to reflect the state at the time of the
directory scanning and which may imply an additional stat call.
Post by Ben Hoyt
Post by Nick Coghlan
* it would be nice to see some relative performance numbers for NFS and CIFS
network shares - the additional network round trips can make excessive stat
calls absolutely brutal from a speed perspective when using a network drive
(that's why the stat caching added to the import system in 3.3 dramatically
sped up the case of having network drives on sys.path, and why I thought AJ
had a point when he was complaining about the fact we didn't expose the
dirent data from os.listdir)
Don't know if you saw, but there are actually some benchmarks,
https://github.com/benhoyt/scandir#benchmarks
No, I hadn't seen those - may be worth referencing explicitly from the
PEP (and if there's already a reference... oops!)
Post by Ben Hoyt
os.walk() was 23 times faster with scandir() than the current
listdir() + stat() implementation on the Windows NFS file system I
tried. Pretty good speedup!
Ah, nice!

Cheers,
Nick.
--
Nick Coghlan | ***@gmail.com | Brisbane, Australia
Steven D'Aprano
2014-06-29 10:52:40 UTC
Permalink
[...]
Post by Ben Hoyt
The problem with this is that properties "look free", they look just
like attribute access, so you wouldn't normally handle exceptions when
accessing them. But .lstat() and .is_dir() etc may do an OS call, so
if you're needing to be careful with error handling, you may want to
handle errors on them. Hence I think it's best practice to make them
functions().
I think this one could go either way. Methods look like they actually
re-test the value each time you call it. I can easily see people not
realising that the value is cached and writing code like this toy
example:


# Detect a file change.
t = the_file.lstat().st_mtime
while the_file.lstat().st_mtime == t:
sleep(0.1)
print("Changed!")


I know that's not the best way to detect file changes, but I'm sure
people will do something like that and not realise that the call to
lstat is cached.

Personally, I would prefer a property. If I forget to wrap a call in a
try...except, it will fail hard and I will get an exception. But with a
method call, the failure is silent and I keep getting the cached result.

Speaking of caching, is there a way to freshen the cached values?
--
Steven
Nick Coghlan
2014-06-29 11:08:36 UTC
Permalink
Post by Steven D'Aprano
Speaking of caching, is there a way to freshen the cached values?
Switch to a full Path object instead of relying on the cached DirEntry data.

This is what makes me wary of including lstat, even though Windows
offers it without the extra stat call. Caching behaviour is *really*
hard to make intuitive, especially when it *sometimes* returns data
that looks fresh (as it on first call on POSIX systems).

Regards,
Nick.
--
Nick Coghlan | ***@gmail.com | Brisbane, Australia
Paul Moore
2014-06-29 11:45:49 UTC
Permalink
Post by Nick Coghlan
This is what makes me wary of including lstat, even though Windows
offers it without the extra stat call. Caching behaviour is *really*
hard to make intuitive, especially when it *sometimes* returns data
that looks fresh (as it on first call on POSIX systems).
If it matters that much we *could* simply call it cached_lstat(). It's
ugly, but I really don't like the idea of throwing the information
away - after all, the fact that we currently throw data away is why
there's even a need for scandir. Let's not make the same mistake
again...

Paul
Nick Coghlan
2014-06-29 12:28:14 UTC
Permalink
Post by Paul Moore
Post by Nick Coghlan
This is what makes me wary of including lstat, even though Windows
offers it without the extra stat call. Caching behaviour is *really*
hard to make intuitive, especially when it *sometimes* returns data
that looks fresh (as it on first call on POSIX systems).
If it matters that much we *could* simply call it cached_lstat(). It's
ugly, but I really don't like the idea of throwing the information
away - after all, the fact that we currently throw data away is why
there's even a need for scandir. Let's not make the same mistake
again...
Future-proofing is the reason DirEntry is a full fledged class in the
first place, though.

Effectively communicating the behavioural difference between DirEntry
and pathlib.Path is the main thing that makes me nervous about
adhering too closely to the Path API.

To restate the problem and the alternative proposal, these are the
DirEntry methods under discussion:

is_dir(): like os.path.isdir(), but requires no system calls on at
least POSIX and Windows
is_file(): like os.path.isfile(), but requires no system calls on
at least POSIX and Windows
is_symlink(): like os.path.islink(), but requires no system calls
on at least POSIX and Windows
lstat(): like os.lstat(), but requires no system calls on Windows

For the almost-certain-to-be-cached items, the suggestion is to make
them properties (or just ordinary attributes):

is_dir
is_file
is_symlink

What do with lstat() is currently less clear, since POSIX directory
scanning doesn't provide that level of detail by default.

The PEP also doesn't currently state whether the is_dir(), is_file()
and is_symlink() results would be updated if a call to lstat()
produced different answers than the original directory scanning
process, which further suggests to me that allowing the stat call to
be delayed on POSIX systems is a potentially problematic and
inherently confusing design. We would have two options:

- update them, meaning calling lstat() may change those results from
being a snapshot of the setting at the time the directory was scanned
- leave them alone, meaning the DirEntry object and the
DirEntry.lstat() result may give different answers

Those both sound ugly to me.

So, here's my alternative proposal: add an "ensure_lstat" flag to
scandir() itself, and don't have *any* methods on DirEntry, only
attributes.

That would make the DirEntry attributes:

is_dir: boolean, always populated
is_file: boolean, always populated
is_symlink boolean, always populated
lstat_result: stat result, may be None on POSIX systems if
ensure_lstat is False

(I'm not particularly sold on "lstat_result" as the name, but "lstat"
reads as a verb to me, so doesn't sound right as an attribute name)

What this would allow:

- by default, scanning is efficient everywhere, but lstat_result may
be None on POSIX systems
- if you always need the lstat result, setting "ensure_lstat" will
trigger the extra system call implicitly
- if you only sometimes need the stat result, you can call os.lstat()
explicitly when the DirEntry lstat attribute is None

Most importantly, *regardless of platform*, the cached stat result (if
not None) would reflect the state of the entry at the time the
directory was scanned, rather than at some arbitrary later point in
time when lstat() was first called on the DirEntry object.

There'd still be a slight window of discrepancy (since the filesystem
state may change between reading the directory entry and making the
lstat() call), but this could be effectively eliminated from the
perspective of the Python code by making the result of the lstat()
call authoritative for the whole DirEntry object.

Regards,
Nick.

P.S. We'd be generating quite a few of these, so we can use __slots__
to keep the memory overhead to a minimum (that's just a general
comment - it's really irrelevant to the methods-or-attributes
question).
--
Nick Coghlan | ***@gmail.com | Brisbane, Australia
Ethan Furman
2014-06-29 17:02:16 UTC
Permalink
Post by Nick Coghlan
So, here's my alternative proposal: add an "ensure_lstat" flag to
scandir() itself, and don't have *any* methods on DirEntry, only
attributes.
is_dir: boolean, always populated
is_file: boolean, always populated
is_symlink boolean, always populated
lstat_result: stat result, may be None on POSIX systems if
ensure_lstat is False
(I'm not particularly sold on "lstat_result" as the name, but "lstat"
reads as a verb to me, so doesn't sound right as an attribute name)
+1

--
~Ethan~
Glenn Linderman
2014-06-30 02:33:33 UTC
Permalink
Post by Nick Coghlan
There'd still be a slight window of discrepancy (since the filesystem
state may change between reading the directory entry and making the
lstat() call), but this could be effectively eliminated from the
perspective of the Python code by making the result of the lstat()
call authoritative for the whole DirEntry object.
+1 to this in particular, but this whole refresh of the semantics sounds
better overall.

Finally, for the case where someone does want to keep the DirEntry
around, a .refresh() API could rerun lstat() and update all the data.

And with that (initial data potentially always populated, or None, and
an explicit refresh() API), the data could all be returned as
properties, implying that they aren't fetching new data themselves,
because they wouldn't be.

Glenn
Ben Hoyt
2014-06-30 17:05:54 UTC
Permalink
Post by Nick Coghlan
So, here's my alternative proposal: add an "ensure_lstat" flag to
scandir() itself, and don't have *any* methods on DirEntry, only
attributes.
is_dir: boolean, always populated
is_file: boolean, always populated
is_symlink boolean, always populated
lstat_result: stat result, may be None on POSIX systems if
ensure_lstat is False
(I'm not particularly sold on "lstat_result" as the name, but "lstat"
reads as a verb to me, so doesn't sound right as an attribute name)
- by default, scanning is efficient everywhere, but lstat_result may
be None on POSIX systems
- if you always need the lstat result, setting "ensure_lstat" will
trigger the extra system call implicitly
- if you only sometimes need the stat result, you can call os.lstat()
explicitly when the DirEntry lstat attribute is None
Most importantly, *regardless of platform*, the cached stat result (if
not None) would reflect the state of the entry at the time the
directory was scanned, rather than at some arbitrary later point in
time when lstat() was first called on the DirEntry object.
There'd still be a slight window of discrepancy (since the filesystem
state may change between reading the directory entry and making the
lstat() call), but this could be effectively eliminated from the
perspective of the Python code by making the result of the lstat()
call authoritative for the whole DirEntry object.
Yeah, I quite like this. It does make the caching more explicit and
consistent. It's slightly annoying that it's less like pathlib.Path
now, but DirEntry was never pathlib.Path anyway, so maybe it doesn't
matter. The differences in naming may highlight the difference in
caching, so maybe it's a good thing.

Two further questions from me:

1) How does error handling work? Now os.stat() will/may be called
during iteration, so in __next__. But it hard to catch errors because
you don't call __next__ explicitly. Is this a problem? How do other
iterators that make system calls or raise errors handle this?

2) There's still the open question in the PEP of whether to include a
way to access the full path. This is cheap to build, it has to be
built anyway on POSIX systems, and it's quite useful for further
operations on the file. I think the best way to handle this is a
.fullname or .full_name attribute as suggested elsewhere. Thoughts?

-Ben
Tim Delaney
2014-06-30 22:07:23 UTC
Permalink
Post by Ben Hoyt
Post by Nick Coghlan
So, here's my alternative proposal: add an "ensure_lstat" flag to
scandir() itself, and don't have *any* methods on DirEntry, only
attributes.
...
Post by Nick Coghlan
Most importantly, *regardless of platform*, the cached stat result (if
not None) would reflect the state of the entry at the time the
directory was scanned, rather than at some arbitrary later point in
time when lstat() was first called on the DirEntry object.
I'm torn between whether I'd prefer the stat fields to be populated on
Windows if ensure_lstat=False or not. There are good arguments each way,
but overall I'm inclining towards having it consistent with POSIX - don't
populate them unless ensure_lstat=True.

+0 for stat fields to be None on all platforms unless ensure_lstat=True.
Post by Ben Hoyt
Yeah, I quite like this. It does make the caching more explicit and
consistent. It's slightly annoying that it's less like pathlib.Path
now, but DirEntry was never pathlib.Path anyway, so maybe it doesn't
matter. The differences in naming may highlight the difference in
caching, so maybe it's a good thing.
See my comments below on .fullname.
Post by Ben Hoyt
1) How does error handling work? Now os.stat() will/may be called
during iteration, so in __next__. But it hard to catch errors because
you don't call __next__ explicitly. Is this a problem? How do other
iterators that make system calls or raise errors handle this?
I think it just needs to be documented that iterating may throw the same
exceptions as os.lstat(). It's a little trickier if you don't want the
scope of your exception to be too broad, but you can always wrap the
iteration in a generator to catch and handle the exceptions you care about,
and allow the rest to propagate.

def scandir_accessible(path='.'):
gen = os.scandir(path)

while True:
try:
yield next(gen)
except PermissionError:
pass

2) There's still the open question in the PEP of whether to include a
Post by Ben Hoyt
way to access the full path. This is cheap to build, it has to be
built anyway on POSIX systems, and it's quite useful for further
operations on the file. I think the best way to handle this is a
.fullname or .full_name attribute as suggested elsewhere. Thoughts?
+1 for .fullname. The earlier suggestion to have __str__ return the name is
killed I think by the fact that .fullname could be bytes.

It would be nice if pathlib.Path objects were enhanced to take a DirEntry
and use the .fullname automatically, but you could always call
Path(direntry.fullname).

Tim Delaney
Ethan Furman
2014-06-30 22:38:45 UTC
Permalink
Post by Tim Delaney
Post by Nick Coghlan
So, here's my alternative proposal: add an "ensure_lstat" flag to
scandir() itself, and don't have *any* methods on DirEntry, only
attributes.
...
Most importantly, *regardless of platform*, the cached stat result (if
not None) would reflect the state of the entry at the time the
directory was scanned, rather than at some arbitrary later point in
time when lstat() was first called on the DirEntry object.
I'm torn between whether I'd prefer the stat fields to be populated
on Windows if ensure_lstat=False or not. There are good arguments each
way, but overall I'm inclining towards having it consistent with POSIX
- don't populate them unless ensure_lstat=True.
+0 for stat fields to be None on all platforms unless ensure_lstat=True.
If a Windows user just needs the free info, why should s/he have to pay the price of a full stat call? I see no reason
to hold the Windows side back and not take advantage of what it has available. There are plenty of posix calls that
Windows is not able to use, after all.

--
~Ethan~
Tim Delaney
2014-06-30 23:15:59 UTC
Permalink
Post by Ethan Furman
Post by Tim Delaney
I'm torn between whether I'd prefer the stat fields to be populated
on Windows if ensure_lstat=False or not. There are good arguments each
way, but overall I'm inclining towards having it consistent with POSIX
- don't populate them unless ensure_lstat=True.
+0 for stat fields to be None on all platforms unless ensure_lstat=True.
If a Windows user just needs the free info, why should s/he have to pay
the price of a full stat call? I see no reason to hold the Windows side
back and not take advantage of what it has available. There are plenty of
posix calls that Windows is not able to use, after all.
On Windows ensure_lstat would either be either a NOP (if the fields are
always populated), or it simply determines if the fields get populated. No
extra stat call.

On POSIX it's the difference between an extra stat call or not.

Tim Delaney
Ethan Furman
2014-06-30 23:45:18 UTC
Permalink
Post by Tim Delaney
Post by Ethan Furman
Post by Tim Delaney
I'm torn between whether I'd prefer the stat fields to be populated
on Windows if ensure_lstat=False or not. There are good arguments each
way, but overall I'm inclining towards having it consistent with POSIX
- don't populate them unless ensure_lstat=True.
+0 for stat fields to be None on all platforms unless ensure_lstat=True.
If a Windows user just needs the free info, why should s/he have to pay
the price of a full stat call? I see no reason to hold the Windows side
back and not take advantage of what it has available. There are plenty
of posix calls that Windows is not able to use, after all.
On Windows ensure_lstat would either be either a NOP (if the fields are
always populated), or it simply determines if the fields get populated.
No extra stat call.
I suppose the exact behavior is still under discussion, as there are only two or three fields one gets "for free" on
Windows (I think...), where as an os.stat call would get everything available for the platform.
Post by Tim Delaney
On POSIX it's the difference between an extra stat call or not.
Agreed on this part.

Still, no reason to slow down the Windows side by throwing away info unnecessarily -- that's why this PEP exists, after all.

--
~Ethan~
Ben Hoyt
2014-07-01 01:28:00 UTC
Permalink
Post by Ethan Furman
I suppose the exact behavior is still under discussion, as there are only
two or three fields one gets "for free" on Windows (I think...), where as an
os.stat call would get everything available for the platform.
No, Windows is nice enough to give you all the same stat_result fields
during scandir (via FindFirstFile/FindNextFile) as a regular
os.stat().

-Ben
Ethan Furman
2014-07-01 01:44:57 UTC
Permalink
Post by Ben Hoyt
Post by Ethan Furman
I suppose the exact behavior is still under discussion, as there are only
two or three fields one gets "for free" on Windows (I think...), where as an
os.stat call would get everything available for the platform.
No, Windows is nice enough to give you all the same stat_result fields
during scandir (via FindFirstFile/FindNextFile) as a regular
os.stat().
Very nice. Even less reason then to throw it away. :)

--
~Ethan~
Terry Reedy
2014-07-01 04:35:24 UTC
Permalink
Post by Ethan Furman
Post by Ben Hoyt
Post by Ethan Furman
I suppose the exact behavior is still under discussion, as there are only
two or three fields one gets "for free" on Windows (I think...), where as an
os.stat call would get everything available for the platform.
No, Windows is nice enough to give you all the same stat_result fields
during scandir (via FindFirstFile/FindNextFile) as a regular
os.stat().
Very nice. Even less reason then to throw it away. :)
I agree.
--
Terry Jan Reedy
Devin Jeanpierre
2014-06-30 23:25:49 UTC
Permalink
On Mon, Jun 30, 2014 at 3:07 PM, Tim Delaney
Post by Tim Delaney
Post by Nick Coghlan
So, here's my alternative proposal: add an "ensure_lstat" flag to
scandir() itself, and don't have *any* methods on DirEntry, only
attributes.
...
Post by Nick Coghlan
Most importantly, *regardless of platform*, the cached stat result (if
not None) would reflect the state of the entry at the time the
directory was scanned, rather than at some arbitrary later point in
time when lstat() was first called on the DirEntry object.
I'm torn between whether I'd prefer the stat fields to be populated on
Windows if ensure_lstat=False or not. There are good arguments each way, but
overall I'm inclining towards having it consistent with POSIX - don't
populate them unless ensure_lstat=True.
+0 for stat fields to be None on all platforms unless ensure_lstat=True.
This won't work well if lstat info is only needed for some entries. Is
that a common use-case? It was mentioned earlier in the thread.

-- Devin
Glenn Linderman
2014-07-01 02:04:43 UTC
Permalink
Post by Devin Jeanpierre
On Mon, Jun 30, 2014 at 3:07 PM, Tim Delaney
Post by Tim Delaney
Post by Nick Coghlan
So, here's my alternative proposal: add an "ensure_lstat" flag to
scandir() itself, and don't have *any* methods on DirEntry, only
attributes.
...
Post by Nick Coghlan
Most importantly, *regardless of platform*, the cached stat result (if
not None) would reflect the state of the entry at the time the
directory was scanned, rather than at some arbitrary later point in
time when lstat() was first called on the DirEntry object.
I'm torn between whether I'd prefer the stat fields to be populated on
Windows if ensure_lstat=False or not. There are good arguments each way, but
overall I'm inclining towards having it consistent with POSIX - don't
populate them unless ensure_lstat=True.
+0 for stat fields to be None on all platforms unless ensure_lstat=True.
This won't work well if lstat info is only needed for some entries. Is
that a common use-case? It was mentioned earlier in the thread.
If it is, use ensure_lstat=False, and use the proposed (by me)
.refresh() API to update the data for those that need it.
Devin Jeanpierre
2014-07-01 02:17:00 UTC
Permalink
The proposal I was replying to was that:

- There is no .refresh()
- ensure_lstat=False means no OS has populated attributes
- ensure_lstat=True means ever OS has populated attributes

Even if we add a .refresh(), the latter two items mean that you can't
avoid doing extra work (either too much on windows, or too much on
linux), if you want only a subset of the files' lstat info.

-- Devin

P.S. your mail client's quoting breaks my mail client (gmail)'s quoting.
Post by Devin Jeanpierre
On Mon, Jun 30, 2014 at 3:07 PM, Tim Delaney
So, here's my alternative proposal: add an "ensure_lstat" flag to
scandir() itself, and don't have *any* methods on DirEntry, only
attributes.
...
Most importantly, *regardless of platform*, the cached stat result (if
not None) would reflect the state of the entry at the time the
directory was scanned, rather than at some arbitrary later point in
time when lstat() was first called on the DirEntry object.
I'm torn between whether I'd prefer the stat fields to be populated on
Windows if ensure_lstat=False or not. There are good arguments each way, but
overall I'm inclining towards having it consistent with POSIX - don't
populate them unless ensure_lstat=True.
+0 for stat fields to be None on all platforms unless ensure_lstat=True.
This won't work well if lstat info is only needed for some entries. Is
that a common use-case? It was mentioned earlier in the thread.
If it is, use ensure_lstat=False, and use the proposed (by me) .refresh()
API to update the data for those that need it.
_______________________________________________
Python-Dev mailing list
https://mail.python.org/mailman/listinfo/python-dev
https://mail.python.org/mailman/options/python-dev/jeanpierreda%40gmail.com
Nick Coghlan
2014-07-01 02:17:44 UTC
Permalink
Post by Devin Jeanpierre
If it is, use ensure_lstat=False, and use the proposed (by me) .refresh()
API to update the data for those that need it.

I'm -1 on a refresh API for DirEntry - just use pathlib in that case.

Cheers,
Nick.
Post by Devin Jeanpierre
_______________________________________________
Python-Dev mailing list
https://mail.python.org/mailman/listinfo/python-dev
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
Eric V. Smith
2014-07-01 02:59:33 UTC
Permalink
Post by Glenn Linderman
Post by Glenn Linderman
If it is, use ensure_lstat=False, and use the proposed (by me)
.refresh() API to update the data for those that need it.
I'm -1 on a refresh API for DirEntry - just use pathlib in that case.
I'm not sure refresh() is the best name, but I think a
"get_stat_info_from_direntry_or_call_stat()" (hah!) makes sense. If you
really need the stat info, then you can write simple code like:

for entry in os.scandir(path):
mtime = entry.get_stat_info_from_direntry_or_call_stat().st_mtime

And it won't call stat() any more times than needed. Once per file on
Posix, zero times per file on Windows.

Without an API like this, you'll need a check in the application code on
whether or not to call stat().

Eric.
Victor Stinner
2014-07-01 06:55:12 UTC
Permalink
Post by Devin Jeanpierre
Post by Tim Delaney
+0 for stat fields to be None on all platforms unless ensure_lstat=True.
This won't work well if lstat info is only needed for some entries. Is
that a common use-case? It was mentioned earlier in the thread.
If it is, use ensure_lstat=False, and use the proposed (by me) .refresh()
API to update the data for those that need it.
We should make DirEntry as simple as possible. In Python, the classic
behaviour is to not define an attribute if it's not available on a
platform. For example, stat().st_file_attributes is only available on
Windows.

I don't like the idea of the ensure_lstat parameter because os.scandir
would have to call two system calls, it makes harder to guess which
syscall failed (readdir or lstat). If you need lstat on UNIX, write:

if hasattr(entry, 'lstat_result'):
size = entry.lstat_result.st_size
else:
size = os.lstat(entry.fullname()).st_size

Victor
Jonas Wielicki
2014-06-29 11:12:55 UTC
Permalink
Received: from localhost (HELO mail.python.org) (127.0.0.1)
by albatross.python.org with SMTP; 29 Jun 2014 13:13:03 +0200
Received: from sol.sotecware.net (unknown [IPv6:2a01:4f8:d16:1305::2])
by mail.python.org (Postfix) with ESMTP
for <python-***@python.org>; Sun, 29 Jun 2014 13:13:03 +0200 (CEST)
Received: from [217.115.12.83] (whiterabbit.sotecware.net [217.115.12.83])
by sol.sotecware.net (Postfix) with ESMTPSA id D8B8E1412D0
for <python-***@python.org>; Sun, 29 Jun 2014 11:12:56 +0000 (UTC)
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
rv:24.0) Gecko/20100101 Thunderbird/24.6.0
In-Reply-To: <CADiSq7cTLCgmXoZmEzHbDG7+0_w9hDxJ=p5RAn62=***@mail.gmail.com>
X-Enigmail-Version: 1.6
X-Mailman-Approved-At: Sun, 29 Jun 2014 17:26:34 +0200
X-BeenThere: python-***@python.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Python core developers <python-dev.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-dev>,
<mailto:python-dev-***@python.org?subject=unsubscribe>
List-Archive: <http://mail.python.org/pipermail/python-dev/>
List-Post: <mailto:python-***@python.org>
List-Help: <mailto:python-dev-***@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-dev>,
<mailto:python-dev-***@python.org?subject=subscribe>
Errors-To: python-dev-bounces+python-python-dev=***@python.org
Sender: "Python-Dev"
<python-dev-bounces+python-python-dev=***@python.org>
Archived-At: <http://permalink.gmane.org/gmane.comp.python.devel/148437>
Post by Nick Coghlan
Post by Steven D'Aprano
Speaking of caching, is there a way to freshen the cached values?
Switch to a full Path object instead of relying on the cached DirEntry data.
This is what makes me wary of including lstat, even though Windows
offers it without the extra stat call. Caching behaviour is *really*
hard to make intuitive, especially when it *sometimes* returns data
that looks fresh (as it on first call on POSIX systems).
This bugs me too. An idea I had was adding a keyword argument to scandir
which specifies whether stat data should be added to the direntry or not.

If the flag is set to True, This would implicitly call lstat on POSIX
before returning the DirEntry, and use the available data on Windows.

If the flag is set to False, all the fields in the DirEntry will be
None, for consistency, even on Windows.


This is not optimal in cases where the stat information is needed only
for some of the DirEntry objects, but would also reduce the required
logic in the DirEntry object.

Thoughts?
Post by Nick Coghlan
Regards,
Nick.
Ethan Furman
2014-06-29 17:04:19 UTC
Permalink
Post by Jonas Wielicki
If the flag is set to False, all the fields in the DirEntry will be
None, for consistency, even on Windows.
-1

This consistency is unnecessary.

--
~Ethan~
Jonas Wielicki
2014-06-29 21:04:09 UTC
Permalink
Post by Ethan Furman
Post by Jonas Wielicki
If the flag is set to False, all the fields in the DirEntry will be
None, for consistency, even on Windows.
-1
This consistency is unnecessary.
I’m not sure -- similar to the windows_wildcard option this might be a
temptation to write platform dependent code, although possibly by
accident (i.e. not reading the docs carefully).
Post by Ethan Furman
--
~Ethan~
_______________________________________________
Python-Dev mailing list
https://mail.python.org/mailman/listinfo/python-dev
https://mail.python.org/mailman/options/python-dev/j.wielicki%40sotecware.net
Akira Li
2014-06-28 13:05:31 UTC
Permalink
Post by Ben Hoyt
Hi Python dev folks,
I've written a PEP proposing a specific os.scandir() API for a
directory iterator that returns the stat-like info from the OS, *the
main advantage of which is to speed up os.walk() and similar
operations between 4-20x, depending on your OS and file system.*
...
http://legacy.python.org/dev/peps/pep-0471/
...
Specifically, this PEP proposes adding a single function to the ``os``
module in the standard library, ``scandir``, that takes a single,
scandir(path='.') -> generator of DirEntry objects
Have you considered adding support for paths relative to directory
descriptors [1] via keyword only dir_fd=None parameter if it may lead to
more efficient implementations on some platforms?

[1]: https://docs.python.org/3.4/library/os.html#dir-fd


--
akira
Chris Angelico
2014-06-28 15:27:44 UTC
Permalink
Post by Akira Li
Have you considered adding support for paths relative to directory
descriptors [1] via keyword only dir_fd=None parameter if it may lead to
more efficient implementations on some platforms?
[1]: https://docs.python.org/3.4/library/os.html#dir-fd
Potentially more efficient and also potentially safer (see 'man
openat')... but an enhancement that can wait, if necessary.

ChrisA
Akira Li
2014-06-29 18:32:53 UTC
Permalink
Post by Chris Angelico
Post by Akira Li
Have you considered adding support for paths relative to directory
descriptors [1] via keyword only dir_fd=None parameter if it may lead to
more efficient implementations on some platforms?
[1]: https://docs.python.org/3.4/library/os.html#dir-fd
Potentially more efficient and also potentially safer (see 'man
openat')... but an enhancement that can wait, if necessary.
Introducing the feature later creates unnecessary incompatibilities.
Either it should be explicitly rejected in the PEP 471 and
something-like `os.scandir(os.open(relative_path, dir_fd=fd))` recommended
instead (assuming `os.scandir in os.supports_fd` like `os.listdir()`).

At C level it could be implemented using fdopendir/openat or scandirat.

Here's the function description using Argument Clinic DSL:

/*[clinic input]

os.scandir

path : path_t(allow_fd=True, nullable=True) = '.'

*path* can be specified as either str or bytes. On some
platforms, *path* may also be specified as an open file
descriptor; the file descriptor must refer to a directory. If
this functionality is unavailable, using it raises
NotImplementedError.

*

dir_fd : dir_fd = None

If not None, it should be a file descriptor open to a
directory, and *path* should be a relative string; path will
then be relative to that directory. if *dir_fd* is
unavailable, using it raises NotImplementedError.

Yield a DirEntry object for each file and directory in *path*.

Just like os.listdir, the '.' and '..' pseudo-directories are skipped,
and the entries are yielded in system-dependent order.

{parameters}
It's an error to use *dir_fd* when specifying *path* as an open file
descriptor.

[clinic start generated code]*/


And corresponding tests (from test_posix:PosixTester), to show the
compatibility with os.listdir argument parsing in detail:

def test_scandir_default(self):
# When scandir is called without argument,
# it's the same as scandir(os.curdir).
self.assertIn(support.TESTFN, [e.name for e in posix.scandir()])

def _test_scandir(self, curdir):
filenames = sorted(e.name for e in posix.scandir(curdir))
self.assertIn(support.TESTFN, filenames)
#NOTE: assume listdir, scandir accept the same types on the platform
self.assertEqual(sorted(posix.listdir(curdir)), filenames)

def test_scandir(self):
self._test_scandir(os.curdir)

def test_scandir_none(self):
# it's the same as scandir(os.curdir).
self._test_scandir(None)

def test_scandir_bytes(self):
# When scandir is called with a bytes object,
# the returned entries names are still of type str.
# Call `os.fsencode(entry.name)` to get bytes
self.assertIn('a', {'a'})
self.assertNotIn(b'a', {'a'})
self._test_scandir(b'.')

@unittest.skipUnless(posix.scandir in os.supports_fd,
"test needs fd support for posix.scandir()")
def test_scandir_fd_minus_one(self):
# it's the same as scandir(os.curdir).
self._test_scandir(-1)

def test_scandir_float(self):
# invalid args
self.assertRaises(TypeError, posix.scandir, -1.0)

@unittest.skipUnless(posix.scandir in os.supports_fd,
"test needs fd support for posix.scandir()")
def test_scandir_fd(self):
fd = posix.open(posix.getcwd(), posix.O_RDONLY)
self.addCleanup(posix.close, fd)
self._test_scandir(fd)
self.assertEqual(
sorted(posix.scandir('.')),
sorted(posix.scandir(fd)))
# call 2nd time to test rewind
self.assertEqual(
sorted(posix.scandir('.')),
sorted(posix.scandir(fd)))

@unittest.skipUnless(posix.scandir in os.supports_dir_fd,
"test needs dir_fd support for os.scandir()")
def test_scandir_dir_fd(self):
relpath = 'relative_path'
with support.temp_dir() as parent:
fullpath = os.path.join(parent, relpath)
with support.temp_dir(path=fullpath):
support.create_empty_file(os.path.join(parent, 'a'))
support.create_empty_file(os.path.join(fullpath, 'b'))
fd = posix.open(parent, posix.O_RDONLY)
self.addCleanup(posix.close, fd)
self.assertEqual(
sorted(posix.scandir(relpath, dir_fd=fd)),
sorted(posix.scandir(fullpath)))
# check that fd is still useful
self.assertEqual(
sorted(posix.scandir(relpath, dir_fd=fd)),
sorted(posix.scandir(fullpath)))


--
Akira
Janzert
2014-07-01 16:06:58 UTC
Permalink
Post by Ben Hoyt
Rationale
=========
Python's built-in ``os.walk()`` is significantly slower than it needs
to be, because -- in addition to calling ``os.listdir()`` on each
directory -- it executes the system call ``os.stat()`` or
``GetFileAttributes()`` on each file to determine whether the entry is
a directory or not.
But the underlying system calls -- ``FindFirstFile`` /
``FindNextFile`` on Windows and ``readdir`` on Linux and OS X --
already tell you whether the files returned are directories or not, so
no further system calls are needed. In short, you can reduce the
number of system calls from approximately 2N to N, where N is the
total number of files and directories in the tree. (And because
directory trees are usually much wider than they are deep, it's often
much better than this.)
One of the major reasons for this seems to be efficiently using
information that is already available from the OS "for free".
Unfortunately it seems that the current API and most of the leading
alternate proposals hide from the user what information is actually
there "free" and what is going to incur an extra cost.

I would prefer an API that simply gives whatever came for free from the
OS and then let the user decide if the extra expense is worth the extra
information. Maybe that stat information was only going to be used for
an informational log that can be skipped if it's going to incur extra
expense?

Janzert

Loading...