Discussion:
Remaining decisions on PEP 471 -- os.scandir()
Ben Hoyt
2014-07-14 00:33:16 UTC
Permalink
Hi folks,

Thanks Victor, Nick, Ethan, and others for continued discussion on the
scandir PEP 471 (most recent thread starts at
https://mail.python.org/pipermail/python-dev/2014-July/135377.html).

Just an aside ... I was reminded again recently why scandir() matters:
a scandir user emailed me the other day, saying "I used scandir to
dump the contents of a network dir in under 15 seconds. 13 root dirs,
60,000 files in the structure. This will replace some old VBA code
embedded in a spreadsheet that was taking 15-20 minutes to do the
exact same thing." I asked if he could run scandir's benchmark.py on
his directory tree, and here's what it printed out:

C:\Python34\scandir-master>benchmark.py "\\my\network\directory"
Using fast C version of scandir
Priming the system's cache...
Benchmarking walks on \\my\network\directory, repeat 1/3...
Benchmarking walks on \\my\network\directory, repeat 2/3...
Benchmarking walks on \\my\network\directory, repeat 3/3...
os.walk took 8739.851s, scandir.walk took 129.500s -- 67.5x as fast

That's right -- os.walk() with scandir was almost 70x as fast as the
current version! Admittedly this is a network file system, but that's
still a real and important use case. It really pays not to throw away
information the OS gives you for free. :-)

On the recent python-dev thread, Victor especially made some well
thought out suggestions. It seems to me there's general agreement that
the basic API in PEP 471 is good (with Ethan not a fan at first, but
it seems he's on board after further discussion :-).

That said, I think there's basically one thing remaining to decide:
whether or not to have DirEntry.is_dir() and .is_file() follow
symlinks by default. I think Victor made a pretty good case that:

(a) following links is usually what you want
(b) that's the precedent set by the similar functions os.path.isdir()
and pathlib.Path.is_dir(), so to do otherwise would be confusing
(c) with the non-link-following version, if you wanted to follow links
you'd have to say something like "if (entry.is_symlink() and
os.path.isdir(entry.full_name)) or entry.is_dir()" instead of just "if
entry.is_dir()"
(d) it's error prone to have to do (c), as I found out recently when I
had a bug in my implementation of os.walk() with scandir -- I had a
bug due to getting this exact test wrong

If we go with Victor's link-following .is_dir() and .is_file(), then
we probably need to add his suggestion of a follow_symlinks=False
parameter (defaults to True). Either that or you have to say
"stat.S_ISDIR(entry.lstat().st_mode)" instead, which is a little bit
less nice.

As a KISS enthusiast, I admit I'm still somewhat partial to the
DirEntry methods just returning (non-link following) info about the
*directory entry* itself. However, I can definitely see the
error-proneness of that, and the advantages given the points above. So
I guess I'm on the fence.

Given the above arguments for symlink-following is_dir()/is_file()
methods (have I missed any, Victor?), what do others think?

I'd be very keen to come to a consensus on this, so that I can make
some final updates to the PEP and see about getting it accepted and/or
implemented. :-)

-Ben
Tim Delaney
2014-07-14 00:52:42 UTC
Permalink
On 14 July 2014 10:33, Ben Hoyt <***@gmail.com> wrote:
If we go with Victor's link-following .is_dir() and .is_file(), then
Post by Ben Hoyt
we probably need to add his suggestion of a follow_symlinks=False
parameter (defaults to True). Either that or you have to say
"stat.S_ISDIR(entry.lstat().st_mode)" instead, which is a little bit
less nice.
Absolutely agreed that follow_symlinks is the way to go, disagree on the
default value.
Post by Ben Hoyt
Given the above arguments for symlink-following is_dir()/is_file()
methods (have I missed any, Victor?), what do others think?
I would say whichever way you go, someone will assume the opposite. IMO not
following symlinks by default is safer. If you follow symlinks by default
then everyone has the following issues:

1. Crossing filesystems (including onto network filesystems);

2. Recursive directory structures (symlink to a parent directory);

3. Symlinks to non-existent files/directories;

4. Symlink to an absolutely huge directory somewhere else (very annoying if
you just wanted to do a directory sizer ...).

If follow_symlinks=False by default, only those who opt-in have to deal
with the above.

Tim Delaney
Nick Coghlan
2014-07-14 02:17:33 UTC
Permalink
Post by Tim Delaney
Post by Ben Hoyt
If we go with Victor's link-following .is_dir() and .is_file(), then
we probably need to add his suggestion of a follow_symlinks=False
parameter (defaults to True). Either that or you have to say
"stat.S_ISDIR(entry.lstat().st_mode)" instead, which is a little bit
less nice.
Absolutely agreed that follow_symlinks is the way to go, disagree on the
default value.
Post by Tim Delaney
Post by Ben Hoyt
Given the above arguments for symlink-following is_dir()/is_file()
methods (have I missed any, Victor?), what do others think?
I would say whichever way you go, someone will assume the opposite. IMO
not following symlinks by default is safer. If you follow symlinks by
Post by Tim Delaney
1. Crossing filesystems (including onto network filesystems);
2. Recursive directory structures (symlink to a parent directory);
3. Symlinks to non-existent files/directories;
4. Symlink to an absolutely huge directory somewhere else (very annoying
if you just wanted to do a directory sizer ...).
Post by Tim Delaney
If follow_symlinks=False by default, only those who opt-in have to deal
with the above.

Or the ever popular symlink to "." (or a directory higher in the tree).

I think os.walk() is a good source of inspiration here: call the flag
"followlink" and default it to False.

Cheers,
Nick.
Post by Tim Delaney
Tim Delaney
_______________________________________________
Python-Dev mailing list
https://mail.python.org/mailman/listinfo/python-dev
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
Tim Delaney
2014-07-14 02:29:12 UTC
Permalink
Post by Nick Coghlan
I think os.walk() is a good source of inspiration here: call the flag
"followlink" and default it to False.
Actually, that's "followlinks", and I'd forgotten that os.walk() defaulted
to not follow - definitely behaviour to match IMO :)

Tim Delaney
Victor Stinner
2014-07-14 08:25:48 UTC
Permalink
Post by Nick Coghlan
Or the ever popular symlink to "." (or a directory higher in the tree).
"." and ".." are explicitly ignored by os.listdir() an os.scandir().
Post by Nick Coghlan
I think os.walk() is a good source of inspiration here: call the flag
"followlink" and default it to False.
IMO the specific function os.walk() is not a good example. It includes
symlinks to directories in the dirs list and then it does not follow
symlink, it is a recursive function and has a followlinks optional
parameter (default: False).

Moreover, in 92% of cases, functions using os.listdir() and
os.path.isdir() *follow* symlinks:
https://mail.python.org/pipermail/python-dev/2014-July/135435.html

Victor
Cameron Simpson
2014-07-16 03:40:00 UTC
Permalink
I was going to stay out of this one...
Post by Victor Stinner
Post by Nick Coghlan
Or the ever popular symlink to "." (or a directory higher in the tree).
"." and ".." are explicitly ignored by os.listdir() an os.scandir().
Post by Nick Coghlan
I think os.walk() is a good source of inspiration here: call the flag
"followlink" and default it to False.
I also think followslinks should be spelt like os.walk, and also default to
False.
Post by Victor Stinner
IMO the specific function os.walk() is not a good example. It includes
symlinks to directories in the dirs list and then it does not follow
symlink,
I agree that is a bad mix.
Post by Victor Stinner
it is a recursive function and has a followlinks optional
parameter (default: False).
Which I think is desirable.
Post by Victor Stinner
Moreover, in 92% of cases, functions using os.listdir() and
https://mail.python.org/pipermail/python-dev/2014-July/135435.html
Sigh.

This is a historic artifact, a convenience, and a side effect of bring symlinks
into UNIX in the first place.

The objective was that symlinks should largely be transparent to users for
naive operation. So the UNIX calls open/cd/listdir all follow symlinks so that
things work transparently and a million C programs do not break.

However, so do chmod/chgrp/chown, for the same reasons and with generally less
desirable effects.

Conversely, the find command, for example, does not follow symlinks and this is
generally a good thing. "ls" is the same. Like os.walk, they are for inspecting
stuff, and shouldn't indirect unless asked.

I think following symlinks, especially for something like os.walk and
os.scandir, should default to False. I DO NOT want to quietly wander to remote
parts of the file space because someone has stuck a symlink somewhere
unfortunate, lurking like a little bomb (or perhaps trapdoor, waiting to suck
me down into an unexpected dark place).

It is also slower to follow symlinks by default.

I am also against flag parameters that default to True, on the whole; they are
a failure of ergonomic design. Leaving off a flag should usually be like
setting it to False. A missing flag is an "off" flag.

For these reasons (and others I have not yet thought through:-) I am voting for
a:

followlinks=False

optional parameter.

If you want to follow links, it is hardly difficult.

Cheers,
Cameron Simpson <***@zip.com.au>

Our job is to make the questions so painful that the only way to make the
pain go away is by thinking. - Fred Friendly
Akira Li
2014-07-14 05:51:24 UTC
Permalink
Post by Tim Delaney
Post by Tim Delaney
Post by Ben Hoyt
If we go with Victor's link-following .is_dir() and .is_file(), then
we probably need to add his suggestion of a follow_symlinks=False
parameter (defaults to True). Either that or you have to say
"stat.S_ISDIR(entry.lstat().st_mode)" instead, which is a little bit
less nice.
Absolutely agreed that follow_symlinks is the way to go, disagree on the
default value.
Post by Tim Delaney
Post by Ben Hoyt
Given the above arguments for symlink-following is_dir()/is_file()
methods (have I missed any, Victor?), what do others think?
I would say whichever way you go, someone will assume the opposite. IMO
not following symlinks by default is safer. If you follow symlinks by
Post by Tim Delaney
1. Crossing filesystems (including onto network filesystems);
2. Recursive directory structures (symlink to a parent directory);
3. Symlinks to non-existent files/directories;
4. Symlink to an absolutely huge directory somewhere else (very annoying
if you just wanted to do a directory sizer ...).
Post by Tim Delaney
If follow_symlinks=False by default, only those who opt-in have to deal
with the above.
Or the ever popular symlink to "." (or a directory higher in the tree).
I think os.walk() is a good source of inspiration here: call the flag
"followlink" and default it to False.
Let's not multiply entities beyond necessity.

There is well-defined *follow_symlinks* parameter
https://docs.python.org/3/library/os.html#follow-symlinks
e.g., os.access, os.chown, os.link, os.stat, os.utime and many other
functions in os module support follow_symlinks parameter, see
os.supports_follow_symlinks.

os.walk is an exception that uses *followlinks*. It might be because it
is an old function e.g., newer os.fwalk uses follow_symlinks.

------------------------------------------------------------

As it has been said: os.path.isdir, pathlib.Path.is_dir in Python
File.directory? in Ruby, System.Directory.doesDirectoryExist in Haskell,
`test -d` in shell do follow symlinks i.e., follow_symlinks=True as
default is more familiar for .is_dir method.

`cd path` in shell, os.chdir(path), `ls path`, os.listdir(path), and
os.scandir(path) itself follow symlinks (even on Windows:
http://bugs.python.org/issue13772 ). GUI file managers such as
`nautilus` also treat symlinks to directories as directories -- you may
click on them to open corresponding directories.

Only *recursive* functions such as os.walk, os.fwalk do not follow
symlinks by default, to avoid symlink loops. Note: the behavior is
consistent with coreutils commands such as `cp` that follows symlinks
for non-recursive actions but e.g., `du` utility that is inherently
recursive doesn't follow symlinks by default.

follow_symlinks=True as default for DirEntry.is_dir method allows to
avoid easy-to-introduce bugs while replacing old
os.listdir/os.path.isdir code or writing a new code using the same
mental model.


--
Akira
Ben Hoyt
2014-07-15 02:48:41 UTC
Permalink
Post by Akira Li
Let's not multiply entities beyond necessity.
There is well-defined *follow_symlinks* parameter
https://docs.python.org/3/library/os.html#follow-symlinks
e.g., os.access, os.chown, os.link, os.stat, os.utime and many other
functions in os module support follow_symlinks parameter, see
os.supports_follow_symlinks.
Huh, interesting. I didn't know os.stat() had a follow_symlinks
parameter -- when False, it's equivalent to lstat. If DirEntry has a
.stat(follow_symlinks=True) method, we don't actually need lstat().
Post by Akira Li
os.walk is an exception that uses *followlinks*. It might be because it
is an old function e.g., newer os.fwalk uses follow_symlinks.
Yes, I'm sure that's correct. Today it'd be called follow_symlinks,
but obviously one can't change os.walk() anymore.
Post by Akira Li
Only *recursive* functions such as os.walk, os.fwalk do not follow
symlinks by default, to avoid symlink loops. [...]
follow_symlinks=True as default for DirEntry.is_dir method allows to
avoid easy-to-introduce bugs while replacing old
os.listdir/os.path.isdir code or writing a new code using the same
mental model.
I think these are good points, especially that of porting existing
listdir()/os.path.isdir() code and avoiding bugs. As I mentioned, I
was really on the fence about the link-following thing, but if it's a
tiny bit harder to implement but it avoids bugs (and I already had a
bug with this when implementing os.walk), that's a worthwhile
trade-off.

In light of that, I propose I update the PEP to basically follow
Victor's model of is_X() and stat() following symlinks by default, and
allowing you to specify follow_symlinks=False if you want something
other than that.
Post by Akira Li
What happens to name and full_name with followlinks=True?
Do they contain the name in the directory (name of the symlink)
or name of the linked file?
I would say they should contain the name and full path of the entry --
the symlink, NOT the linked file. They kind of have to, right,
otherwise they'd have to be method calls that potentially call the
system.

In any case, here's the modified proposal:

scandir(path='.') -> generator of DirEntry objects, which have:

* name: name as per listdir()
* full_name: full path name (not necessarily absolute), equivalent of
os.path.join(path, entry.name)
* is_dir(follow_symlinks=True): like os.path.isdir(entry.full_name),
but free in most cases; cached per entry
* is_file(follow_symlinks=True): like os.path.isfile(entry.full_name),
but free in most cases; cached per entry
* is_symlink(): like os.path.islink(), but free in most cases; cached per entry
* stat(follow_symlinks=True): like os.stat(entry.full_name,
follow_symlinks=follow_symlinks); cached per entry

The above may not be quite perfect, but it's good, and I think there's
been enough bike-shedding on the API. :-)

So please speak now or forever hold your peace. :-) I intend to update
the PEP to reflect this and make a few other clarifications in the
next few days.

-Ben
Ethan Furman
2014-07-15 03:00:51 UTC
Permalink
Received: from localhost (HELO mail.python.org) (127.0.0.1)
by albatross.python.org with SMTP; 15 Jul 2014 05:46:33 +0200
Received: from gateway11.websitewelcome.com (unknown [67.18.71.7])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(No client certificate requested)
by mail.python.org (Postfix) with ESMTPS
for <python-***@python.org>; Tue, 15 Jul 2014 05:46:33 +0200 (CEST)
Received: by gateway11.websitewelcome.com (Postfix, from userid 500)
id 03D4A317631F3; Mon, 14 Jul 2014 22:00:52 -0500 (CDT)
Received: from gator3304.hostgator.com (gator3304.hostgator.com
[192.254.250.168])
by gateway11.websitewelcome.com (Postfix) with ESMTP id D7E93317631B0
for <python-***@python.org>; Mon, 14 Jul 2014 22:00:51 -0500 (CDT)
Received: from [173.12.184.233] (port=41085)
by gator3304.hostgator.com with esmtpsa (TLSv1:DHE-RSA-CAMELLIA256-SHA:256)
(Exim 4.82) (envelope-from <***@stoneleaf.us>) id 1X6sz5-0003n0-Bx
for python-***@python.org; Mon, 14 Jul 2014 22:00:51 -0500
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
rv:16.0) Gecko/20121010 Thunderbird/16.0.1
In-Reply-To: <CAL9jXCH6Rh2r0-_-B7_OmtaBjiWYgnTs=i+***@mail.gmail.com>
X-AntiAbuse: This header was added to track abuse,
please include it with any abuse report
X-AntiAbuse: Primary Hostname - gator3304.hostgator.com
X-AntiAbuse: Original Domain - python.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - stoneleaf.us
X-BWhitelist: no
X-Source-IP: 173.12.184.233
X-Exim-ID: 1X6sz5-0003n0-Bx
X-Source:
X-Source-Args:
X-Source-Dir:
X-Source-Sender: ([173.12.184.233]) [173.12.184.233]:41085
X-Source-Auth: ethan+stoneleaf.us
X-Email-Count: 1
X-Source-Cap: dG9idWs7dG9idWs7Z2F0b3IzMzA0Lmhvc3RnYXRvci5jb20=
X-BeenThere: python-***@python.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Python core developers <python-dev.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-dev>,
<mailto:python-dev-***@python.org?subject=unsubscribe>
List-Archive: <http://mail.python.org/pipermail/python-dev/>
List-Post: <mailto:python-***@python.org>
List-Help: <mailto:python-dev-***@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-dev>,
<mailto:python-dev-***@python.org?subject=subscribe>
Errors-To: python-dev-bounces+python-python-dev=***@python.org
Sender: "Python-Dev"
<python-dev-bounces+python-python-dev=***@python.org>
Archived-At: <http://permalink.gmane.org/gmane.comp.python.devel/148695>
Post by Ben Hoyt
* name: name as per listdir()
* full_name: full path name (not necessarily absolute), equivalent of
os.path.join(path, entry.name)
* is_dir(follow_symlinks=True): like os.path.isdir(entry.full_name),
but free in most cases; cached per entry
* is_file(follow_symlinks=True): like os.path.isfile(entry.full_name),
but free in most cases; cached per entry
* is_symlink(): like os.path.islink(), but free in most cases; cached per entry
* stat(follow_symlinks=True): like os.stat(entry.full_name,
follow_symlinks=follow_symlinks); cached per entry
The above may not be quite perfect, but it's good, and I think there's
been enough bike-shedding on the API. :-)
Looks doable. Just make sure the cached entries reflect the 'follow_symlinks' setting -- so a symlink could end up with
both an lstat cached entry and a stat cached entry.

--
~Ethan~
Ben Hoyt
2014-07-15 12:01:16 UTC
Permalink
Post by Ethan Furman
Looks doable. Just make sure the cached entries reflect the
'follow_symlinks' setting -- so a symlink could end up with both an lstat
cached entry and a stat cached entry.
Yes, good point -- basically the functions will use the _stat cache if
follow_symlinks=True, otherwise the _lstat cache. If the entry is not
a symlink (the usual case), they'll be the same value.

-Ben
Victor Stinner
2014-07-15 06:25:52 UTC
Permalink
Post by Ben Hoyt
Post by Victor Stinner
What happens to name and full_name with followlinks=True?
Do they contain the name in the directory (name of the symlink)
or name of the linked file?
I would say they should contain the name and full path of the entry --
the symlink, NOT the linked file. They kind of have to, right,
otherwise they'd have to be method calls that potentially call the
system.
Sorry, I don't remember who but someone proposed to add the follow_symlinks
parameter in scandir() directly. If the parameter is added to methods,
there is no such issue.

I like the compromise of adding an optional follow_symlinks to is_xxx() and
stat() method. No need for .lstat().

Again: remove any garantee about the cache in the definitions of methods,
instead copy the doc from os.path and os. Add a global remark saying that
most methods don't need any syscall in general, except for symlinks (with
follow_symlinks=True).

Victor
Ben Hoyt
2014-07-15 12:05:55 UTC
Permalink
Post by Victor Stinner
Sorry, I don't remember who but someone proposed to add the follow_symlinks
parameter in scandir() directly. If the parameter is added to methods,
there is no such issue.
Yeah, I think having the DirEntry methods do different things
depending on how scandir() was called is a really bad idea. It seems
you're agreeing with this?
Post by Victor Stinner
Again: remove any garantee about the cache in the definitions of methods,
instead copy the doc from os.path and os. Add a global remark saying that
most methods don't need any syscall in general, except for symlinks (with
follow_symlinks=True).
I'm not sure I follow this -- surely it *has* to be documented that
the values of DirEntry.is_X() and DirEntry.stat() are cached per
entry, in contrast to os.path.isX()/os.stat()?

I don't mind a global remark about not needing syscalls, but I do
think it makes sense to make it explicit -- that is_X() almost never
need syscalls, whereas stat() does only on POSIX but is free on
Windows (except for symlinks).

-Ben
Ethan Furman
2014-07-15 16:41:40 UTC
Permalink
Post by Victor Stinner
Again: remove any garantee about the cache in the definitions of methods,
instead copy the doc from os.path and os. Add a global remark saying that
most methods don't need any syscall in general, except for symlinks (with
follow_symlinks=True).
I don't understand what you're saying here. The fact that DirEnrry.is_xxx will use cached values *must* be documented,
or our users will waste huge amounts of time trying to figure out why an unknowingly cached value is no longer matching
the current status.

~Ethan~
Nick Coghlan
2014-07-15 11:24:14 UTC
Permalink
Post by Ben Hoyt
In light of that, I propose I update the PEP to basically follow
Victor's model of is_X() and stat() following symlinks by default, and
allowing you to specify follow_symlinks=False if you want something
other than that.
Post by Victor Stinner
What happens to name and full_name with followlinks=True?
Do they contain the name in the directory (name of the symlink)
or name of the linked file?
I would say they should contain the name and full path of the entry --
the symlink, NOT the linked file. They kind of have to, right,
otherwise they'd have to be method calls that potentially call the
system.
It would be worth explicitly pointing out "os.readlink(entry.full_name)" in
the docs as the way to get the target of a symlink entry.

Alternatively, it may be worth including a readlink() method directly on
the entry objects. (That can easily be added later though, so no need for
it in the initial proposal).
Post by Ben Hoyt
* name: name as per listdir()
* full_name: full path name (not necessarily absolute), equivalent of
os.path.join(path, entry.name)
* is_dir(follow_symlinks=True): like os.path.isdir(entry.full_name),
but free in most cases; cached per entry
* is_file(follow_symlinks=True): like os.path.isfile(entry.full_name),
but free in most cases; cached per entry
* is_symlink(): like os.path.islink(), but free in most cases; cached per entry
* stat(follow_symlinks=True): like os.stat(entry.full_name,
follow_symlinks=follow_symlinks); cached per entry
The above may not be quite perfect, but it's good, and I think there's
been enough bike-shedding on the API. :-)
+1, sounds good to me (and I like having the caching guarantees listed -
helps make it clear how DirEntry differs from pathlib.Path)

Cheers,
Nick.
Post by Ben Hoyt
So please speak now or forever hold your peace. :-) I intend to update
the PEP to reflect this and make a few other clarifications in the
next few days.
-Ben
_______________________________________________
Python-Dev mailing list
https://mail.python.org/mailman/listinfo/python-dev
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
Ben Hoyt
2014-07-15 12:19:35 UTC
Permalink
This post might be inappropriate. Click to display it.
Paul Moore
2014-07-15 12:31:16 UTC
Permalink
Post by Ben Hoyt
Hmmm, perhaps. You suggest .full_name implies it's the absolute path,
which isn't true. I don't mind .path, but it kind of sounds like "the
Path object associated with this entry". I think "full_name" is fine
-- it's not "abs_name".
Interesting. I hadn't really thought about it, but I might have
assumed full_name was absolute. However, now I see that it's "only as
absolute as the directory argument to scandir is". Having said that, I
don't think that full_name *implies* that, just that it's a possible
mistake people could make. I agree that "path" could be seen as
implying a Path object.

My preference would be to retain the name full_name, but just make it
explicit in the documentation that it is based on the directory name
argument.

Paul
Ethan Furman
2014-07-14 04:52:37 UTC
Permalink
Received: from localhost (HELO mail.python.org) (127.0.0.1)
by albatross.python.org with SMTP; 14 Jul 2014 07:37:59 +0200
Received: from gateway13.websitewelcome.com (unknown [69.93.243.26])
(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
(No client certificate requested)
by mail.python.org (Postfix) with ESMTPS
for <python-***@python.org>; Mon, 14 Jul 2014 07:37:59 +0200 (CEST)
Received: by gateway13.websitewelcome.com (Postfix, from userid 5007)
id 9BEDA4A2D8DEB; Sun, 13 Jul 2014 23:52:37 -0500 (CDT)
Received: from gator3304.hostgator.com (gator3304.hostgator.com
[192.254.250.168])
by gateway13.websitewelcome.com (Postfix) with ESMTP id 7B7DC4A2D8D98
for <python-***@python.org>; Sun, 13 Jul 2014 23:52:37 -0500 (CDT)
Received: from [173.12.184.233] (port=39865)
by gator3304.hostgator.com with esmtpsa (TLSv1:DHE-RSA-CAMELLIA256-SHA:256)
(Exim 4.82) (envelope-from <***@stoneleaf.us>) id 1X6YFg-0005Qz-W5
for python-***@python.org; Sun, 13 Jul 2014 23:52:37 -0500
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
rv:16.0) Gecko/20121010 Thunderbird/16.0.1
In-Reply-To: <***@mail.gmail.com>
X-AntiAbuse: This header was added to track abuse,
please include it with any abuse report
X-AntiAbuse: Primary Hostname - gator3304.hostgator.com
X-AntiAbuse: Original Domain - python.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - stoneleaf.us
X-BWhitelist: no
X-Source-IP: 173.12.184.233
X-Exim-ID: 1X6YFg-0005Qz-W5
X-Source:
X-Source-Args:
X-Source-Dir:
X-Source-Sender: ([173.12.184.233]) [173.12.184.233]:39865
X-Source-Auth: ethan+stoneleaf.us
X-Email-Count: 1
X-Source-Cap: dG9idWs7dG9idWs7Z2F0b3IzMzA0Lmhvc3RnYXRvci5jb20=
X-BeenThere: python-***@python.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Python core developers <python-dev.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-dev>,
<mailto:python-dev-***@python.org?subject=unsubscribe>
List-Archive: <http://mail.python.org/pipermail/python-dev/>
List-Post: <mailto:python-***@python.org>
List-Help: <mailto:python-dev-***@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-dev>,
<mailto:python-dev-***@python.org?subject=subscribe>
Errors-To: python-dev-bounces+python-python-dev=***@python.org
Sender: "Python-Dev"
<python-dev-bounces+python-python-dev=***@python.org>
Archived-At: <http://permalink.gmane.org/gmane.comp.python.devel/148665>
Post by Ben Hoyt
On the recent python-dev thread, Victor especially made some well
thought out suggestions. It seems to me there's general agreement that
the basic API in PEP 471 is good (with Ethan not a fan at first, but
it seems he's on board after further discussion :-).
I would still like to have 'info' and 'onerror' added to the basic API, but I agree that having methods and caching on
first lookup is good.
Post by Ben Hoyt
whether or not to have DirEntry.is_dir() and .is_file() follow
symlinks by default.
We should have a flag for that, and default it to False:

scandir(path, *, followlinks=False, info=None, onerror=None)

--
~Ethan~
Victor Stinner
2014-07-14 08:31:00 UTC
Permalink
Post by Ethan Furman
scandir(path, *, followlinks=False, info=None, onerror=None)
What happens to name and full_name with followlinks=True? Do they
contain the name in the directory (name of the symlink) or name of the
linked file?

So it means that is_dir() may or may not follow symlinks depending how
the object was built?

Victor
Victor Stinner
2014-07-14 08:18:31 UTC
Permalink
Post by Ben Hoyt
If we go with Victor's link-following .is_dir() and .is_file(), then
we probably need to add his suggestion of a follow_symlinks=False
parameter (defaults to True). Either that or you have to say
"stat.S_ISDIR(entry.lstat().st_mode)" instead, which is a little bit
less nice.
You forgot one of my argument: we must have exactly the same API than
os.path.is_dir() and pathlib.Path.is_dir(), because it would be very
confusing (source of bugs) to have a different behaviour.

Since these functions don't have any parameter (there is no such
follow_symlink(s) parameter), I'm opposed to the idea of adding such
parameter.

If you really want to add a follow_symlink optional parameter, IMO you
should modify all os.path.is*() functions and all pathlib.Path.is*()
methods to add it there too. Maybe if nobody asked for this feature
before, it's because it's not useful in practice. You can simply test
explicitly is_symlink() before checking is_dir().

Well, let's imagine DirEntry.is_dir() does not follow symlinks. How do
you test is_dir() and follow symlinks?
"stat.S_ISDIR(entry.stat().st_mode)" ? You have to import the stat
module, and use the ugly C macro S_ISDIR().

Victor
Ben Hoyt
2014-07-14 12:27:39 UTC
Permalink
First, just to clarify a couple of points.
Post by Victor Stinner
You forgot one of my argument: we must have exactly the same API than
os.path.is_dir() and pathlib.Path.is_dir(), because it would be very
confusing (source of bugs) to have a different behaviour.
Actually, I specifically included that argument. It's item (b) in the
list in my original message yesterday. :-)
Post by Victor Stinner
Since these functions don't have any parameter (there is no such
follow_symlink(s) parameter), I'm opposed to the idea of adding such
parameter.
If you really want to add a follow_symlink optional parameter, IMO you
should modify all os.path.is*() functions and all pathlib.Path.is*()
methods to add it there too. Maybe if nobody asked for this feature
before, it's because it's not useful in practice. You can simply test
explicitly is_symlink() before checking is_dir().
Yeah, this is fair enough.
Post by Victor Stinner
Well, let's imagine DirEntry.is_dir() does not follow symlinks. How do
you test is_dir() and follow symlinks?
"stat.S_ISDIR(entry.stat().st_mode)" ? You have to import the stat
module, and use the ugly C macro S_ISDIR().
No, you don't actually need stat/S_ISDIR in that case -- if
DirEntry.is_dir() does not follow symlinks, you just say:

entry.is_symlink() and os.path.isdir(entry.full_name)

Or for the full test:

(entry.is_symlink() and os.path.isdir(entry.full_name)) or entry.is_dir()

On the other hand, if DirEntry.is_dir() does follow symlinks per your
proposal, then to do is_dir without following symlinks you need to use
DirEntry. lstat() like so:

stat.S_ISDIR(entry.lstat().st_mode)

So from this perspective it's somewhat nicer to have DirEntry.is_X()
not follow links and use DirEntry.is_symlink() and os.path.isX() to
supplement that if you want to follow links.

I think Victor has a good point re 92% of the stdlib calls that use
listdir and isX do follow links.

However, I think Tim Delaney makes some good points above about the
(not so) safety of scandir following symlinks by default -- symlinks
to network file systems, nonexist files, or huge directory trees. In
that light, this kind of thing should be opt-*in*.

I guess I'm still slightly on the DirEntry-does-not-follow-links side
of the fence, due to the fact that it's a method on the *directory
entry* object, due to simplicity of implementation, and due to Tim
Delaney's "it should be safe by default" point above.

However, we're *almost* bikeshedding at this point, and I think we
just need to pick one way or the other. It's straight forward to
implement one in terms of the other in each case.

-Ben
Antoine Pitrou
2014-07-20 16:50:06 UTC
Permalink
Received: from localhost (HELO mail.python.org) (127.0.0.1)
by albatross.python.org with SMTP; 20 Jul 2014 18:50:17 +0200
Received: from plane.gmane.org (unknown [80.91.229.3])
(using TLSv1 with cipher AES256-SHA (256/256 bits))
(No client certificate requested)
by mail.python.org (Postfix) with ESMTPS
for <python-***@python.org>; Sun, 20 Jul 2014 18:50:17 +0200 (CEST)
Received: from list by plane.gmane.org with local (Exim 4.69)
(envelope-from <python-python-***@m.gmane.org>) id 1X8uJU-0004Q6-CK
for python-***@python.org; Sun, 20 Jul 2014 18:50:16 +0200
Received: from pool-100-2-156-185.nycmny.fios.verizon.net ([100.2.156.185])
by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
id 1AlnuQ-0007hv-00
for <python-***@python.org>; Sun, 20 Jul 2014 18:50:16 +0200
Received: from antoine by pool-100-2-156-185.nycmny.fios.verizon.net with
local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00
for <python-***@python.org>; Sun, 20 Jul 2014 18:50:16 +0200
X-Injected-Via-Gmane: http://gmane.org/
Lines: 15
X-Complaints-To: ***@ger.gmane.org
X-Gmane-NNTP-Posting-Host: pool-100-2-156-185.nycmny.fios.verizon.net
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
rv:24.0) Gecko/20100101 Thunderbird/24.6.0
In-Reply-To: <***@mail.gmail.com>
X-BeenThere: python-***@python.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Python core developers <python-dev.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-dev>,
<mailto:python-dev-***@python.org?subject=unsubscribe>
List-Archive: <http://mail.python.org/pipermail/python-dev/>
List-Post: <mailto:python-***@python.org>
List-Help: <mailto:python-dev-***@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-dev>,
<mailto:python-dev-***@python.org?subject=subscribe>
Errors-To: python-dev-bounces+python-python-dev=***@python.org
Sender: "Python-Dev"
<python-dev-bounces+python-python-dev=***@python.org>
Archived-At: <http://permalink.gmane.org/gmane.comp.python.devel/148725>



Hi,
Post by Ben Hoyt
Thanks Victor, Nick, Ethan, and others for continued discussion on the
scandir PEP 471 (most recent thread starts at
https://mail.python.org/pipermail/python-dev/2014-July/135377.html).
Have you tried modifying importlib's _bootstrap.py to use scandir()
instead of listdir() + stat()?

Regards

Antoine.
Ben Hoyt
2014-07-20 21:34:19 UTC
Permalink
Have you tried modifying importlib's _bootstrap.py to use scandir() instead
of listdir() + stat()?
No, I haven't -- I'm not familiar with that code. What does
_bootstrap.py do -- does it do a lot of listdir calls and stat-ing of
many files?

-Ben
Brett Cannon
2014-07-20 22:35:48 UTC
Permalink
Oh yes. :) The file Antoine is referring to is the implementation of import.
Post by Ben Hoyt
Post by Antoine Pitrou
Have you tried modifying importlib's _bootstrap.py to use scandir()
instead
Post by Antoine Pitrou
of listdir() + stat()?
No, I haven't -- I'm not familiar with that code. What does
_bootstrap.py do -- does it do a lot of listdir calls and stat-ing of
many files?
-Ben
_______________________________________________
Python-Dev mailing list
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/
brett%40python.org
Antoine Pitrou
2014-07-20 23:45:28 UTC
Permalink
Post by Ben Hoyt
Have you tried modifying importlib's _bootstrap.py to use scandir() instead
of listdir() + stat()?
No, I haven't -- I'm not familiar with that code. What does
_bootstrap.py do -- does it do a lot of listdir calls and stat-ing of
many files?
Quite a bit, although that should be dampened in recent 3.x versions,
thanks to the caching of directory contents.

Even though there is tangible performance improvement from scandir(), it
would be useful to find out if the API fits well.

Regards

Antoine.
Ben Hoyt
2014-07-21 15:32:05 UTC
Permalink
Post by Antoine Pitrou
Even though there is tangible performance improvement from scandir(), it
would be useful to find out if the API fits well.
Got it -- I see where you're coming from now. I'll take a quick look
(hopefully later this week).

-Ben
Victor Stinner
2014-07-21 15:57:12 UTC
Permalink
Hi,
Have you tried modifying importlib's _bootstrap.py to use scandir() instead
of listdir() + stat()?
IMO the current os.scandir() API does not fit importlib requirements.
importlib usually wants fresh data, whereas DirEntry cache cannot be
invalidated. It's probably possible to cache some os.stat() result in
importlib, but it looks like it requires a non trivial refactoring of
the code. I don't know importlib enough to suggest how to change it.

There are many open isssues related to stat() in importlib, I found these ones:

http://bugs.python.org/issue14604
http://bugs.python.org/issue14067
http://bugs.python.org/issue19216

Closed issues:

http://bugs.python.org/issue17330
http://bugs.python.org/issue18810


By the way, DirEntry constructor is not documented in the PEP. Should
we document it? It might be a way to "invalidate the cache":

entry = DirEntry(os.path.dirname(entry.path), entry.name)

Maybe it is an abuse of the API. A clear_cache() method would be less
ugly :-) But maybe Ben Hoyt does not want to promote keeping DirEntry
for a long time?

Another question: should we expose DirEntry type directly in the os
namespace? (os.DirEntry)

Victor
Steve Dower
2014-07-21 16:11:45 UTC
Permalink
Post by Victor Stinner
Post by Antoine Pitrou
Have you tried modifying importlib's _bootstrap.py to use scandir()
instead of listdir() + stat()?
IMO the current os.scandir() API does not fit importlib requirements.
importlib usually wants fresh data, whereas DirEntry cache cannot be
invalidated. It's probably possible to cache some os.stat() result in
importlib, but it looks like it requires a non trivial refactoring of
the code. I don't know importlib enough to suggest how to change it.
The data is completely fresh at the time it is obtained, which is identical to using stat(). There will always be a race-condition between looking and doing, which is why we still use exception handling on actions.
Post by Victor Stinner
By the way, DirEntry constructor is not documented in the PEP. Should
entry = DirEntry(os.path.dirname(entry.path), entry.name)
Maybe it is an abuse of the API. A clear_cache() method would be less
ugly :-) But maybe Ben Hoyt does not want to promote keeping DirEntry
for a long time?
DirEntry is a convenient way to return a tuple without returning a tuple, that's all. If you want up to date info, call os.stat() and pass in the path. This should just be a better (and ideally transparent) substitute for os.listdir() in every single context.

Personally I'd make it a string subclass and put one-shot properties on it (i.e. call/cache stat() on first access where we don't already know the answer), which I think is close enough to where it's landed that I'm happy. (As far as bikeshedding goes, I prefer "_DirEntry" and no docs :) )

Cheers,
Steve
Nick Coghlan
2014-07-21 21:37:09 UTC
Permalink
Post by Steve Dower
Personally I'd make it a string subclass and put one-shot properties on
it (i.e. call/cache stat() on first access where we don't already know the
answer), which I think is close enough to where it's landed that I'm happy.
(As far as bikeshedding goes, I prefer "_DirEntry" and no docs :) )

+1 for "_DirEntry" as the name in the implementation, and documenting its
behaviour under "scandir" rather than as a standalone object.

Only -0 for full documentation as a standalone class, though.

Cheers,
Nick.
Post by Steve Dower
Cheers,
Steve
_______________________________________________
Python-Dev mailing list
https://mail.python.org/mailman/listinfo/python-dev
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
Ben Hoyt
2014-07-21 16:48:50 UTC
Permalink
Thanks for an initial look into this, Victor.
Post by Victor Stinner
IMO the current os.scandir() API does not fit importlib requirements.
importlib usually wants fresh data, whereas DirEntry cache cannot be
invalidated. It's probably possible to cache some os.stat() result in
importlib, but it looks like it requires a non trivial refactoring of
the code. I don't know importlib enough to suggest how to change it.
Yes, with importlib already doing its own caching (somewhat
complicated, as the open and closed issues show), I get the feeling it
wouldn't be a good fit. Note that I'm not saying we wouldn't use it if
we were implementing importlib from scratch.
Post by Victor Stinner
By the way, DirEntry constructor is not documented in the PEP. Should
I would prefer not to, just to keep things simple. Similar to creating
os.stat_result() objects ... you can kind of do it (see scandir.py),
but it's not recommended or even documented. The entire purpose of
DirEntry objects is so scandir can produce them, not for general use.
Post by Victor Stinner
entry = DirEntry(os.path.dirname(entry.path), entry.name)
Maybe it is an abuse of the API. A clear_cache() method would be less
ugly :-) But maybe Ben Hoyt does not want to promote keeping DirEntry
for a long time?
Another question: should we expose DirEntry type directly in the os
namespace? (os.DirEntry)
Again, I'd rather not expose this. It's quite system-specific (see the
different system versions in scandir.py), and trying to combine this,
make it consistent, and document it would be a bit of a pain, and also
possibly prevent future modifications (because then the parts of the
implementation would be set in stone).

I'm not really opposed to a clear_cache() method -- basically it'd set
_lstat and _stat and _d_type to None internally. However, I'd prefer
to keep it as is, and as the PEP says:

If developers want "refresh" behaviour (for example, for watching a
file's size change), they can simply use pathlib.Path objects, or call
the regular os.stat() or os.path.getsize() functions which get fresh
data from the operating system every call.

-Ben
Victor Stinner
2014-07-21 22:39:26 UTC
Permalink
Post by Ben Hoyt
Post by Victor Stinner
By the way, DirEntry constructor is not documented in the PEP. Should
I would prefer not to, just to keep things simple. Similar to creating
os.stat_result() objects ... you can kind of do it (see scandir.py),
but it's not recommended or even documented. The entire purpose of
DirEntry objects is so scandir can produce them, not for general use.
Post by Victor Stinner
entry = DirEntry(os.path.dirname(entry.path), entry.name)
Maybe it is an abuse of the API. A clear_cache() method would be less
ugly :-) But maybe Ben Hoyt does not want to promote keeping DirEntry
for a long time?
Another question: should we expose DirEntry type directly in the os
namespace? (os.DirEntry)
Again, I'd rather not expose this. It's quite system-specific (see the
different system versions in scandir.py), and trying to combine this,
make it consistent, and document it would be a bit of a pain, and also
possibly prevent future modifications (because then the parts of the
implementation would be set in stone).
We should mimic os.stat() and os.stat_result: os.stat_result symbol
exists in the os namespace, but the type constructor is not
documented. No need for extra protection like not adding the type in
the os module, or adding a "_" prefix to the name.

By the way, it's possible to serialize a stat_result with pickle.

See also my issue "Enhance doc of os.stat_result":
http://bugs.python.org/issue21813
Post by Ben Hoyt
I'm not really opposed to a clear_cache() method -- basically it'd set
_lstat and _stat and _d_type to None internally. However, I'd prefer
to keep it as is, and as the PEP says: (...)
Ok, agreed.

Victor
Ben Hoyt
2014-07-22 02:32:10 UTC
Permalink
Post by Victor Stinner
We should mimic os.stat() and os.stat_result: os.stat_result symbol
exists in the os namespace, but the type constructor is not
documented. No need for extra protection like not adding the type in
the os module, or adding a "_" prefix to the name.
Yeah, that works for me.
Post by Victor Stinner
By the way, it's possible to serialize a stat_result with pickle.
That makes sense, as stat_result is basically just a tuple and a bit
extra. I wonder if it should be possible to pickle DirEntry objects?
I'm thinking possibly not. If so, would it cache the stat or file type
info?

-Ben

Loading...