Fix Unicode-disabled build of Python 2.7

Discussion:

Serhiy Storchaka

2014-06-24 08:22:11 UTC

I submitted a number of patches which fixes currently broken
Unicode-disabled build of Python 2.7 (built with --disable-unicode
configure option). I suppose this was broken in 2.7 when C
implementation of the io module was introduced.

http://bugs.python.org/issue21833 -- main patch which fixes the io
module and adds helpers for testing.

http://bugs.python.org/issue21834 -- a lot of minor fixes for tests.

Following issues fix different modules and related tests:

http://bugs.python.org/issue21854 -- cookielib
http://bugs.python.org/issue21838 -- ctypes
http://bugs.python.org/issue21855 -- decimal
http://bugs.python.org/issue21839 -- distutils
http://bugs.python.org/issue21843 -- doctest
http://bugs.python.org/issue21851 -- gettext
http://bugs.python.org/issue21844 -- HTMLParser
http://bugs.python.org/issue21850 -- httplib and SimpleHTTPServer
http://bugs.python.org/issue21842 -- IDLE
http://bugs.python.org/issue21853 -- inspect
http://bugs.python.org/issue21848 -- logging
http://bugs.python.org/issue21849 -- multiprocessing
http://bugs.python.org/issue21852 -- optparse
http://bugs.python.org/issue21840 -- os.path
http://bugs.python.org/issue21845 -- plistlib
http://bugs.python.org/issue21836 -- sqlite3
http://bugs.python.org/issue21837 -- tarfile
http://bugs.python.org/issue21835 -- Tkinter
http://bugs.python.org/issue21847 -- xmlrpc
http://bugs.python.org/issue21841 -- xml.sax
http://bugs.python.org/issue21846 -- zipfile

Most fixes are trivial and are only several lines of a code.

Victor Stinner

2014-06-24 08:55:21 UTC

Permalink

Hi,

I don't know anyone building Python without Unicode. I would prefer to
modify configure to raise an error, and drop #ifdef in the code. (Stop
supporting building Python 2 without Unicode.)

Building Python 2 without Unicode support is not an innocent change.
Python is moving strongly to Unicode: Python 3 uses Unicode by
default. So to me it sounds really weird to work on building Python 2
without Unicode support. It means that you may have "Python 2" and
"Python 2 without Unicode" which are not exactly the same language.
IMO u"unicode" is part of the Python 2 language.

--disable-unicode is an old option added while Python 1.5 was very
slowly moving to Unicode.

I have the same opinion on --without-thread option (we should stop
supporting it, this option is useless). I worked in the embedded
world, Python used for the UI of a TV set top box. Even if the
hardware was slow and old, Python was compiled with threads and
Unicode. Unicode was mandatory to handle correctly letters with
diacritics, threads were used to handle network and D-Bus for
examples.

Victor

Post by Serhiy Storchaka
I submitted a number of patches which fixes currently broken
Unicode-disabled build of Python 2.7 (built with --disable-unicode configure
option). I suppose this was broken in 2.7 when C implementation of the io
module was introduced.
http://bugs.python.org/issue21833 -- main patch which fixes the io module
and adds helpers for testing.
http://bugs.python.org/issue21834 -- a lot of minor fixes for tests.
http://bugs.python.org/issue21854 -- cookielib
http://bugs.python.org/issue21838 -- ctypes
http://bugs.python.org/issue21855 -- decimal
http://bugs.python.org/issue21839 -- distutils
http://bugs.python.org/issue21843 -- doctest
http://bugs.python.org/issue21851 -- gettext
http://bugs.python.org/issue21844 -- HTMLParser
http://bugs.python.org/issue21850 -- httplib and SimpleHTTPServer
http://bugs.python.org/issue21842 -- IDLE
http://bugs.python.org/issue21853 -- inspect
http://bugs.python.org/issue21848 -- logging
http://bugs.python.org/issue21849 -- multiprocessing
http://bugs.python.org/issue21852 -- optparse
http://bugs.python.org/issue21840 -- os.path
http://bugs.python.org/issue21845 -- plistlib
http://bugs.python.org/issue21836 -- sqlite3
http://bugs.python.org/issue21837 -- tarfile
http://bugs.python.org/issue21835 -- Tkinter
http://bugs.python.org/issue21847 -- xmlrpc
http://bugs.python.org/issue21841 -- xml.sax
http://bugs.python.org/issue21846 -- zipfile
Most fixes are trivial and are only several lines of a code.
_______________________________________________
Python-Dev mailing list
https://mail.python.org/mailman/listinfo/python-dev
https://mail.python.org/mailman/options/python-dev/victor.stinner%40gmail.com

Skip Montanaro

2014-06-24 11:04:41 UTC

Permalink

I can't see any reason to make a backwards-incompatible change to
Python 2 to only support Unicode. You're bound to break somebody's
setup. Wouldn't it be better to fix bugs as Serhiy has done?

Skip

Antoine Pitrou

2014-06-24 11:47:37 UTC

Permalink

Post by Skip Montanaro
I can't see any reason to make a backwards-incompatible change to
Python 2 to only support Unicode. You're bound to break somebody's
setup.

Apparently, that setup would already have been broken for years.

Regards

Antoine.

Victor Stinner

2014-06-24 11:50:25 UTC

Permalink

Post by Skip Montanaro
I can't see any reason to make a backwards-incompatible change to
Python 2 to only support Unicode. You're bound to break somebody's
setup. Wouldn't it be better to fix bugs as Serhiy has done?

According to the long list of issues, I don't think that it's possible
to compile and use Python stdlib when Python is compiled without
Unicode support. So I'm not sure that we can say that it's an
backward-incompatible change.

Who is somebody? Who compiles Python without Unicode support? Which
version of Python?

With Python 2.6, ./configure --disable-unicode fails with:
"checking what type to use for unicode... configure: error: invalid
value for --enable-unicode. Use either ucs2 or ucs4 (lowercase)."

So I'm not sure that anyone used this option recently.

The configure script was fixed 2 years ago in Python 2.7 (2 years
after the release of Python 2.7.0):
http://hg.python.org/cpython/rev/d7aff4423172
http://bugs.python.org/issue21833

"./configure --disable-unicode" works on Python 2.5.6: unicode type
doesn't exist, and u'abc' is a bytes string.

It works with Python 2.7.7+ too.

Victor

Serhiy Storchaka

2014-06-24 12:10:07 UTC

Permalink

Post by Victor Stinner

Python has about 300 modules, my patches fix about 30 modules (only 8 of
them cause compiling error). And that's almost all. Left only pickle,
json, etree, email and unicode-specific modules (codecs, unicodedata and
encodings). Besides pickle I'm not sure that others can be fixed.

The fact that only small fraction of modules needs fixes means that
Python without unicode support can be pretty usable.

The main problem was with testing itself. Test suite depends on
tempfile, which now uses io.open, which didn't work without unicode
support (at least since 2.7).

Benjamin Peterson

2014-06-24 16:06:10 UTC

Permalink

If Serhiy wants to spend his time supporting this arcane feature, he can
do that. It doesn't really seem worth risking regressions to do this,
though.

Post by Victor Stinner
Hi,
I don't know anyone building Python without Unicode. I would prefer to
modify configure to raise an error, and drop #ifdef in the code. (Stop
supporting building Python 2 without Unicode.)
Building Python 2 without Unicode support is not an innocent change.
Python is moving strongly to Unicode: Python 3 uses Unicode by
default. So to me it sounds really weird to work on building Python 2
without Unicode support. It means that you may have "Python 2" and
"Python 2 without Unicode" which are not exactly the same language.
IMO u"unicode" is part of the Python 2 language.
--disable-unicode is an old option added while Python 1.5 was very
slowly moving to Unicode.
I have the same opinion on --without-thread option (we should stop
supporting it, this option is useless). I worked in the embedded
world, Python used for the UI of a TV set top box. Even if the
hardware was slow and old, Python was compiled with threads and
Unicode. Unicode was mandatory to handle correctly letters with
diacritics, threads were used to handle network and D-Bus for
examples.
Victor

_______________________________________________
Python-Dev mailing list
https://mail.python.org/mailman/listinfo/python-dev
https://mail.python.org/mailman/options/python-dev/benjamin%40python.org

Ned Deily

2014-06-24 19:54:29 UTC

Permalink

In article

Post by Benjamin Peterson
If Serhiy wants to spend his time supporting this arcane feature, he can
do that. It doesn't really seem worth risking regressions to do this,
though.

That's why I'm concerned about applying these 20+ patches that touch
many parts of the code base. I don't have any objection to the "arcane
feature" per se and I appreciate the obvious effort that Serhiy put into
the patches but, at this stage of the life of Python 2, our overriding
concern should be stability. That's really why most users of Python 2.7
continue to use it. As I see it, maintenance mode is a promise from us
to our users that we will try our best, in general, to only make changes
that fix serious problems, either due to bugs in Python itself or
changes in the external world (new OS releases, etc). We don't
automatically fix all bugs. Any time we make a change, we're making an
engineering decision with cost-benefit tradeoffs. The more lines of
code changed, the greater the risk that we introduce new bugs;
inadvertently adding regressions has been an issue over a number of the
2.7.x releases, including the most recent one. The cost-benefit of this
set of changes seems to me to be:

Costs:
- Code changes in many modules:
- careful review -> additional work for multiple core developers
- careful testing on all platforms including this option that we
don't currently test at all, AFAIK -> added work for platform experts
- risk of regressions not caught prior to release, at worst requiring
another early followup release -> added work for release team,
third-party packagers, users
- possibly making backporting of other issues more difficult due to
merge conflicts
- possible invalidation of waiting-for-review patches forcing patch
refreshes and retests -> added work for potential contributors
- possible invalidation of user local patches -> added work for users
- may encourage use of an apparently little-used feature that has no
equivalent in Python 3, another incentive to stay with Py2?

Benefit:
- Fixes documented feature that may be of benefit to users of Python in
applications with very limited memory available, although there aren't
any open issues from users requesting this (AFAIK). No benefit to the
overwhelming majority of Python users, who only use Unicode-enabled
builds.

That just doesn't seem like a good trade-off to me. I'll certainly
abide by the release manager's decision but I think we all need to be
thinking more about these kinds of cost-benefit tradeoffs and recognize
that there are often non-obvious costs of making changes, costs that can
affect our entire community. Yes, we are committed to maintaining
Python 2.7 for multiple years but that doesn't mean we have to fix every
open issue or even most open issues. Any or all of the above costs may
apply to any changes we make. For many of our users, the best
maintenance policy for Python 2.7 would be the least change possible.

--
Ned Deily,
***@acm.org

Ethan Furman

2014-06-24 20:10:48 UTC

Permalink

Post by Ned Deily
Yes, we are committed to maintaining
Python 2.7 for multiple years but that doesn't mean we have to fix every
open issue or even most open issues. Any or all of the above costs may
apply to any changes we make. For many of our users, the best
maintenance policy for Python 2.7 would be the least change possible.

+1

We need to keep 2.7 running, but we don't need to kill ourselves doing it. If a bug has been there for a while, the
affected users are probably working around it by now. ;)

--
~Ethan~

Nick Coghlan

2014-06-24 23:15:27 UTC

Permalink

Post by Ethan Furman

+1
We need to keep 2.7 running, but we don't need to kill ourselves doing

it. If a bug has been there for a while, the affected users are probably
working around it by now. ;)

Aye, in this case, I'm in the "officially deprecate the feature" camp.
Don't actively try to break it further, just slap a warning in the docs to
say it is no longer a supported configuration.

In my own personal case, I not only wasn't aware that there was still an
option to turn off the Unicode support, but I also wouldn't really class a
build with it turned off as still being Python. As Jim noted, there are
quite a lot of APIs that don't make sense if there's no Unicode type
available.

Cheers,
Nick.

Post by Ethan Furman
--
~Ethan~
_______________________________________________
Python-Dev mailing list
https://mail.python.org/mailman/listinfo/python-dev

https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

Skip Montanaro

2014-06-25 12:20:49 UTC

Permalink

Post by Nick Coghlan
Aye, in this case, I'm in the "officially deprecate the feature" camp.

Definitely preferable to the suggestion to remove the configure flag.

Skip

Serhiy Storchaka

2014-06-25 12:58:02 UTC

Permalink

Post by Ned Deily
- Fixes documented feature that may be of benefit to users of Python in
applications with very limited memory available, although there aren't
any open issues from users requesting this (AFAIK). No benefit to the
overwhelming majority of Python users, who only use Unicode-enabled
builds.

Other benefit: patches exposed several bugs in code (mainly errors in
backporting from 3.x).

Victor Stinner

2014-06-25 13:29:01 UTC

Permalink

Post by Serhiy Storchaka

Other benefit: patches exposed several bugs in code (mainly errors in
backporting from 3.x).

Oh, interesting. Do you have examples of such bugs?

Victor

Serhiy Storchaka

2014-06-25 14:00:42 UTC

Permalink

Post by Victor Stinner

Post by Serhiy Storchaka
Other benefit: patches exposed several bugs in code (mainly errors in
backporting from 3.x).

Oh, interesting. Do you have examples of such bugs?

In posixpath branches for unicode and str should be reversed.
In multiprocessing .encode('utf-8') is applied on utf-8 encoded str
(this is unicode string in Python 3). And there is similar error in at
least one other place. Tests for bytearray actually test bytes, not
bytearray. That is what I remember.

Nick Coghlan

2014-06-25 23:28:35 UTC

Permalink

Post by Serhiy Storchaka

Post by Victor Stinner

Post by Serhiy Storchaka
Other benefit: patches exposed several bugs in code (mainly errors in
backporting from 3.x).

Oh, interesting. Do you have examples of such bugs?

In posixpath branches for unicode and str should be reversed.
In multiprocessing .encode('utf-8') is applied on utf-8 encoded str (this

is unicode string in Python 3). And there is similar error in at least one
other place. Tests for bytearray actually test bytes, not bytearray. That
is what I remember.

OK, *that* sounds like an excellent reason to keep the Unicode disabled
builds functional, and make sure they stay that way with a buildbot: to
help make sure we're not accidentally running afoul of the implicit
interoperability between str and unicode when backporting fixes from Python
3.

Helping to ensure correct handling of str values makes this capability
something of benefit to *all* Python 2 users, not just those that turn off
the Unicode support. It also makes it a potentially useful testing tool
when assessing str/unicode handling in general.

Regards,
Nick.

Post by Serhiy Storchaka
_______________________________________________
Python-Dev mailing list
https://mail.python.org/mailman/listinfo/python-dev

https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

Serhiy Storchaka

2014-06-26 07:15:06 UTC

Permalink

Post by Nick Coghlan
OK, *that* sounds like an excellent reason to keep the Unicode disabled
builds functional, and make sure they stay that way with a buildbot: to
help make sure we're not accidentally running afoul of the implicit
interoperability between str and unicode when backporting fixes from
Python 3.
Helping to ensure correct handling of str values makes this capability
something of benefit to *all* Python 2 users, not just those that turn
off the Unicode support. It also makes it a potentially useful testing
tool when assessing str/unicode handling in general.

Do you want to make some patch reviews?

Antoine Pitrou

2014-06-26 11:04:53 UTC

Permalink

Hmmm... From my perspective, trying to enforce unicode-disabled builds
will only lower the (already low) chance that I may want to write /
backport bug fixes for 2.7.

For the same reason, I agree with Victor that we should ditch the
threading-disabled builds. It's too much of a hassle for no actual,
practical benefit. People who want a threadless unicodeless Python can
install Python 1.5.2 for all I care.

Regards

Antoine.

Chris Angelico

2014-06-26 12:49:40 UTC

Permalink

Post by Antoine Pitrou
For the same reason, I agree with Victor that we should ditch the
threading-disabled builds. It's too much of a hassle for no actual,
practical benefit. People who want a threadless unicodeless Python can
install Python 1.5.2 for all I care.

Or some other implementation of Python. It's looking like micropython
will be permanently supporting a non-Unicode build (although I stepped
away from the project after a strong disagreement over what would and
would not make sense, and haven't been following it since). If someone
wants a Python that doesn't have stuff that the core CPython devs
treat as essential, s/he probably wants something like uPy anyway.

ChrisA

Paul Sokolovsky

2014-06-28 10:58:54 UTC

Permalink

Hello,

On Thu, 26 Jun 2014 22:49:40 +1000

Post by Chris Angelico

Or some other implementation of Python. It's looking like micropython
will be permanently supporting a non-Unicode build

Yes.

Post by Chris Angelico
(although I stepped
away from the project after a strong disagreement over what would and
would not make sense, and haven't been following it since).

Your patches with my further additions were finally merged. Unicode
strings still cannot be enabled by default due to
https://github.com/micropython/micropython/issues/726 . Any help with
reviewing/testing what's currently available is welcome.

Post by Chris Angelico
If someone
wants a Python that doesn't have stuff that the core CPython devs
treat as essential, s/he probably wants something like uPy anyway.

I hinted it during previous discussions of MicroPython, and would like
to say it again, that MicroPython already embraced a lot of ideas
rejected from CPython, like GC-only operation (which alone not
something to be proud of, but can you start up and do something in 2K
heap?) or tagged pointers
(https://mail.python.org/pipermail/python-dev/2004-July/046139.html).
So, it should be good vehicle to try any unorthodox ideas(*) or
implementations.

* MicroPython already implements intra-module constants for example.

--
Best regards,
Paul mailto:***@gmail.com

Victor Stinner

2014-06-27 23:51:44 UTC

Permalink

By the way, adding a buildbot for testing Python without thread
support is not enough. The buildbot is currently broken since more
than one month and nobody noticed :-p

http://buildbot.python.org/all/builders/AMD64%20Fedora%20without%20threads%203.x/

Ok, I noticed, but I consider that I spent too much time on this minor
use case. I prefer to leave such task to someone else :-)

Victor

Berker Peksağ

2014-06-30 00:08:24 UTC

Permalink

On Sat, Jun 28, 2014 at 2:51 AM, Victor Stinner

Post by Victor Stinner

By the way, adding a buildbot for testing Python without thread
support is not enough. The buildbot is currently broken since more
than one month and nobody noticed :-p

I've opened http://bugs.python.org/issue21755 to fix the test a couple
of weeks ago.

--Berker

Post by Victor Stinner
http://buildbot.python.org/all/builders/AMD64%20Fedora%20without%20threads%203.x/
Ok, I noticed, but I consider that I spent too much time on this minor
use case. I prefer to leave such task to someone else :-)
Victor
_______________________________________________
Python-Dev mailing list
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/berker.peksag%40gmail.com

Terry Reedy

2014-06-24 14:24:01 UTC

Permalink

This list and more to follow suggests that --disable-unicode was
somewhat broken long before 2.7 and the introduction of _io.

Post by Serhiy Storchaka
http://bugs.python.org/issue21854 -- cookielib
http://bugs.python.org/issue21838 -- ctypes
http://bugs.python.org/issue21855 -- decimal
http://bugs.python.org/issue21839 -- distutils
http://bugs.python.org/issue21843 -- doctest
http://bugs.python.org/issue21851 -- gettext
http://bugs.python.org/issue21844 -- HTMLParser
http://bugs.python.org/issue21850 -- httplib and SimpleHTTPServer
http://bugs.python.org/issue21842 -- IDLE
http://bugs.python.org/issue21853 -- inspect
http://bugs.python.org/issue21848 -- logging
http://bugs.python.org/issue21849 -- multiprocessing
http://bugs.python.org/issue21852 -- optparse
http://bugs.python.org/issue21840 -- os.path
http://bugs.python.org/issue21845 -- plistlib
http://bugs.python.org/issue21836 -- sqlite3
http://bugs.python.org/issue21837 -- tarfile
http://bugs.python.org/issue21835 -- Tkinter
http://bugs.python.org/issue21847 -- xmlrpc
http://bugs.python.org/issue21841 -- xml.sax
http://bugs.python.org/issue21846 -- zipfile
Most fixes are trivial and are only several lines of a code.

--
Terry Jan Reedy

Jim J. Jewett

2014-06-24 21:03:27 UTC

Permalink

It has frequently been broken. Without a buildbot, it will continue
to break. I have given at least a quick look at all your proposed
changes; most are fixes to test code, such as skip decorators.

People checked in tests without the right guards because it did work
on their own builds, and on all stable buildbots. That will probably
continue to happen unless/until a --disable-unicode buildbot is added.

It would be good to fix the tests (and actual library issues).
Unfortunately, some of the specifically proposed changes (such as
defining and using _unicode instead of unicode within python code)
look to me as though they would trigger problems in the normal build
(where the unicode object *does* exist, but would no longer be used).
Other changes, such as the use of \x escapes, appear correct, but make
the tests harder to read -- and might end up removing a test for
correct unicode funtionality across different spellings.

Even if we assume that the tests are fine, and I'm just an idiot who
misread them, the fact that there is any confusion means that these
particular changes may be tricky enough to be for a bad tradeoff for 2.7.

It *might* work if you could make a more focused change. For example,
instead of leaving the 'unicode' name unbound, provide an object that
simply returns false for isinstance and raises a UnicodeError for any
other method call. Even *this* might be too aggressive to 2.7, but the
fact that it would only appear in the --disable-unicode builds, and
would make them more similar to the regular build are points in its
favor.

Before doing that, though, please document what the --disable-unicode
mode is actually *supposed* to do when interacting with byte-streams
that a standard defines as UTF-8. (For example, are the changes to
_xml_dumps and _xml_loads at
http://bugs.python.org/file35758/multiprocessing.patch
correct, or do those functions assume they get bytes as input, or
should the functions raise an exception any time they are called?)

-jJ

--

If there are still threading problems with my replies, please
email me with details, so that I can try to resolve them. -jJ

Serhiy Storchaka

2014-06-25 12:55:35 UTC

Permalink

Post by Jim J. Jewett
It would be good to fix the tests (and actual library issues).
Unfortunately, some of the specifically proposed changes (such as
defining and using _unicode instead of unicode within python code)
look to me as though they would trigger problems in the normal build
(where the unicode object *does* exist, but would no longer be used).

This is recomended by MvL [1] and widely used (19 times in source code)
idiom.

[1] http://bugs.python.org/issue8767#msg159473

Post by Jim J. Jewett
Other changes, such as the use of \x escapes, appear correct, but make
the tests harder to read -- and might end up removing a test for
correct unicode funtionality across different spellings.
Even if we assume that the tests are fine, and I'm just an idiot who
misread them, the fact that there is any confusion means that these
particular changes may be tricky enough to be for a bad tradeoff for 2.7.
It *might* work if you could make a more focused change. For example,
instead of leaving the 'unicode' name unbound, provide an object that
simply returns false for isinstance and raises a UnicodeError for any
other method call. Even *this* might be too aggressive to 2.7, but the
fact that it would only appear in the --disable-unicode builds, and
would make them more similar to the regular build are points in its
favor.

No, existing code use different approach. "unicode" doesn't exist, while
encode/decode methods exist but are useless. If my memory doesn't fail
me, there is even special explanatory comment about this historical
decision somewhere. This decision was made many years ago.

Post by Jim J. Jewett
Before doing that, though, please document what the --disable-unicode
mode is actually *supposed* to do when interacting with byte-streams
that a standard defines as UTF-8. (For example, are the changes to
_xml_dumps and _xml_loads at
http://bugs.python.org/file35758/multiprocessing.patch
correct, or do those functions assume they get bytes as input, or
should the functions raise an exception any time they are called?)

Looking more carefully, I see that there is a bug in unicode-enable
build (wrong backporting from 3.x). In 2.x xmlrpclib.dumps produces
already utf-8 encoded string, in 3.x xmlrpc.client.dumps produces
unicode string. multiprocessing should fail with non-ascii str or unicode.

Side benefit of my patches is that they expose existing errors in
unicode-enable build.