On 2/14/06, Thomas Wouters <***@xs4all.net> wrote:
> On Mon, Feb 13, 2006 at 03:44:27PM -0800, Guido van Rossum wrote:
> > But adding an encoding doesn't help. The str.encode() method always
> > assumes that the string itself is ASCII-encoded, and that's not good
> > enough:
> > >>> "abc".encode("latin-1")
> > 'abc'
> > >>> "abc".decode("latin-1")
> > u'abc'
> > >>> "abc\xf0".decode("latin-1")
> > u'abc\xf0'
> > >>> "abc\xf0".encode("latin-1")
> > Traceback (most recent call last):
> > File "<stdin>", line 1, in ?
> > UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position
> > 3: ordinal not in range(128)
(Note that I've since been convinced that bytes(s) where type(s) ==
str should just return a bytes object containing the same bytes as s,
regardless of encoding. So basically you're preaching to the choir
now. The only remaining question is what if anything to do with an
encoding argment when the first argument is of type str...)
> These comments disturb me. I never really understood why (byte) strings grew
> the 'encode' method, since 8-bit strings *are already encoded*, by their
> very nature. I mean, I understand it's useful because Python does
> non-unicode encodings like 'hex', but I don't really understand *why*. The
> benefits don't seem to outweigh the cost (but that's hindsight.)
It may also have something to do with Jython compatibility (which has
str and unicode being the same thing) or 3.0 future-proofing.
> Directly encoding a (byte) string into a unicode encoding is mostly useless,
> as you've shown. The only use-case I can think of is translating ASCII in,
> for instance, EBCDIC. Encoding anything into an ASCII superset is a no-op,
> unless the system encoding isn't 'ascii' (and that's pretty rare, and not
> something a Python programmer should depend on.) On the other hand, the fact
> that (byte) strings have an 'encode' method creates a lot of confusion in
> unicode-newbies, and causes programs to break only when input is non-ASCII.
> And non-ASCII input just happens too often and too unpredictably in
> 'real-world' code, and not enough in European programmers' tests ;P
Oh, there are lots of ways that non-ASCII input can break code, you
don't have to invoke encode() on str objects to get that effect. :/
> Unicode objects and strings are not the same thing. We shouldn't treat them
> as the same thing.
Well in 3.0 they *will* be the same thing, and in Jython they already are.
> They share an interface (like lists and tuples do), and
> if you only use that interface, treating them as the same kind object is
> mostly ok. They actually share *less* of an interface than lists and tuples,
> though, as comparing strings to unicode objects can raise an exception,
> whereas comparing lists to tuples is not expected to.
No, it causes silent surprises since [1,2,3] != (1,2,3).
> For anything less
> trivial than indexing, slicing and most of the string methods, and anything
> what so ever involving non-ASCII (or, rather, non-system-encoding), unicode
> objects and strings *must* be treated separately. For instance, there is no
> correct way to do:
> unless you know the type of 's'. If it's unicode, you want u"\x80" instead
> of "\x80". If it's not unicode, splitting "\x80" may not even be sensible,
> but you wouldn't know from looking at the code -- maybe it expects a
> specific encoding (or encoding family), maybe not. As soon as you deal with
> unicode, you need to really understand the concept, and too many programmers
> don't. And it's very hard to tell from someone's comments whether they fail
> to understand or just get some of the terminology wrong; that's why Guido's
> comments about 'encoding a byte string' and 'what if the file encoding is
> Unicode' scare me. The unicode/string mixup almost makes me wish Python
> was statically typed.
I'm mostly trying to reflect various broken mental models that users
may have. Believe me, my own confusion is nothing compared to the
confusion that occurs in less gifted users. :-)
The only use case for mixing ASCII and Unicode that I *wanted* to work
right was the mixing of pure ASCII strings (typically literals) with
Unicode data. And that works.
Where things unfortunately fall flat is when you start reading data
from files or interactive input and it gives you some encoded str
object instead of a Unicode object. Our mistake was that we didn't
foresee this clearly enough. Perhaps open(filename).read(), where the
file contains non-ASCII bytes, should have been changed to either
return a Unicode string (if an encoding can somehow be guessed), or
raise an exception, rather than returning an str object in some
unknown (and usually unknowable) encoding.
I hope to fix that in 3.0 too, BTW.
> So please, please, please don't make the mistake of 'doing something' with
> the 'encoding' argument to 'bytes(s, encoding)' when 's' is a (byte) string.
> It wouldn't actually be usable except for the same things as 'str.encode':
> to convert from ASCII to non-ASCII-supersets, or to convert to non-unicode
> encodings (such as 'hex'.) You can achieve those two by doing, e.g.,
> 'bytes(s.encode('hex'))' if you really want to. Ignoring the encoding
> (rather than raising an exception) would also allow code to be trivially
> portable between Python 2.x and Py3K, when "" is actually a unicode object.
> Not that I'm happy with ignoring anything, but not ignoring would be bigger
> crime here.
I'm beginning to see that this is a pretty reasonable interpretation.
> Oh, and while on the subject, I'm not convinced going all-unicode in Py3K is
> a good idea either, but maybe I should save that discussion for PyCon. I'm
> not thinking "why do we need unicode" anymore (which I did two years ago ;)
> but I *am* thinking it'll be a big step for 90% of the programmers if they
> have to grasp unicode and encodings to be able to even do 'raw_input()'
> sensibly. I know I spend an inordinate amount of time trying to explain the
> basics on #python on irc.freenode.net already.
I'm actually hoping that by having all strings be Unicode we'd
*reduce* the amount of confusion. The key (see above where I admitted
this as our biggest Unicode mistake) is to make sure that the
encoding/decoding is built into all I/O operations.
--Guido van Rossum (home page: http://www.python.org/~guido/)