Discussion:
Internal representation of strings and Micropython
Steven D'Aprano
2014-06-04 01:17:18 UTC
Permalink
There is a discussion over at MicroPython about the internal
representation of Unicode strings. Micropython is aimed at embedded
devices, and so minimizing memory use is important, possibly even
more important than performance.

(I'm not speaking on their behalf, just commenting as an interested
outsider.)

At the moment, their Unicode support is patchy. They are talking about
either:

* Having a build-time option to restrict all strings to ASCII-only.

(I think what they mean by that is that strings will be like Python 2
strings, ASCII-plus-arbitrary-bytes, not actually ASCII.)

* Implementing Unicode internally as UTF-8, and giving up O(1)
indexing operations.

https://github.com/micropython/micropython/issues/657


Would either of these trade-offs be acceptable while still claiming
"Python 3.4 compatibility"?

My own feeling is that O(1) string indexing operations are a quality of
implementation issue, not a deal breaker to call it a Python. I can't
see any requirement in the docs that str[n] must take O(1) time, but
perhaps I have missed something.




--
Steven
Donald Stufft
2014-06-04 01:46:22 UTC
Permalink
I think UTF8 is the best option.

> On Jun 3, 2014, at 9:17 PM, Steven D'Aprano <***@pearwood.info> wrote:
>
> There is a discussion over at MicroPython about the internal
> representation of Unicode strings. Micropython is aimed at embedded
> devices, and so minimizing memory use is important, possibly even
> more important than performance.
>
> (I'm not speaking on their behalf, just commenting as an interested
> outsider.)
>
> At the moment, their Unicode support is patchy. They are talking about
> either:
>
> * Having a build-time option to restrict all strings to ASCII-only.
>
> (I think what they mean by that is that strings will be like Python 2
> strings, ASCII-plus-arbitrary-bytes, not actually ASCII.)
>
> * Implementing Unicode internally as UTF-8, and giving up O(1)
> indexing operations.
>
> https://github.com/micropython/micropython/issues/657
>
>
> Would either of these trade-offs be acceptable while still claiming
> "Python 3.4 compatibility"?
>
> My own feeling is that O(1) string indexing operations are a quality of
> implementation issue, not a deal breaker to call it a Python. I can't
> see any requirement in the docs that str[n] must take O(1) time, but
> perhaps I have missed something.
>
>
>
>
> --
> Steven
> _______________________________________________
> Python-Dev mailing list
> Python-***@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/donald%40stufft.io
Kristján Valur Jónsson
2014-06-04 11:15:34 UTC
Permalink
For those that haven't seen this:

http://www.utf8everywhere.org/

> -----Original Message-----
> From: Python-Dev [mailto:python-dev-
> bounces+kristjan=***@python.org] On Behalf Of Donald Stufft
> Sent: 4. júní 2014 01:46
> To: Steven D'Aprano
> Cc: python-***@python.org
> Subject: Re: [Python-Dev] Internal representation of strings and
> Micropython
>
> I think UTF8 is the best option.
>
Chris Angelico
2014-06-04 02:32:12 UTC
Permalink
On Wed, Jun 4, 2014 at 11:17 AM, Steven D'Aprano <***@pearwood.info> wrote:
> * Having a build-time option to restrict all strings to ASCII-only.
>
> (I think what they mean by that is that strings will be like Python 2
> strings, ASCII-plus-arbitrary-bytes, not actually ASCII.)

What I was actually suggesting along those lines was that the str type
still be notionally a Unicode string, but that any codepoints >127
would either raise an exception or blow an assertion, and all the code
to handle multibyte representations would be compiled out. So there'd
still be a difference between strings of text and streams of bytes,
but all encoding and decoding to/from ASCII-compatible encodings would
just point to the same bytes in RAM.

Risk: Someone would implement that with assertions, then compile with
assertions disabled, test only with ASCII, and have lurking bugs.

ChrisA
Guido van Rossum
2014-06-04 05:23:07 UTC
Permalink
On Tue, Jun 3, 2014 at 7:32 PM, Chris Angelico <***@gmail.com> wrote:

> On Wed, Jun 4, 2014 at 11:17 AM, Steven D'Aprano <***@pearwood.info>
> wrote:
> > * Having a build-time option to restrict all strings to ASCII-only.
> >
> > (I think what they mean by that is that strings will be like Python 2
> > strings, ASCII-plus-arbitrary-bytes, not actually ASCII.)
>
> What I was actually suggesting along those lines was that the str type
> still be notionally a Unicode string, but that any codepoints >127
> would either raise an exception or blow an assertion, and all the code
> to handle multibyte representations would be compiled out.


That would be a pretty lousy option.

So there'd
> still be a difference between strings of text and streams of bytes,
> but all encoding and decoding to/from ASCII-compatible encodings would
> just point to the same bytes in RAM.
>

I suppose this is why you propose to reject 128-255?


> Risk: Someone would implement that with assertions, then compile with
> assertions disabled, test only with ASCII, and have lurking bugs.
>

Never mind disabling assertions -- even with enabled assertions you'd have
to expect most Python programs to fail with non-ASCII input.

Then again the UTF-8 option would be pretty devastating too for anything
manipulating strings (especially since many Python APIs are defined using
indexes, e.g. the re module).

Why not support variable-width strings like CPython 3.4?

--
--Guido van Rossum (python.org/~guido)
Chris Angelico
2014-06-04 07:03:22 UTC
Permalink
On Wed, Jun 4, 2014 at 3:23 PM, Guido van Rossum <***@python.org> wrote:
> On Tue, Jun 3, 2014 at 7:32 PM, Chris Angelico <***@gmail.com> wrote:
>>
>> On Wed, Jun 4, 2014 at 11:17 AM, Steven D'Aprano <***@pearwood.info>
>> wrote:
>> > * Having a build-time option to restrict all strings to ASCII-only.
>> >
>> > (I think what they mean by that is that strings will be like Python 2
>> > strings, ASCII-plus-arbitrary-bytes, not actually ASCII.)
>>
>> What I was actually suggesting along those lines was that the str type
>> still be notionally a Unicode string, but that any codepoints >127
>> would either raise an exception or blow an assertion, and all the code
>> to handle multibyte representations would be compiled out.
>
>
> That would be a pretty lousy option.
>
>> So there'd
>> still be a difference between strings of text and streams of bytes,
>> but all encoding and decoding to/from ASCII-compatible encodings would
>> just point to the same bytes in RAM.
>
> I suppose this is why you propose to reject 128-255?

Correct. It would allow small devices to guarantee that strings are
compact (MicroPython is aimed primarily at an embedded controller),
guarantee identity transformations in several common encodings (and
maybe this sort of build wouldn't ship with any non-ASCII-compat
encodings at all), and never demonstrate behaviour different from
CPython's except by explicitly failing.

>> Risk: Someone would implement that with assertions, then compile with
>> assertions disabled, test only with ASCII, and have lurking bugs.
>
>
> Never mind disabling assertions -- even with enabled assertions you'd have
> to expect most Python programs to fail with non-ASCII input.

Right, which is why I don't like the idea. But you don't need
non-ASCII characters to blink an LED or turn a servo, and there is
significant resistance to the notion that appending a non-ASCII
character to a long ASCII-only string requires the whole string to be
copied and doubled in size (lots of heap space used).

> Then again the UTF-8 option would be pretty devastating too for anything
> manipulating strings (especially since many Python APIs are defined using
> indexes, e.g. the re module).

That's what I thought, too, but a quick poll on python-list suggests
that indexing isn't nearly as common as I had thought it to be. On a
smallish device, you won't have megabytes of string to index, so even
O(N) indexing can't get pathological. (This would be an acknowledged
limitation of micropython as a Unix Python - "it's designed for small
programs, and it's performance-optimized for small programs, so it
might get pathologically slow on certain large data manipulations".)

> Why not support variable-width strings like CPython 3.4?

That was my first recommendation, and in fact I started writing code
to implement parts of PEP 393, with a view to basically doing it the
same way in both Pythons. But discussion on the tracker issue showed a
certain amount of hostility toward the potential expansion of strings,
particularly in the worst-case example of appending a single SMP
character onto a long ASCII string.

ChrisA
Paul Sokolovsky
2014-06-04 11:12:31 UTC
Permalink
Hello,

On Wed, 4 Jun 2014 17:03:22 +1000
Chris Angelico <***@gmail.com> wrote:

[]

> > Why not support variable-width strings like CPython 3.4?
>
> That was my first recommendation, and in fact I started writing code
> to implement parts of PEP 393, with a view to basically doing it the
> same way in both Pythons. But discussion on the tracker issue showed a
> certain amount of hostility toward the potential expansion of strings,
> particularly in the worst-case example of appending a single SMP
> character onto a long ASCII string.

An alternative view is that the discussion on the tracker showed Python
developers' mind-fixation on implementing something the way CPython does
it. And I didn't yet go to that argument, but in the end, MicroPython
does not try to rewrite CPython or compete with it. So, having few
choices with pros and cons leading approximately to the tie among them,
it's the least productive to make the same choice as CPython did.

Even having "rule of thumb" of choosing not-a-CPython way would be more
productive than having the same rule of thumb for blindly choosing
CPython way. (Of course, actually it should be technical discussion
based on the target requirements, like we hopefully did, with strong
arguments against using something else but the de-facto standard
transfer encoding for Unicode).


>
> ChrisA

--
Best regards,
Paul mailto:***@gmail.com
Chris Angelico
2014-06-04 11:17:12 UTC
Permalink
On Wed, Jun 4, 2014 at 9:12 PM, Paul Sokolovsky <***@gmail.com> wrote:
> An alternative view is that the discussion on the tracker showed Python
> developers' mind-fixation on implementing something the way CPython does
> it. And I didn't yet go to that argument, but in the end, MicroPython
> does not try to rewrite CPython or compete with it. So, having few
> choices with pros and cons leading approximately to the tie among them,
> it's the least productive to make the same choice as CPython did.

I'm not a CPython dev, nor a Python dev, and I don't think any of the
big names of CPython or Python has showed up on that tracker as yet.
But why is "be different from CPython" such a valuable choice? CPython
works. It's had many hours of dev time put into it. Problems have been
identified and avoided. Throwing that out means throwing away a
freely-given shoulder to stand on, in an Isaac Newton way.

http://www.joelonsoftware.com/articles/fog0000000069.html

ChrisA
Daniel Holth
2014-06-04 11:35:28 UTC
Permalink
Can of worms, opened.
On Jun 4, 2014 7:20 AM, "Chris Angelico" <***@gmail.com> wrote:

> On Wed, Jun 4, 2014 at 9:12 PM, Paul Sokolovsky <***@gmail.com> wrote:
> > An alternative view is that the discussion on the tracker showed Python
> > developers' mind-fixation on implementing something the way CPython does
> > it. And I didn't yet go to that argument, but in the end, MicroPython
> > does not try to rewrite CPython or compete with it. So, having few
> > choices with pros and cons leading approximately to the tie among them,
> > it's the least productive to make the same choice as CPython did.
>
> I'm not a CPython dev, nor a Python dev, and I don't think any of the
> big names of CPython or Python has showed up on that tracker as yet.
> But why is "be different from CPython" such a valuable choice? CPython
> works. It's had many hours of dev time put into it. Problems have been
> identified and avoided. Throwing that out means throwing away a
> freely-given shoulder to stand on, in an Isaac Newton way.
>
> http://www.joelonsoftware.com/articles/fog0000000069.html
>
> ChrisA
> _______________________________________________
> Python-Dev mailing list
> Python-***@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/dholth%40gmail.com
>
Paul Sokolovsky
2014-06-04 12:18:01 UTC
Permalink
Hello,

On Wed, 4 Jun 2014 21:17:12 +1000
Chris Angelico <***@gmail.com> wrote:

> On Wed, Jun 4, 2014 at 9:12 PM, Paul Sokolovsky <***@gmail.com>
> wrote:
> > An alternative view is that the discussion on the tracker showed
> > Python developers' mind-fixation on implementing something the way
> > CPython does it. And I didn't yet go to that argument, but in the
> > end, MicroPython does not try to rewrite CPython or compete with
> > it. So, having few choices with pros and cons leading approximately
> > to the tie among them, it's the least productive to make the same
> > choice as CPython did.
>
> I'm not a CPython dev, nor a Python dev, and I don't think any of the
> big names of CPython or Python has showed up on that tracker as yet.
> But why is "be different from CPython" such a valuable choice? CPython
> works. It's had many hours of dev time put into it.

Exactly, CPython (already) exists, and it works, so people can just use
it. MicroPython's aim is to go where CPython didn't, and couldn't, go.
For that, it's got to be different, or it literally won't fit there,
like CPython doesn't.

[]

--
Best regards,
Paul mailto:***@gmail.com
Serhiy Storchaka
2014-06-04 14:17:29 UTC
Permalink
04.06.14 10:03, Chris Angelico написав(ла):
> Right, which is why I don't like the idea. But you don't need
> non-ASCII characters to blink an LED or turn a servo, and there is
> significant resistance to the notion that appending a non-ASCII
> character to a long ASCII-only string requires the whole string to be
> copied and doubled in size (lots of heap space used).

But you need non-ASCII characters to display a title of MP3 track.
Chris Angelico
2014-06-04 14:26:10 UTC
Permalink
On Thu, Jun 5, 2014 at 12:17 AM, Serhiy Storchaka <***@gmail.com> wrote:
> 04.06.14 10:03, Chris Angelico написав(ла):
>
>> Right, which is why I don't like the idea. But you don't need
>> non-ASCII characters to blink an LED or turn a servo, and there is
>> significant resistance to the notion that appending a non-ASCII
>> character to a long ASCII-only string requires the whole string to be
>> copied and doubled in size (lots of heap space used).
>
>
> But you need non-ASCII characters to display a title of MP3 track.

Agreed. IMO, any Python, no matter how micro, needs full Unicode
support; but there is resistance from uPy's devs.

ChrisA
Paul Sokolovsky
2014-06-04 14:49:30 UTC
Permalink
Hello,

On Thu, 5 Jun 2014 00:26:10 +1000
Chris Angelico <***@gmail.com> wrote:

> On Thu, Jun 5, 2014 at 12:17 AM, Serhiy Storchaka
> <***@gmail.com> wrote:
> > 04.06.14 10:03, Chris Angelico написав(ла):
> >
> >> Right, which is why I don't like the idea. But you don't need
> >> non-ASCII characters to blink an LED or turn a servo, and there is
> >> significant resistance to the notion that appending a non-ASCII
> >> character to a long ASCII-only string requires the whole string to
> >> be copied and doubled in size (lots of heap space used).
> >
> >
> > But you need non-ASCII characters to display a title of MP3 track.

Yes, but to display a title, you don't need to do codepoint access at
random - you need to either take a block of memory (length in bytes) and
do something with it (pass to a C function, transfer over some bus,
etc.), or *iterate in order* over codepoints in a string. All these
operations are as efficient (O-notation) for UTF-8 as for UTF-32.

Some operations are not going to be as fast, so - oops - avoid doing
them without good reason. And kindly drop expectations that doing
arbitrary operations on *Unicode* are as efficient as you imagined.
(Note the *Unicode* in general, not particular flavor of which you got
used to, up to thinking it's the one and only "right" flavor.)

> Agreed. IMO, any Python, no matter how micro, needs full Unicode
> support; but there is resistance from uPy's devs.

FUD ;-).

>
> ChrisA

--
Best regards,
Paul mailto:***@gmail.com
Chris Angelico
2014-06-04 15:00:52 UTC
Permalink
On Thu, Jun 5, 2014 at 12:49 AM, Paul Sokolovsky <***@gmail.com> wrote:
>> > But you need non-ASCII characters to display a title of MP3 track.
>
> Yes, but to display a title, you don't need to do codepoint access at
> random - you need to either take a block of memory (length in bytes) and
> do something with it (pass to a C function, transfer over some bus,
> etc.), or *iterate in order* over codepoints in a string. All these
> operations are as efficient (O-notation) for UTF-8 as for UTF-32.

Suppose you have a long title, and you need to abbreviate it by
dropping out words (delimited by whitespace), such that you keep the
first word (always) and the last (if possible) and as many as possible
in between. How are you going to write that? With PEP 393 or UTF-32
strings, you can simply record the index of every whitespace you find,
count off lengths, and decide what to keep and what to ellipsize.

> Some operations are not going to be as fast, so - oops - avoid doing
> them without good reason. And kindly drop expectations that doing
> arbitrary operations on *Unicode* are as efficient as you imagined.
> (Note the *Unicode* in general, not particular flavor of which you got
> used to, up to thinking it's the one and only "right" flavor.)

Not sure what you mean by flavors of Unicode. Unicode is a mapping of
codepoints to characters, not an in-memory representation. And I've
been working with Python 3.3 since before it came out, and with Pike
(which has a very similar model) for longer, and in both of them, I
casually perform operations on Unicode strings in the same way that I
used to perform operations on REXX strings (which were eight-bit in
the current system codepage - 437 for us). I do expect those
operations to be efficient, and I get what I expect.

Maybe they won't be in uPy, but that would be a limitation of uPy, not
a fundamental problem with Unicode.

ChrisA
Steve Dower
2014-06-04 15:32:25 UTC
Permalink
Received: from localhost (HELO mail.python.org) (127.0.0.1)
by albatross.python.org with SMTP; 04 Jun 2014 17:32:36 +0200
Received: from na01-bn1-obe.outbound.protection.outlook.com (unknown
[207.46.163.189]) (using TLSv1 with cipher AES128-SHA (128/128 bits))
(No client certificate requested)
by mail.python.org (Postfix) with ESMTPS
for <python-***@python.org>; Wed, 4 Jun 2014 17:32:36 +0200 (CEST)
Received: from BLUPR03MB391.namprd03.prod.outlook.com (10.141.78.21) by
BLUPR03MB002.namprd03.prod.outlook.com (10.255.208.36) with Microsoft SMTP
Server (TLS) id 15.0.954.9; Wed, 4 Jun 2014 15:32:27 +0000
Received: from BLUPR03MB389.namprd03.prod.outlook.com (10.141.78.11) by
BLUPR03MB391.namprd03.prod.outlook.com (10.141.78.21) with Microsoft SMTP
Server (TLS) id 15.0.954.9; Wed, 4 Jun 2014 15:32:26 +0000
Received: from BLUPR03MB389.namprd03.prod.outlook.com ([10.141.78.11]) by
BLUPR03MB389.namprd03.prod.outlook.com ([10.141.78.11]) with mapi id
15.00.0954.000; Wed, 4 Jun 2014 15:32:26 +0000
Thread-Topic: [Python-Dev] Internal representation of strings and Micropython
Thread-Index: AQHPf5LaJ/VbLGhP9k2hZIYnN6PWj5tgOzsAgAAvwYCAABwCAIAAeUuAgAACbQCAAAaFAIAAAy0AgAAHhFAIn-Reply-To: <CAPTjJmp4Q8-Q6´BXB_mupv+***@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [2001:4898:80e8:ee31::3]
x-microsoft-antispam: BL:0; ACTION:Default; RISK:Low; SCL:0; SPMLVL:NotSpam;
PCL:0; RULEID:
x-forefront-prvs: 0232B30BBC
x-forefront-antispam-report: SFV:NSPM;
SFS:(6009001)(428001)(189002)(199002)(51704005)(24454002)(74502001)(21056001)(99396002)(99286001)(101416001)(86362001)(92566001)(83072002)(31966008)(77096999)(54356999)(76176999)(50986999)(74662001)(86612001)(83322001)(93886002)(80022001)(85852003)(2656002)(81342001)(33646001)(4396001)(81542001)(74316001)(64706001)(76482001)(77982001)(76576001)(87936001)(46102001)(20776003)(79102001)(24736002);
DIR:OUT; SFP:; SCL:1; SRVR:BLUPR03MB391;
H:BLUPR03MB389.namprd03.prod.outlook.com; FPR:; MLV:sfv; PTR:InfoNoRecords;
MX:1; A:1; LANG:en;
received-spf: None (: microsoft.com does not designate permitted sender hosts)
authentication-results: spf=none (sender IP is )
smtp.mailfrom=***@microsoft.com;
X-Microsoft-Antispam: BL:0; ACTION:Default; RISK:Low; SCL:0; SPMLVL:NotSpam;
PCL:0; RULEID:
X-OriginatorOrg: microsoft.onmicrosoft.com
X-BeenThere: python-***@python.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Python core developers <python-dev.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-dev>,
<mailto:python-dev-***@python.org?subject=unsubscribe>
List-Archive: <http://mail.python.org/pipermail/python-dev/>
List-Post: <mailto:python-***@python.org>
List-Help: <mailto:python-dev-***@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-dev>,
<mailto:python-dev-***@python.org?subject=subscribe>
Errors-To: python-dev-bounces+python-python-dev=***@python.org
Sender: "Python-Dev"
<python-dev-bounces+python-python-dev=***@python.org>
Archived-At: <http://permalink.gmane.org/gmane.comp.python.devel/147932>

Steven D'Aprano wrote:
> The language semantics says that a string is an array of code points. Every
> index relates to a single code point, no code point extends over two or more
> indexes.
> There's a 1:1 relationship between code points and indexes. How is direct
> indexing "likely to be incorrect"?

We're discussing the behaviour under a different (hypothetical) design decision than a 1:1 relationship between code points and indexes, so arguing from that stance doesn't make much sense.

> e.g.
>
> s = "---ÿ---"
> offset = s.index('ÿ')
> assert s[offset] == 'ÿ'
>
> That cannot fail with Python's semantics.

Agreed, and it shouldn't (I was actually referring to the optimization being incorrect for the goal, not the language semantics). What you'd probably find is that sizeof('ÿ') == sizeof(s[offset]) == 2, which may be surprising, but is also correct.

But what are you trying to achieve (why are you writing this code)? All this example really shows is that you're only using indexing for trivial purposes.

Chris's example of an actual case where it may look like a good idea to use indexing for optimization makes this more obvious IMHO:

Chris Angelico wrote:
> Suppose you have a long title, and you need to abbreviate it by dropping out
> words (delimited by whitespace), such that you keep the first word (always) and
> the last (if possible) and as many as possible in between. How are you going to
> write that? With PEP 393 or UTF-32 strings, you can simply record the index of
> every whitespace you find, count off lengths, and decide what to keep and what
> to ellipsize.

"Recording the index" is where the optimization comes in. With a variable-length encoding - heck, even with a fixed-length one - I'd just use str.split(' ') (or re.split('\\s', string), depending on how much I care about the type of delimiter) and manipulate the list.

If copying into a separate list is a problem (memory-wise), re.finditer('\\S+', string) also provides the same behaviour and gives me the sliced string, so there's no need to index for anything.

The downside is that it isn't as easy to teach as the 1:1 relationship, and currently it doesn't perform as well *in CPython*. But if MicroPython is focusing on size over speed, I don't see any reason why they shouldn't permit different performance characteristics and require a slightly different approach to highly-optimized coding.

In any case, this is an interesting discussion with a genuine effect on the Python interpreter ecosystem. Jython and IronPython already have different string implementations from CPython - having official (and hopefully flexible) guidance on deviations from the reference implementation would I think help other implementations provide even more value, which is only a good thing for Python.

Cheers,
Steve
Mark Lawrence
2014-06-04 15:52:26 UTC
Permalink
On 04/06/2014 16:32, Steve Dower wrote:
>
> If copying into a separate list is a problem (memory-wise), re.finditer('\\S+', string) also provides the same behaviour and gives me the sliced string, so there's no need to index for anything.
>

Out of idle curiosity is there anything that stops MicroPython, or any
other implementation for that matter, from providing views of a string
rather than copying every time? IIRC memoryviews in CPython rely on the
buffer protocol at the C API level, so since strings don't support this
protocol you can't take a memoryview of them. Could this actually be
implemented in the future, is the underlying C code just too
complicated, or what?

--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection is active.
http://www.avast.com
Steven D'Aprano
2014-06-04 20:10:40 UTC
Permalink
On Wed, Jun 04, 2014 at 03:32:25PM +0000, Steve Dower wrote:
> Steven D'Aprano wrote:
> > The language semantics says that a string is an array of code points. Every
> > index relates to a single code point, no code point extends over two or more
> > indexes.
> > There's a 1:1 relationship between code points and indexes. How is direct
> > indexing "likely to be incorrect"?
>
> We're discussing the behaviour under a different (hypothetical) design
> decision than a 1:1 relationship between code points and indexes, so
> arguing from that stance doesn't make much sense.

I'm open to different implementations. I earlier even suggested that the
choice of O(1) indexing versus O(N) indexing was a quality of
implementation issue, not a make-or-break issue for whether something
can call itself Python (or even 99% compatible with Python").

But I don't believe that exposing that implementation at the Python
level is valid: regardless of whether it is efficient or not, I should
be able to write code like this:

a = [mystring[i] for i in range(len(mystring))]
b = list(mystring)
assert a == b

That is not the case if you expose the underlying byte-level
implementation at the Python level, and treat strings as an array of
*bytes*. Paul seems to want to do this, or at least he wants Python 4
to do this. I think it is *completely* inappropriate to do so.

I *think* you may agree with me, (correct me if I'm wrong) because you
go on to agree with me:

> > e.g.
> >
> > s = "---ÿ---"
> > offset = s.index('ÿ')
> > assert s[offset] == 'ÿ'
> >
> > That cannot fail with Python's semantics.
>
> Agreed, and it shouldn't

but I'm not actually sure.


> (I was actually referring to the optimization
> being incorrect for the goal, not the language semantics). What you'd
> probably find is that sizeof('ÿ') == sizeof(s[offset]) == 2, which may
> be surprising, but is also correct.

You don't seem to be taking about sys.getsizeof, so I guess you're
talking about something at the C level (or other underlying
implementation), ignoring the object overhead. I don't know why you
think I'd find that surprising -- one cannot fit 0x10FFFF Unicode code
points in a single byte, so whether you use UTF-32, UTF-16, UTF-8,
Python 3.3's FSR or some other implementation, at least some code points
are going to use more than one byte.


> But what are you trying to achieve (why are you writing this code)?
> All this example really shows is that you're only using indexing for
> trivial purposes.

I'm trying to understand what point you are trying to make, because I'm
afraid I don't quite get it.


[...]
> If copying into a separate list is a problem (memory-wise),
> re.finditer('\\S+', string) also provides the same behaviour and gives
> me the sliced string, so there's no need to index for anything.

finditer returns a bunch of MatchObjects, which give you the indexes
of the found substring. Whether you do it yourself, or get the re
module to do it, you're indexing somewhere.


> The downside is that it isn't as easy to teach as the 1:1
> relationship, and currently it doesn't perform as well *in CPython*.
> But if MicroPython is focusing on size over speed, I don't see any
> reason why they shouldn't permit different performance characteristics
> and require a slightly different approach to highly-optimized coding.

I don't have a problem with different implementations, so long as that
implementation isn't exposed at the Python level with changes of
semantics such as breaking the promise that a string is an array of code
points, not of bytes.

> In any case, this is an interesting discussion with a genuine effect
> on the Python interpreter ecosystem. Jython and IronPython already
> have different string implementations from CPython - having official
> (and hopefully flexible) guidance on deviations from the reference
> implementation would I think help other implementations provide even
> more value, which is only a good thing for Python.

Yes, agreed.



--
Steven
Paul Sokolovsky
2014-06-04 15:53:52 UTC
Permalink
Hello,

On Thu, 5 Jun 2014 01:00:52 +1000
Chris Angelico <***@gmail.com> wrote:

> On Thu, Jun 5, 2014 at 12:49 AM, Paul Sokolovsky <***@gmail.com>
> wrote:
> >> > But you need non-ASCII characters to display a title of MP3
> >> > track.
> >
> > Yes, but to display a title, you don't need to do codepoint access
> > at random - you need to either take a block of memory (length in
> > bytes) and do something with it (pass to a C function, transfer
> > over some bus, etc.), or *iterate in order* over codepoints in a
> > string. All these operations are as efficient (O-notation) for
> > UTF-8 as for UTF-32.
>
> Suppose you have a long title, and you need to abbreviate it by
> dropping out words (delimited by whitespace), such that you keep the
> first word (always) and the last (if possible) and as many as possible
> in between. How are you going to write that? With PEP 393 or UTF-32
> strings, you can simply record the index of every whitespace you find,
> count off lengths, and decide what to keep and what to ellipsize.

I'll submit angry bugreport along the lines of "WWWHAT, it's 3.5 and
there's still no str.isplit()??!!11", then do it with re.finditer()
(while submitting another report on inconsistent naming scheme).

[]

--
Best regards,
Paul mailto:***@gmail.com
Serhiy Storchaka
2014-06-04 17:35:06 UTC
Permalink
04.06.14 17:49, Paul Sokolovsky написав(ла):
> On Thu, 5 Jun 2014 00:26:10 +1000
> Chris Angelico <***@gmail.com> wrote:
>> On Thu, Jun 5, 2014 at 12:17 AM, Serhiy Storchaka
>> <***@gmail.com> wrote:
>>> 04.06.14 10:03, Chris Angelico написав(ла):
>>>> Right, which is why I don't like the idea. But you don't need
>>>> non-ASCII characters to blink an LED or turn a servo, and there is
>>>> significant resistance to the notion that appending a non-ASCII
>>>> character to a long ASCII-only string requires the whole string to
>>>> be copied and doubled in size (lots of heap space used).
>>> But you need non-ASCII characters to display a title of MP3 track.
>
> Yes, but to display a title, you don't need to do codepoint access at
> random - you need to either take a block of memory (length in bytes) and
> do something with it (pass to a C function, transfer over some bus,
> etc.), or *iterate in order* over codepoints in a string. All these
> operations are as efficient (O-notation) for UTF-8 as for UTF-32.

Several previous comments discuss first option, ASCII-only strings.
Paul Sokolovsky
2014-06-04 10:53:14 UTC
Permalink
Hello,

On Tue, 3 Jun 2014 22:23:07 -0700
Guido van Rossum <***@python.org> wrote:

[]
> Never mind disabling assertions -- even with enabled assertions you'd
> have to expect most Python programs to fail with non-ASCII input.
>
> Then again the UTF-8 option would be pretty devastating too for
> anything manipulating strings (especially since many Python APIs are
> defined using indexes, e.g. the re module).

If the Unicode is slow (*), then obvious choice is not using Unicode
when not needed. Too bad that's a bit hard in Python3, as it enforces
Unicode everywhere, and dealing with efficient strings requires
prefixing them with funny characters like "b", etc.

* If Unicode if slow because it causes heap to bloat and go swap, the
choice is still the same.

>
> Why not support variable-width strings like CPython 3.4?

Because, like good deal of community, we hope that Python4 will get
back to reality, and strings will be efficient (both for processing and
storage) by default, and niche and marginal "Unicode string" type will
be used explicitly (using funny prefixes, etc.), only when really
needed.


Ah, all these not so funny geek jokes about internals of language
implementation, hope they didn't make somebody's day dull!

>
> --
> --Guido van Rossum (python.org/~guido)



--
Best regards,
Paul mailto:***@gmail.com
Mark Lawrence
2014-06-04 13:29:51 UTC
Permalink
On 04/06/2014 11:53, Paul Sokolovsky wrote:
> Hello,
>
> On Tue, 3 Jun 2014 22:23:07 -0700
> Guido van Rossum <***@python.org> wrote:
>
> []
>> Never mind disabling assertions -- even with enabled assertions you'd
>> have to expect most Python programs to fail with non-ASCII input.
>>
>> Then again the UTF-8 option would be pretty devastating too for
>> anything manipulating strings (especially since many Python APIs are
>> defined using indexes, e.g. the re module).
>
> If the Unicode is slow (*), then obvious choice is not using Unicode
> when not needed. Too bad that's a bit hard in Python3, as it enforces
> Unicode everywhere, and dealing with efficient strings requires
> prefixing them with funny characters like "b", etc.
>
> * If Unicode if slow because it causes heap to bloat and go swap, the
> choice is still the same.

Where is your evidence that (presumably) CPython unicode is slow? What
is your response to this message
http://bugs.python.org/issue16061#msg171413 from the bug tracker?

>
>>
>> Why not support variable-width strings like CPython 3.4?
>
> Because, like good deal of community, we hope that Python4 will get
> back to reality, and strings will be efficient (both for processing and
> storage) by default, and niche and marginal "Unicode string" type will
> be used explicitly (using funny prefixes, etc.), only when really
> needed.

Where is your evidence that supports the above claim?

>
>
> Ah, all these not so funny geek jokes about internals of language
> implementation, hope they didn't make somebody's day dull!
>
>>
>> --
>> --Guido van Rossum (python.org/~guido)
>
>
>


--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection is active.
http://www.avast.com
Chris Angelico
2014-06-04 10:51:36 UTC
Permalink
On Wed, Jun 4, 2014 at 8:38 PM, Paul Sokolovsky <***@gmail.com> wrote:
> That's another reason why people don't like Unicode enforced upon them
> - all the talk about supporting all languages and scripts is demagogy
> and hypocrisy, given a choice, Unicode zealots would rather limit
> people to Latin script then give up on their arbitrarily chosen,
> one-among-thousands,
> soon-to-be-replaced-by-apples'-and-microsofts'-"exciting-new" encoding.

Wrong. I use and recommend Unicode, with UTF-8 for transmission, and I
do not ever want to limit people to Latin-1 or any other such subset.
Even though English is the only language I speak, I am *frequently*
using non-ASCII characters (eg when I discuss mathematics on a MUD),
and if I could be absolutely sure that everyone in the conversation
correctly comprehended Unicode, I could do this with a lot more
confidence. Unfortunately, the server I use just passes bytes in and
out, and some clients assume CP-1252, others assume Latin-1, and
others (including my Gypsum) try UTF-8 first and fall back on an
eight-bit encoding (currently CP-1252 because of the first group). But
in an ideal world, server and clients would all speak Unicode
everywhere, and transmit and receive UTF-8. This is not hypocrisy,
this is the way to work reliably.

> Once again, my claim is what MicroPython implements now is more correct
> - in a sense wider than technical - handling. We don't provide Unicode
> encoding support, because it's highly bloated, but let people use any
> encoding they like. That comes at some price, like length of strings in
> characters are not know to runtime, only in bytes, but quite a lot of
> applications can be written by having just that.

The current implementation is flat-out lying, actually. It claims that
it's storing Unicode codepoints (as per the Python spec) while
actually storing bytes, and then it transmits those bytes to the
console etc as-is. This is a bug. It needs to be fixed. The only
question is, what form will the fix take? Will it be PEP 393's
flexible fixed-width representation? UTF-8? UTF-16 (I hope not!)? A
hybrid of Latin-1 where possible and UTF-8 otherwise? But something
has to be done.

ChrisA
Chris Angelico
2014-06-04 10:53:46 UTC
Permalink
On Wed, Jun 4, 2014 at 8:38 PM, Paul Sokolovsky <***@gmail.com> wrote:
> And I'm saying that not to discourage Unicode addition to MicroPython,
> but to hint that "force-force" approach implemented by CPython3 and
> causing rage and split in the community is not appreciated.

FWIW, it's Python 3 (the language) and not CPython 3.x (the
implementation) that specifies Unicode strings in this way. I don't
know why it has to cause a split in the community; this is the one way
to make sure *everyone's* strings work perfectly, rather than having
ASCII strings work fine and others start tripping over problems in
various APIs.

ChrisA
Paul Sokolovsky
2014-06-04 11:49:33 UTC
Permalink
Hello,

On Wed, 4 Jun 2014 20:53:46 +1000
Chris Angelico <***@gmail.com> wrote:

> On Wed, Jun 4, 2014 at 8:38 PM, Paul Sokolovsky <***@gmail.com>
> wrote:
> > And I'm saying that not to discourage Unicode addition to
> > MicroPython, but to hint that "force-force" approach implemented by
> > CPython3 and causing rage and split in the community is not
> > appreciated.
>
> FWIW, it's Python 3 (the language) and not CPython 3.x (the
> implementation) that specifies Unicode strings in this way.

Yeah, but it's CPython what dictates how language evolves (some people
even think that it dictates how language should be implemented!), so all
good parts belong to Python3, and all bad parts - to CPython3,
right? ;-)

> I don't
> know why it has to cause a split in the community; this is the one way
> to make sure *everyone's* strings work perfectly, rather than having
> ASCII strings work fine and others start tripping over problems in
> various APIs.

It did cause split in the community, that's the fact, that's why
Python2 and Python3 are at the respective positions. Anyway, I'm not
interested in participating in that split, I did not yet uttered my
opinion on that publicly enough, so I seized a chance to drop some
witty remarks, but I don't want to start yet another Unicode flame.



So, let's please be back to Unicode storage representation in
MicroPython. So, https://github.com/micropython/micropython/issues/657
discussed technical aspects, in a recent mail on this list I expressed
my opinion why following CPython way is not productive (for development
satisfaction and evolution of Python community, to be explicit).

Final argument I would have is that you certainly can implement Unicode
support the PEP393 way - it would be enormous help and would be gladly
accepted. The question, how useful it will be for MicroPython. It
certainly will be useful to report passing of testsuites. But will it
be *really* used?

For microcontroller board, it might be too heavy (put simple, with it,
people will be able to do less (== heap running out sooner)), than
without it, so one may expect it to be disabled by default. Then POSIX
port is there surely not to let people replace "python" command
with "micropython" and run Django, but to let people develop and debug
their apps with more comfort than on embedded board. So, it should
behave close to MCU version, and would follow with MCU choice
re: Unicode.

That's actually the reason why I keep up this discussion - not for the
sake of argument or to bash Python3's Unicode choices. With recent
MicroPython announcement, we surely looked for more people to
contribute to its development. But then we (or at least I can speak for
myself), would like to make sure that these contribution are actually
the most useful ones (for both MicroPython, and Python community in
general, which gets more choices, rather than just getting N% smaller
CPython rewrite).

So, you're not sure how O(N) string indexing will work? But MicroPython
offers a great opportunity to try! And it's something new and exciting,
which surely will be useful (== will save people memory), not just
something old and boring ;-).


>
> ChrisA


--
Best regards,
Paul mailto:***@gmail.com
Daniel Holth
2014-06-04 12:17:16 UTC
Permalink
Received: from localhost (HELO mail.python.org) (127.0.0.1)
by albatross.python.org with SMTP; 04 Jun 2014 14:17:22 +0200
Received: from mail-wg0-x22a.google.com (unknown
[IPv6:2a00:1450:400c:c00::22a])
(using TLSv1 with cipher ECDHE-RSA-AES128-SHA (128/128 bits))
(No client certificate requested)
by mail.python.org (Postfix) with ESMTPS
for <python-***@python.org>; Wed, 4 Jun 2014 14:17:22 +0200 (CEST)
Received: by mail-wg0-f42.google.com with SMTP id y10so8159955wgg.1
for <python-***@python.org>; Wed, 04 Jun 2014 05:17:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
h=mime-version:in-reply-to:references:date:message-id:subject:from:to
:cc:content-type;
bh=nBLeMBOukPnGhX6AsCeHiZDfkwBMfClboNh1RbVGNOM=;
b=TC1NcbP/ihxmoSUYEP8MmVedfyWsCFjwRg3/5mNzpXEVwS33I6UDkPaPFytAvU1zcc
wdc9LP2+TzHKj1NBTfTfuBouFTiPDB982umzXps+BDJohm89DyZOOZsb0AlOReaHsMUi
KzKf3TUDbfTNoJrOoT++BaC/S/UCR2oR5h7ZsjWwhtDWnieuguvsqJ3FWDZ9KxGE/u4d
tHDBPRpw1gMYhX2R3xkXRgO3aY+By40WsPqx61uuCjzq5ggFtMUcUvJItGOzyay2gEr3
nwbnxwsu2TXIisEUxTOLXAA4gLcXus5CkyUIu+v3A96eebXsJt+o2RTzRnZHdCWtM4D8
HhbA==
X-Received: by 10.194.92.176 with SMTP id cn16mr69440057wjb.43.1401884236695;
Wed, 04 Jun 2014 05:17:16 -0700 (PDT)
Received: by 10.194.202.230 with HTTP; Wed, 4 Jun 2014 05:17:16 -0700 (PDT)
In-Reply-To: <***@x34f>
X-BeenThere: python-***@python.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Python core developers <python-dev.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-dev>,
<mailto:python-dev-***@python.org?subject=unsubscribe>
List-Archive: <http://mail.python.org/pipermail/python-dev/>
List-Post: <mailto:python-***@python.org>
List-Help: <mailto:python-dev-***@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-dev>,
<mailto:python-dev-***@python.org?subject=subscribe>
Errors-To: python-dev-bounces+python-python-dev=***@python.org
Sender: "Python-Dev"
<python-dev-bounces+python-python-dev=***@python.org>
Archived-At: <http://permalink.gmane.org/gmane.comp.python.devel/147916>

If we're voting I think representing Unicode internally in micropython
as utf-8 with O(N) indexing is a great idea, partly because I'm not
sure indexing into strings is a good idea - lots of Unicode code
points don't make sense by themselves; see also grapheme clusters. It
would probably work great.

On Wed, Jun 4, 2014 at 7:49 AM, Paul Sokolovsky <***@gmail.com> wrote:
> Hello,
>
> On Wed, 4 Jun 2014 20:53:46 +1000
> Chris Angelico <***@gmail.com> wrote:
>
>> On Wed, Jun 4, 2014 at 8:38 PM, Paul Sokolovsky <***@gmail.com>
>> wrote:
>> > And I'm saying that not to discourage Unicode addition to
>> > MicroPython, but to hint that "force-force" approach implemented by
>> > CPython3 and causing rage and split in the community is not
>> > appreciated.
>>
>> FWIW, it's Python 3 (the language) and not CPython 3.x (the
>> implementation) that specifies Unicode strings in this way.
>
> Yeah, but it's CPython what dictates how language evolves (some people
> even think that it dictates how language should be implemented!), so all
> good parts belong to Python3, and all bad parts - to CPython3,
> right? ;-)
>
>> I don't
>> know why it has to cause a split in the community; this is the one way
>> to make sure *everyone's* strings work perfectly, rather than having
>> ASCII strings work fine and others start tripping over problems in
>> various APIs.
>
> It did cause split in the community, that's the fact, that's why
> Python2 and Python3 are at the respective positions. Anyway, I'm not
> interested in participating in that split, I did not yet uttered my
> opinion on that publicly enough, so I seized a chance to drop some
> witty remarks, but I don't want to start yet another Unicode flame.
>
>
>
> So, let's please be back to Unicode storage representation in
> MicroPython. So, https://github.com/micropython/micropython/issues/657
> discussed technical aspects, in a recent mail on this list I expressed
> my opinion why following CPython way is not productive (for development
> satisfaction and evolution of Python community, to be explicit).
>
> Final argument I would have is that you certainly can implement Unicode
> support the PEP393 way - it would be enormous help and would be gladly
> accepted. The question, how useful it will be for MicroPython. It
> certainly will be useful to report passing of testsuites. But will it
> be *really* used?
>
> For microcontroller board, it might be too heavy (put simple, with it,
> people will be able to do less (== heap running out sooner)), than
> without it, so one may expect it to be disabled by default. Then POSIX
> port is there surely not to let people replace "python" command
> with "micropython" and run Django, but to let people develop and debug
> their apps with more comfort than on embedded board. So, it should
> behave close to MCU version, and would follow with MCU choice
> re: Unicode.
>
> That's actually the reason why I keep up this discussion - not for the
> sake of argument or to bash Python3's Unicode choices. With recent
> MicroPython announcement, we surely looked for more people to
> contribute to its development. But then we (or at least I can speak for
> myself), would like to make sure that these contribution are actually
> the most useful ones (for both MicroPython, and Python community in
> general, which gets more choices, rather than just getting N% smaller
> CPython rewrite).
>
> So, you're not sure how O(N) string indexing will work? But MicroPython
> offers a great opportunity to try! And it's something new and exciting,
> which surely will be useful (== will save people memory), not just
> something old and boring ;-).
>
>
>>
>> ChrisA
>
>
> --
> Best regards,
> Paul mailto:***@gmail.com
> _______________________________________________
> Python-Dev mailing list
> Python-***@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/dholth%40gmail.com
Steve Dower
2014-06-04 13:14:04 UTC
Permalink
I'm agree with Daniel. Directly indexing into text suggests an attempted optimization that is likely to be incorrect for a set of strings. Splitting, regex, concatenation and formatting are really the main operations that matter, and MicroPython can optimize their implementation of these easily enough for O(N) indexing.

Cheers,
Steve

Top-posted from my Windows Phone
________________________________
From: Daniel Holth<mailto:***@gmail.com>
Sent: ý6/ý4/ý2014 5:17
To: Paul Sokolovsky<mailto:***@gmail.com>
Cc: python-dev<mailto:python-***@python.org>
Subject: Re: [Python-Dev] Internal representation of strings and Micropython

If we're voting I think representing Unicode internally in micropython
as utf-8 with O(N) indexing is a great idea, partly because I'm not
sure indexing into strings is a good idea - lots of Unicode code
points don't make sense by themselves; see also grapheme clusters. It
would probably work great.

On Wed, Jun 4, 2014 at 7:49 AM, Paul Sokolovsky <***@gmail.com> wrote:
> Hello,
>
> On Wed, 4 Jun 2014 20:53:46 +1000
> Chris Angelico <***@gmail.com> wrote:
>
>> On Wed, Jun 4, 2014 at 8:38 PM, Paul Sokolovsky <***@gmail.com>
>> wrote:
>> > And I'm saying that not to discourage Unicode addition to
>> > MicroPython, but to hint that "force-force" approach implemented by
>> > CPython3 and causing rage and split in the community is not
>> > appreciated.
>>
>> FWIW, it's Python 3 (the language) and not CPython 3.x (the
>> implementation) that specifies Unicode strings in this way.
>
> Yeah, but it's CPython what dictates how language evolves (some people
> even think that it dictates how language should be implemented!), so all
> good parts belong to Python3, and all bad parts - to CPython3,
> right? ;-)
>
>> I don't
>> know why it has to cause a split in the community; this is the one way
>> to make sure *everyone's* strings work perfectly, rather than having
>> ASCII strings work fine and others start tripping over problems in
>> various APIs.
>
> It did cause split in the community, that's the fact, that's why
> Python2 and Python3 are at the respective positions. Anyway, I'm not
> interested in participating in that split, I did not yet uttered my
> opinion on that publicly enough, so I seized a chance to drop some
> witty remarks, but I don't want to start yet another Unicode flame.
>
>
>
> So, let's please be back to Unicode storage representation in
> MicroPython. So, https://github.com/micropython/micropython/issues/657
> discussed technical aspects, in a recent mail on this list I expressed
> my opinion why following CPython way is not productive (for development
> satisfaction and evolution of Python community, to be explicit).
>
> Final argument I would have is that you certainly can implement Unicode
> support the PEP393 way - it would be enormous help and would be gladly
> accepted. The question, how useful it will be for MicroPython. It
> certainly will be useful to report passing of testsuites. But will it
> be *really* used?
>
> For microcontroller board, it might be too heavy (put simple, with it,
> people will be able to do less (== heap running out sooner)), than
> without it, so one may expect it to be disabled by default. Then POSIX
> port is there surely not to let people replace "python" command
> with "micropython" and run Django, but to let people develop and debug
> their apps with more comfort than on embedded board. So, it should
> behave close to MCU version, and would follow with MCU choice
> re: Unicode.
>
> That's actually the reason why I keep up this discussion - not for the
> sake of argument or to bash Python3's Unicode choices. With recent
> MicroPython announcement, we surely looked for more people to
> contribute to its development. But then we (or at least I can speak for
> myself), would like to make sure that these contribution are actually
> the most useful ones (for both MicroPython, and Python community in
> general, which gets more choices, rather than just getting N% smaller
> CPython rewrite).
>
> So, you're not sure how O(N) string indexing will work? But MicroPython
> offers a great opportunity to try! And it's something new and exciting,
> which surely will be useful (== will save people memory), not just
> something old and boring ;-).
>
>
>>
>> ChrisA
>
>
> --
> Best regards,
> Paul mailto:***@gmail.com
> _______________________________________________
> Python-Dev mailing list
> Python-***@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/dholth%40gmail.com
Steven D'Aprano
2014-06-04 14:12:45 UTC
Permalink
On Wed, Jun 04, 2014 at 01:14:04PM +0000, Steve Dower wrote:
> I'm agree with Daniel. Directly indexing into text suggests an
> attempted optimization that is likely to be incorrect for a set of
> strings.

I'm afraid I don't understand this argument. The language semantics says
that a string is an array of code points. Every index relates to a
single code point, no code point extends over two or more indexes.
There's a 1:1 relationship between code points and indexes. How is
direct indexing "likely to be incorrect"?

e.g.

s = "---ÿ---"
offset = s.index('ÿ')
assert s[offset] == 'ÿ'

That cannot fail with Python's semantics.

[Aside: it does fail in Python 2, showing that the idea that "strings
are bytes" is fatally broken. Fortunately Python has moved beyond that.]


> Splitting, regex, concatenation and formatting are really the
> main operations that matter, and MicroPython can optimize their
> implementation of these easily enough for O(N) indexing.

Really? Well, it will be a nice experiment. Fortunately MicroPython runs
under Linux as well as on embedded systems (a clever decision, by the
way) so I look forward to seeing how their internal-utf8 implementation
stacks up against CPython's FSR implementation.

Out of curiosity, when the FSR was proposed, did anyone consider an
internal UTF-8 representation? If so, why was it rejected?




--
Steven
Daniel Holth
2014-06-04 15:31:17 UTC
Permalink
On Wed, Jun 4, 2014 at 10:12 AM, Steven D'Aprano <***@pearwood.info> wrote:
> On Wed, Jun 04, 2014 at 01:14:04PM +0000, Steve Dower wrote:
>> I'm agree with Daniel. Directly indexing into text suggests an
>> attempted optimization that is likely to be incorrect for a set of
>> strings.
>
> I'm afraid I don't understand this argument. The language semantics says
> that a string is an array of code points. Every index relates to a
> single code point, no code point extends over two or more indexes.
> There's a 1:1 relationship between code points and indexes. How is
> direct indexing "likely to be incorrect"?

"Useful" is probably a better word. When you get into the complicated
languages and you want to know how wide something is, and you might
have y with two dots on it as one code point or two and left-to-right
and right-to-left indicators and who knows what else... then looking
at individual code points only works sometimes. I get the slicing
idea.

I like the idea that encoding to utf-8 would be the fastest thing you
can do with a string. You could consider doing regexps in that domain,
and other implementation specific optimizations in exactly the same
way that any Python implementation has them.

None of this would make it harder to move a servo.
Glenn Linderman
2014-06-04 20:50:42 UTC
Permalink
On 6/4/2014 6:14 AM, Steve Dower wrote:
> I'm agree with Daniel. Directly indexing into text suggests an
> attempted optimization that is likely to be incorrect for a set of
> strings. Splitting, regex, concatenation and formatting are really the
> main operations that matter, and MicroPython can optimize their
> implementation of these easily enough for O(N) indexing.
>
> Cheers,
> Steve
>
> Top-posted from my Windows Phone
> ------------------------------------------------------------------------
> From: Daniel Holth <mailto:***@gmail.com>
> Sent: ‎6/‎4/‎2014 5:17
> To: Paul Sokolovsky <mailto:***@gmail.com>
> Cc: python-dev <mailto:python-***@python.org>
> Subject: Re: [Python-Dev] Internal representation of strings and
> Micropython
>
> If we're voting I think representing Unicode internally in micropython
> as utf-8 with O(N) indexing is a great idea, partly because I'm not
> sure indexing into strings is a good idea - lots of Unicode code
> points don't make sense by themselves; see also grapheme clusters. It
> would probably work great.

I think native UTF-8 support is the most promising route for a
micropython Unicode support.

It would be an interesting proof-of-concept to implement an alternative
CPython with PEP-393 replaced by UTF-8 internally... doing conversions
for APIs that require a different encoding, but always maintaining and
computing with the UTF-8 representation.

1) The first proof-of-concept implementation should implement codepoint
indexing as a O(N) operation, searching from the beginning of the string
for the Nth codepoint.

Other Proof-of-concept implementation could implement a codepoint
boundary cache, there could be a variety of caching algorithms.

2) (Least space efficient) An array that could be indexed by codepoint
position and result in byte position. (This would use more space than a
UTF-32 representation!)

3) (Most space efficient) One cached entry, that caches the last
codepoint/byte position referenced. UTF-8 is able to be traversed in
either direction, so "next/previous" codepoint access would be
relatively fast (and such are very common operations, even when indexing
notation is used: "for ix in range( len( str_x )): func( str_x[ ix ])".)

4) (Fixed size caches) N entries, one for the last codepoint, and
others at Codepoint_Length/N intervals. N could be tunable.

5) (Fixed size caches) Like 4, plus an extra entry like 3.

6) (Variable size caches) Like 2, but only indexing every Nth code
point. N could be tunable.

7) (Variable size caches) Like 6, plus an extra entry like 3.

8) (Content specific variable size caches) Index each codepoint that is
a different byte size than the previous codepoint, allowing indexing to
be used in the intervals. Worst case size is like 2, best case size is a
single entry for the end, when all code points are represented by the
same number of bytes.

9) (Content specific variable size caches) Like 8, only cache entries
could indicate fixed or variable size characters in the next interval,
with a scheme like 4 or 6 used to prevent one interval from covering the
whole string.

Other hybrid schemes may present themselves as useful once experience is
gained with some of these. It might be surprising how few algorithms
need more than algorithm 3 to get reasonable performance.

Glenn
Chris Angelico
2014-06-04 21:28:00 UTC
Permalink
On Thu, Jun 5, 2014 at 6:50 AM, Glenn Linderman <v+***@g.nevcal.com> wrote:
> 8) (Content specific variable size caches) Index each codepoint that is a
> different byte size than the previous codepoint, allowing indexing to be
> used in the intervals. Worst case size is like 2, best case size is a single
> entry for the end, when all code points are represented by the same number
> of bytes.

Conceptually interesting, and I'd love to know how well that'd perform
in real-world usage. Would do very nicely on blocks of text that are
all from the same range of codepoints, but if you intersperse high and
low codepoints it'll be like 2 but with significantly more complicated
lookups (imagine a "name=value\nname=value\n" stream where the names
and values are all in the same language - you'll have a lot of
transitions).

Chrisa
Glenn Linderman
2014-06-04 21:57:36 UTC
Permalink
On 6/4/2014 2:28 PM, Chris Angelico wrote:
> On Thu, Jun 5, 2014 at 6:50 AM, Glenn Linderman <v+***@g.nevcal.com> wrote:
>> 8) (Content specific variable size caches) Index each codepoint that is a
>> different byte size than the previous codepoint, allowing indexing to be
>> used in the intervals. Worst case size is like 2, best case size is a single
>> entry for the end, when all code points are represented by the same number
>> of bytes.
> Conceptually interesting, and I'd love to know how well that'd perform
> in real-world usage.

So would I :)

> Would do very nicely on blocks of text that are
> all from the same range of codepoints, but if you intersperse high and
> low codepoints it'll be like 2 but with significantly more complicated
> lookups (imagine a "name=value\nname=value\n" stream where the names
> and values are all in the same language - you'll have a lot of
> transitions).

Lookup is binary search on code point index or a search for same in some
tree structure, I would think.

"like 2 but ..." well, the data structure would be bigger than for 2,
but your example shows 4-5 high codepoints per low codepoint (for some
languages).

I did just think of another refinement to this technique (my list was
not intended to be all-inclusive... just a bunch of variations I thought
of then).

10) (Content specific variable size caches) Like 8, but the last
character in a run is allowed (but not required) to be a different
number of bytes than prior characters, because the offset calculation
will still work for the first character of a different size.

So #10 would halve the size of your imagined stream that intersperses
one low-byte charater with each sequence of high-byte characters.
Stephen J. Turnbull
2014-06-05 07:00:01 UTC
Permalink
Glenn Linderman writes:

> 3) (Most space efficient) One cached entry, that caches the last
> codepoint/byte position referenced. UTF-8 is able to be traversed in
> either direction, so "next/previous" codepoint access would be
> relatively fast (and such are very common operations, even when indexing
> notation is used: "for ix in range( len( str_x )): func( str_x[ ix ])".)

Been there, tried that (Emacsen). Either it's a YAGNI (moving forward
or backward over UTF-8 by characters short distances is plenty fast,
especially if you've got a lot of ASCII you can move by words for
somewhat longer distances), or it's not good enough. There *may* be a
sweet spot, but it's definitely smaller than the one on Sharapova's
racket.

> 4) (Fixed size caches) N entries, one for the last codepoint, and
> others at Codepoint_Length/N intervals. N could be tunable.

To achieve space saving, cache has to be quite small, and the bigger
your integers, the smaller it gets. A naive implementation on 64-bit
machine would give you 16 bytes/cache entry. Using a non-native size
will be a space win, but needs care in implementation. Initializing
the cache is very expensive for small strings, so you need conditional
and maybe lazy initialization (for large strings).

By the way, there's also

10) Keep counts of the leading and trailing number of ASCII
(one-octet) characters. This is often a *huge* win; it's quite
common to encounter documents where size - lc - tc = 2 (ie,
there's only one two-octet character in the document).

11) Keep a list (or tree) of most-recently-accessed positions.

Despite my negative experience with multibyte encodings in Emacsen,
I'm persuaded by the arguments that there probably aren't all that
many places in core Python where indexing is used in an essential way,
so MicroPython itself can probably optimize those "behind the
scenes". Application programmers in the embedded context may be
expected to be deal with the need to avoid random access algorithms
and use iterators and generators to accomplish most tasks.
Serhiy Storchaka
2014-06-05 07:54:03 UTC
Permalink
04.06.14 23:50, Glenn Linderman написав(ла):
> 3) (Most space efficient) One cached entry, that caches the last
> codepoint/byte position referenced. UTF-8 is able to be traversed in
> either direction, so "next/previous" codepoint access would be
> relatively fast (and such are very common operations, even when indexing
> notation is used: "for ix in range( len( str_x )): func( str_x[ ix ])".)

Great idea! It should cover most real-word cases. Note that we can scan
UTF-8 string left-to-right and right-to-left.
Paul Sokolovsky
2014-06-04 10:38:57 UTC
Permalink
Hello,

On Wed, 4 Jun 2014 12:32:12 +1000
Chris Angelico <***@gmail.com> wrote:

> On Wed, Jun 4, 2014 at 11:17 AM, Steven D'Aprano
> <***@pearwood.info> wrote:
> > * Having a build-time option to restrict all strings to ASCII-only.
> >
> > (I think what they mean by that is that strings will be like
> > Python 2 strings, ASCII-plus-arbitrary-bytes, not actually ASCII.)
>
> What I was actually suggesting along those lines was that the str type
> still be notionally a Unicode string, but that any codepoints >127
> would either raise an exception or blow an assertion,

That's another reason why people don't like Unicode enforced upon them
- all the talk about supporting all languages and scripts is demagogy
and hypocrisy, given a choice, Unicode zealots would rather limit
people to Latin script then give up on their arbitrarily chosen,
one-among-thousands,
soon-to-be-replaced-by-apples'-and-microsofts'-"exciting-new" encoding.

Once again, my claim is what MicroPython implements now is more correct
- in a sense wider than technical - handling. We don't provide Unicode
encoding support, because it's highly bloated, but let people use any
encoding they like. That comes at some price, like length of strings in
characters are not know to runtime, only in bytes, but quite a lot of
applications can be written by having just that.

And I'm saying that not to discourage Unicode addition to MicroPython,
but to hint that "force-force" approach implemented by CPython3 and
causing rage and split in the community is not appreciated.

> and all the code
> to handle multibyte representations would be compiled out. So there'd
> still be a difference between strings of text and streams of bytes,
> but all encoding and decoding to/from ASCII-compatible encodings would
> just point to the same bytes in RAM.
>
> Risk: Someone would implement that with assertions, then compile with
> assertions disabled, test only with ASCII, and have lurking bugs.
>
> ChrisA


--
Best regards,
Paul mailto:***@gmail.com
Steven D'Aprano
2014-06-04 14:40:53 UTC
Permalink
On Wed, Jun 04, 2014 at 01:38:57PM +0300, Paul Sokolovsky wrote:

> That's another reason why people don't like Unicode enforced upon them

Enforcing design and language decisions is the job of the programming
language. You might as well complain that Python forces C doubles as the
floating point type, or that it forces Bignums as the integer type, or
that it forces significant indentation, or "class" as a keyword. Or that
C forces you to use braces and manage your own memory. That's the
purpose of the language, to make those decisions as to what features to
provide and what not to provide.


> - all the talk about supporting all languages and scripts is demagogy
> and hypocrisy, given a choice, Unicode zealots would rather limit
> people to Latin script

I have no words to describe how ridiculous this accusation is.


> then give up on their arbitrarily chosen, one-among-thousands,
> soon-to-be-replaced-by-apples'-and-microsofts'-"exciting-new" encoding.


> Once again, my claim is what MicroPython implements now is more correct
> - in a sense wider than technical - handling. We don't provide Unicode
> encoding support, because it's highly bloated, but let people use any
> encoding they like. That comes at some price, like length of strings in
> characters are not know to runtime, only in bytes

What's does uPy return for the length of '∞'? If the answer is anything
but 1, that's a bug.


--
Steven
Nick Coghlan
2014-06-04 05:17:00 UTC
Permalink
On 4 June 2014 11:17, Steven D'Aprano <***@pearwood.info> wrote:
> My own feeling is that O(1) string indexing operations are a quality of
> implementation issue, not a deal breaker to call it a Python.

If string indexing & iteration is still presented to the user as "an
array of code points", it should still avoid the bugs that plagued
both Python 2 narrow builds and direct use of UTF-8 encoded Py2
strings.

If they don't try to offer C API compatibility, it should be feasible
to do it that way. If they *do* try to offer C API compatibility, they
may have a problem.

> I can't
> see any requirement in the docs that str[n] must take O(1) time, but
> perhaps I have missed something.

There's a general expectation that indexing will be O(1) because all
the builtin containers that support that syntax use it for O(1) lookup
operations.

Cheers,
Nick.

--
Nick Coghlan | ***@gmail.com | Brisbane, Australia
Chris Angelico
2014-06-04 06:51:12 UTC
Permalink
On Wed, Jun 4, 2014 at 3:17 PM, Nick Coghlan <***@gmail.com> wrote:
> On 4 June 2014 11:17, Steven D'Aprano <***@pearwood.info> wrote:
>> My own feeling is that O(1) string indexing operations are a quality of
>> implementation issue, not a deal breaker to call it a Python.
>
> If string indexing & iteration is still presented to the user as "an
> array of code points", it should still avoid the bugs that plagued
> both Python 2 narrow builds and direct use of UTF-8 encoded Py2
> strings.

It would. The downsides of a UTF-8 representation would be slower
iteration and much slower (O(N)) indexing/slicing.

ChrisA
dw+
2014-06-04 05:39:04 UTC
Permalink
On Wed, Jun 04, 2014 at 03:17:00PM +1000, Nick Coghlan wrote:

> There's a general expectation that indexing will be O(1) because all
> the builtin containers that support that syntax use it for O(1) lookup
> operations.

Depending on your definition of built in, there is at least one standard
library container that does not - collections.deque.

Given the specialized kinds of application this Python implementation is
targetted at, it seems UTF-8 is ideal considering the huge memory
savings resulting from the compressed representation, and the reduced
likelihood of there being any real need for serious text processing on
the device.

It is also unlikely to find software or libraries like Django or
Werkzeug running on a microcontroller, more likely all the Python code
would be custom, in which case, replacing string indexing with
iteration, or temporary conversion to a list is easily done.

In this context, while a fixed-width encoding may be the correct choice
it would also likely be the wrong choice.


David
Stephen J. Turnbull
2014-06-04 09:36:20 UTC
Permalink
dw+python-***@hmmz.org writes:

> Given the specialized kinds of application this Python
> implementation is targetted at, it seems UTF-8 is ideal considering
> the huge memory savings resulting from the compressed
> representation,

I think you really need to check what the applications are in detail.
UTF-8 costs about 35% more storage for Japanese, and even more for
Chinese, than does UTF-16. So if you might be using a lot of Asian
localized strings, it might even be worth implementing PEP-393 to get
the best of both worlds for most strings.
Juraj Sukop
2014-06-04 09:53:43 UTC
Permalink
On Wed, Jun 4, 2014 at 11:36 AM, Stephen J. Turnbull <***@xemacs.org>
wrote:

>
> I think you really need to check what the applications are in detail.
> UTF-8 costs about 35% more storage for Japanese, and even more for
> Chinese, than does UTF-16.


"UTF-8 can be smaller even for Asian languages, e.g.: front page of
Wikipedia Japan: 83 kB in UTF-8, 144 kB in UTF-16"
From http://www.lua.org/wshop12/Ierusalimschy.pdf (p. 12)
Nick Coghlan
2014-06-04 13:33:01 UTC
Permalink
On 4 June 2014 15:39, <dw+python-***@hmmz.org> wrote:
> On Wed, Jun 04, 2014 at 03:17:00PM +1000, Nick Coghlan wrote:
>
>> There's a general expectation that indexing will be O(1) because all
>> the builtin containers that support that syntax use it for O(1) lookup
>> operations.
>
> Depending on your definition of built in, there is at least one standard
> library container that does not - collections.deque.
>
> Given the specialized kinds of application this Python implementation is
> targetted at, it seems UTF-8 is ideal considering the huge memory
> savings resulting from the compressed representation, and the reduced
> likelihood of there being any real need for serious text processing on
> the device.

Right - I wasn't clear that I think storing text internally as UTF-8
sounds fine for MicroPython. Anything where the O(N) nature of
indexing by code point matters probably won't be run in that
environment anyway.

Cheers,
Nick.

--
Nick Coghlan | ***@gmail.com | Brisbane, Australia
MRAB
2014-06-04 16:52:17 UTC
Permalink
On 2014-06-04 14:33, Nick Coghlan wrote:
> On 4 June 2014 15:39, <dw+python-***@hmmz.org> wrote:
>> On Wed, Jun 04, 2014 at 03:17:00PM +1000, Nick Coghlan wrote:
>>
>>> There's a general expectation that indexing will be O(1) because
>>> all the builtin containers that support that syntax use it for
>>> O(1) lookup operations.
>>
>> Depending on your definition of built in, there is at least one
>> standard library container that does not - collections.deque.
>>
>> Given the specialized kinds of application this Python
>> implementation is targetted at, it seems UTF-8 is ideal considering
>> the huge memory savings resulting from the compressed
>> representation, and the reduced likelihood of there being any real
>> need for serious text processing on the device.
>
> Right - I wasn't clear that I think storing text internally as UTF-8
> sounds fine for MicroPython. Anything where the O(N) nature of
> indexing by code point matters probably won't be run in that
> environment anyway.
>
In order to avoid indexing, you could use some kind of 'cursor' class to
step forwards and backwards along strings. The cursor could include
both the codepoint index and the byte index.
Serhiy Storchaka
2014-06-04 17:11:11 UTC
Permalink
04.06.14 19:52, MRAB написав(ла):
> In order to avoid indexing, you could use some kind of 'cursor' class to
> step forwards and backwards along strings. The cursor could include
> both the codepoint index and the byte index.

So you need different string library and different regular expression
library.
m***@v.loewis.de
2014-06-04 07:02:13 UTC
Permalink
Zitat von Steven D'Aprano <***@pearwood.info>:

> * Having a build-time option to restrict all strings to ASCII-only.
>
> (I think what they mean by that is that strings will be like Python 2
> strings, ASCII-plus-arbitrary-bytes, not actually ASCII.)

An ASCII-plus-arbitrary-bytes type called "str" would prevent claiming
"Python 3.4 compatibility" for sure.

Restricting strings to ASCII (as Chris apparently actually suggested)
would allow to claim compatibility with a stretch: existing Python
code might not run on such an implementation. However, since a lot
of existing Python code wouldn't run on MicroPython, anyway, one
might claim to implement a Python 3.4 subset.

> * Implementing Unicode internally as UTF-8, and giving up O(1)
> indexing operations.
>
> Would either of these trade-offs be acceptable while still claiming
> "Python 3.4 compatibility"?
>
> My own feeling is that O(1) string indexing operations are a quality of
> implementation issue, not a deal breaker to call it a Python. I can't
> see any requirement in the docs that str[n] must take O(1) time, but
> perhaps I have missed something.

I agree. It's an open question whether such an implementation would be
practical, both in terms of existing Python code, and in terms of existing
C extension modules that people might want to port to MicroPython.

There are more things to consider for the internal implementation,
in particular how the string length is implemented. Several alternatives
exist:
1. store the UTF-8 length (i.e. memory size)
2. store the number of code points (i.e. Python len())
3. store both
4. store neither, but use null termination instead

Variant 3 is most run-time efficient, but could easily use 8 bytes
just for the length, which could outweigh the storage of the actual
data. Variants 1 and 2 lose on some operations (1 loses on computing
len(), 2 loses on string concatenation). 3 would add the restriction
of not allowing U+0000 in a string (which would be reasonable IMO),
and make all length computations inefficient. However, it wouldn't
be worse than standard C.

Regards,
Martin
Chris Angelico
2014-06-04 07:06:25 UTC
Permalink
On Wed, Jun 4, 2014 at 5:02 PM, <***@v.loewis.de> wrote:
> There are more things to consider for the internal implementation,
> in particular how the string length is implemented. Several alternatives
> exist:
> 1. store the UTF-8 length (i.e. memory size)
> 2. store the number of code points (i.e. Python len())
> 3. store both
> 4. store neither, but use null termination instead
>
> Variant 3 is most run-time efficient, but could easily use 8 bytes
> just for the length, which could outweigh the storage of the actual
> data. Variants 1 and 2 lose on some operations (1 loses on computing
> len(), 2 loses on string concatenation). 3 would add the restriction
> of not allowing U+0000 in a string (which would be reasonable IMO),
> and make all length computations inefficient. However, it wouldn't
> be worse than standard C.

The current implementation stores a 16-bit length, which is both the
memory size and the len(). As far as I can see, the memory size is
never needed, so I'd just go for option 2; string concatenation is
already known to be one of those operations that can be slow if you do
it badly, and an optimized str.join() would cover the recommended
use-case.

ChrisA
Jeff Allen
2014-06-04 07:41:12 UTC
Permalink
Jython uses UTF-16 internally -- probably the only sensible choice in a
Python that can call Java. Indexing is O(N), fundamentally. By
"fundamentally", I mean for those strings that have not yet noticed that
they contain no supplementary (>0xffff) characters.

I've toyed with making this O(1) universally. Like Steven, I understand
this to be a freedom afforded to implementers, rather than an issue of
conformity.

Jeff Allen

On 04/06/2014 02:17, Steven D'Aprano wrote:
> There is a discussion over at MicroPython about the internal
> representation of Unicode strings.
...
> My own feeling is that O(1) string indexing operations are a quality of
> implementation issue, not a deal breaker to call it a Python. I can't
> see any requirement in the docs that str[n] must take O(1) time, but
> perhaps I have missed something.
>
INADA Naoki
2014-06-04 16:45:51 UTC
Permalink
For Jython and IronPython, UTF-16 may be best internal encoding.

Recent languages (Swiffy, Golang, Rust) chose UTF-8 as internal encoding.
Using utf-8 is simple and efficient. For example, no need for utf-8
copy of the string when writing to file
and serializing to JSON.

When implementing Python using these languages, UTF-8 will be best
internal encoding.

To allow Python implementations other than CPython can use UTF-8 or
UTF-16 as internal encoding efficiently,
I think adding internal position based API is the best solution.

>>> s = "\U00100000x"
>>> len(s)
2
>>> s[1:]
'x'
>>> s.find('x')
1
>>> # s.isize() # Internal length. 5 for utf-8, 3 for utf-16
>>> # s.ifind('x') # Internal position, 4 for utf-8, 2 for utf-16
>>> # s.islice(s.ifind('x')) => 'x'


(I like design of golang and Rust. I hope CPython uses utf-8 as
internal encoding in the future.
But this is off-topic.)


On Wed, Jun 4, 2014 at 4:41 PM, Jeff Allen <***@farowl.co.uk> wrote:
> Jython uses UTF-16 internally -- probably the only sensible choice in a
> Python that can call Java. Indexing is O(N), fundamentally. By
> "fundamentally", I mean for those strings that have not yet noticed that
> they contain no supplementary (>0xffff) characters.
>
> I've toyed with making this O(1) universally. Like Steven, I understand this
> to be a freedom afforded to implementers, rather than an issue of
> conformity.
>
> Jeff Allen
>
>
> On 04/06/2014 02:17, Steven D'Aprano wrote:
>>
>> There is a discussion over at MicroPython about the internal
>> representation of Unicode strings.
>
> ...
>
>> My own feeling is that O(1) string indexing operations are a quality of
>> implementation issue, not a deal breaker to call it a Python. I can't
>> see any requirement in the docs that str[n] must take O(1) time, but
>> perhaps I have missed something.
>>
>
> _______________________________________________
> Python-Dev mailing list
> Python-***@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com



--
INADA Naoki <***@gmail.com>
Terry Reedy
2014-06-04 21:19:29 UTC
Permalink
On 6/4/2014 3:41 AM, Jeff Allen wrote:
> Jython uses UTF-16 internally -- probably the only sensible choice in a
> Python that can call Java. Indexing is O(N), fundamentally. By
> "fundamentally", I mean for those strings that have not yet noticed that
> they contain no supplementary (>0xffff) characters.
>
> I've toyed with making this O(1) universally. Like Steven, I understand
> this to be a freedom afforded to implementers, rather than an issue of
> conformity.
>
> Jeff Allen
>
> On 04/06/2014 02:17, Steven D'Aprano wrote:
>> There is a discussion over at MicroPython about the internal
>> representation of Unicode strings.
> ...
>> My own feeling is that O(1) string indexing operations are a quality of
>> implementation issue, not a deal breaker to call it a Python. I can't
>> see any requirement in the docs that str[n] must take O(1) time, but
>> perhaps I have missed something.
>>
>


--
Terry Jan Reedy
Terry Reedy
2014-06-04 21:21:20 UTC
Permalink
On 6/4/2014 3:41 AM, Jeff Allen wrote:
> Jython uses UTF-16 internally -- probably the only sensible choice in a
> Python that can call Java. Indexing is O(N), fundamentally. By
> "fundamentally", I mean for those strings that have not yet noticed that
> they contain no supplementary (>0xffff) characters.

Indexing can be made O(log(k)) where k is the number of astral chars,
and is usually small.

--
Terry Jan Reedy
Serhiy Storchaka
2014-06-04 22:54:42 UTC
Permalink
05.06.14 00:21, Terry Reedy написав(ла):
> On 6/4/2014 3:41 AM, Jeff Allen wrote:
>> Jython uses UTF-16 internally -- probably the only sensible choice in a
>> Python that can call Java. Indexing is O(N), fundamentally. By
>> "fundamentally", I mean for those strings that have not yet noticed that
>> they contain no supplementary (>0xffff) characters.
>
> Indexing can be made O(log(k)) where k is the number of astral chars,
> and is usually small.

I like your idea and think it would be great if Jython will implement
it. Unfortunately it is too late to do this in CPython.
Terry Reedy
2014-06-05 02:25:03 UTC
Permalink
On 6/4/2014 6:54 PM, Serhiy Storchaka wrote:
> 05.06.14 00:21, Terry Reedy написав(ла):
>> On 6/4/2014 3:41 AM, Jeff Allen wrote:
>>> Jython uses UTF-16 internally -- probably the only sensible choice in a
>>> Python that can call Java. Indexing is O(N), fundamentally. By
>>> "fundamentally", I mean for those strings that have not yet noticed that
>>> they contain no supplementary (>0xffff) characters.
>>
>> Indexing can be made O(log(k)) where k is the number of astral chars,
>> and is usually small.
>
> I like your idea and think it would be great if Jython will implement
> it.

A proof of concept implementation in Python that handles both indexing
and slicing is on the tracker. It is simpler than I initially expected.

> Unfortunately it is too late to do this in CPython.

I mentioned it as an alternative during the '393 discussion. I more than
half agree that the FSR is the better choice for CPython, which had no
particular attachment to UTF-16 in the way that I think Jython, for
instance, does.

--
Terry Jan Reedy
Serhiy Storchaka
2014-06-05 08:08:21 UTC
Permalink
05.06.14 05:25, Terry Reedy написав(ла):
> I mentioned it as an alternative during the '393 discussion. I more than
> half agree that the FSR is the better choice for CPython, which had no
> particular attachment to UTF-16 in the way that I think Jython, for
> instance, does.

Yes, I remember. I thing that hybrid FSR-UTF16 (like FSR, but UTF-16 is
used instead of UCS4) is the better choice for CPython. I suppose that
with populating emoticons and other icon characters in nearest 5 or 10
years, even English text will often contain astral characters. And
spending 4 bytes per character if long text contains one astral
character looks too prodigally.
Stephen J. Turnbull
2014-06-05 10:00:01 UTC
Permalink
Serhiy Storchaka writes:

> Yes, I remember. I thing that hybrid FSR-UTF16 (like FSR, but UTF-16 is
> used instead of UCS4) is the better choice for CPython. I suppose that
> with populating emoticons and other icon characters in nearest 5 or 10
> years, even English text will often contain astral characters. And
> spending 4 bytes per character if long text contains one astral
> character looks too prodigally.

Why use something that complex if you don't have to? For the use case
you have in mind, just map them into private space. If you really
want to be aggressive, use surrogate space, too (anything that cares
what a scalar represents should be trapping on non-scalars, catch that
exception and look up the char -- dangerous, though, because such
exceptions are probably all over the place).
Serhiy Storchaka
2014-06-04 13:39:46 UTC
Permalink
04.06.14 04:17, Steven D'Aprano написав(ла):
> Would either of these trade-offs be acceptable while still claiming
> "Python 3.4 compatibility"?
>
> My own feeling is that O(1) string indexing operations are a quality of
> implementation issue, not a deal breaker to call it a Python. I can't
> see any requirement in the docs that str[n] must take O(1) time, but
> perhaps I have missed something.

I think than breaking O(1) expectation for indexing makes the
implementation significant incompatible with Python. Virtually all
string operations in Python operates with indices.

O(1) indexing operations can be kept with minimal memory requirements if
implement Unicode internally as modified UTF-8 plus optional array of
offsets for every, say, 32th character (which even can be compressed to
an array of 16-bit or 32-bit integers).
Daniel Holth
2014-06-04 14:01:12 UTC
Permalink
MicroPython is going to be significantly incompatible with Python
anyway. But you should be able to run your mp code on regular Python.

On Wed, Jun 4, 2014 at 9:39 AM, Serhiy Storchaka <***@gmail.com> wrote:
> 04.06.14 04:17, Steven D'Aprano написав(ла):
>
>> Would either of these trade-offs be acceptable while still claiming
>> "Python 3.4 compatibility"?
>>
>> My own feeling is that O(1) string indexing operations are a quality of
>> implementation issue, not a deal breaker to call it a Python. I can't
>> see any requirement in the docs that str[n] must take O(1) time, but
>> perhaps I have missed something.
>
>
> I think than breaking O(1) expectation for indexing makes the implementation
> significant incompatible with Python. Virtually all string operations in
> Python operates with indices.
>
> O(1) indexing operations can be kept with minimal memory requirements if
> implement Unicode internally as modified UTF-8 plus optional array of
> offsets for every, say, 32th character (which even can be compressed to an
> array of 16-bit or 32-bit integers).
>
>
> _______________________________________________
> Python-Dev mailing list
> Python-***@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/dholth%40gmail.com
Paul Moore
2014-06-04 14:02:48 UTC
Permalink
On 4 June 2014 14:39, Serhiy Storchaka <***@gmail.com> wrote:
> I think than breaking O(1) expectation for indexing makes the implementation
> significant incompatible with Python. Virtually all string operations in
> Python operates with indices.

I don't use indexing on strings except in rare situations. Sure I use
lots of operations that may well use indexing *internally* but that's
the point. MicroPython can optimise those operations without needing
to guarantee O(1) indexing, and I'd be fine with that.

Paul
Serhiy Storchaka
2014-06-04 14:40:14 UTC
Permalink
04.06.14 17:02, Paul Moore написав(ла):
> On 4 June 2014 14:39, Serhiy Storchaka <***@gmail.com> wrote:
>> I think than breaking O(1) expectation for indexing makes the implementation
>> significant incompatible with Python. Virtually all string operations in
>> Python operates with indices.
>
> I don't use indexing on strings except in rare situations. Sure I use
> lots of operations that may well use indexing *internally* but that's
> the point. MicroPython can optimise those operations without needing
> to guarantee O(1) indexing, and I'd be fine with that.

Any non-trivial text parsing uses indices or regular expressions (and
regular expressions themself use indices internally).

It would be interesting to collect a statistic about how many indexing
operations happened during the life of a string in typical (Micro)Python
program.
Paul Sokolovsky
2014-06-04 15:38:31 UTC
Permalink
Hello,

On Wed, 04 Jun 2014 17:40:14 +0300
Serhiy Storchaka <***@gmail.com> wrote:

> 04.06.14 17:02, Paul Moore написав(ла):
> > On 4 June 2014 14:39, Serhiy Storchaka <***@gmail.com> wrote:
> >> I think than breaking O(1) expectation for indexing makes the
> >> implementation significant incompatible with Python. Virtually all
> >> string operations in Python operates with indices.
> >
> > I don't use indexing on strings except in rare situations. Sure I
> > use lots of operations that may well use indexing *internally* but
> > that's the point. MicroPython can optimise those operations without
> > needing to guarantee O(1) indexing, and I'd be fine with that.
>
> Any non-trivial text parsing uses indices or regular expressions (and
> regular expressions themself use indices internally).

I keep hearing this stuff, and unfortunately so far don't have enough
time to collect all that stuff and provide detailed response. So,
here's spur of the moment response - hopefully we're in the same
context so it is easy to understand.

So, gentlemen, you keep mixing up character-by-character random access
to string and taking substrings of a string.

Character-by-character random access imply that you would need to scan
thru (possibly almost) all chars in a string. That's O(N) (N-length of
string). With varlength encoding (taking O(N) to index arbitrary char),
there's thus concern that this would be O(N^2) op.

But show me real-world case for that. Common usecase is scanning string
left-to-right, that should be done using iterator and thus O(N).
Right-to-left scanning would be order(s) of magnitude less frequent, as
and also handled by iterator.

What's next? You're doing some funky anagrams and need to swap each 2
adjacent chars? Sorry, naive implementation will be slow. If you're in
serious anagram business, you'll need to code C extension. No, wait!
Instead you should learn Python better. You should run a string
windowing iterator which will return adjacent pair and swap those
constant-len strings.

More cases anyone? Implementing DES and doing arbitrary permutations?
Kindly drop doing that on strings, do it on bytes or lists.

Hopefully, the idea is clear - if you *scan* thru string using indexes
in *random* order, you're doing weird thing and *want* weird
performance. Doing stuff is s[0] ot s[-1] - there's finite (and small)
number of such operation per strings.



Now about taking substrings of strings (which in Python often expressed
by slice indexing). Well, this is quite different from scanning each
character of a strings. Just like s[0]/s[-1] this usually happens
finite number of times for a particular string, independent of its
length, i.e. O(1) times (ex, you take a string and split it in 3
parts), or maybe number of substrings is not bound-fixed, but has
different growth order, O(M) (for example, you split string in tokens,
tokens can be long, but there're usually external limits on how many
it's sensible to have on one line).

So, again, you're not going to get quadric time unless you're unlucky
or sloppy. And just again, you should brush up your Python skills and
use regex functions shich return iterators to get your parsed tokens,
etc.

(To clarify the obvious - "you" here is abstract pronoun, not referring
to respectable Python developers who actually made it possible to write
efficient Python programs).


So, hopefully the point is conveyed - you can write inefficient Python
programs. CPython goes out of the way to hide many inefficiencies (using
unbelievably bloated heap usage - from uPy's point of view, which
starts up in 2K heap). You just shouldn't write inefficient programs,
voila. But if you want, you can keep writing inefficient programs, they
just will be inefficient. Peace.

> It would be interesting to collect a statistic about how many
> indexing operations happened during the life of a string in typical
> (Micro)Python program.

Yup.

--
Best regards,
Paul mailto:***@gmail.com
Steve Dower
2014-06-04 15:51:38 UTC
Permalink
Paul Sokolovsky wrote:
> You just shouldn't write inefficient programs, voila. But if you want, you can keep writing inefficient programs, they just will be inefficient. Peace.

Can I nominate this for QOTD? :)

Cheers,
Steve
Serhiy Storchaka
2014-06-04 16:49:18 UTC
Permalink
04.06.14 18:38, Paul Sokolovsky написав(ла):
>> Any non-trivial text parsing uses indices or regular expressions (and
>> regular expressions themself use indices internally).
>
> I keep hearing this stuff, and unfortunately so far don't have enough
> time to collect all that stuff and provide detailed response. So,
> here's spur of the moment response - hopefully we're in the same
> context so it is easy to understand.
>
> So, gentlemen, you keep mixing up character-by-character random access
> to string and taking substrings of a string.
>
> Character-by-character random access imply that you would need to scan
> thru (possibly almost) all chars in a string. That's O(N) (N-length of
> string). With varlength encoding (taking O(N) to index arbitrary char),
> there's thus concern that this would be O(N^2) op.
>
> But show me real-world case for that. Common usecase is scanning string
> left-to-right, that should be done using iterator and thus O(N).
> Right-to-left scanning would be order(s) of magnitude less frequent, as
> and also handled by iterator.

html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize don't
use iterators. They use indices, str.find and/or regular expressions.
Common use case is quickly find substring starting from current position
using str.find or re.search, process found token, advance position and
repeat.
Paul Sokolovsky
2014-06-04 17:05:20 UTC
Permalink
Hello,

On Wed, 04 Jun 2014 19:49:18 +0300
Serhiy Storchaka <***@gmail.com> wrote:

[]
> > But show me real-world case for that. Common usecase is scanning
> > string left-to-right, that should be done using iterator and thus
> > O(N). Right-to-left scanning would be order(s) of magnitude less
> > frequent, as and also handled by iterator.
>
> html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize
> don't use iterators. They use indices, str.find and/or regular
> expressions. Common use case is quickly find substring starting from
> current position using str.find or re.search, process found token,
> advance position and repeat.

That's sad, I agree.


--
Best regards,
Paul mailto:***@gmail.com
Serhiy Storchaka
2014-06-04 17:52:14 UTC
Permalink
04.06.14 20:05, Paul Sokolovsky написав(ла):
> On Wed, 04 Jun 2014 19:49:18 +0300
> Serhiy Storchaka <***@gmail.com> wrote:
>> html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize
>> don't use iterators. They use indices, str.find and/or regular
>> expressions. Common use case is quickly find substring starting from
>> current position using str.find or re.search, process found token,
>> advance position and repeat.
>
> That's sad, I agree.

Other languages (Go, Rust) can be happy without O(1) indexing of
strings. All string and regex operations work with iterators or cursors,
and I believe this approach is not significant worse than implementing
strings as O(1)-indexable arrays of characters (for some applications it
can be worse, for other it can be better). But Python is different
language, it has different operations for strings and different idioms.
A language which doesn't support O(1) indexing is not Python, it is only
Python-like language.
Paul Sokolovsky
2014-06-04 18:29:29 UTC
Permalink
Hello,

On Wed, 04 Jun 2014 20:52:14 +0300
Serhiy Storchaka <***@gmail.com> wrote:

[]
> > That's sad, I agree.
>
> Other languages (Go, Rust) can be happy without O(1) indexing of
> strings. All string and regex operations work with iterators or
> cursors, and I believe this approach is not significant worse than
> implementing strings as O(1)-indexable arrays of characters (for some
> applications it can be worse, for other it can be better). But Python
> is different language, it has different operations for strings and
> different idioms. A language which doesn't support O(1) indexing is
> not Python, it is only Python-like language.

Sorry, but that's just your personal opinion, not shared by other
developers, as this thread showed. And let's not pretend we live in
happy-ever world of Python 1.5.2 which doesn't need anything more
because it's perfect as it is. Somebody added all those iterators and
iterator-returning functions to Pythons. And then the problem Python
has is a typical "last mile" problem, that iterators were not applied
completely everywhere. There's little choice but to move in that
direction, though.

What you call "idioms", other people call "sloppy programming
practices". There's common suggestion how to be at peace with Python's
indentation for those who find it a problem - "get over it". Well,
somehow it itches to say same for people who think that Python3 should
be used the same way as Python1: Get over the fact that Python is no
longer little funny language being laughed at by Perl crowd for being
order of magnitude slower at processing text files. While you still can
do little funny tricks we all love Python for, it now also offers
framework to do it right, and it makes little sense saying that doing it
little funny way is the definitive trait of Python.

(And for me it's easy to be such categorical - the only way I could
subscribe to idea of running Python on an MCU and not be laughable is by
trusting Python to provide framework for being efficient. I quit
working on another language because I have trusted that iterator,
generator, buffer protocols are not little funny things but thoroughly
engineered efficient concepts, and I don't feel betrayed.)


--
Best regards,
Paul mailto:***@gmail.com
Greg Ewing
2014-06-05 00:08:21 UTC
Permalink
Serhiy Storchaka wrote:
> A language which doesn't support O(1) indexing is not Python, it is only
> Python-like language.

That's debatable, but even if it's true, I don't think
there's anything wrong with MicroPython being only a
"Python-like language". As has been pointed out, fitting
Python onto a small device is always going to necessitate
some compromises.

--
Greg
Paul Sokolovsky
2014-06-05 01:19:13 UTC
Permalink
Hello,

On Thu, 05 Jun 2014 12:08:21 +1200
Greg Ewing <***@canterbury.ac.nz> wrote:

> Serhiy Storchaka wrote:
> > A language which doesn't support O(1) indexing is not Python, it is
> > only Python-like language.
>
> That's debatable, but even if it's true, I don't think
> there's anything wrong with MicroPython being only a
> "Python-like language". As has been pointed out, fitting
> Python onto a small device is always going to necessitate
> some compromises.

Thanks. I mentioned in another mail that we exactly trying to develop a
minimalistic, but Python implementation, not Python-like language.

What is "Python-like" for me. The other most well-know, and mature (as
in "started quite some time ago") "small Python" implementation is
PyMite aka Python-on-a-chip
https://code.google.com/p/python-on-a-chip/ . It implements good deal
of Python2 language. It doesn't implement exception handling
(try/except). Can a Python be without exception handling? For me,
the clear answer is "no".

Please put that in perspective when alarming over O(1) indexing of
inherently problematic niche datatype. (Again, it's not my or
MicroPython's fault that it was forced as standard string type. Maybe
if CPython seriously considered now-standard UTF-8 encoding, results
of what is "str" type might be different. But CPython has gigabytes of
heap to spare, and for MicroPython, every half-bit is precious).


>
> --
> Greg
> _______________________________________________
> Python-Dev mailing list
> Python-***@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/pmiscml%40gmail.com



--
Best regards,
Paul mailto:***@gmail.com
Stephen J. Turnbull
2014-06-05 07:54:11 UTC
Permalink
Paul Sokolovsky writes:

> Please put that in perspective when alarming over O(1) indexing of
> inherently problematic niche datatype. (Again, it's not my or
> MicroPython's fault that it was forced as standard string type. Maybe
> if CPython seriously considered now-standard UTF-8 encoding, results
> of what is "str" type might be different. But CPython has gigabytes of
> heap to spare, and for MicroPython, every half-bit is precious).

Would you please stop trolling? The reasons for adopting Unicode as a
separate data type were good and sufficient in 2000, and they remain
so today, even if you have been fortunate enough not to burn yourself
on character-byte conflation yet.

What matters to you is that str (unicode) is an opaque type -- there
is no specification of the internal representation in the language
reference, and in fact several different ones coexist happily across
existing Python implementations -- and you're free to use a UTF-8
implementation if that suits the applications you expect for
MicroPython.

PEP 393 exists, of course, and specifies the current internal
representation for CPython 3. But I don't see anything in it that
suggests it's mandated for any other implementation.
Paul Sokolovsky
2014-06-05 11:25:28 UTC
Permalink
Hello,

On Thu, 05 Jun 2014 16:54:11 +0900
"Stephen J. Turnbull" <***@xemacs.org> wrote:

> Paul Sokolovsky writes:
>
> > Please put that in perspective when alarming over O(1) indexing of
> > inherently problematic niche datatype. (Again, it's not my or
> > MicroPython's fault that it was forced as standard string type.
> > Maybe if CPython seriously considered now-standard UTF-8 encoding,
> > results of what is "str" type might be different. But CPython has
> > gigabytes of heap to spare, and for MicroPython, every half-bit is
> > precious).
>
> Would you please stop trolling? The reasons for adopting Unicode as a
> separate data type were good and sufficient in 2000, and they remain

If it was kept at "separate data type" bay, there wouldn't be any
problem. But it was made "one and only string type", and all strife
started then.

And there going to be "trolling" as long as Python developers and
decision-makers will ignore (troll?) outcry from the community (again, I
was surprised and not surprised to see ~50% of traffic on python-list
touches Unicode issues).

Well, I understand the plan - hoping that people will "get over this".
And I'm personally happy to stay away from this "trolling", but any
discussion related to Unicode goes in circles and returns to feeling
that Unicode at the central role as put there by Python3 is misplaced.

Then for me, it's just a matter of job security and personal future - I
don't want to spend rest of my days as a javascript (or other idiotic
language) monkey. And the message is clear in the air
(http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/ and
elsewhere): if Python strings are now in Go, and in Python itself are
now Java strings, all causing strife, why not go cruising around and see
what's up, instead of staying strong, and growing bigger, community.

> so today, even if you have been fortunate enough not to burn yourself
> on character-byte conflation yet.
>
> What matters to you is that str (unicode) is an opaque type -- there
> is no specification of the internal representation in the language
> reference, and in fact several different ones coexist happily across
> existing Python implementations -- and you're free to use a UTF-8
> implementation if that suits the applications you expect for
> MicroPython.
>
> PEP 393 exists, of course, and specifies the current internal
> representation for CPython 3. But I don't see anything in it that
> suggests it's mandated for any other implementation.

I knew all this before very well. What's strange is that other
developers don't know, or treat seriously, all of the above. That's why
gentleman who kindly was interested in adding Unicode support to
MicroPython started with the idea of dragging in CPython implementation.
And the only effect persuasion that it's not necessarily the best
solution had, was that he started to feel that he's being manipulated
into writing something ugly, instead of the bright idea he had.

That's why another gentleman reduces it to: "O(1) on string indexing or
not a Python!".

And that's why another gentleman, who agrees to UTF-8 arguments, still
gives an excuse
(https://mail.python.org/pipermail/python-dev/2014-June/134727.html):
"In this context, while a fixed-width encoding may be the correct
choice it would also likely be the wrong choice."


In this regard, I'm glad to participate in mind-resetting discussion.
So, let's reiterate - there's nothing like "the best", "the only right",
"the only correct", "righter than", "more correct than" in CPython's
implementation of Unicode storage. It is *arbitrary*. Well, sure, it's
not arbitrary, but based on requirements, and these requirements match
CPython's (implied) usage model well enough. But among all possible
sets of requirements, CPython's requirements are no more valid that
other possible. And other set of requirement fairly clearly lead to
situation where CPython implementation is rejected as not correct for
those requirements at all.



--
Best regards,
Paul mailto:***@gmail.com
Nick Coghlan
2014-06-05 11:43:16 UTC
Permalink
Received: from localhost (HELO mail.python.org) (127.0.0.1)
by albatross.python.org with SMTP; 05 Jun 2014 13:43:18 +0200
Received: from mail-qg0-x232.google.com (unknown
[IPv6:2607:f8b0:400d:c04::232])
(using TLSv1 with cipher ECDHE-RSA-AES128-SHA (128/128 bits))
(No client certificate requested)
by mail.python.org (Postfix) with ESMTPS
for <python-***@python.org>; Thu, 5 Jun 2014 13:43:17 +0200 (CEST)
Received: by mail-qg0-f50.google.com with SMTP id z60so1317731qgd.9
for <python-***@python.org>; Thu, 05 Jun 2014 04:43:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
h=mime-version:in-reply-to:references:date:message-id:subject:from:to
:cc:content-type;
bh=sInq+AT9GjaI7r3qY8lgnrR+lAK+TBCJLP7sYfMxcCI=;
b=ATTWgJIvgd7zmjrSQqS4qeRRzB+m5yAXOEsejI8DY80CjERn/kzhN8WGblEjictQbb
RAOklDweDLpOD8MjIzZyJnv4qyRBs2nsXLw/ZE6vSzNChlCzctoCN1pz8N4ATGBrsHZO
67HK2gJBJRn3c0bDG7dQ3uYRx7JJg9Fa+Wh0e+oYEdrmJCFDUJi6n0ri0vPHDACNBTPF
W4g/oUrHRueaqim/pPOgmwdcsGtbU7Mfrqd0T+SuER5O+HEnc1OxWKViqJpvRM/qDofB
OZwe1vyeTGiuG7xKhLbVG4jNgLmQ5R+xpIvUn0VNfS4SX79kzFUpY82LwowNXhqrPdKo
LP1A==
X-Received: by 10.140.97.197 with SMTP id m63mr74960827qge.15.1401968596838;
Thu, 05 Jun 2014 04:43:16 -0700 (PDT)
Received: by 10.224.183.210 with HTTP; Thu, 5 Jun 2014 04:43:16 -0700 (PDT)
In-Reply-To: <***@x34f>
X-BeenThere: python-***@python.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Python core developers <python-dev.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-dev>,
<mailto:python-dev-***@python.org?subject=unsubscribe>
List-Archive: <http://mail.python.org/pipermail/python-dev/>
List-Post: <mailto:python-***@python.org>
List-Help: <mailto:python-dev-***@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-dev>,
<mailto:python-dev-***@python.org?subject=subscribe>
Errors-To: python-dev-bounces+python-python-dev=***@python.org
Sender: "Python-Dev"
<python-dev-bounces+python-python-dev=***@python.org>
Archived-At: <http://permalink.gmane.org/gmane.comp.python.devel/147986>

On 5 June 2014 21:25, Paul Sokolovsky <***@gmail.com> wrote:
> Well, I understand the plan - hoping that people will "get over this".
> And I'm personally happy to stay away from this "trolling", but any
> discussion related to Unicode goes in circles and returns to feeling
> that Unicode at the central role as put there by Python3 is misplaced.

Many of the challenges network programmers face in Python 3 are around
binary data being more inconvenient to work with than it needs to be,
not the fact we decentralised boundary code by offering a strict
binary/text separation as the default mode of operation. Aside from
some of the POSIX locale handling issues on Linux, many of the
concerns are with the usability of bytes and bytearray, not with str -
that's why binary interpolation is coming back in 3.5, and there will
likely be other usability tweaks for those types as well.

More on that at
http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_answers.html#what-actually-changed-in-the-text-model-between-python-2-and-python-3

Cheers,
Nick.

--
Nick Coghlan | ***@gmail.com | Brisbane, Australia
Paul Sokolovsky
2014-06-05 12:01:21 UTC
Permalink
Hello,

On Thu, 5 Jun 2014 21:43:16 +1000
Nick Coghlan <***@gmail.com> wrote:

> On 5 June 2014 21:25, Paul Sokolovsky <***@gmail.com> wrote:
> > Well, I understand the plan - hoping that people will "get over
> > this". And I'm personally happy to stay away from this "trolling",
> > but any discussion related to Unicode goes in circles and returns
> > to feeling that Unicode at the central role as put there by Python3
> > is misplaced.
>
> Many of the challenges network programmers face in Python 3 are around
> binary data being more inconvenient to work with than it needs to be,
> not the fact we decentralised boundary code by offering a strict
> binary/text separation as the default mode of operation.

Just to clarify - (many) other gentlemen and I (in that order, I'm not
taking a lead), don't call to go back to Python2 behavior with implicit
conversion between byte-oriented strings and Unicode, etc. They just
point out that perhaps Python3 went too far with Unicode cause by making
it the default string type. Strict separation is surely mostly good
thing (I can sigh that it leads to Java-like dichotomical bloat for all
I/O classes, but well, I was able to put up with that in MicroPython
already).

> Aside from
> some of the POSIX locale handling issues on Linux, many of the
> concerns are with the usability of bytes and bytearray, not with str -
> that's why binary interpolation is coming back in 3.5, and there will
> likely be other usability tweaks for those types as well.

All these changes are what let me dream on and speculate on
possibility that Python4 could offer an encoding-neutral string type
(which means based on bytes), while move unicode back to an explicit
type to be used explicitly only when needed (bloated frameworks like
Django can force users to it anyway, but that will be forcing on
framework level, not on language level, against which people rebel.)
People can dream, right?


Thanks,
Paul mailto:***@gmail.com
Nick Coghlan
2014-06-05 12:20:04 UTC
Permalink
On 5 June 2014 22:01, Paul Sokolovsky <***@gmail.com> wrote:
>> Aside from
>> some of the POSIX locale handling issues on Linux, many of the
>> concerns are with the usability of bytes and bytearray, not with str -
>> that's why binary interpolation is coming back in 3.5, and there will
>> likely be other usability tweaks for those types as well.
>
> All these changes are what let me dream on and speculate on
> possibility that Python4 could offer an encoding-neutral string type
> (which means based on bytes), while move unicode back to an explicit
> type to be used explicitly only when needed (bloated frameworks like
> Django can force users to it anyway, but that will be forcing on
> framework level, not on language level, against which people rebel.)
> People can dream, right?

If you don't model strings as arrays of code points, or at least
assume a particular universal encoding (like UTF-8), you have to give
up string concatenation in order to tolerate arbitrary encodings -
otherwise you end up with unintelligible data that nobody can decode
because it switches encodings without notice. That's a viable model if
your OS guarantees it (Mac OS X does, for example, so Python 3 assumes
UTF-8 for all OS interfaces there), but Linux currently has no such
guarantee - many runtimes just decide they don't care, and assume
UTF-8 anyway (Python 3 may even join them some day, due to the
problems caused by trusting the locale encoding to be correct, but the
startup code will need non-trivial changes for that to happen - the
C.UTF-8 locale may even become widespread before we get there).

Cheers,
Nick.

--
Nick Coghlan | ***@gmail.com | Brisbane, Australia
Paul Sokolovsky
2014-06-05 12:37:08 UTC
Permalink
Hello,

On Thu, 5 Jun 2014 22:20:04 +1000
Nick Coghlan <***@gmail.com> wrote:

[]
> problems caused by trusting the locale encoding to be correct, but the
> startup code will need non-trivial changes for that to happen - the
> C.UTF-8 locale may even become widespread before we get there).

... And until those golden times come, it would be nice if Python did
not force its perfect world model, which unfortunately is not based on
surrounding reality, and let users solve their encoding problems
themselves - when they need, because again, one can go quite a long way
without dealing with encodings at all. Whereas now Python3 forces users
to deal with encoding almost universally, but forcing a particular for
all strings (which is again, doesn't correspond to the state of
surrounding reality). I already hear response that it's good that users
taught to deal with encoding, that will make them write correct
programs, but that's a bit far away from the original aim of making it
write "correct" programs easy and pleasant. (And definition of
"correct" vary.)

But all that is just an opinion.

>
> Cheers,
> Nick.
>
> --
> Nick Coghlan | ***@gmail.com | Brisbane, Australia



--
Best regards,
Paul mailto:***@gmail.com
Nick Coghlan
2014-06-05 13:15:54 UTC
Permalink
On 5 June 2014 22:37, Paul Sokolovsky <***@gmail.com> wrote:
> On Thu, 5 Jun 2014 22:20:04 +1000
> Nick Coghlan <***@gmail.com> wrote:
>> problems caused by trusting the locale encoding to be correct, but the
>> startup code will need non-trivial changes for that to happen - the
>> C.UTF-8 locale may even become widespread before we get there).
>
> ... And until those golden times come, it would be nice if Python did
> not force its perfect world model, which unfortunately is not based on
> surrounding reality, and let users solve their encoding problems
> themselves - when they need, because again, one can go quite a long way
> without dealing with encodings at all. Whereas now Python3 forces users
> to deal with encoding almost universally, but forcing a particular for
> all strings (which is again, doesn't correspond to the state of
> surrounding reality). I already hear response that it's good that users
> taught to deal with encoding, that will make them write correct
> programs, but that's a bit far away from the original aim of making it
> write "correct" programs easy and pleasant. (And definition of
> "correct" vary.)

As I've said before in other contexts, find me Windows, Mac OS X and
JVM developers, or educators and scientists that are as concerned by
the text model changes as folks that are primarily focused on Linux
system (including network) programming, and I'll be more willing to
concede the point.

Windows, Mac OS X, and the JVM are all opinionated about the text
encodings to be used at platform boundaries (using UTF-16, UTF-8 and
UTF-16, respectively). By contrast, Linux (or, more accurately, POSIX)
says "well, it's configurable, but we won't provide a reliable
mechanism for finding out what the encoding is. So either guess as
best you can based on the info the OS *does* provide, assume UTF-8,
assume 'some ASCII compatible encoding', or don't do anything that
requires knowing the encoding of the data being exchanged with the OS,
like, say, displaying file names to users or accepting arbitrary text
as input, transforming it in a content aware fashion, and echoing it
back in a console application".

None of those options are perfectly good choices. 6(ish) years ago, we
chose the first option, because it has the best chance of working
properly on Linux systems that use ASCII incompatible encodings like
ShiftJIS, ISO-2022, and various other East Asian codecs. For normal
user space programming, Linux is pretty reliable when it comes to
ensuring the locale encoding is set to something sensible, but the
price we currently pay for that decision is interoperability issues
with things like daemons not receiving any configuration settings and
hence falling back the POSIX locale and ssh environment forwarding
moving a clients encoding settings to a session on a server with
different settings. I still consider it preferable to impose
inconveniences like that based on use case (situations where Linux
systems don't provide sensible encoding settings) than geographic
region (locales where ASCII incompatible encodings are likely to still
be in common use).

If I (or someone else) ever find the time to implement PEP 432 (or
something like it) to address some of the limitations of the
interpreter startup sequence that currently make it difficult to avoid
relying on the POSIX locale encoding on Linux, then we'll be in a
position to reassess that decision based on the increased adoption of
UTF-8 by Linux distributions in recent years. As the major community
Linux distributions complete the migration of their system utilities
to Python 3, we'll get to see if they decide it's better to make their
locale settings more reliable, or help make it easier for Python 3 to
ignore them when they're wrong.

Cheers,
Nick.

--
Nick Coghlan | ***@gmail.com | Brisbane, Australia
Paul Moore
2014-06-05 15:59:51 UTC
Permalink
On 5 June 2014 14:15, Nick Coghlan <***@gmail.com> wrote:
> As I've said before in other contexts, find me Windows, Mac OS X and
> JVM developers, or educators and scientists that are as concerned by
> the text model changes as folks that are primarily focused on Linux
> system (including network) programming, and I'll be more willing to
> concede the point.

There is once again a strong selection bias in this discussion, by its
very nature. People who like the new model don't have anything to
complain about, and so are not heard.

Just to support Nick's point, I for one find the Python 3 text model a
huge benefit, both in practical terms of making my programs more
robust, and educationally, as I have a far better understanding of
encodings and their issues than I ever did under Python 2. Whenever a
discussion like this occurs, I find it hard not to resent the people
arguing that the new model should be taken away from me and replaced
with a form of the old error-prone (for me) approach - as if it was in
my best interests.

Internal details don't bother me - using UTF8 and having indexing be
potentially O(N) is of little relevance. But make me work with a
string type that *doesn't* abstract a string as a sequence of Unicode
code points and I'll get very upset.

Paul
Tim Delaney
2014-06-05 12:21:30 UTC
Permalink
On 5 June 2014 22:01, Paul Sokolovsky <***@gmail.com> wrote:

>
> All these changes are what let me dream on and speculate on
> possibility that Python4 could offer an encoding-neutral string type
> (which means based on bytes)
>

To me, an "encoding neutral string type" means roughly "characters are
atomic", and the best representation we have for a "character" is a Unicode
code point. Through any interface that provides "characters" each
individual "character" (code point) is indivisible.

To me, Python 3 has exactly an "encoding-neutral string type". It also has
a bytes type that is is just that - bytes which can represent anything at
all.It might be the UTF-8 representation of a string, but you have the
freedom to manipulate it however you like - including making it no longer
valid UTF-8.

Whilst I think O(1) indexing of strings is important, I don't think it's as
important as the property that "characters" are indivisible and would be
quite happy for MicroPython to use UTF-8 as the underlying string
representation (or some more clever thing, several ideas in this thread) so
long as:

1. It maintains a string type that presents code points as indivisible
elements;

2. The performance consequences of using UTF-8 are documented, as well as
any optimisations, tricks, etc that are used to overcome those consequences
(and what impact if any they would have if code written for MicroPython was
run in CPython).

Cheers,

Tim Delaney
Stefan Krah
2014-06-05 12:10:54 UTC
Permalink
Paul Sokolovsky <***@gmail.com> wrote:
> In this regard, I'm glad to participate in mind-resetting discussion.
> So, let's reiterate - there's nothing like "the best", "the only right",
> "the only correct", "righter than", "more correct than" in CPython's
> implementation of Unicode storage. It is *arbitrary*. Well, sure, it's
> not arbitrary, but based on requirements, and these requirements match
> CPython's (implied) usage model well enough. But among all possible
> sets of requirements, CPython's requirements are no more valid that
> other possible. And other set of requirement fairly clearly lead to
> situation where CPython implementation is rejected as not correct for
> those requirements at all.

Several core-devs have said that using UTF-8 for MicroPython is perfectly okay.
I also think it's the right choice and I hope that you guys come up with a very
efficient implementation.


Stefan Krah
Nick Coghlan
2014-06-05 12:38:13 UTC
Permalink
On 5 June 2014 22:10, Stefan Krah <***@bytereef.org> wrote:
> Paul Sokolovsky <***@gmail.com> wrote:
>> In this regard, I'm glad to participate in mind-resetting discussion.
>> So, let's reiterate - there's nothing like "the best", "the only right",
>> "the only correct", "righter than", "more correct than" in CPython's
>> implementation of Unicode storage. It is *arbitrary*. Well, sure, it's
>> not arbitrary, but based on requirements, and these requirements match
>> CPython's (implied) usage model well enough. But among all possible
>> sets of requirements, CPython's requirements are no more valid that
>> other possible. And other set of requirement fairly clearly lead to
>> situation where CPython implementation is rejected as not correct for
>> those requirements at all.
>
> Several core-devs have said that using UTF-8 for MicroPython is perfectly okay.
> I also think it's the right choice and I hope that you guys come up with a very
> efficient implementation.

Based on this discussion , I've also posted a draft patch aimed at
clarifying the relevant aspects of the data model section of the
language reference (http://bugs.python.org/issue21667).

Cheers,
Nick.

--
Nick Coghlan | ***@gmail.com | Brisbane, Australia
Nick Coghlan
2014-06-05 11:32:19 UTC
Permalink
On 5 June 2014 17:54, Stephen J. Turnbull <***@xemacs.org> wrote:
> What matters to you is that str (unicode) is an opaque type -- there
> is no specification of the internal representation in the language
> reference, and in fact several different ones coexist happily across
> existing Python implementations -- and you're free to use a UTF-8
> implementation if that suits the applications you expect for
> MicroPython.

However, as others have noted in the thread, the critical thing is to
*not* let that internal implementation detail leak into the Python
level string behaviour. That's what happened with narrow builds of
Python 2 and pre-PEP-393 releases of Python 3 (effectively using
UTF-16 internally), and it was the cause of a sufficiently large
number of bugs that the Linux distributions tend to instead accept the
memory cost of using wide builds (4 bytes for all code points) for
affected versions.

Preserving the "the Python 3 str type is an immutable array of code
points" semantics matters significantly more than whether or not
indexing by code point is O(1). The various caching tricks suggested
in this thread (especially "leading ASCII characters", "trailing ASCII
characters" and "position & index of last lookup") could keep the
typical lookup performance well below O(N).

> PEP 393 exists, of course, and specifies the current internal
> representation for CPython 3. But I don't see anything in it that
> suggests it's mandated for any other implementation.

CPython is constrained by C API compatibility requirements, as well as
implementation constraints due to the amount of internal code that
would need to be rewritten to handle a variable width encoding as the
canonical internal representation (since the problems with Python 2
narrow builds mean we already know variable width encodings aren't
handled correctly by the current code).

Implementations that share code with CPython, or try to mimic the C
API especially closely, may face similar restrictions. Outside that, I
think we're better off if alternative implementations are free to
experiment with different internal string representations.

Cheers,
Nick.

--
Nick Coghlan | ***@gmail.com | Brisbane, Australia
Serhiy Storchaka
2014-06-05 07:26:19 UTC
Permalink
05.06.14 03:08, Greg Ewing написав(ла):
> Serhiy Storchaka wrote:
>> A language which doesn't support O(1) indexing is not Python, it is
>> only Python-like language.
>
> That's debatable, but even if it's true, I don't think
> there's anything wrong with MicroPython being only a
> "Python-like language". As has been pointed out, fitting
> Python onto a small device is always going to necessitate
> some compromises.

Agree, there's anything wrong. I think that even limiting integers to 32
or 64 bits is acceptable compromise for Python-like language targeted to
small devices. But programming on such language requires different
techniques and habits.
Greg Ewing
2014-06-05 00:03:17 UTC
Permalink
Serhiy Storchaka wrote:
> html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize don't
> use iterators. They use indices, str.find and/or regular expressions.
> Common use case is quickly find substring starting from current position
> using str.find or re.search, process found token, advance position and
> repeat.

For that kind of thing, you don't need an actual character
index, just some way of referring to a place in a string.

Instead of an integer, str.find() etc. could return a
StringPosition, which would be an opaque reference to a
particular point in a particular string. You would be
able to pass StringPositions to indexing and slicing
operations to get fast indexing into the string that
they were derived from.

StringPositions could support the following operations:

StringPosition + int --> StringPosition
StringPosition - int --> StringPosition
StringPosition - StringPosition --> int

These would be computed by counting characters forwards
or backwards in the string, which would be slower than
int arithmetic but still faster than counting from the
beginning of the string every time.

In other contexts, StringPositions would coerce to ints
(maybe being an int subclass?) allowing them to be used
in any existing algorithm that slices strings using ints.

--
Greg
Glenn Linderman
2014-06-05 00:08:33 UTC
Permalink
On 6/4/2014 5:03 PM, Greg Ewing wrote:
> Serhiy Storchaka wrote:
>> html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize
>> don't use iterators. They use indices, str.find and/or regular
>> expressions. Common use case is quickly find substring starting from
>> current position using str.find or re.search, process found token,
>> advance position and repeat.
>
> For that kind of thing, you don't need an actual character
> index, just some way of referring to a place in a string.

I think you meant codepoint index, rather than character index.

>
> Instead of an integer, str.find() etc. could return a
> StringPosition, which would be an opaque reference to a
> particular point in a particular string. You would be
> able to pass StringPositions to indexing and slicing
> operations to get fast indexing into the string that
> they were derived from.
>
> StringPositions could support the following operations:
>
> StringPosition + int --> StringPosition
> StringPosition - int --> StringPosition
> StringPosition - StringPosition --> int
>
> These would be computed by counting characters forwards
> or backwards in the string, which would be slower than
> int arithmetic but still faster than counting from the
> beginning of the string every time.
>
> In other contexts, StringPositions would coerce to ints
> (maybe being an int subclass?) allowing them to be used
> in any existing algorithm that slices strings using ints.
>
This starts to diverge from Python codepoint indexing via integers.
Calculating or caching the codepoint index to byte offset as part of the
str implementation stays compatible with Python. Introducing
StringPosition makes a Python-like language. Or so it seems to me.
Glenn Linderman
2014-06-05 00:13:37 UTC
Permalink
On 6/4/2014 5:08 PM, Glenn Linderman wrote:
> On 6/4/2014 5:03 PM, Greg Ewing wrote:
>> Serhiy Storchaka wrote:
>>> html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize
>>> don't use iterators. They use indices, str.find and/or regular
>>> expressions. Common use case is quickly find substring starting from
>>> current position using str.find or re.search, process found token,
>>> advance position and repeat.
>>
>> For that kind of thing, you don't need an actual character
>> index, just some way of referring to a place in a string.
>
> I think you meant codepoint index, rather than character index.
>
>>
>> Instead of an integer, str.find() etc. could return a
>> StringPosition, which would be an opaque reference to a
>> particular point in a particular string. You would be
>> able to pass StringPositions to indexing and slicing
>> operations to get fast indexing into the string that
>> they were derived from.
>>
>> StringPositions could support the following operations:
>>
>> StringPosition + int --> StringPosition
>> StringPosition - int --> StringPosition
>> StringPosition - StringPosition --> int
>>
>> These would be computed by counting characters forwards
>> or backwards in the string, which would be slower than
>> int arithmetic but still faster than counting from the
>> beginning of the string every time.
>>
>> In other contexts, StringPositions would coerce to ints
>> (maybe being an int subclass?) allowing them to be used
>> in any existing algorithm that slices strings using ints.
>>
> This starts to diverge from Python codepoint indexing via integers.
> Calculating or caching the codepoint index to byte offset as part of
> the str implementation stays compatible with Python. Introducing
> StringPosition makes a Python-like language. Or so it seems to me.

Another thought is that StringPosition only works (quickly, at least),
as you point out, for the string that they were derived from... so
algorithms that walk two strings at a time cannot use the same
StringPosition to do so... yep, this is quite divergent from CPython and
Python.
Greg Ewing
2014-06-05 00:57:16 UTC
Permalink
Glenn Linderman wrote:
>
> so algorithms that walk two strings at a time cannot use the same
> StringPosition to do so... yep, this is quite divergent from CPython and
> Python.

They can, it's just that at most one of the indexing
operations would be fast; the StringPosition would
devolve into an int for the other one.

Such an algorithm would be of dubious correctness
anyway, since as you pointed out, codepoints and
characters are not quite the same thing. A codepoint
index in one string doesn't necessarily count off
the same number of characters in another string.
So to be safe, you should really walk each string
individually.

--
Greg
Greg Ewing
2014-06-05 00:52:03 UTC
Permalink
Glenn Linderman wrote:
>
>> For that kind of thing, you don't need an actual character
>> index, just some way of referring to a place in a string.
>
> I think you meant codepoint index, rather than character index.

Probably, but what I said is true either way.

> This starts to diverge from Python codepoint indexing via integers.

That's true, although most programs would have to go
out of their way to tell the difference, especially if
StringPosition were a subclass of int.

I agree that cacheing indexes would be more transparent,
though.

--
Greg
Paul Sokolovsky
2014-06-05 01:01:38 UTC
Permalink
Hello,

On Thu, 05 Jun 2014 12:03:17 +1200
Greg Ewing <***@canterbury.ac.nz> wrote:

> Serhiy Storchaka wrote:
> > html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize
> > don't use iterators. They use indices, str.find and/or regular
> > expressions. Common use case is quickly find substring starting
> > from current position using str.find or re.search, process found
> > token, advance position and repeat.
>
> For that kind of thing, you don't need an actual character
> index, just some way of referring to a place in a string.
>
> Instead of an integer, str.find() etc. could return a
> StringPosition,

That's more brave then I had in mind, but definitely shows what
alternative implementation have in store to fight back if some
perfomance problems are actually detected. My own thoughts were, for
example, as response to people who (quoting) "slice strings for living"
is some form of "extended slicing" like str[(0, 4, 6, 8, 15)].

But I really think that providing iterator interface for common string
operations would cover most of real-world cases, and will be actually
beneficial for Python language in general.

>
> --
> Greg


--
Best regards,
Paul mailto:***@gmail.com
Chris Angelico
2014-06-05 01:17:04 UTC
Permalink
On Thu, Jun 5, 2014 at 10:03 AM, Greg Ewing <***@canterbury.ac.nz> wrote:
> StringPositions could support the following operations:
>
> StringPosition + int --> StringPosition
> StringPosition - int --> StringPosition
> StringPosition - StringPosition --> int
>
> These would be computed by counting characters forwards
> or backwards in the string, which would be slower than
> int arithmetic but still faster than counting from the
> beginning of the string every time.

The SP would have to keep track of which string it's associated with,
which might make for some surprising retentions of large strings.
(Imagine returning what you think is an integer, but actually turns
out to be a SP, and you're trying to work out why your program is
eating up so much more memory than it should. This int-like object is
so much more than an int.)

ChrisA
Serhiy Storchaka
2014-06-05 07:39:18 UTC
Permalink
05.06.14 03:03, Greg Ewing написав(ла):
> Serhiy Storchaka wrote:
>> html.HTMLParser, json.JSONDecoder, re.compile, tokenize.tokenize don't
>> use iterators. They use indices, str.find and/or regular expressions.
>> Common use case is quickly find substring starting from current
>> position using str.find or re.search, process found token, advance
>> position and repeat.
>
> For that kind of thing, you don't need an actual character
> index, just some way of referring to a place in a string.

Of course. But _existing_ Python interfaces all work with indices. And
it is too late to change this, this train was gone 20 years ago.

There is no need in yet one way to do string operations. One obvious way
is enough.
Stephen J. Turnbull
2014-06-04 17:57:39 UTC
Permalink
Serhiy Storchaka writes:

> It would be interesting to collect a statistic about how many indexing
> operations happened during the life of a string in typical (Micro)Python
> program.

Probably irrelevant (I doubt anybody is going to be writing
programmers' editors in MicroPython), but by far the most frequently
called functions in XEmacs are byte_to_char_index and its inverse.
Guido van Rossum
2014-06-04 18:25:51 UTC
Permalink
This thread has devolved into a flame war. I think we should trust the
Micropython implementers (whoever they are -- are they participating here?)
to know their users and let them do what feels right to them. We should
just ask them not to claim full compatibility with any particular Python
version -- that seems the most contentious point. Realistically, most
Python code that works on Python 3.4 won't work on Micropython (for various
reasons, not just the string behavior) and neither does it need to.

--
--Guido van Rossum (python.org/~guido)
Paul Sokolovsky
2014-06-04 21:14:32 UTC
Permalink
Hello,

On Wed, 4 Jun 2014 11:25:51 -0700
Guido van Rossum <***@python.org> wrote:

> This thread has devolved into a flame war. I think we should trust the
> Micropython implementers (whoever they are -- are they participating
> here?)

I'm a regular contributor. I'm not sure if the author, Damien George,
is on the list. In either case, he's a nice guy who prefer to do
development rather than participate in flame wars ;-). And for the
record, all opinions expressed are solely mine, and not official
position of MicroPython project.

> to know their users and let them do what feels right to them.
> We should just ask them not to claim full compatibility with any
> particular Python version -- that seems the most contentious point.

"Full" compatibility is never claimed, and understanding it as such is
optimistic, "between the lines" reading of some users. All of:
announcement posted on python-list (which prompted current inflow of
MicroPython-related discussions), README at
https://github.com/micropython/micropython , and detailed differences
doc https://github.com/micropython/micropython/wiki/Differences make it
clear there's no talk about "full" compatibility, and only specific
compatibility (and incompatibility) points are claimed.

That said, and unlike previous attempts to develop a small Python
implementations (which of course existed), we're striving to be exactly
a Python language implementation, not a Python-like language
implementation. As there's no formal, implementation-independent
language spec, what constitutes a compatible language implementation is
subject to opinions, and we welcome and appreciate independent review,
like this thread did.

> Realistically, most Python code that works on Python 3.4 won't work
> on Micropython (for various reasons, not just the string behavior)
> and neither does it need to.

That's true. However, as was said, we're striving to provide a
compatible implementation, and compatibility claims must be validated.
While we have simple "in-house" testsuite, more serious compatibility
validation requires running a testsuite for reference implementation
(CPython), and that's gradually being approached.

>
> --
> --Guido van Rossum (python.org/~guido)



--
Best regards,
Paul mailto:***@gmail.com
R. David Murray
2014-06-04 21:54:08 UTC
Permalink
On Thu, 05 Jun 2014 00:14:32 +0300, Paul Sokolovsky <***@gmail.com> wrote:
> That said, and unlike previous attempts to develop a small Python
> implementations (which of course existed), we're striving to be exactly
> a Python language implementation, not a Python-like language
> implementation. As there's no formal, implementation-independent
> language spec, what constitutes a compatible language implementation is
> subject to opinions, and we welcome and appreciate independent review,
> like this thread did.

The language reference is also the language specification. I don't
know what you mean by 'formal', so presumably it doesn't qualify
:) That said, if there are places that are not correctly marked as
implementation specific, those are bugs in the reference and should
be fixed. There almost certainly are still such bugs, and I suspect
MicroPython can help us fix them, just as PyPy did/does.

--David
Terry Reedy
2014-06-04 22:04:52 UTC
Permalink
On 6/4/2014 5:14 PM, Paul Sokolovsky wrote:

> That said, and unlike previous attempts to develop a small Python
> implementations (which of course existed), we're striving to be exactly
> a Python language implementation, not a Python-like language
> implementation. As there's no formal, implementation-independent
> language spec, what constitutes a compatible language implementation is
> subject to opinions, and we welcome and appreciate independent review,
> like this thread did.
>
>> Realistically, most Python code that works on Python 3.4 won't work
>> on Micropython (for various reasons, not just the string behavior)
>> and neither does it need to.
>
> That's true. However, as was said, we're striving to provide a
> compatible implementation, and compatibility claims must be validated.
> While we have simple "in-house" testsuite, more serious compatibility
> validation requires running a testsuite for reference implementation
> (CPython), and that's gradually being approached.

I would call what you are doing a 'Python 3.n subset, with limitations',
where n should be a specific number, which I would urge should be at
least 3, if not 4 ('yield from'). To me, that would mean that every
Micropython program (that does not use a clearly non-Python addon like
inline assembly) would run the same* on CPython 3.n. Conversely, a
Python 3.n program should either run the same* on MicroPython as
CPython, or raise. What most to avoid is giving different* answers.

*'same' does not include timing differences or normal float variations
or bug fixes in MicroPython not in CPython.

As for unicode: I would see ascii-only (very limited codepoints) or bare
utf-8 (limited speed == expanded time) as possibly fitting the
definition above. Just be clear what the limitations are. And accept
that there will be people who do not bother to read the limitations and
then complain when they bang into them.

PS. You do not seem to be aware of how well the current PEP393
implementation works. If you are going to write any more about it, I
suggest you run Tools/Stringbench/stringbench.py for timings.

--
Terry Jan Reedy
Paul Sokolovsky
2014-06-04 22:52:53 UTC
Permalink
Hello,

On Wed, 04 Jun 2014 18:04:52 -0400
Terry Reedy <***@udel.edu> wrote:

> On 6/4/2014 5:14 PM, Paul Sokolovsky wrote:
>
> > That said, and unlike previous attempts to develop a small Python
> > implementations (which of course existed), we're striving to be
> > exactly a Python language implementation, not a Python-like language
> > implementation. As there's no formal, implementation-independent
> > language spec, what constitutes a compatible language
> > implementation is subject to opinions, and we welcome and
> > appreciate independent review, like this thread did.
> >
> >> Realistically, most Python code that works on Python 3.4 won't work
> >> on Micropython (for various reasons, not just the string behavior)
> >> and neither does it need to.
> >
> > That's true. However, as was said, we're striving to provide a
> > compatible implementation, and compatibility claims must be
> > validated. While we have simple "in-house" testsuite, more serious
> > compatibility validation requires running a testsuite for reference
> > implementation (CPython), and that's gradually being approached.
>
> I would call what you are doing a 'Python 3.n subset, with

Thanks, that's what we call it ourselves in the docs linked in the
original message, and use n=4. Note that being a subset is not a design
requirement, but there's higher-priority requirement of staying lean,
so realistically uPy will always stay a subset.

> limitations', where n should be a specific number, which I would urge
> should be at least 3, if not 4 ('yield from'). To me, that would mean
> that every Micropython program (that does not use a clearly
> non-Python addon like inline assembly) would run the same* on CPython
> 3.n. Conversely, a Python 3.n program should either run the same* on
> MicroPython as CPython, or raise. What most to avoid is giving
> different* answers.

That's nice aim, to implement which we don't have enough resources, so
would appreciate any help from interested parties.

> *'same' does not include timing differences or normal float
> variations or bug fixes in MicroPython not in CPython.
>
> As for unicode: I would see ascii-only (very limited codepoints) or
> bare utf-8 (limited speed == expanded time) as possibly fitting the
> definition above. Just be clear what the limitations are. And accept
> that there will be people who do not bother to read the limitations
> and then complain when they bang into them.
>
> PS. You do not seem to be aware of how well the current PEP393
> implementation works. If you are going to write any more about it, I
> suggest you run Tools/Stringbench/stringbench.py for timings.

"Well" is subjective (or should be defined formally based on the
requirements). With my MicroPython hat on, an implementation which
receives a string, transcodes it, leading to bigger size, just to
immediately transcode back and send out - is awful, environment
unfriendly implementation ;-).


--
Best regards,
Paul mailto:***@gmail.com
Chris Angelico
2014-06-04 23:05:33 UTC
Permalink
On Thu, Jun 5, 2014 at 8:52 AM, Paul Sokolovsky <***@gmail.com> wrote:
> "Well" is subjective (or should be defined formally based on the
> requirements). With my MicroPython hat on, an implementation which
> receives a string, transcodes it, leading to bigger size, just to
> immediately transcode back and send out - is awful, environment
> unfriendly implementation ;-).

Be careful of confusing correctness and performance, though. The
transcoding you describe is inefficient, but (presumably) correct;
something that's fast but wrong is straight-up buggy. You can always
fix inefficiency in a later release, but buggy behaviour sometimes is
relied on (which is why ECMAScript still exposes UTF-16 to scripts,
and why Windows window messages have a WPARAM and an LPARAM, and why
Python's threading module has duplicate names for a lot of functions,
because it's just not worth changing). I'd be much more comfortable
releasing something where "everything works fine, but if you use
astral characters in your strings, memory usage blows out by a factor
of four" (or "... the len() function takes O(N) time") than one where
"everything works fine as long as you use BMP only, but SMP characters
result in tests failing".

ChrisA
Terry Reedy
2014-06-05 02:15:30 UTC
Permalink
On 6/4/2014 6:52 PM, Paul Sokolovsky wrote:

> "Well" is subjective (or should be defined formally based on the
> requirements). With my MicroPython hat on, an implementation which
> receives a string, transcodes it, leading to bigger size, just to
> immediately transcode back and send out - is awful, environment
> unfriendly implementation ;-).

I am not sure what you concretely mean by 'receive a string', but I
think you are again batting at a strawman. If you mean 'read from a
file', and all you want to do is read bytes from and write bytes to
external 'files', then there is obviously no need to transcode and
neither Python 2 or 3 make you do so.

--
Terry Jan Reedy
Paul Sokolovsky
2014-06-05 10:10:39 UTC
Permalink
Hello,

On Wed, 04 Jun 2014 22:15:30 -0400
Terry Reedy <***@udel.edu> wrote:

> On 6/4/2014 6:52 PM, Paul Sokolovsky wrote:
>
> > "Well" is subjective (or should be defined formally based on the
> > requirements). With my MicroPython hat on, an implementation which
> > receives a string, transcodes it, leading to bigger size, just to
> > immediately transcode back and send out - is awful, environment
> > unfriendly implementation ;-).
>
> I am not sure what you concretely mean by 'receive a string', but I

I (surely) mean an abstract input (as an Input/Output aka I/O)
operation.

> think you are again batting at a strawman. If you mean 'read from a
> file', and all you want to do is read bytes from and write bytes to
> external 'files', then there is obviously no need to transcode and
> neither Python 2 or 3 make you do so.

But most files, network protocols are text-based, and I (and many other
people) don't want to artificially use "binary data" type for them,
with all attached funny things, like "b" prefix. And then Python2
indeed doesn't transcode anything, and Python3 does, without being
asked, and for no good purpose, because in most cases, Input data will
be Output as-is (maybe in byte-boundary-split chunks).

So, it all goes in rounds - ignoring the forced-Unicode problem (after a
week of subscription to python-list, half of traffic there appear to be
dedicated to Unicode-related flames) on python-dev behalf is not
going to help (Python community).

[]



--
Best regards,
Paul mailto:***@gmail.com
Serhiy Storchaka
2014-06-04 22:43:59 UTC
Permalink
05.06.14 01:04, Terry Reedy написав(ла):
> PS. You do not seem to be aware of how well the current PEP393
> implementation works. If you are going to write any more about it, I
> suggest you run Tools/Stringbench/stringbench.py for timings.

AFAIK stringbench is ASCII-only, so it likely is compatible with current
and any future MicroPython implementations, but unlikely will expose
non-ASCII limitations or performance.
Eric Snow
2014-06-04 22:12:23 UTC
Permalink
On Wed, Jun 4, 2014 at 3:14 PM, Paul Sokolovsky <***@gmail.com> wrote:
> That said, and unlike previous attempts to develop a small Python
> implementations (which of course existed), we're striving to be exactly
> a Python language implementation, not a Python-like language
> implementation. As there's no formal, implementation-independent
> language spec, what constitutes a compatible language implementation is
> subject to opinions, and we welcome and appreciate independent review,
> like this thread did.

Actually, there is a "formal, implementation-independent language spec":

https://docs.python.org/3/reference/

>
>> Realistically, most Python code that works on Python 3.4 won't work
>> on Micropython (for various reasons, not just the string behavior)
>> and neither does it need to.
>
> That's true. However, as was said, we're striving to provide a
> compatible implementation, and compatibility claims must be validated.
> While we have simple "in-house" testsuite, more serious compatibility
> validation requires running a testsuite for reference implementation
> (CPython), and that's gradually being approached.

To a large extent the test suite in
http://hg.python.org/cpython/file/default/Lib/test effectively
validates (full) compliance with the corresponding release (change
"default" to the release branch of your choice). With that goal, no
small effort has been made to mark implementation-specific tests as
such. So uPy could consider using the test suite (and explicitly skip
the tests for features that uPy doesn't support).

-eric
Paul Sokolovsky
2014-06-04 23:11:10 UTC
Permalink
Hello,

On Wed, 4 Jun 2014 16:12:23 -0600
Eric Snow <***@gmail.com> wrote:

> On Wed, Jun 4, 2014 at 3:14 PM, Paul Sokolovsky <***@gmail.com>
> wrote:
> > That said, and unlike previous attempts to develop a small Python
> > implementations (which of course existed), we're striving to be
> > exactly a Python language implementation, not a Python-like language
> > implementation. As there's no formal, implementation-independent
> > language spec, what constitutes a compatible language
> > implementation is subject to opinions, and we welcome and
> > appreciate independent review, like this thread did.
>
> Actually, there is a "formal, implementation-independent language
> spec":
>
> https://docs.python.org/3/reference/

Opening that link in browser, pressing Ctrl+F and pasting your quote
gives zero hits, so it's not exactly what you claim it to be. It's also
pretty far from being formal (unambiguous, covering all choices, etc.)
and comprehensive. Also, please point me at "conformance" section.

That said, all of us Pythoneers treat it as the best formal reference
available, no news here.

> >> Realistically, most Python code that works on Python 3.4 won't work
> >> on Micropython (for various reasons, not just the string behavior)
> >> and neither does it need to.
> >
> > That's true. However, as was said, we're striving to provide a
> > compatible implementation, and compatibility claims must be
> > validated. While we have simple "in-house" testsuite, more serious
> > compatibility validation requires running a testsuite for reference
> > implementation (CPython), and that's gradually being approached.
>
> To a large extent the test suite in
> http://hg.python.org/cpython/file/default/Lib/test effectively
> validates (full) compliance with the corresponding release (change
> "default" to the release branch of your choice). With that goal, no
> small effort has been made to mark implementation-specific tests as
> such. So uPy could consider using the test suite (and explicitly skip
> the tests for features that uPy doesn't support).

That's exactly what we do, per the previous paragraph. And we face a
lot of questionable tests, just like you say. Shameless plug: if anyone
interested to run existing code on MicroPython, please help us with
CPython testsuite! ;-)

>
> -eric



--
Best regards,
Paul mailto:***@gmail.com
Eric Snow
2014-06-05 00:01:23 UTC
Permalink
On Wed, Jun 4, 2014 at 5:11 PM, Paul Sokolovsky <***@gmail.com> wrote:
> On Wed, 4 Jun 2014 16:12:23 -0600
> Eric Snow <***@gmail.com> wrote:
>> Actually, there is a "formal, implementation-independent language
>> spec":
>>
>> https://docs.python.org/3/reference/
>
> Opening that link in browser, pressing Ctrl+F and pasting your quote
> gives zero hits, so it's not exactly what you claim it to be. It's also
> pretty far from being formal (unambiguous, covering all choices, etc.)
> and comprehensive. Also, please point me at "conformance" section.
>
> That said, all of us Pythoneers treat it as the best formal reference
> available, no news here.

It's not just the best formal reference. It's the official
specification. I agree it is not so "formal" as other language
specifications and it does not enumerate every facet of the language.
However, underspecified parts are worth improving (as we've done with
the import system portion in the last few years). Incidentally, the
efforts of other Python implementors have often resulted in such
improvements to the language reference. Those improvements typically
come as a result of questions to this very list. :) That's
essentially what this email thread is!

-eric
Steven D'Aprano
2014-06-05 13:23:12 UTC
Permalink
On Wed, Jun 04, 2014 at 11:17:18AM +1000, Steven D'Aprano wrote:
> There is a discussion over at MicroPython about the internal
> representation of Unicode strings. Micropython is aimed at embedded
> devices, and so minimizing memory use is important, possibly even
> more important than performance.
[...]

Wow! I'm amazed at the response here, since I expected it would have
received a fairly brief "Yes" or "No" response, not this long thread.
Here is a summary (as best as I am able) of a few points which I think
are important:

(1) I asked if it would be okay for MicroPython to *optionally* use
nominally Unicode strings limited to ASCII. Pretty much the only
response to this as been Guido saying "That would be a pretty lousy
option", and since nobody has really defended the suggestion, I think we
can assume that it's off the table.

(2) I asked if it would be okay for µPy to use an UTF-8 implementation
even though it would lead to O(N) indexing operations instead of O(1).
There's been some opposition to this, including Guido's:

Then again the UTF-8 option would be pretty devastating
too for anything manipulating strings (especially since
many Python APIs are defined using indexes, e.g. the re
module).

but unless Guido wants to say different, I think the consensus is that
a UTF-8 implementation is allowed, even at the cost of O(N) indexing
operations. Saving memory -- assuming that it does save memory, which I
think is an assumption and not proven -- over time is allowed.

(3) It seems to me that there's been a lot of theorizing about what
implementation will be obviously more efficient. Folks, how about some
benchmarks before making claims about code efficiency? :-)

(4) Similarly, there have been many suggestions more suited in my
opinion to python-ideas, or even python-list, for ways to implement O(1)
indexing on top of UTF-8. Some of them involve per-string mutable state
(e.g. the last index seen), or complicated int sub-classes that need to
know what string they come from. Remember your Zen please:

Simple is better than complex.
Complex is better than complicated.
...
If the implementation is hard to explain, it's a bad idea.

(5) I'm not convinced that UTF-8 internally is *necessarily* more
efficient, but look forward to seeing the result of benchmarks. The
rationale of internal UTF-8 is that the use of any other encoding
internally will be inefficient since those strings will need to be
transcoded to UTF-8 before they can be written or printed, so keeping
them as UTF-8 in the first place saves the transcoding step. Well, yes,
but many strings may never be written out:

print(prefix + s[1:].strip().lower().center(80) + suffix)

creates five strings that are never written out and one that is. So if
the internal encoding of strings is more efficient than UTF-8, and most
of them never need transcoding to UTF-8, a non-UTF-8 internal format
might be a nett win. So I'm looking forward to seeing the results of
µPy's experiments with it.

Thanks to all who have commented.



--
Steven
Loading...