Discussion:
[ruby-core:20483] encoding of symbols
Dave Thomas
2008-12-11 17:15:56 UTC
Permalink
If I have a source file like this:

#encoding: utf-8
a = "cat"
b = "∂og"

Then a and b will both have the encoding UTF-8

If I have

#encoding: utf-8
a = /cat/
b = /∂og/

then b will be UTF-8, but a will have the encoding US-ASCII. It was
explained to me that this was for performance reasons.

But then look at

#encoding: utf-8
a = :cat
b = :∂og

I was surprised to see that a has US-ASCII encoding. Now that strings
and symbols are converging, shouldn't both a and b be encoded UTF-8,
so that symbols and strings behave the same way?


Dave
Charles Oliver Nutter
2008-12-11 17:38:53 UTC
Permalink
Post by Dave Thomas
I was surprised to see that a has US-ASCII encoding. Now that strings
and symbols are converging, shouldn't both a and b be encoded UTF-8, so
that symbols and strings behave the same way?
Strings and symbols are converging?

- Charlie
Dave Thomas
2008-12-11 18:34:38 UTC
Permalink
Post by Charles Oliver Nutter
Post by Dave Thomas
I was surprised to see that a has US-ASCII encoding. Now that
strings and symbols are converging, shouldn't both a and b be
encoded UTF-8, so that symbols and strings behave the same way?
Strings and symbols are converging?
:cat[2] => "t"

So the interpreter implies that symbols have a string representation
(as opposed, for example, to an object id representation, in which
case an aref would return a bit, as it does for fixnums).

And if indexing into a symbol returns a string, then I'd expect the
same encoding rules to apply to the symbol as to the string.

Right now, we have the strange situation that

"cat".to_sym.to_s.encoding != "cat".encoding


Dave
Michael Selig
2008-12-11 23:51:06 UTC
Permalink
Post by Dave Thomas
Right now, we have the strange situation that
"cat".to_sym.to_s.encoding != "cat".encoding
Yes, this seems to be an inconsistency, though in practice I don't think
it causes any problems.
I seem to recall that a few months ago the parser "optimized" strings to
US-ASCII when the src encoding was UTF-8 (or any other ascii-compatible
encoding), but this behaviour changed at some point. Perhaps this
inconsistency is a remnant of that?

I would also like to point out a couple of other inconsistencies with
symbols:

1) "p" seems to do the wrong thing with symbols encodings, yet inspect is
OK:

As a string:
p "\u0639abc" => "عabc"
p "\u0639abc".force_encoding("BINARY") => "\xD8\xB9abc"

As a symbol:
p "\u0639abc".to_sym => :عabc
p "\u0639abc".force_encoding("BINARY").to_sym => :عabc

Using inspect:
"\u0639abc".to_sym.inspect => ":عabc"
"\u0639abc".force_encoding("BINARY").to_sym.inspect => ":\xD8\xB9abc"

The annoying thing about this is that when you use "p" 2 symbols with
different encodings can look the same, but are actually different ids.


2) Symbol#== rdoc says "If sym and obj are exactly the same symbol,
returns true. Otherwise, compares them as strings."
I don't think this is right:
p :cat == "cat" => false

It works like this in 1.8 also. I think this is just a documentation
error, and the "Otherwise, compares them as strings" should be dropped.

Cheers
Mike
Yukihiro Matsumoto
2008-12-11 23:43:42 UTC
Permalink
Hi,

In message "Re: [ruby-core:20484] Re: encoding of symbols"
on Fri, 12 Dec 2008 02:38:53 +0900, Charles Oliver Nutter <***@sun.com> writes:

|Strings and symbols are converging?

No, during the development of 1.9, I experimented to make Symbol a
subclass of String, but I considered it a mistake, and reverted.

matz.
Yukihiro Matsumoto
2008-12-11 23:45:03 UTC
Permalink
Hi,

In message "Re: [ruby-core:20483] encoding of symbols"
on Fri, 12 Dec 2008 02:15:56 +0900, Dave Thomas <***@pragprog.com> writes:

|#encoding: utf-8
|a = :cat
|b = :$B"_(Bog
|
|I was surprised to see that a has US-ASCII encoding. Now that strings
|and symbols are converging, shouldn't both a and b be encoded UTF-8,
|so that symbols and strings behave the same way?

You are right about encoding of the symbols. I will fix, unless there's
any reason I forget right now.

matz.
Yukihiro Matsumoto
2008-12-12 00:32:16 UTC
Permalink
Hi,

In message "Re: [ruby-core:20494] Re: encoding of symbols"
on Fri, 12 Dec 2008 08:45:03 +0900, Yukihiro Matsumoto <***@ruby-lang.org> writes:

||I was surprised to see that a has US-ASCII encoding. Now that strings
||and symbols are converging, shouldn't both a and b be encoded UTF-8,
||so that symbols and strings behave the same way?
|
|You are right about encoding of the symbols. I will fix, unless there's
|any reason I forget right now.

It was very easy to implement it, but when I tried, I found different
inconsistency, which is:

# encoding: utf-8
p :a.encoding # => #<Encoding:UTF-8>
p :p.encoding # => #<Encoding:US-ASCII>

means a symbol would have an encoding of which the symbol first
appears, in this case symbol :a first appears on a file with UTF-8
encoding, whereas :p appears first for a name to built-in method.

So rather making symbols to have somewhat unpredictable encoding, I'd
rather keep them as they are now, despite the inconsistency with
string encoding.

matz.
Charles Oliver Nutter
2008-12-13 11:33:13 UTC
Permalink
Post by Yukihiro Matsumoto
It was very easy to implement it, but when I tried, I found different
# encoding: utf-8
p :a.encoding # => #<Encoding:UTF-8>
p :p.encoding # => #<Encoding:US-ASCII>
means a symbol would have an encoding of which the symbol first
appears, in this case symbol :a first appears on a file with UTF-8
encoding, whereas :p appears first for a name to built-in method.
So rather making symbols to have somewhat unpredictable encoding, I'd
rather keep them as they are now, despite the inconsistency with
string encoding.
Very good point; symbols are not necessarily created in the file where
you use their literal form, and therefore need to have a single encoding
everywhere. I concur.

- Charlie
Brian Candler
2008-12-13 14:01:44 UTC
Permalink
Post by Charles Oliver Nutter
Very good point; symbols are not necessarily created in the file where
you use their literal form, and therefore need to have a single encoding
everywhere. I concur.
Unless :p<UTF-8> and :p<US-ASCII> could somehow be the "same" symbol (that
is, send() would find the same method)

Aside: is there a page somewhere which documents in detail the semantics of
ruby 1.9's Strings and encodings? For example, what are the semantics of
comparing strings with different encodings? Are they compared byte-by-byte,
or character-by-character as unicode codepoints, or some other way? It
doesn't seem to make a difference here:

irb(main):001:0> a = "abc"
=> "abc"
irb(main):002:0> b = a.dup
=> "abc"
irb(main):003:0> a.encoding
=> #<Encoding:US-ASCII>
irb(main):004:0> b.force_encoding("UTF-8")
=> "abc"
irb(main):005:0> a == b
=> true
irb(main):006:0> b.force_encoding("BINARY")
=> "abc"
irb(main):007:0> a == b
=> true

But it does here:

irb(main):018:0> a = "aß"
=> "aß"
irb(main):019:0> b = a.dup
=> "aß"
irb(main):020:0> a.encoding
=> #<Encoding:UTF-8>
irb(main):021:0> b.force_encoding("BINARY")
=> "a\xC3\x9F"
irb(main):022:0> a == b
=> false

What if I give the "same" character but from a different encoding?

irb(main):001:0> a = "aß"
=> "aß"
irb(main):002:0> b = "a\xdf"
=> "a\xDF"
irb(main):003:0> b.force_encoding("ISO-8859-1")
=> "a�"
irb(main):004:0> a == b
=> false

(I think that's right - both are codepoint 223)

Furthermore, what if I use a String as a key to a hash? It seems the
encoding *is* taken into consideration:

irb(main):025:0> a = "aß"
=> "aß"
irb(main):026:0> h = {a => 99}
=> {"aß"=>99}
irb(main):027:0> b = a.dup
=> "aß"
irb(main):028:0> h[b]
=> 99
irb(main):029:0> b.force_encoding("BINARY")
=> "a\xC3\x9F"
irb(main):030:0> h[b]
=> nil

But they go onto the same hash chain:

irb(main):031:0> a.hash
=> 565426832
irb(main):032:0> b.hash
=> 565426832

What does 'inspect' do when the string has a particular encoding? And what
does irb do when outputting a string whose encoding is different to that of
the terminal?

Not understanding these rules makes me very uncomfortable.

ri documentation seems to be pretty silent on these points:

-------------------------------------------------------------- String#==
str == obj => true or false

From Ruby 1.9.1
------------------------------------------------------------------------
Equality---If _obj_ is not a +String+, returns +false+. Otherwise,
returns +true+ if _str_ +<=>+ _obj_ returns zero.


------------------------------------------------------------- String#<=>
str <=> other_str => -1, 0, +1

From Ruby 1.9.1
------------------------------------------------------------------------
Comparison---Returns -1 if _other_str_ is less than, 0 if
_other_str_ is equal to, and +1 if _other_str_ is greater than
_str_. If the strings are of different lengths, and the strings are
equal when compared up to the shortest length, then the longer
string is considered greater than the shorter one. In older
versions of Ruby, setting +$=+ allowed case-insensitive
comparisons; this is now deprecated in favor of using
+String#casecmp+.

+<=>+ is the basis for the methods +<+, +<=+, +>+, +>=+, and
+between?+, included from module +Comparable+. The method
+String#==+ does not use +Comparable#==+.

"abcdef" <=> "abcde" #=> 1
"abcdef" <=> "abcdef" #=> 0
"abcdef" <=> "abcdefg" #=> -1
"abcdef" <=> "ABCDEF" #=> 1


As I say, if there's some more detailled documentation please could you
point me in the right direction.

Thanks,

Brian.
James Gray
2008-12-13 15:48:56 UTC
Permalink
Post by Brian Candler
Aside: is there a page somewhere which documents in detail the
semantics of ruby 1.9's Strings and encodings?
I've been working on putting something like this together on my blog:

http://blog.grayproductions.net/articles/understanding_m17n

I started with explaining encodings in general and then moved into
what Ruby 1.8 can do.

My next stop is a complete tour of the 1.9 encoding landscape, but I
haven't got there yet. In truth, I've been stalling a bit just to
make sure I cover what ends up being the truth when it ships…

James Edward Gray II
Michael Selig
2008-12-14 00:57:55 UTC
Permalink
Post by Brian Candler
For example, what are the semantics of
comparing strings with different encodings? Are they compared
byte-by-byte,
or character-by-character as unicode codepoints, or some other way?
Yes, I agree this needs to be documentated a lot better than it is at the
moment.
I also think that some of the behaviour is a little "unexpected" :) though
this is only in unusual cases.

From my testing:
- String operations are done using the bytes in the strings - they are not
converted to codepoints internally
- String equality comparisons seem to be simply done on a byte-by-byte
basis, without regard to the encoding
- *However* other operations are not simply byte-by-byte. They are done
character-by-character, but without converting to codepoints - eg: a 3
byte character is kept as 3 bytes. For example this means that when
operating on a variable-length encoding, simple operations like indexing
can be inefficient, as Ruby may have to scan through the string from the
start. However Ruby does try to optimize this where possible.
- There is also a concept of "compatible encodings". Given 2 encodings e1
& e2, e1 is compatible with e2 if the representation of every character in
e1 is the same as in e2. This implies that e2 must be a "bigger" encoding
than e1 - ie: e2 is a superset of e1. Typically we are mainly talking
about US-ASCII here, which is compatible with most other character sets
that are either all single-byte (eg: all the ISO-8859 sets) or are
variable-length multi-byte (eg: UTF-8).
- When operating on encodings e1 & e2, if e1 is compatible with e2, then
Ruby treats both strings as being in encoding e2.
- String#> and String#< are a bit wierd. Normally they are just done on a
byte-by-byte basis, UNLESS the strings are the same and are incompatible
encodings, then they always seem to return FALSE. (I have to check this -
it may be more complicated than this).
- When operating on incompatible encodings, *normally* non-comparison
operations (including regexp matches) raise an "Encoding Compatibility
Error".
- However there appears to be an exception to this: if operating on 2
incompatible encodings AND US-ASCII is compatible with both, AND both
strings are US-ASCII strings, then the operation appears to proceed,
treating both as US-ASCII. For example "abc" as an ISO-8859-1 and "abc" as
UTF-8. I guess this is Ruby being "forgiving". (Personally I am not sure
if this is good or bad). The encoding of the result (for example of a
string concatenation) seems to be one of the 2 original encodings - I
haven't figured out the logic to this yet :)

James - feel free to use any of the above to add to your excellent M17N
summary.

Cheers
Mike
James Gray
2008-12-14 03:58:37 UTC
Permalink
On Sun, 14 Dec 2008 01:01:44 +1100, Brian Candler
Post by Brian Candler
For example, what are the semantics of
comparing strings with different encodings? Are they compared byte-
by-byte,
or character-by-character as unicode codepoints, or some other way?
Yes, I agree this needs to be documentated a lot better than it is
at the moment.
I also think that some of the behaviour is a little "unexpected" :)
though this is only in unusual cases.
- String operations are done using the bytes in the strings - they
are not converted to codepoints internally
- String equality comparisons seem to be simply done on a byte-by-
byte basis, without regard to the encoding
- *However* other operations are not simply byte-by-byte. They are
done character-by-character, but without converting to codepoints -
eg: a 3 byte character is kept as 3 bytes. For example this means
that when operating on a variable-length encoding, simple operations
like indexing can be inefficient, as Ruby may have to scan through
the string from the start. However Ruby does try to optimize this
where possible.
- There is also a concept of "compatible encodings". Given 2
encodings e1 & e2, e1 is compatible with e2 if the representation of
every character in e1 is the same as in e2. This implies that e2
must be a "bigger" encoding than e1 - ie: e2 is a superset of e1.
Typically we are mainly talking about US-ASCII here, which is
compatible with most other character sets that are either all single-
byte (eg: all the ISO-8859 sets) or are variable-length multi-byte
(eg: UTF-8).
- When operating on encodings e1 & e2, if e1 is compatible with e2,
then Ruby treats both strings as being in encoding e2.
- String#> and String#< are a bit wierd. Normally they are just
done on a byte-by-byte basis, UNLESS the strings are the same and
are incompatible encodings, then they always seem to return FALSE.
(I have to check this - it may be more complicated than this).
- When operating on incompatible encodings, *normally* non-
comparison operations (including regexp matches) raise an "Encoding
Compatibility Error".
- However there appears to be an exception to this: if operating on
2 incompatible encodings AND US-ASCII is compatible with both, AND
both strings are US-ASCII strings, then the operation appears to
proceed, treating both as US-ASCII. For example "abc" as an
ISO-8859-1 and "abc" as UTF-8. I guess this is Ruby being
"forgiving". (Personally I am not sure if this is good or bad). The
encoding of the result (for example of a string concatenation) seems
to be one of the 2 original encodings - I haven't figured out the
logic to this yet :)
James - feel free to use any of the above to add to your excellent
M17N summary.
Wow. I definitely will.

You're attention to detail remains impressive.

James Edward Gray II
Daniel Luz
2008-12-14 06:26:10 UTC
Permalink
Post by Michael Selig
- String equality comparisons seem to be simply done on a byte-by-byte
basis, without regard to the encoding
Am I misinterpreting something here?

u = "café".encode("utf-8")
b = u.dup.force_encoding("binary")
i = u.dup.force_encoding("iso-8859-1")
u == b # => false
b == i # => false
u == i # => false
u.eql?(b) # => false
Post by Michael Selig
- There is also a concept of "compatible encodings". Given 2 encodings e1 &
e2, e1 is compatible with e2 if the representation of every character in e1
is the same as in e2. This implies that e2 must be a "bigger" encoding than
e1 - ie: e2 is a superset of e1. Typically we are mainly talking about
US-ASCII here, which is compatible with most other character sets that are
either all single-byte (eg: all the ISO-8859 sets) or are variable-length
multi-byte (eg: UTF-8).
- When operating on encodings e1 & e2, if e1 is compatible with e2, then
Ruby treats both strings as being in encoding e2.
I only knew of ASCII-compatibility. Are there other cases? ISO-8859-1
and Windows-1252 (a superset) at least are not compatible:

i = "café".encode("iso-8859-1")
w = "café".encode("windows-1252")
i == w # => false
i + w # Encoding::CompatibilityError
w + i # Encoding::CompatibilityError


On Sat, Dec 13, 2008 at 12:01, Brian Candler <***@pobox.com> wrote:
(...)
Post by Michael Selig
irb(main):031:0> a.hash
=> 565426832
irb(main):032:0> b.hash
=> 565426832
This one's interesting. I guess avoiding collisions would be a Good
Thing, but we still must maintain ASCII compatibility, and we don't
always know the ascii_only state of a String. Computing it when
computing the hash of a String does not sound like a bad idea to me,
but if there are more complex encoding compatibility combinations,
then this whole idea starts to get pretty hard.

Anyway, keeping the hash as it is now should have, I hope, very few
collisions in the Real World™. Most applications will remain in
single-encoding land, and even multilingual ones should hardly need to
store the very same byte sequence in multiple encodings as keys in a
single Hash.

--
Daniel
Michael Selig
2008-12-14 07:31:55 UTC
Permalink
Post by Daniel Luz
Post by Michael Selig
- String equality comparisons seem to be simply done on a byte-by-byte
basis, without regard to the encoding
Am I misinterpreting something here?
u = "café".encode("utf-8")
b = u.dup.force_encoding("binary")
i = u.dup.force_encoding("iso-8859-1")
u == b # => false
b == i # => false
u == i # => false
u.eql?(b) # => false
Sorry, you are quite right.
Equality is false if the encodings are not compatible. If they are
compatible, it is done on a byte-by-byte basis.
Post by Daniel Luz
I only knew of ASCII-compatibility. Are there other cases? ISO-8859-1
i = "café".encode("iso-8859-1")
w = "café".encode("windows-1252")
i == w # => false
i + w # Encoding::CompatibilityError
w + i # Encoding::CompatibilityError
I think this might be a bug.

Cheers
Mike.
Martin Duerst
2008-12-14 09:12:43 UTC
Permalink
Post by Michael Selig
Post by Daniel Luz
I only knew of ASCII-compatibility. Are there other cases? ISO-8859-1
i = "caf$B%F%%(B".encode("iso-8859-1")
w = "caf$B%F%%(B".encode("windows-1252")
i == w # => false
i + w # Encoding::CompatibilityError
w + i # Encoding::CompatibilityError
I think this might be a bug.
In Ruby, as in most other practical uses that I know of,
iso-8859-1 (as well as all the other iso-8895-x) includes
control codes in the range 0x80-0x9F. ISO-8859-1 and
Windows-1252 would be compatible if you ignore these
control codes, but we can't just ignore them.

Regards, Martin.


#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
Michael Selig
2008-12-14 07:11:29 UTC
Permalink
On Sun, 14 Dec 2008 11:57:55 +1100, Michael Selig
- String#> and String#< are a bit weird. Normally they are just done on
a byte-by-byte basis, UNLESS the strings are the same and are
incompatible encodings, then they always seem to return FALSE. (I have
to check this - it may be more complicated than this).
Actually I just checked this, and this is wrong, sorry. I ended up looking
at the source code of rb_str_cmp() in string.c, and here is what I think
it does:
- it does a byte-by-byte comparison. Assuming the strings are different,
Ruby returns what you would expect based on this.
- if the strings are byte for byte identical, but they have incompatible
encodings and at least one of the strings contains a non-ASCII character,
then it seems that the result is determined by the ordering of the
encodings, based on ruby's "encoding index" - an internal ordering of the
available encodings. Maybe I have got this wrong - it doesn't make a lot
of sense to me!

Cheers
Mike
Yukihiro Matsumoto
2008-12-15 09:12:59 UTC
Permalink
Hi,

In message "Re: [ruby-core:20545] Re: 1.9 character encoding (was: encoding of symbols)"
on Sun, 14 Dec 2008 16:11:29 +0900, "Michael Selig" <***@fs.com.au> writes:

|- it does a byte-by-byte comparison. Assuming the strings are different,
|Ruby returns what you would expect based on this.
|- if the strings are byte for byte identical, but they have incompatible
|encodings and at least one of the strings contains a non-ASCII character,
|then it seems that the result is determined by the ordering of the
|encodings, based on ruby's "encoding index" - an internal ordering of the
|available encodings. Maybe I have got this wrong - it doesn't make a lot
|of sense to me!

You're right. When we have two strings with identical byte sequence
but different encodings, we have to tell they are different. The
comparison result does not matter much, so I used encoding index.
Is there any alternative choice that makes sense?

matz.
Michael Selig
2008-12-15 10:59:54 UTC
Permalink
On Mon, 15 Dec 2008 20:12:59 +1100, Yukihiro Matsumoto
Post by Yukihiro Matsumoto
You're right. When we have two strings with identical byte sequence
but different encodings, we have to tell they are different. The
comparison result does not matter much, so I used encoding index.
Is there any alternative choice that makes sense?
It probably doesn't make sense to try to order 2 strings of incompatible
encoding, so what you have done is probably is as good as anything else.
The only real alernative is to raise an Encoding Compatibility error, but
that is not a good idea either, I think, because I believe you would want
s1 == s2
to return false rather than an error on incompatible encodings. So if you
consider String#<=> as the "base" for all the string comparison methods
(whether implemented that way or not), then to be consistent with "==" it
would have to return a value for all possible encodings of s1 & s2,
compatible or not, which implies that String#>, < etc must all return a
value also.

By the way, I think I phrased my description of String method
implementations badly. I meant to say that Strings are stored as the bytes
of their representation in their encoding, not as an array of codepoints.
There *are* some methods which must convert characters to codepoints for
their implementation, but this happens "on the fly". Many common String
methods (eg: concatenate) operate directly on the bytes without the need
to convert to codepoints. I should also have pointed out that Ruby goes to
a lot of trouble to optimize methods operating on single-byte character
strings in order to keep their performance good.

Mike
d***@aanet.com.au
2008-12-17 23:43:45 UTC
Permalink
Post by Yukihiro Matsumoto
You're right. When we have two strings with identical byte sequence
but different encodings, we have to tell they are different. The
comparison result does not matter much, so I used encoding index.
Is there any alternative choice that makes sense?
i think there is an alternative :) in fact, i think it should be the only
option: doing it by character, not byte. if we have a string ("cat" for
example) in two different, incompatible encodings (say latin-1 and ucs4),
they *should* match because they are the *same* thing: "cat" and "cat"

why should we care how the characters are represented in the computer?
there is already too much trouble caused by all the different encodings in
the world. the encodings should disappear from view and we should only see
the characters, unless we explicitly ask for the byte values

and i think this would be simple to achieve too. we first add a function
to do actual conversions between two encodings based on character, not
just reinterpreting the byte values. so c in latin-1 (0x63) would become c
in utf-32 (0x00000063). it could have lists of which encodings are
supersets of other encodings (based on byte values, so latin-1 is a
superset of us-ascii because every us-ascii string is already a valid
latin-1 string, but utf-8 is not a superset of shift jis even though utf-8
can represent every shift jis character (i think :)). it would then know
it doesn't have to do any actual conversion when switching from a subset
to a superset. the string would already be valid in the superset. if a
character is found when converting that can't be represented in the new
encoding then an existing Encoding::CompatibilityError or a new
ConversionError would be raised

then comparison could be changed so that it converts one of the strings
into the other's encoding if it needs to before comparing the strings.
something like:

def ==()
# begin as == currently does
...

# do the new conversion stuff
if left.encoding != right.encoding and left.encoding != "binary" and
right.encoding != "binary" and not superset(left.encoding,
right.encoding)
try
new_right = right.convert(left.encoding) # there could be a
ranking of the encodings and this could determine which
encoding to convert to rather than just picking the left one
every time
rescue Encoding::CompatibilityError
return false
end
end

# do the comparison as we currently, byte-by-byte
end

you'll notice that it neatly catches the exception and returns false,
because obviously the strings won't match in that case. most importantly,
there are no extra problems for the user to worry about when comparing
strings. they will not have to catch exceptions, or anything like that

also, notice that it checks if either string has a binary encoding. this
basically means that there is no encoding, so effectively the two strings
are already in the same encoding and a byte-by-byte comparison is
automatically done. this gives us the current behaviour:

if str_one.encode("binary") == str_two # would match based on byte, rather
than character, as ruby does currently

with that simple change we've achieved default comparison based on
character, and comparison based on byte if asked for, so we get the best
of both worlds :)

concatenation could be extended too. if one string is a superset of the
other then no actual conversion needs to be done and the resulting string
would already be in the superset. if not, either always convert the right
string to the left string and raise an exception when we encounter a char
that can't be converted (which is no worse that what already happens when
concatenating differently encoded strings) or we could pick the encoding
that is a character-based superset (say shift jis to utf-8), or failing
that ruby could just pick a default superior encoding that handles every
character found in every other encoding. utf-8 would fit this role
(correct?) so we could fall back to it and then the concatenation would
*never* fail

it doesn't matter that ruby may pick a seemingly random encoding for the
resulting string because, anywhere else this new string is used, ruby
would just handle it intelligently

you could even use "binary" here again:

str_one + str_two.encode("binary")

this would act as if you wrote "str_one +
str_two.encode(str_one.encoding)" and effectively do the concatenation
based on bytes rather than characters

with all this in place, you could write say a program that works with a
lot of japanese documents, some older ones that are in shift jis, some
newer ones in utf-8, some in other japanese-friendly encodings, and then
you could just work with them without worrying at all: mix them; match
them; write them to disk; grab a string from one file and search through
ever other file, properly matching any instances of the string, no matter
the encoding. you would hardly every have to think about the encodings, if
at all. it would it all just work! god, i'm getting excited just
describing this :) should this not be what we strive for? a system where
we just think of the characters, not about how our computer decides to
write them. i think this is much more what ruby is about

i might have a go at coding this myself to see how it works in practice if
anyone else is interested. does anyone know of a document that describes
yarv's internals, as i'm not familiar with it yet? or more importantly any
criticisms or anything that my brain missed that may destroy this idea?
Michael Selig
2008-12-18 00:44:28 UTC
Permalink
Hi,

***@aanet.com.au wrote:

I don't mean to shoot you down in flames, but a lot of thought and effort
has gone into Ruby's encoding support. Ruby could have followed the Python
route of converting everything to Unicode, but that was rejected for various
good reasons. Also automatic transcoding to solve issues of incompatible
encodings was also rejected because it causes a number of problems, in
particular I believe that transcoding isn't necessarilly accurate, because
for example there may be multiple or ambiguous representations of the same
character.

What *was* introduced is the concept of a "default_internal" encoding,
which, if used by the programmer, causes I/O and other interfaces to
transcode to the internal encdoing on input & the opposite on output.
Typically the default_internal encoding, if used, is UTF-8, and in this case
the programmer would have to accept that, on doing I/O to a file in a
different encoding, the transcoding *may* cause data loss.
Post by d***@aanet.com.au
we first add a function
to do actual conversions between two encodings based on character, not
just reinterpreting the byte values. so c in latin-1 (0x63) would become c
in utf-32 (0x00000063).
String#encode does this I believe
Post by d***@aanet.com.au
it could have lists of which encodings are
supersets of other encodings
Unfortunately it turns out that the only encoding that we can reliably state
is a subset of any other encoding is US-ASCII, and Ruby knows about this and
optimizes for it.

Cheers
Mike
d***@aanet.com.au
2008-12-18 02:09:35 UTC
Permalink
Post by Michael Selig
I don't mean to shoot you down in flames, but a lot of thought and effort
has gone into Ruby's encoding support. Ruby could have followed the Python
route of converting everything to Unicode, but that was rejected for various
good reasons. Also automatic transcoding to solve issues of incompatible
encodings was also rejected because it causes a number of problems, in
particular I believe that transcoding isn't necessarilly accurate, because
for example there may be multiple or ambiguous representations of the same
character.
What *was* introduced is the concept of a "default_internal" encoding,
which, if used by the programmer, causes I/O and other interfaces to
transcode to the internal encdoing on input & the opposite on output.
Typically the default_internal encoding, if used, is UTF-8, and in this case
the programmer would have to accept that, on doing I/O to a file in a
different encoding, the transcoding *may* cause data loss.
haha. that's fine :) i expected and asked for criticism. they're just
ideas you're criticising. no harm in that

you seem to be misunderstanding the main idea and focusing on the "perhaps
we could even go so far as to convert to a default superior encoding if
needed duration concatenation" part. that was secondary and isn't
necessary to the success of the idea

also you say in the first paragraph that ruby rejected the idea of
following python by converting everything to unicode, yet acknowledge in
the second paragraph that ruby does, in fact, do this very thing using the
concept of the default internal encoding, it just doesn't wave it in the
programmers face and is voluntary. is this not partly contradictory?

the data loss when the strings leave ruby would happen anyway. if the
programmer, for example, chose to work in a better encoding within ruby,
or whether it happened automatically under my proposal, but had to write
files in a lesser encoding, or whether they chose to stay within the
restrictions of the lesser encoding the whole time, there would be no true
data loss, just a loss of the benefits gain by working in the better
encoding. if the output encoding is restricted, that is a problem
independent of what ruby does or doesn't do. within ruby itself there
would be no information loss and that is the important thing, nor any
unnecessary errors raised
Post by Michael Selig
Post by d***@aanet.com.au
we first add a function
to do actual conversions between two encodings based on character, not
just reinterpreting the byte values. so c in latin-1 (0x63) would become c
in utf-32 (0x00000063).
String#encode does this I believe
this was just an example. what about if a string had the japanese
character ka in shift jis and was being converted to utf-8. the value
would be entirely different and encode() is not capable of doing this, is
it?
Post by Michael Selig
Post by d***@aanet.com.au
it could have lists of which encodings are
supersets of other encodings
Unfortunately it turns out that the only encoding that we can reliably state
is a subset of any other encoding is US-ASCII, and Ruby knows about this and
optimizes for it.
well, wikipedia seems to suggests jis 0201 is a subset of shift jis (i was
also thinking falsely that latin-1 is a subset of utf-8), but this doesn't
really matter. it is only an optimisation and the success of my proposal
doesn't rest on it

i have a feeling i probably won't get anywhere with this, sadly :) ruby
may have too much momentum. what does everyone else think?
daz
2008-12-18 03:53:34 UTC
Permalink
Post by d***@aanet.com.au
i have a feeling i probably won't get anywhere with this, sadly :) ruby
may have too much momentum. what does everyone else think?
I think all this encoding stuff belongs in a co-processor on the
motherboard. ;) :D

It pains me to think that there are so many groups around that are
having to work on this same dilemma. Respect to them.

Good posts, Daniel.


daz
d***@aanet.com.au
2008-12-18 05:27:22 UTC
Permalink
Post by daz
Post by d***@aanet.com.au
i have a feeling i probably won't get anywhere with this, sadly :) ruby
may have too much momentum. what does everyone else think?
I think all this encoding stuff belongs in a co-processor on the
motherboard. ;) :D
It pains me to think that there are so many groups around that are
having to work on this same dilemma. Respect to them.
Good posts, Daniel.
haha. good idea. that's exactly where it should all go ;) cheers mate
Martin Duerst
2008-12-19 08:18:47 UTC
Permalink
Post by d***@aanet.com.au
Post by Michael Selig
Post by d***@aanet.com.au
we first add a function
to do actual conversions between two encodings based on character, not
just reinterpreting the byte values. so c in latin-1 (0x63) would become c
in utf-32 (0x00000063).
String#encode does this I believe
this was just an example. what about if a string had the japanese
character ka in shift jis and was being converted to utf-8. the value
would be entirely different and encode() is not capable of doing this, is
it?
Have you ever tried? It's perfectly capable to do this!
Why did you think it's not?

Regards, Martin.


#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
d***@aanet.com.au
2008-12-22 01:32:47 UTC
Permalink
Post by Martin Duerst
Post by d***@aanet.com.au
Post by Michael Selig
Post by d***@aanet.com.au
we first add a function
to do actual conversions between two encodings based on character, not
just reinterpreting the byte values. so c in latin-1 (0x63) would
become
c
in utf-32 (0x00000063).
String#encode does this I believe
this was just an example. what about if a string had the japanese
character ka in shift jis and was being converted to utf-8. the value
would be entirely different and encode() is not capable of doing this, is
it?
Have you ever tried? It's perfectly capable to do this!
Why did you think it's not?
well, it's seems that i don't really know what i'm talking about so i
might just shut up now... :)

d***@aanet.com.au
2008-12-18 03:48:35 UTC
Permalink
Post by d***@aanet.com.au
# do the new conversion stuff
if left.encoding != right.encoding and left.encoding != "binary" and
right.encoding != "binary" and not superset(left.encoding,
right.encoding)
try
new_right = right.convert(left.encoding) # there could be a
ranking of the encodings and this could determine which
encoding to convert to rather than just picking the left one
every time
rescue Encoding::CompatibilityError
return false
end
and i've just realised i wrote try instead of begin. haha. sorry about
that :)
Yukihiro Matsumoto
2008-12-18 07:07:56 UTC
Permalink
Hi,

In message "Re: [ruby-core:20619] Re: 1.9 character encoding (was: encoding of symbols)"
on Thu, 18 Dec 2008 08:43:45 +0900, ***@aanet.com.au writes:

|> Is there any alternative choice that makes sense?
|
|i think there is an alternative :) in fact, i think it should be the only
|option: doing it by character, not byte. if we have a string ("cat" for
|example) in two different, incompatible encodings (say latin-1 and ucs4),
|they *should* match because they are the *same* thing: "cat" and "cat"
|
|why should we care how the characters are represented in the computer?
|there is already too much trouble caused by all the different encodings in
|the world. the encodings should disappear from view and we should only see
|the characters, unless we explicitly ask for the byte values

I am against trancoding before comparison. The applications models
falls into either:

* treat single encoding only, say iso-8869-1 (single encoding model)
* convert everything Unicode at I/O (universal encoding model)
* mix various encoding per strings (multiple encoding model)

Most of the applications use the former two, and mix encoding
comparison happens only in the last one. So I don't think we need to
make comparison more complex than it is now.

matz.
Daniel Cavanagh
2008-12-18 09:22:28 UTC
Permalink
Post by Yukihiro Matsumoto
I am against trancoding before comparison. The applications models
* treat single encoding only, say iso-8869-1 (single encoding model)
* convert everything Unicode at I/O (universal encoding model)
* mix various encoding per strings (multiple encoding model)
Most of the applications use the former two, and mix encoding
comparison happens only in the last one. So I don't think we need to
make comparison more complex than it is now.
comparison would barely become any more complex than it is now. the
extra code would be pretty much what i wrote in my original email. the
only complex part is the conversion function and that would immensely
useful anyway. are there any plans to introduce a simple, native
function like this (ie. something very unlike iconv ;)

but if you're against late conversion (ie. at comparison and
concatenation) and you're against early conversion (ie. at I/O), i
don't think this is going to happen :)

this is still annoying me, though. for instance, why do i have to use /
u with regexes to match against unicode strings. why do i have to
specify the type of the regex at all? why do i have to add in checks
before i have to be sure i can concatenate properly? it's awkward and
it shouldn't be. it's very un-ruby like
Brian Candler
2008-12-18 09:59:57 UTC
Permalink
Post by Daniel Cavanagh
and you're against early conversion (ie. at I/O)
That's what ruby-1.9 does out-of-the-box. e.g. if you set the external
encoding to ISO-8859-1, and the internal encoding to UTF-8, then your
program will see the source as if it were a stream of UTF-8.
Post by Daniel Cavanagh
why do i have to use /u
with regexes to match against unicode strings.
Only in Ruby 1.8. Or do you have a counter-example? Here in irb19 from
1.9.1-preview2:

irb(main):001:0> foo = "aßb"
=> "aßb"
irb(main):002:0> foo =~ /b/
=> 2
irb(main):003:0> foo =~ /ß/
=> 1
Post by Daniel Cavanagh
why do i have to specify
the type of the regex at all?
Only in 1.8
Post by Daniel Cavanagh
why do i have to add in checks before i
have to be sure i can concatenate properly?
Not sure what you mean. If you have decided to read in one string in UTF-8,
and a different string in Shift-JIS, and you really insist on concatenating
them, then clearly you'll end up with a binary muddle. But what else do you
expect? If you wrote

s = str1 + str2

how does Ruby know whether s should take the encoding of str1, or str2?

But if you convert all your data to one encoding at input time, so internal
processing is consistent, then the problem goes away.

At least, this is how I understand it at the moment :-)

Brian.
Daniel Cavanagh
2008-12-18 10:50:29 UTC
Permalink
Post by Brian Candler
Post by Daniel Cavanagh
and you're against early conversion (ie. at I/O)
That's what ruby-1.9 does out-of-the-box. e.g. if you set the external
encoding to ISO-8859-1, and the internal encoding to UTF-8, then your
program will see the source as if it were a stream of UTF-8.
michael selig said "Ruby could have followed the Python route of
converting everything to Unicode, but that was rejected for various
good reasons. Also automatic transcoding to solve issues of
incompatible encodings was also rejected because it causes a number of
problems, in particular I believe that transcoding isn't necessarilly
accurate, because for example there may be multiple or ambiguous
representations of the same character."

i was taking that to mean that ruby will not be doing automatic
conversion to one encoding. perhaps he just meant by default and that
the option is there if wanted
Post by Brian Candler
Post by Daniel Cavanagh
why do i have to use /u
with regexes to match against unicode strings.
Only in Ruby 1.8. Or do you have a counter-example? Here in irb19 from
irb(main):001:0> foo = "aßb"
=> "aßb"
irb(main):002:0> foo =~ /b/
=> 2
irb(main):003:0> foo =~ /ß/
=> 1
Post by Daniel Cavanagh
why do i have to specify
the type of the regex at all?
Only in 1.8
oops. sorry. for some reason i was under the impression that this was
still necessary. i don't know why
Post by Brian Candler
Post by Daniel Cavanagh
why do i have to add in checks before i
have to be sure i can concatenate properly?
Not sure what you mean. If you have decided to read in one string in UTF-8,
and a different string in Shift-JIS, and you really insist on
concatenating
them, then clearly you'll end up with a binary muddle. But what else do you
s = str1 + str2
how does Ruby know whether s should take the encoding of str1, or str2?
But if you convert all your data to one encoding at input time, so internal
processing is consistent, then the problem goes away.
At least, this is how I understand it at the moment :-)
if it's possible to convert everything to one encoding at input time,
why is it not possible to do so at concatenation time? if neither
encoding can represent both strings, ruby could pick a different,
superior encoding. there should theoretically be less problems because
ruby could always decide on a suitable encoding, but the programmer
could pick a bad default encoding for input conversion (although why
anyone would pick anything other than utf-8 is beyond me). and also
it's possible that a converted string would have ended up not being
used with strings of a different encodings, so it would have been a
waste to convert. doing it at comparison would have saved this extra
unnecessary conversion
James Gray
2008-12-18 15:26:55 UTC
Permalink
Post by Daniel Cavanagh
Post by Brian Candler
Post by Daniel Cavanagh
and you're against early conversion (ie. at I/O)
That's what ruby-1.9 does out-of-the-box. e.g. if you set the
external
encoding to ISO-8859-1, and the internal encoding to UTF-8, then your
program will see the source as if it were a stream of UTF-8.
michael selig said "Ruby could have followed the Python route of
converting everything to Unicode, but that was rejected for various
good reasons. Also automatic transcoding to solve issues of
incompatible encodings was also rejected because it causes a number
of problems, in particular I believe that transcoding isn't
necessarilly accurate, because for example there may be multiple or
ambiguous representations of the same character."
i was taking that to mean that ruby will not be doing automatic
conversion to one encoding. perhaps he just meant by default and
that the option is there if wanted
And his very next paragraph in the message you are quoting from was:

What *was* introduced is the concept of a "default_internal"
encoding, which, if used by the programmer, causes I/O and
other interfaces to transcode to the internal encdoing on
input & the opposite on output.

James Edward Gray II
Martin Duerst
2008-12-19 07:22:40 UTC
Permalink
Post by Daniel Cavanagh
Post by Yukihiro Matsumoto
I am against trancoding before comparison. The applications models
* treat single encoding only, say iso-8869-1 (single encoding model)
* convert everything Unicode at I/O (universal encoding model)
* mix various encoding per strings (multiple encoding model)
Most of the applications use the former two, and mix encoding
comparison happens only in the last one. So I don't think we need to
make comparison more complex than it is now.
comparison would barely become any more complex than it is now. the
extra code would be pretty much what i wrote in my original email. the
only complex part is the conversion function and that would immensely
useful anyway. are there any plans to introduce a simple, native
function like this (ie. something very unlike iconv ;)
There is already native conversion (String#encode).
Can you tell me what you think isn't simple enough with
String#encode?
Post by Daniel Cavanagh
but if you're against late conversion (ie. at comparison and
concatenation) and you're against early conversion (ie. at I/O), i
don't think this is going to happen :)
Did Matz say he's against early conversion? I think the only
thing he says is that different people have different needs.

Doing early conversion may be very fine in some cases,
but not in others.

Regards, Martin.

#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
Brian Candler
2008-12-18 09:24:59 UTC
Permalink
Post by d***@aanet.com.au
concatenation could be extended too. if one string is a superset of the
other then no actual conversion needs to be done
Concatenation raises interesting issues. For example:

data = "".force_encoding("UTF-8")
while chunk = file.read(1024)
data << chunk
end
# what is data.encoding ?

Here each chunk of 1024 bytes may have split multibyte characters at start
or end. However it's OK to concatenate them, and as long as the file is read
to the end, the result would be valid UTF-8.

Ruby's current behaviour is to do the concatenation bytewise, but downgrades
the encoding to binary when concatenating binary onto the end of UTF-8 (and
File#read returns binary)

irb(main):001:0> data = "".force_encoding("UTF-8")
=> ""
irb(main):002:0> data.encoding
=> #<Encoding:UTF-8>
irb(main):003:0> data << "\x61"
=> "a"
irb(main):004:0> data.encoding
=> #<Encoding:UTF-8>
irb(main):005:0> data << "\xc3"
=> "a\xC3"
irb(main):006:0> data << "\x9f"
=> "a\xC3\x9F"
irb(main):007:0> data.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):008:0> data.force_encoding("UTF-8")
=> "aß"
Daniel Cavanagh
2008-12-18 11:00:02 UTC
Permalink
On Thu, Dec 18, 2008 at 08:43:45AM +0900,
Post by d***@aanet.com.au
concatenation could be extended too. if one string is a superset of the
other then no actual conversion needs to be done
data = "".force_encoding("UTF-8")
while chunk = file.read(1024)
data << chunk
end
# what is data.encoding ?
Here each chunk of 1024 bytes may have split multibyte characters at start
or end. However it's OK to concatenate them, and as long as the file is read
to the end, the result would be valid UTF-8.
Ruby's current behaviour is to do the concatenation bytewise, but downgrades
the encoding to binary when concatenating binary onto the end of UTF-8 (and
File#read returns binary)
irb(main):001:0> data = "".force_encoding("UTF-8")
=> ""
irb(main):002:0> data.encoding
=> #<Encoding:UTF-8>
irb(main):003:0> data << "\x61"
=> "a"
irb(main):004:0> data.encoding
=> #<Encoding:UTF-8>
irb(main):005:0> data << "\xc3"
=> "a\xC3"
irb(main):006:0> data << "\x9f"
=> "a\xC3\x9F"
irb(main):007:0> data.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):008:0> data.force_encoding("UTF-8")
=> "aß"
well we know how to solve that don't we? make read() read characters
not bytes ;)

honestly, that seems to be only proper solution. it makes no sense to
work with characters everywhere but then read only bytes. reading only
bytes should set the string's encoding to binary, and only when the
programmer is sure the string is valid utf-8 should he change the
encoding. the other options seem to be continue to do what you
describe above (which is less than desirable) or raise an exception,
which would be annoying to have to check for but possibly better than
the current solution. or maybe not...
James Gray
2008-12-18 15:33:45 UTC
Permalink
Post by Daniel Cavanagh
On Thu, Dec 18, 2008 at 08:43:45AM +0900,
Post by d***@aanet.com.au
concatenation could be extended too. if one string is a superset of the
other then no actual conversion needs to be done
data = "".force_encoding("UTF-8")
while chunk = file.read(1024)
data << chunk
end
# what is data.encoding ?
Here each chunk of 1024 bytes may have split multibyte characters at start
or end. However it's OK to concatenate them, and as long as the file is read
to the end, the result would be valid UTF-8.
Ruby's current behaviour is to do the concatenation bytewise, but downgrades
the encoding to binary when concatenating binary onto the end of UTF-8 (and
File#read returns binary)
irb(main):001:0> data = "".force_encoding("UTF-8")
=> ""
irb(main):002:0> data.encoding
=> #<Encoding:UTF-8>
irb(main):003:0> data << "\x61"
=> "a"
irb(main):004:0> data.encoding
=> #<Encoding:UTF-8>
irb(main):005:0> data << "\xc3"
=> "a\xC3"
irb(main):006:0> data << "\x9f"
=> "a\xC3\x9F"
irb(main):007:0> data.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):008:0> data.force_encoding("UTF-8")
=> "aß"
well we know how to solve that don't we? make read() read characters
not bytes ;)
honestly, that seems to be only proper solution. it makes no sense
to work with characters everywhere but then read only bytes. reading
only bytes should set the string's encoding to binary, and only when
the programmer is sure the string is valid utf-8 should he change
the encoding.
You can certainly do that. CSV does:

#
# Builds a String in <tt>@encoding</tt>. All +chunks+ will be
transcoded to
# that encoding.
#
def encode_str(*chunks)
chunks.map { |chunk| chunk.encode(@encoding.name) }.join
end

#
# Reads at least +bytes+ from <tt>@io</tt>, but will read up 10
bytes ahead if
# needed to ensure the data read is valid in the ecoding of that
data. This
# should ensure that it is safe to use regular expressions on the
read data,
# unless it is actually a broken encoding. The read data will be
returned in
# <tt>@encoding</tt>.
#
def read_to_char(bytes)
return "" if @io.eof?
data = @io.read(bytes)
begin
encoded = encode_str(data)
raise unless encoded.valid_encoding?
return encoded
rescue # encoding error or my invalid data raise
if @io.eof? or data.size >= bytes + 10
return data
else
data += @io.read(1) until data.valid_encoding? or
@io.eof? or
data.size >= bytes + 10
retry
end
end
end
Post by Daniel Cavanagh
the other options seem to be continue to do what you describe above
(which is less than desirable) or raise an exception, which would be
annoying to have to check for but possibly better than the current
solution. or maybe not...
Exceptions will be raises if you try to do something like match a
regular expression against data with a broken encoding.

James Edward Gray II
d***@aanet.com.au
2008-12-18 23:41:29 UTC
Permalink
Post by James Gray
Post by Daniel Cavanagh
well we know how to solve that don't we? make read() read characters
not bytes ;)
honestly, that seems to be only proper solution. it makes no sense
to work with characters everywhere but then read only bytes. reading
only bytes should set the string's encoding to binary, and only when
the programmer is sure the string is valid utf-8 should he change
the encoding.
#
transcoded to
# that encoding.
#
def encode_str(*chunks)
end
#
bytes ahead if
# needed to ensure the data read is valid in the ecoding of that
data. This
# should ensure that it is safe to use regular expressions on the
read data,
# unless it is actually a broken encoding. The read data will be
returned in
#
def read_to_char(bytes)
begin
encoded = encode_str(data)
raise unless encoded.valid_encoding?
return encoded
rescue # encoding error or my invalid data raise
return data
else
@io.eof? or
data.size >= bytes + 10
retry
end
end
end
that doesn't read based on characters though. i meant read(1) would read
one whole character, no matter the number of bytes. this reads based on
bytes and just makes sure that there are no broken characters. and does
everyone who wants to work with exotic characters (ie, everyone but
english speakers nowadays, i would imagine) have to put this code into
their applications every time? shouldn't ruby just do it for them? but
this is a separate issue from the one i raised so perhaps we shouldn't go
into that just yet :)
Bill Kelly
2008-12-19 02:09:48 UTC
Permalink
Post by d***@aanet.com.au
that doesn't read based on characters though. i meant read(1) would read
one whole character, no matter the number of bytes.
What about reading streams such as network sockets?

And/or nonblocking reads?



Regards,

Bill
Michael Selig
2008-12-19 02:41:29 UTC
Permalink
Hi,
Post by Daniel Cavanagh
i was taking that to mean that ruby will not be doing automatic
conversion
to one encoding. perhaps he just meant by default and that the option is
there if wanted
Yep, that's what I meant.
I understand that the people who deal with Japanese & Chinese encodings
have certain requirements that could break if Ruby did automatic
transcoding. The compromise was that no transcoding happen by default
(which keeps these people and those doing non m17n applications happy),
and to set "default_internal" to enable transcoding on I/O for those
people who are doing m17n.
Post by Daniel Cavanagh
that doesn't read based on characters though. i meant read(1) would read
one whole character, no matter the number of bytes.
Yes IO#read is byte-oriented.
If you want to do character input use IO#getc & gets. Note that gets has a
"limit" parameter which, although it's a number of bytes, never splits a
multi-byte character.
To my knowledge there is no was of reading N characters, other than
looping thru "getc" or "each_char". Perhaps there should be. It is
probably too late to change the "limit" in IO#gets to mean characters.
Perhaps a character count on IO#getc might be an idea?

Cheers,
Mike.
James Gray
2008-12-18 15:22:43 UTC
Permalink
Post by Brian Candler
Ruby's current behaviour is to do the concatenation bytewise, but
downgrades the encoding to binary when concatenating binary onto the
end of UTF-8 (and File#read returns binary)
Yeah, it basically has a lowest common denominator encoding when
concatenating:

$ ri_dev -T Encoding::compatible?
-------------------------------------------------- Encoding::compatible?
Encoding.compatible?(str1, str2) => enc or nil

From Ruby 1.9.0
------------------------------------------------------------------------
Checks the compatibility of two strings. If they are compatible,
means concatenatable, returns an encoding which the concatinated
string will be. If they are not compatible, nil is returned.

Encoding.compatible?("\xa1".force_encoding("iso-8859-1"), "b")
=> #<Encoding:ISO-8859-1>

Encoding.compatible?(
"\xa1".force_encoding("iso-8859-1"),
"\xa1\xa1".force_encoding("euc-jp"))
=> nil

James Edward Gray II
Yukihiro Matsumoto
2008-12-18 18:44:09 UTC
Permalink
Hi,

In message "Re: [ruby-core:20637] Re: 1.9 character encoding (was: encoding of symbols)"
on Thu, 18 Dec 2008 18:24:59 +0900, Brian Candler <***@pobox.com> writes:

|Concatenation raises interesting issues. For example:
|
| data = "".force_encoding("UTF-8")
| while chunk = file.read(1024)
| data << chunk
| end
| # what is data.encoding ?

UTF-8 + ASCII-8BIT makes ASCII-8BIT. Binary wins.

|irb(main):001:0> data = "".force_encoding("UTF-8")
|=> ""
|irb(main):002:0> data.encoding
|=> #<Encoding:UTF-8>
|irb(main):003:0> data << "\x61"
|=> "a"
|irb(main):004:0> data.encoding
|=> #<Encoding:UTF-8>
|irb(main):005:0> data << "\xc3"
|=> "a\xC3"
|irb(main):006:0> data << "\x9f"
|=> "a\xC3\x9F"
|irb(main):007:0> data.encoding
|=> #<Encoding:ASCII-8BIT>
|irb(main):008:0> data.force_encoding("UTF-8")
|=> "aß"

In this case, your source encoding (encoding of literals) seems to be
US-ASCII, but Ruby tokenizer automagically upgrade strings with 8bit
characters to ASCII-8BIT, so according to the above rule, the resulting
encoding should be ASCII-8BIT. If your source encoding is, say UTF-8,
you will have UTF-8 result.

matz.
Martin Duerst
2008-12-19 07:22:40 UTC
Permalink
Post by Brian Candler
Post by d***@aanet.com.au
concatenation could be extended too. if one string is a superset of the
other then no actual conversion needs to be done
data = "".force_encoding("UTF-8")
while chunk = file.read(1024)
data << chunk
end
# what is data.encoding ?
Here each chunk of 1024 bytes may have split multibyte characters at start
or end. However it's OK to concatenate them, and as long as the file is read
to the end, the result would be valid UTF-8.
As File#read works on bytes, the above is conceptually wrong.
It should be fixed to:

data = ""
while chunk = file.read(1024)
data << chunk
end
data.force_encoding("UTF-8")
# what is data.encoding ?

Another way to do it is to just read from a file and tell that you
want the result to be UTF-8.

With Ruby 1.9, we get a lot more power for internationalization/
multilingualization. But we also have to learn how to use that
power, the same way we had to learn (or are still learning :-)
other powerful Ruby features.

Regards, Martin.



#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
Brian Candler
2008-12-15 11:11:33 UTC
Permalink
Post by Michael Selig
Post by Brian Candler
For example, what are the semantics of
comparing strings with different encodings? Are they compared
byte-by-byte,
or character-by-character as unicode codepoints, or some other way?
Yes, I agree this needs to be documentated a lot better than it is at the
moment.
I also think that some of the behaviour is a little "unexpected" :)
though this is only in unusual cases.
Thank you for your detailled explanation.

The other thing that concerns me most is how much more 'magic' behaviour is
there which I need to know about, and what I need to do to turn it off when
I am dealing with binary data. For me, DTRT means Leave My Binary Data Alone
:-)

e.g. I read that File.open now has an :encoding=>... option. This is not
documented in ri, apart from showing there is an [opt] parameter.

Does the :encoding default to binary, or to the encoding of the source file,
or the terminal from which it is run, or to something else? Does it try to
be clever, e.g. taste the Unicode BOM?? For me that would be the "wrong"
thing. For me, it seems to be doing the wrong thing by default:

irb(main):004:0> File.open("/bin/sh").gets.encoding
=> #<Encoding:UTF-8>

Ditto for sockets.

Ditto for Net::HTTP - e.g. does it try to use the Content-Type ... charset
header? If not now, might it do so in future??

At the moment I am worried about the semantics of the most basic low-level
operations. For example, if I read some bytes from file A, and compare it to
a string literal B, I want to be sure they will compare equal if they are
the same sequence of bytes. By the sound of it, this means I have to declare
:encoding=>"BINARY" everywhere, or at least be confident that everything I
do has this as a default (and will remain so going forward); if I forget
one, I may introduce subtle bugs into my program.

Even BINARY seems to be a second-class alias:

irb(main):010:0> a.force_encoding("BINARY")
=> "a\xC3\x9F"
irb(main):011:0> a.encoding
=> #<Encoding:ASCII-8BIT>

Ruby is telling me that all data is text, whereas I believe the opposite is
true :-)

Regards,

Brian.
Brian Candler
2008-12-15 11:32:48 UTC
Permalink
Post by Brian Candler
irb(main):004:0> File.open("/bin/sh").gets.encoding
=> #<Encoding:UTF-8>
Sorry, that wasn't a good example. Using #read instead:

irb(main):001:0> File.open("/bin/sh").read(16).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):002:0> File.open("/etc/passwd").read(16).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):003:0> File.open("/bin/sh").gets.encoding
=> #<Encoding:UTF-8>
irb(main):004:0> File.open("/etc/passwd").gets.encoding
=> #<Encoding:UTF-8>

So I guess: the encoding declared in open is used for #gets, but #read
always returns ASCII-8BIT.

I see also that inline literals have variable encoding:

irb(main):029:0> "\x61\x62\x63".encoding
=> #<Encoding:US-ASCII>
irb(main):028:0> "\x61\xc3\x9f".encoding
=> #<Encoding:ASCII-8BIT>
Post by Brian Candler
From what I've heard so far, these will compare equal to data from my file
as long as US-ASCII and ASCII-8BIT are treated as 'compatible' encodings. It
still bothers me though :-)

#getc returns a single "character" (which can be more than one byte) and
tags it as UTF-8, although it will happily consume and return invalid UTF-8
too.

irb(main):025:0> File.open("tst","w") { |f| f.write "a\xc3" }
=> 2
irb(main):026:0> f = File.open("tst")
=> #<File:tst>
irb(main):027:0> 2.times { c = f.getc; puts "#{c.inspect} => #{c.encoding}" }
"a" => UTF-8
"\xC3" => UTF-8
=> 2

irb(main):033:0> File.open("tst","w") { |f| f.write "a\xc3\x9f" }
=> 3
irb(main):034:0> f = File.open("tst")
=> #<File:tst>
irb(main):035:0> 2.times { c = f.getc; puts "#{c.inspect} => #{c.encoding}" }
"a" => UTF-8
"ß" => UTF-8
=> 2

Regards,

Brian.
Yukihiro Matsumoto
2008-12-15 14:51:15 UTC
Permalink
Hi,

In message "Re: [ruby-core:20567] Re: 1.9 character encoding (was: encoding of symbols)"
on Mon, 15 Dec 2008 20:32:48 +0900, Brian Candler <***@pobox.com> writes:

|Sorry, that wasn't a good example. Using #read instead:
|
|irb(main):001:0> File.open("/bin/sh").read(16).encoding
|=> #<Encoding:ASCII-8BIT>
|irb(main):002:0> File.open("/etc/passwd").read(16).encoding
|=> #<Encoding:ASCII-8BIT>
|irb(main):003:0> File.open("/bin/sh").gets.encoding
|=> #<Encoding:UTF-8>
|irb(main):004:0> File.open("/etc/passwd").gets.encoding
|=> #<Encoding:UTF-8>
|
|So I guess: the encoding declared in open is used for #gets, but #read
|always returns ASCII-8BIT.

Yes, read specifies data length in bytes, so its return value should
be binary (ASCII-8BIT).

|I see also that inline literals have variable encoding:
|
|irb(main):029:0> "\x61\x62\x63".encoding
|=> #<Encoding:US-ASCII>
|irb(main):028:0> "\x61\xc3\x9f".encoding
|=> #<Encoding:ASCII-8BIT>

This is old behavior. Now string literals are always in their source
encoding. Try newer version.

matz.
Brian Candler
2008-12-15 20:33:48 UTC
Permalink
Post by Yukihiro Matsumoto
|irb(main):029:0> "\x61\x62\x63".encoding
|=> #<Encoding:US-ASCII>
|irb(main):028:0> "\x61\xc3\x9f".encoding
|=> #<Encoding:ASCII-8BIT>
This is old behavior. Now string literals are always in their source
encoding. Try newer version.
That was 1.9.1-preview2. I have just built from trunk, and I get the same:

irb(main):001:0> RUBY_REVISION
=> 20768
irb(main):002:0> "\x61\x62\x63".encoding
=> #<Encoding:US-ASCII>
irb(main):003:0> "\x61\xc3\x9f".encoding
=> #<Encoding:ASCII-8BIT>
irb(main):004:0> __ENCODING__
=> #<Encoding:US-ASCII>


And:

$ cat ert.rb
p __ENCODING__
p Encoding.default_external
p Object.constants.grep(/RUBY/).map { |c| [c,Object.const_get(c)] }
p "\x61\x62\x63".encoding
p "\x61\xc3\x9f".encoding

$ ruby19 ert.rb
#<Encoding:US-ASCII>
#<Encoding:UTF-8>
[[:RUBY_VERSION, "1.9.1"], [:RUBY_RELEASE_DATE, "2008-12-16"], [:RUBY_PLATFORM, "i686-linux"], [:RUBY_PATCHLEVEL, 5000], [:RUBY_REVISION, 20768], [:RUBY_DESCRIPTION, "ruby 1.9.1 (2008-12-16 revision 20768) [i686-linux]"], [:RUBY_COPYRIGHT, "ruby - Copyright (C) 1993-2008 Yukihiro Matsumoto"], [:RUBY_ENGINE, "ruby"]]
#<Encoding:US-ASCII>
#<Encoding:ASCII-8BIT>

Well, I guess I can make sense of this:
- source encoding is US-ASCII, presumably by default
- if I embed hex escapes, the literal is promoted to ASCII-8BIT
- if I embed a UTF-8 codepoint directly in a literal, it raises an
encoding error
- external encoding is UTF-8, presumably from environment

However I can add

# Encoding: binary

to the top of my source to get consistent encoding of literals.

B.
Yukihiro Matsumoto
2008-12-15 23:48:23 UTC
Permalink
Hi,

In message "Re: [ruby-core:20589] Re: 1.9 character encoding (was: encoding of symbols)"
on Tue, 16 Dec 2008 05:33:48 +0900, Brian Candler <***@pobox.com> writes:
|
|On Mon, Dec 15, 2008 at 11:51:15PM +0900, Yukihiro Matsumoto wrote:
|> |irb(main):029:0> "\x61\x62\x63".encoding
|> |=> #<Encoding:US-ASCII>
|> |irb(main):028:0> "\x61\xc3\x9f".encoding
|> |=> #<Encoding:ASCII-8BIT>
|>
|> This is old behavior. Now string literals are always in their source
|> encoding. Try newer version.
|
|That was 1.9.1-preview2. I have just built from trunk, and I get the same:

Oops, sorry. This is special treatment for US-ASCII which is
restricted in 7bits characters.

|Well, I guess I can make sense of this:
|- source encoding is US-ASCII, presumably by default
|- if I embed hex escapes, the literal is promoted to ASCII-8BIT
|- if I embed a UTF-8 codepoint directly in a literal, it raises an
| encoding error

But you can embed \uXXXX characters in string literals and you get
utf-8 strings.

matz.
James Gray
2008-12-15 14:00:31 UTC
Permalink
Post by Brian Candler
irb(main):010:0> a.force_encoding("BINARY")
=> "a\xC3\x9F"
irb(main):011:0> a.encoding
=> #<Encoding:ASCII-8BIT>
Ruby is telling me that all data is text, whereas I believe the
opposite is true :-)
BINARY is an alias for ASCII-8BIT. The latter name is just a Rubyism
to suggest that even with BINARY data, we somethings want to pick out
the ASCII with a Regexp or the like. The example typically given is
the signature of a image file, like PNG.

James Edward Gray II
Dave Thomas
2008-12-13 14:49:17 UTC
Permalink
Post by Charles Oliver Nutter
Very good point; symbols are not necessarily created in the file
where you use their literal form, and therefore need to have a
single encoding everywhere. I concur.
Maybe, to avoid confusion, symbols should be constrained to be US-
ASCII then. Otherwise we have a strange situation where the order in
which files are required affects the internal encoding of a symbol,
and that seems wrong.

In theory, we say "the value of :fred is the same where ever it
occurs. But as it stands now, that's no longer true—the value of a
symbol depends on the order that files are required. :olé uncoded with
iso-8859-1 is a different symbol to :olé encoded utf-8.

It's analogous to the situation where we could set the default base
for the interpretation of numeric literals. If that base is local to a
source file, then the behavior is well defined. But if instead we say
that the base of a number depends on the base that was in effect the
first time that literal was encountered, then chaos would ensue.

Clearly with symbols the situation is not quite as immediate, but the
principle is the same.

Basically, we're saying that the value of :olé.succ is no longer well
defined.



Dave
Martin Duerst
2008-12-19 05:27:42 UTC
Permalink
Post by Charles Oliver Nutter
Very good point; symbols are not necessarily created in the file
where you use their literal form, and therefore need to have a
single encoding everywhere. I concur.
Maybe, to avoid confusion, symbols should be constrained to be US- ASCII then.
For libraries and other general stuff, that makes sense.
It doesn't make sense for more specialized stuff.
Otherwise we have a strange situation where the order in
which files are required affects the internal encoding of a symbol,
and that seems wrong.
In theory, we say "the value of :fred is the same where ever it
occurs. But as it stands now, that's no longer true$BMU(Bhe value of a
symbol depends on the order that files are required.
I don't understand why anything would depend on file requiring order.
Each source file has to know its encoding, therefore order should be
irrelevant.
:ol$Bq(Buncoded with
iso-8859-1 is a different symbol to :ol$Bq(Bencoded utf-8.
It definitely will be different.


Regards, Martin.


#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
Martin Duerst
2008-12-14 07:48:00 UTC
Permalink
Post by Yukihiro Matsumoto
It was very easy to implement it, but when I tried, I found different
# encoding: utf-8
p :a.encoding # => #<Encoding:UTF-8>
p :p.encoding # => #<Encoding:US-ASCII>
means a symbol would have an encoding of which the symbol first
appears, in this case symbol :a first appears on a file with UTF-8
encoding, whereas :p appears first for a name to built-in method.
So rather making symbols to have somewhat unpredictable encoding, I'd
rather keep them as they are now, despite the inconsistency with
string encoding.
Very good point; symbols are not necessarily created in the file where you use their literal form, and therefore need to have a single encoding everywhere. I concur.
Please note that as far as I understand, ASCII-only symbols will be
in <Encoding:US-ASCII> and will therefore match across files with
different non-ASCII encodings, but symbols including non_ascii characters
will be in different encodings, and therefore won't match between
files of different encodings.

Regards, Martin.



#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp
Loading...