Post by Yukihiro MatsumotoYou're right. When we have two strings with identical byte sequence
but different encodings, we have to tell they are different. The
comparison result does not matter much, so I used encoding index.
Is there any alternative choice that makes sense?
i think there is an alternative :) in fact, i think it should be the only
option: doing it by character, not byte. if we have a string ("cat" for
example) in two different, incompatible encodings (say latin-1 and ucs4),
they *should* match because they are the *same* thing: "cat" and "cat"
why should we care how the characters are represented in the computer?
there is already too much trouble caused by all the different encodings in
the world. the encodings should disappear from view and we should only see
the characters, unless we explicitly ask for the byte values
and i think this would be simple to achieve too. we first add a function
to do actual conversions between two encodings based on character, not
just reinterpreting the byte values. so c in latin-1 (0x63) would become c
in utf-32 (0x00000063). it could have lists of which encodings are
supersets of other encodings (based on byte values, so latin-1 is a
superset of us-ascii because every us-ascii string is already a valid
latin-1 string, but utf-8 is not a superset of shift jis even though utf-8
can represent every shift jis character (i think :)). it would then know
it doesn't have to do any actual conversion when switching from a subset
to a superset. the string would already be valid in the superset. if a
character is found when converting that can't be represented in the new
encoding then an existing Encoding::CompatibilityError or a new
ConversionError would be raised
then comparison could be changed so that it converts one of the strings
into the other's encoding if it needs to before comparing the strings.
something like:
def ==()
# begin as == currently does
...
# do the new conversion stuff
if left.encoding != right.encoding and left.encoding != "binary" and
right.encoding != "binary" and not superset(left.encoding,
right.encoding)
try
new_right = right.convert(left.encoding) # there could be a
ranking of the encodings and this could determine which
encoding to convert to rather than just picking the left one
every time
rescue Encoding::CompatibilityError
return false
end
end
# do the comparison as we currently, byte-by-byte
end
you'll notice that it neatly catches the exception and returns false,
because obviously the strings won't match in that case. most importantly,
there are no extra problems for the user to worry about when comparing
strings. they will not have to catch exceptions, or anything like that
also, notice that it checks if either string has a binary encoding. this
basically means that there is no encoding, so effectively the two strings
are already in the same encoding and a byte-by-byte comparison is
automatically done. this gives us the current behaviour:
if str_one.encode("binary") == str_two # would match based on byte, rather
than character, as ruby does currently
with that simple change we've achieved default comparison based on
character, and comparison based on byte if asked for, so we get the best
of both worlds :)
concatenation could be extended too. if one string is a superset of the
other then no actual conversion needs to be done and the resulting string
would already be in the superset. if not, either always convert the right
string to the left string and raise an exception when we encounter a char
that can't be converted (which is no worse that what already happens when
concatenating differently encoded strings) or we could pick the encoding
that is a character-based superset (say shift jis to utf-8), or failing
that ruby could just pick a default superior encoding that handles every
character found in every other encoding. utf-8 would fit this role
(correct?) so we could fall back to it and then the concatenation would
*never* fail
it doesn't matter that ruby may pick a seemingly random encoding for the
resulting string because, anywhere else this new string is used, ruby
would just handle it intelligently
you could even use "binary" here again:
str_one + str_two.encode("binary")
this would act as if you wrote "str_one +
str_two.encode(str_one.encoding)" and effectively do the concatenation
based on bytes rather than characters
with all this in place, you could write say a program that works with a
lot of japanese documents, some older ones that are in shift jis, some
newer ones in utf-8, some in other japanese-friendly encodings, and then
you could just work with them without worrying at all: mix them; match
them; write them to disk; grab a string from one file and search through
ever other file, properly matching any instances of the string, no matter
the encoding. you would hardly every have to think about the encodings, if
at all. it would it all just work! god, i'm getting excited just
describing this :) should this not be what we strive for? a system where
we just think of the characters, not about how our computer decides to
write them. i think this is much more what ruby is about
i might have a go at coding this myself to see how it works in practice if
anyone else is interested. does anyone know of a document that describes
yarv's internals, as i'm not familiar with it yet? or more importantly any
criticisms or anything that my brain missed that may destroy this idea?