Community technical support mailing list was retired 2010 and replaced with a professional technical support team. For assistance please contact: Pre-sales Technical support via email to sales@march-hare.com.
Hmm, i do not agree, but maybe we have simply a communication problem because i'm not a native english speaker. I try to point out what i meant: There are four variations of UTF-16 files: UTF-16 and UTF-16 (one is BE, one is LE), UTF-16BE and UTF-16LE. The first two contain a BOM, which signals the byte order, the other two don't. They don't need one because they are explicitly call UTF-16BE and UTF-16LE. There are not strict rules, when to use a BOM, it depends on the protocol that uses the text stream. Example given, Microsoft declared that txt files must have a BOM. On the other hands side, the usage of a BOM can be tabooed (see http://www.unicode.org/faq/utf_bom.html#28). To have an example out of my work: We save content of a database in CVS. In the DB the unicode string has not BOM and when we save it to file and put it to CVS we don't add a BOM. All four variations are fully legal. What cvsnt seems to do during commit is to cut of the first to bytes where it expects to have the BOM. And while checkout/update, it adds "0xFF 0xFE" in front of the stream. What does this mean for the four variations: 1) UTF-16 (BE with BOM) Input file: 0xFF 0xFE 0x54 0x00 0x68 0x00 0x69 0x00 0x73 0x00 => "This" Output file: 0xFF 0xFE 0x54 0x00 0x68 0x00 0x69 0x00 0x73 0x00 => "This" 2) UTF-16 (LE with BOM) Input file: 0xFE 0xFF 0x00 0x54 0x00 0x68 0x00 0x69 0x00 0x73 => "This" Output file: 0xFF 0xFE 0x00 0x54 0x00 0x68 0x00 0x69 0x00 0x73 => ---- 3) UTF-16BE Input file: 0x54 0x00 0x68 0x00 0x69 0x00 0x73 0x00 => "This" Output file: 0xFF 0xFE 0x68 0x00 0x69 0x00 0x73 0x00 => "his" 4) UTF-16LE Input file: 0x00 0x54 0x00 0x68 0x00 0x69 0x00 0x73 => "This" Output file: 0xFF 0xFE 0x00 0x68 0x00 0x69 0x00 0x73 => ---- In case 1) everything is ok. In case 2) The BOM says it is a BE file, but content is LE => damage In case 3) lost one byte of content, but added a BOM, which might be undesired => damage In case 4) lost one byte, but added a BE BOM to a LE stream => damage So from the 4 variations only one was intact after commit/update. To make one point clear: In my opinion it is 100 percent ok to support only BE UTF-16, but it should be more precisely documented, that this is the only format. In particular to the following: In my experience it is very difficult to delete something using cvs. cvs works very defensive, which is a very, very fine thing. Whenever something is about to change, cvs makes backup files. To be consistent with this, i propose to defuse to behaviour described above in that way to reject the commit of "Unicode" files when they don't start with 0xFF 0xFE. Hope that clears the fog Olaf Tony Hoyle wrote: > Olaf Groeger wrote: > >> >> But be aware that this must be UTF-16 BE including BOM (0xff 0xfe). All >> other UTF-16 (LE and/or no BOM) will be silently damaged. >> > LE isn't common on intel systems (in fact it's basically unheard of). The > file is still a perfectly valid Unicode file - the BOM is part of the > standard, precisely to avoid the problems distinguishing between LE and > BE. > > If you want the exact file use binary mode... you lose merging though. > > Tony