2015 Apr 12, 10:39 2012 Jun 11, 2:36
A leaf directory in a whole set of files that map from character set byte value to Unicode code point. This one is a set of Microsoft character set byte mappings, but there are other
vendors in there too.
technical unicode charset 2010 Aug 13, 11:47Other characters sets for HTTP headers: "By default, message header field parameters in Hypertext Transfer Protocol (HTTP) messages cannot carry characters outside the ISO-8859-1 character set. RFC
2231 defines an encoding mechanism for use in Multipurpose Internet Mail Extensions (MIME) headers. This document specifies an encoding suitable for use in HTTP header fields that is compatible with
a profile of the encoding defined in RFC 2231."
rfc language localization charset http technical reference http-header 2010 Jan 8, 2:08Flickr dev talks image metadata the various forms which to prefer and how to guess at their character encodings.
unicode charset flickr photo image exif programming reference xmp technical 2009 Apr 23, 1:35"This e-mail is an attempt to give a relatively concise yet reasonably complete overview of non-Unicode character sets and encodings for 'Chinese characters', excluding those which are not supported
by at least one of the four browsers IE, Safari, Firefox and Opera (henceforth 'all browsers'), and tentatively avoiding technical details which are out of scope for HTML5 unless they are important
to gain a general understanding of the relevant issues."
html html5 iso-2022 charset encoding character unicode cjk 2009 Mar 6, 5:16
I've found while debugging networking in IE its often useful to quickly tell if a string is encoded in UTF-8. You can check for the Byte Order Mark (EF BB BF in UTF-8) but, I rarely see the BOM on
UTF-8 strings. Instead I apply a quick and dirty UTF-8 test that takes advantage of the well-formed UTF-8 restrictions.
Unlike other multibyte character encoding forms (see Windows supported character sets or IANA's list of character sets), for example Big5, where sticking together any two bytes is more likely than not to give a valid byte sequence, UTF-8 is more restrictive. And unlike
other multibyte character encodings, UTF-8 bytes may be taken out of context and one can still know that its a single byte character, the starting byte of a three byte sequence, etc.
The full rules for well-formed UTF-8 are a little too complicated for me to commit to memory. Instead I've got my own simpler (this is the quick part) set of rules that will be mostly correct (this
is the dirty part). For as many bytes in the string as you care to examine, check the most significant digit of the byte:
-
F:
-
This is byte 1 of a 4 byte encoded codepoint and must be followed by 3 trail bytes.
-
E:
-
This is byte 1 of a 3 byte encoded codepoint and must be followed by 2 trail bytes.
-
C..D:
-
This is byte 1 of a 2 byte encoded codepoint and must be followed by 1 trail byte.
-
8..B:
-
This is a trail byte.
-
0..7:
-
This is a single byte encoded codepoint.
The simpler rules can produce false positives in some cases: that is, they'll say a string is UTF-8 when in fact it might not be. But it won't produce false negatives. The following is table
from the
Unicode spec. that actually describes well-formed UTF-8.
Code Points
|
1st Byte
|
2nd Byte
|
3rd Byte
|
4th Byte
|
U+0000..U+007F
|
00..7F
|
U+0080..U+07FF
|
C2..DF
|
80..BF
|
U+0800..U+0FFF
|
E0
|
A0..BF
|
80..BF
|
U+1000..U+CFFF
|
E1..EC
|
80..BF
|
80..BF
|
U+D000..U+D7FF
|
ED
|
80..9F
|
80..BF
|
U+E000..U+FFFF
|
EE..EF
|
80..BF
|
80..BF
|
U+10000..U+3FFFF
|
F0
|
90..BF
|
80..BF
|
80..BF
|
U+40000..U+FFFFF
|
F1..F3
|
80..BF
|
80..BF
|
80..BF
|
U+100000..U+10FFFF
|
F4
|
80..8F
|
80..BF
|
80..BF
|
test technical unicode boring charset utf8 encoding 2008 Oct 1, 1:08A weekly summary of the going-ons in the WHATWG usually on the topic of squabbles in HTML5 esp. what to do about the alt attribute in the img tag. Interesting stuff on charsets.
development software whatwg html5 html specification feed rss user-agent w3c 2008 Mar 18, 11:21End-of-line handling in XML. Spoiler: XML processor should normalize most newline character sequences to 0xA.
xml spec standard w3c unicode charset newline end-of-line 2008 Mar 8, 11:44"This memo defines extensions to the RFC 2045 media type and RFC 2183 disposition parameter value mechanisms to provide ... a means to specify parameter values in character sets other than
US-ASCII..."
http http-header rfc standard reference ietf mime encoding charset language content-disposition 2008 Mar 8, 11:43"I was not able to find universal settings to do this task, but it looks like Mozilla based browsers accepts utf-8 encoded headers and headers Encoded Word Extensions from RFC 2231. Internet explorer
accepts utf-8 filenames only when 1. the data are URL e
http http-header charset ascii utf8 mozilla ie browser content-disposition 2007 Nov 7, 4:28Out of date W3C document containing stats on frequency of use of various charsets in HTML pages (in 1997)
charset encoding i18n language reference w3c statistics 2007 Oct 19, 4:10FTA: 'This letter was sent to a Russian student by her French friend, who manually wrote the address that he received by e-mail. His e-mail client, unfortunately, was not set up correctly to display
Cyrillic characters, so they were substituted with diacr
encoding charset unicode language humor article 2007 Jan 31, 4:56IETF's standard for ISO 2022 JP that defines a character encoding that wraps other Japanese character encodings.
codepage encoding windows programming iso-2022 charset japanese ietf rfc 2007 Jan 31, 4:34The ISO 2022 defines a character encoding that wraps other character encodings.
codepage encoding windows programming iso-2022 charset japanese 2006 Dec 3, 12:28I've updated
Encode-O-Matic again. This is a tool I'm working on to convert between various Internet related encodings such as
character sets, HTML encoding, URI encoding, base64, and IDN. In this update I've put it all into an installer. I'm using
Nullsoft's installer
generator to produce the installer. I've added a Base Conversion converter to convert between arbitrary bases and a Reverse converter that reverses the input by character, byte, or strings with
arbitrary delimiters.
installer encodeomatic project charset nullsoft encoding 2005 Oct 30, 5:11How to enter any specific Unicode character using the numeric keypad in Windows
unicode codepage charset language 2005 Jul 28, 4:37List of Character Sets available for uses in such things as HTML documents
reference codepage html language charset