charset - Dave's Blog


Tweet from David_Risney

2015 Apr 12, 10:39
Does 'charset=utf8' work anywhere? Or do other browsers fallback to UTF-8 just giving the appearance? @ericlaw 

Unicode Character Set Mappings

2012 Jun 11, 2:36

A leaf directory in a whole set of files that map from character set byte value to Unicode code point.  This one is a set of Microsoft character set byte mappings, but there are other vendors in there too.

PermalinkCommentstechnical unicode charset

RFC 5987 - Character Set and Language Encoding for Hypertext Transfer Protocol (HTTP) Header Field Parameters

2010 Aug 13, 11:47Other characters sets for HTTP headers: "By default, message header field parameters in Hypertext Transfer Protocol (HTTP) messages cannot carry characters outside the ISO-8859-1 character set. RFC 2231 defines an encoding mechanism for use in Multipurpose Internet Mail Extensions (MIME) headers. This document specifies an encoding suitable for use in HTTP header fields that is compatible with a profile of the encoding defined in RFC 2231."PermalinkCommentsrfc language localization charset http technical reference http-header

Code: Flickr Developer Blog ยป A Chinese puzzle: Unicode and EXIF metadata parsing

2010 Jan 8, 2:08Flickr dev talks image metadata the various forms which to prefer and how to guess at their character encodings.PermalinkCommentsunicode charset flickr photo image exif programming reference xmp technical

[whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009 Apr 23, 1:35"This e-mail is an attempt to give a relatively concise yet reasonably complete overview of non-Unicode character sets and encodings for 'Chinese characters', excluding those which are not supported by at least one of the four browsers IE, Safari, Firefox and Opera (henceforth 'all browsers'), and tentatively avoiding technical details which are out of scope for HTML5 unless they are important to gain a general understanding of the relevant issues."PermalinkCommentshtml html5 iso-2022 charset encoding character unicode cjk

The 'Is It UTF-8?' Quick and Dirty Test

2009 Mar 6, 5:16

I've found while debugging networking in IE its often useful to quickly tell if a string is encoded in UTF-8. You can check for the Byte Order Mark (EF BB BF in UTF-8) but, I rarely see the BOM on UTF-8 strings. Instead I apply a quick and dirty UTF-8 test that takes advantage of the well-formed UTF-8 restrictions.

Unlike other multibyte character encoding forms (see Windows supported character sets or IANA's list of character sets), for example Big5, where sticking together any two bytes is more likely than not to give a valid byte sequence, UTF-8 is more restrictive. And unlike other multibyte character encodings, UTF-8 bytes may be taken out of context and one can still know that its a single byte character, the starting byte of a three byte sequence, etc.

The full rules for well-formed UTF-8 are a little too complicated for me to commit to memory. Instead I've got my own simpler (this is the quick part) set of rules that will be mostly correct (this is the dirty part). For as many bytes in the string as you care to examine, check the most significant digit of the byte:

This is byte 1 of a 4 byte encoded codepoint and must be followed by 3 trail bytes.
This is byte 1 of a 3 byte encoded codepoint and must be followed by 2 trail bytes.
This is byte 1 of a 2 byte encoded codepoint and must be followed by 1 trail byte.
This is a trail byte.
This is a single byte encoded codepoint.
The simpler rules can produce false positives in some cases: that is, they'll say a string is UTF-8 when in fact it might not be. But it won't produce false negatives. The following is table from the Unicode spec. that actually describes well-formed UTF-8.
Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
U+0000..U+007F 00..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF

PermalinkCommentstest technical unicode boring charset utf8 encoding


2008 Oct 1, 1:08A weekly summary of the going-ons in the WHATWG usually on the topic of squabbles in HTML5 esp. what to do about the alt attribute in the img tag. Interesting stuff on charsets.PermalinkCommentsdevelopment software whatwg html5 html specification feed rss user-agent w3c

Extensible Markup Language (XML) 1.1 (Second Edition)

2008 Mar 18, 11:21End-of-line handling in XML. Spoiler: XML processor should normalize most newline character sequences to 0xA.PermalinkCommentsxml spec standard w3c unicode charset newline end-of-line

RFC 2231 MIME Parameter Value and Encoded Word Extensions: Character Sets, Languages, and Continuations

2008 Mar 8, 11:44"This memo defines extensions to the RFC 2045 media type and RFC 2183 disposition parameter value mechanisms to provide ... a means to specify parameter values in character sets other than US-ASCII..."PermalinkCommentshttp http-header rfc standard reference ietf mime encoding charset language content-disposition

HTTP headers and non-asci characters (Content-Disposition, filename, attachment) Article

2008 Mar 8, 11:43"I was not able to find universal settings to do this task, but it looks like Mozilla based browsers accepts utf-8 encoded headers and headers Encoded Word Extensions from RFC 2231. Internet explorer accepts utf-8 filenames only when 1. the data are URL ePermalinkCommentshttp http-header charset ascii utf8 mozilla ie browser content-disposition

i18n: languages, countries and character sets

2007 Nov 7, 4:28Out of date W3C document containing stats on frequency of use of various charsets in HTML pages (in 1997)PermalinkCommentscharset encoding i18n language reference w3c statistics

Worse Than Failure - Character encoding WTF

2007 Oct 19, 4:10FTA: 'This letter was sent to a Russian student by her French friend, who manually wrote the address that he received by e-mail. His e-mail client, unfortunately, was not set up correctly to display Cyrillic characters, so they were substituted with diacrPermalinkCommentsencoding charset unicode language humor article

RFC 1468 - Japanese Character Encoding for Internet Messages

2007 Jan 31, 4:56IETF's standard for ISO 2022 JP that defines a character encoding that wraps other Japanese character encodings.PermalinkCommentscodepage encoding windows programming iso-2022 charset japanese ietf rfc

ISO/IEC 2022 - Wikipedia, the free encyclopedia

2007 Jan 31, 4:34The ISO 2022 defines a character encoding that wraps other character encodings.PermalinkCommentscodepage encoding windows programming iso-2022 charset japanese

Encode-O-Matic Update

2006 Dec 3, 12:28I've updated Encode-O-Matic again. This is a tool I'm working on to convert between various Internet related encodings such as character sets, HTML encoding, URI encoding, base64, and IDN. In this update I've put it all into an installer. I'm using Nullsoft's installer generator to produce the installer. I've added a Base Conversion converter to convert between arbitrary bases and a Reverse converter that reverses the input by character, byte, or strings with arbitrary delimiters.PermalinkCommentsinstaller encodeomatic project charset nullsoft encoding

W3C I18N Tutorial: Character sets & encodings in XHTML, HTML and CSS

2006 Apr 7, 5:24PermalinkCommentscharset html mime programming reference tutorial utf8 unicode w3c web xml encoding language css

Unicode Han Database

2006 Feb 23, 4:57PermalinkCommentschinese language unicode han charset

RFC 1557 (rfc1557) - Korean Character Encoding for Internet Messages

2005 Dec 16, 11:15PermalinkCommentslanguage codepage reference rfc internet mime korean charset


2005 Oct 30, 5:11How to enter any specific Unicode character using the numeric keypad in WindowsPermalinkCommentsunicode codepage charset language

IANA Character Sets List

2005 Jul 28, 4:37List of Character Sets available for uses in such things as HTML documentsPermalinkCommentsreference codepage html language charset
Older Entries Creative Commons License Some rights reserved.