utf-8 - Dave's Blog

Search
My timeline on Mastodon

Tweet from David_Risney

2015 Apr 12, 10:39
Does 'charset=utf8' work anywhere? Or do other browsers fallback to UTF-8 just giving the appearance? @ericlaw http://wp.me/p60i9o-r 
PermalinkComments

Indicating Character Encoding and Language for HTTP Header Field Parameters

2011 Nov 24, 7:45

From the document: ‘Appendix B. Implementation Report: The encoding defined in this document currently is used for two different HTTP header fields: “Content-Disposition”, defined in [RFC6266], and “Link”, defined in [RFC5988]. As the encoding is a profile/clarification of the one defined in [RFC2231] in 1997, many user agents already supported it for use in “Content-Disposition” when [RFC5987] got published.

Since the publication of [RFC5987], two more popular desktop user agents have added support for this encoding; see http://purl.org/
   NET/http/content-disposition-tests#encoding-2231-char for details. At this time, only one major desktop user agent (Safari) does not support it.

Note that the implementation in Internet Explorer 9 does not support the ISO-8859-1 encoding; this document revision acknowledges that UTF-8 is sufficient for expressing all code points, and removes the requirement to support ISO-8859-1.’

Yay for UTF-8!

PermalinkCommentstechnical http http-headers ie9 internationalization utf-8 encoding

Infrared Paint Link Roundup

2009 May 29, 2:50

I like the idea of QR codes, encoding URLs and placing them on real world objects, but the QR codes themselves are kind of ugly. To make them less obvious I thought I could spray QR codes on to an object with an infrared reflective paint and shine infrared light on the QR codes, since most cameras, for instance the camera in my G1 phone, pick up infrared that our eyes do not.

In my search for infrared paint I've found a seller of IR ink (via programming forum) and an Infrared Paint Recipe (via IR FAQ).

In looking for this paint I've found that it comes up a lot in relation to the military for things like paint markers that are visible at night with proper equipment, and paint that absorbs IR light to make vehicles less obvious to night vision goggles. Even though the first reflects infrared light and the second absorbs it websites end up refering to both as infrared paint which made it difficult to search.

Additionally I found links to some other geeky infrared projects:

PermalinkCommentsir paint technical ir infrared qr qr code

URLs are tough - Anne's Weblog

2009 Apr 7, 1:30I really dislike how IE deals with non-US-ASCII in URLs. I should write up a post on what exactly IE does with non-US-ASCII characters in URLs. "Just like IRIs the URL is mapped to a URI using UTF-8. Except for the query component of the URL (the bit after the question mark). Here for legacy reasons the encoding of the document is used instead. Except if the encoding of the document is UTF-16, in which case UTF-8 is used. Effectively, using non-ASCII characters in URLs in documents not encoded as UTF-8 or UTF-16 will give you surprising results, to say the least. Yay for browsers!"PermalinkCommentshttp encoding html5 url uri unicode iri

The 'Is It UTF-8?' Quick and Dirty Test

2009 Mar 6, 5:16

I've found while debugging networking in IE its often useful to quickly tell if a string is encoded in UTF-8. You can check for the Byte Order Mark (EF BB BF in UTF-8) but, I rarely see the BOM on UTF-8 strings. Instead I apply a quick and dirty UTF-8 test that takes advantage of the well-formed UTF-8 restrictions.

Unlike other multibyte character encoding forms (see Windows supported character sets or IANA's list of character sets), for example Big5, where sticking together any two bytes is more likely than not to give a valid byte sequence, UTF-8 is more restrictive. And unlike other multibyte character encodings, UTF-8 bytes may be taken out of context and one can still know that its a single byte character, the starting byte of a three byte sequence, etc.

The full rules for well-formed UTF-8 are a little too complicated for me to commit to memory. Instead I've got my own simpler (this is the quick part) set of rules that will be mostly correct (this is the dirty part). For as many bytes in the string as you care to examine, check the most significant digit of the byte:

F:
This is byte 1 of a 4 byte encoded codepoint and must be followed by 3 trail bytes.
E:
This is byte 1 of a 3 byte encoded codepoint and must be followed by 2 trail bytes.
C..D:
This is byte 1 of a 2 byte encoded codepoint and must be followed by 1 trail byte.
8..B:
This is a trail byte.
0..7:
This is a single byte encoded codepoint.
The simpler rules can produce false positives in some cases: that is, they'll say a string is UTF-8 when in fact it might not be. But it won't produce false negatives. The following is table from the Unicode spec. that actually describes well-formed UTF-8.
Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
U+0000..U+007F 00..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF

PermalinkCommentstest technical unicode boring charset utf8 encoding

HTTP headers and non-asci characters (Content-Disposition, filename, attachment) Article

2008 Mar 8, 11:43"I was not able to find universal settings to do this task, but it looks like Mozilla based browsers accepts utf-8 encoded headers and headers Encoded Word Extensions from RFC 2231. Internet explorer accepts utf-8 filenames only when 1. the data are URL ePermalinkCommentshttp http-header charset ascii utf8 mozilla ie browser content-disposition

Excercise Bike and Tacoma Screw Products

2008 Jan 13, 11:07

Sarah and I got an exercise bike on sale and when attempting to put it together found that it was missing a bag of about ten different screws. The manufacturer website said we could order a replacement bag for thirty dollars (!!) but since the instructions listed the various kinds of screws we needed I figured we could just go to a hardware store and buy them.

We started at Home Depot because I didn't know better. The screws are all listed in metric sizes which is apparently uncommon and a helpful senior worker forwarded us to McLendons whose stock was better but we were again redirected this time to Tacoma Screw Products.

Tacoma Screw Products is great! See them for your hardware needs first! The store has a back area with every kind of screw ever. I felt a little out of place as as all the customers looked like contractors. The employee who helped me explained the various options I had in screws as the bike instructions weren't as explicit as they could have been. In the end I bought all my screws for only one dollar (much better than $30!) and they all fit correctly.

PermalinkCommentsscrew bike personal tacoma screw products nontechnical

Second Life Translator

2007 Jul 4, 10:58Hackdiary
I really enjoy reading Matt Biddulph's blog hackdiary. An entry some time ago talked about his Second Life flickr screen which is a screen in Second Life that displays images from flickr.com based on viewers suggested tags. I'm a novice to the Second Life scripting API and so it was from this blog post I became aware of the llHTTPRequest. This is like the XMLHttpRequest for Second Life code in that it lets you make HTTP requests. I decided that I too could do something cool with this.

Translator
I decided to make a translator object that a Second Life user would wear that would translate anything said near them. The details aren't too surprising: The translator object keeps an owner modifiable list of translation instructions each consisting of who to listen to, the language they speak, who to tell the translation to, and into what language to translate. When the translator hears someone, it runs through its list of translation instructions and when it finds a match for the speaker uses the llHTTPRequest to send off what was said to Google translate. When the result comes back the translator simply says the response.

Issues
Unfortunately, the llHTTPRequest limits the response size to 2K and no translation site I can find has the translated text in the first 2K. There's a flag HTTP_BODY_MAXLENGTH provided but it defaults to 2K and you can't change its value. So I decided to setup a PHP script on my site to act as a translating proxy and parse the translated text out of the HTML response from Google translate. Through experimentation I found that their site can take parameters text and langpair queries in the query like so: http://translate.google.com/translate_t?text=car%20moi%20m%C3%AAme%20j%27en%20rit&langpair=fr|en. On the topic of non US-ASCII characters (which is important for a translator) I found that llHTTPRequest encodes non US-ASCII characters as percent-encoded UTF-8 when constructing the request URI. However, when Google translate takes parameters off the URI it only seems to interpret it as percent-encoded UTF-8 when the user-agent is IE's. So after changing my PHP script to use IE7's user-agent non US-ASCII character input worked.

In Use
Actually using it in practice is rather difficult. Between typos, slang, abbreviations, and the current state of the free online translators its very difficult to carry on a conversation. Additionally, I don't really like talking to random people on Second Life anyway. So... not too useful.PermalinkCommentspersonal translate second-life technical translator sl code google php llhttprequest

History of UTF-8

2007 Mar 30, 1:27The origins of the UTF-8 Unicode encoding.PermalinkCommentshistory encoding unicode ken-thompson utf8

ishida >> utilities: Unicode code converter

2007 Feb 14, 3:09Richard Ishida's website that converts between encodings of UTF-8, UTF-16, and HTML NCRs.PermalinkCommentsunicode conversion html encoding javascript software tool utf8 richard-ishida

FAQ - UTF-8, UTF-16, UTF-32 & BOM

2006 Jun 21, 12:22Unicode Byte Order Mark FAQPermalinkCommentsunicode encoding bom language

Unicode Encoding Forms

2005 Aug 4, 11:46Specification for encoding forms of Unicode (UTF-8, UTF-16, UTF-32...)PermalinkCommentsunicode utf8 specification reference language

RFC 3629 (rfc3629) - UTF-8, a transformation format of ISO 10646

2005 Mar 28, 10:31PermalinkCommentsrfc reference development unicode
Older Entries Creative Commons License Some rights reserved.