utf8 - Dave's Blog

Search
My timeline on Mastodon

Tweet from David_Risney

2015 Apr 12, 10:39
Does 'charset=utf8' work anywhere? Or do other browsers fallback to UTF-8 just giving the appearance? @ericlaw http://wp.me/p60i9o-r 
PermalinkComments

URI functions in Windows Store Applications

2013 Jul 25, 1:00

Summary

The Modern SDK contains some URI related functionality as do libraries available in particular projection languages. Unfortunately, collectively these APIs do not cover all scenarios in all languages. Specifically, JavaScript and C++ have no URI building APIs, and C++ additionally has no percent-encoding/decoding APIs.
WinRT (JS and C++)
JS Only
C++ Only
.NET Only
Parse
 
Build
Normalize
Equality
 
 
Relative resolution
Encode data for including in URI property
Decode data extracted from URI property
Build Query
Parse Query
The Windows.Foudnation.Uri type is not projected into .NET modern applications. Instead those applications use System.Uri and the platform ensures that it is correctly converted back and forth between Windows.Foundation.Uri as appropriate. Accordingly the column marked WinRT above is applicable to JS and C++ modern applications but not .NET modern applications. The only entries above applicable to .NET are the .NET Only column and the WwwFormUrlDecoder in the bottom left which is available to .NET.

Scenarios

Parse

This functionality is provided by the WinRT API Windows.Foundation.Uri in C++ and JS, and by System.Uri in .NET.
Parsing a URI pulls it apart into its basic components without decoding or otherwise modifying the contents.
var uri = new Windows.Foundation.Uri("http://example.com/path%20segment1/path%20segment2?key1=value1&key2=value2");
console.log(uri.path);// /path%20segment1/path%20segment2

WsDecodeUrl (C++)

WsDecodeUrl is not suitable for general purpose URI parsing.  Use Windows.Foundation.Uri instead.

Build (C#)

URI building is only available in C# via System.UriBuilder.
URI building is the inverse of URI parsing: URI building allows the developer to specify the value of basic components of a URI and the API assembles them into a URI. 
To work around the lack of a URI building API developers will likely concatenate strings to form their URIs.  This can lead to injection bugs if they don’t validate or encode their input properly, but if based on trusted or known input is unlikely to have issues.
            Uri originalUri = new Uri("http://example.com/path1/?query");
            UriBuilder uriBuilder = new UriBuilder(originalUri);
            uriBuilder.Path = "/path2/";
            Uri newUri = uriBuilder.Uri; // http://example.com/path2/?query

WsEncodeUrl (C++)

WsEncodeUrl, in addition to building a URI from components also does some encoding.  It encodes non-US-ASCII characters as UTF8, the percent, and a subset of gen-delims based on the URI property: all :/?#[]@ are percent-encoded except :/@ in the path and :/?@ in query and fragment.
Accordingly, WsEncodeUrl is not suitable for general purpose URI building.  It is acceptable to use in the following cases:
- You’re building a URI out of non-encoded URI properties and don’t care about the difference between encoded and decoded characters.  For instance you’re the only one consuming the URI and you uniformly decode URI properties when consuming – for instance using WsDecodeUrl to consume the URI.
- You’re building a URI with URI properties that don’t contain any of the characters that WsEncodeUrl encodes.

Normalize

This functionality is provided by the WinRT API Windows.Foundation.Uri in C++ and JS and by System.Uri in .NET.  Normalization is applied during construction of the Uri object.
URI normalization is the application of URI normalization rules (including DNS normalization, IDN normalization, percent-encoding normalization, etc.) to the input URI.
        var normalizedUri = new Windows.Foundation.Uri("HTTP://EXAMPLE.COM/p%61th foo/");
        console.log(normalizedUri.absoluteUri); // http://example.com/path%20foo/
This is modulo Win8 812823 in which the Windows.Foundation.Uri.AbsoluteUri property returns a normalized IRI not a normalized URI.  This bug does not affect System.Uri.AbsoluteUri which returns a normalized URI.

Equality

This functionality is provided by the WinRT API Windows.Foundation.Uri in C++ and JS and by System.Uri in .NET. 
URI equality determines if two URIs are equal or not necessarily equal.
            var uri1 = new Windows.Foundation.Uri("HTTP://EXAMPLE.COM/p%61th foo/"),
                uri2 = new Windows.Foundation.Uri("http://example.com/path%20foo/");
            console.log(uri1.equals(uri2)); // true

Relative resolution

This functionality is provided by the WinRT API Windows.Foundation.Uri in C++ and JS and by System.Uri in .NET 
Relative resolution is a function that given an absolute URI A and a relative URI B, produces a new absolute URI C.  C is the combination of A and B in which the basic components specified in B override or combine with those in A under rules specified in RFC 3986.
        var baseUri = new Windows.Foundation.Uri("http://example.com/index.html"),
            relativeUri = "/path?query#fragment",
            absoluteUri = baseUri.combineUri(relativeUri);
        console.log(baseUri.absoluteUri);       // http://example.com/index.html
        console.log(absoluteUri.absoluteUri);   // http://example.com/path?query#fragment

Encode data for including in URI property

This functionality is available in JavaScript via encodeURIComponent and in C# via System.Uri.EscapeDataString. Although the two methods mentioned above will suffice for this purpose, they do not perform exactly the same operation.
Additionally we now have Windows.Foundation.Uri.EscapeComponent in WinRT, which is available in JavaScript and C++ (not C# since it doesn’t have access to Windows.Foundation.Uri).  This is also slightly different from the previously mentioned mechanisms but works best for this purpose.
Encoding data for inclusion in a URI property is necessary when constructing a URI from data.  In all the above cases the developer is dealing with a URI or substrings of a URI and so the strings are all encoded as appropriate. For instance, in the parsing example the path contains “path%20segment1” and not “path segment1”.  To construct a URI one must first construct the basic components of the URI which involves encoding the data.  For example, if one wanted to include “path segment / example” in the path of a URI, one must percent-encode the ‘ ‘ since it is not allowed in a URI, as well as the ‘/’ since although it is allowed, it is a delimiter and won’t be interpreted as data unless encoded.
If a developer does not have this API provided they can write it themselves.  Percent-encoding methods appear simple to write, but the difficult part is getting the set of characters to encode correct, as well as handling non-US-ASCII characters.
        var uri = new Windows.Foundation.Uri("http://example.com" +
            "/" + Windows.Foundation.Uri.escapeComponent("path segment / example") +
            "?key=" + Windows.Foundation.Uri.escapeComponent("=&?#"));
        console.log(uri.absoluteUri); // http://example.com/path%20segment%20%2F%20example?key=%3D%26%3F%23

WsEncodeUrl (C++)

In addition to building a URI from components, WsEncodeUrl also percent-encodes some characters.  However the API is not recommend for this scenario given the particular set of characters that are encoded and the convoluted nature in which a developer would have to use this API in order to use it for this purpose.
There are no general purpose scenarios for which the characters WsEncodeUrl encodes make sense: encode the %, encode a subset of gen-delims but not also encode the sub-delims.  For instance this could not replace encodeURIComponent in a C++ version of the following code snippet since if ‘value’ contained ‘&’ or ‘=’ (both sub-delims) they wouldn’t be encoded and would be confused for delimiters in the name value pairs in the query:
"http://example.com/?key=" + Windows.Foundation.Uri.escapeComponent(value)
Since WsEncodeUrl produces a string URI, to obtain the property they want to encode they’d need to parse the resulting URI.  WsDecodeUrl won’t work because it decodes the property but Windows.Foundation.Uri doesn’t decode.  Accordingly the developer could run their string through WsEncodeUrl then Windows.Foundation.Uri to extract the property.

Decode data extracted from URI property

This functionality is available in JavaScript via decodeURIComponent and in C# via System.Uri.UnescapeDataString. Although the two methods mentioned above will suffice for this purpose, they do not perform exactly the same operation.
Additionally we now also have Windows.Foundation.Uri.UnescapeComponent in WinRT, which is available in JavaScript and C++ (not C# since it doesn’t have access to Windows.Foundation.Uri).  This is also slightly different from the previously mentioned mechanisms but works best for this purpose.
Decoding is necessary when extracting data from a parsed URI property.  For example, if a URI query contains a series of name and value pairs delimited by ‘=’ between names and values, and by ‘&’ between pairs, one must first parse the query into name and value entries and then decode the values.  It is necessary to make this an extra step separate from parsing the URI property so that sub-delimiters (in this case ‘&’ and ‘=’) that are encoded will be interpreted as data, and those that are decoded will be interpreted as delimiters.
If a developer does not have this API provided they can write it themselves.  Percent-decoding methods appear simple to write, but have some tricky parts including correctly handling non-US-ASCII, and remembering not to decode .
In the following example, note that if unescapeComponent were called first, the encoded ‘&’ and ‘=’ would be decoded and interfere with the parsing of the name value pairs in the query.
            var uri = new Windows.Foundation.Uri("http://example.com/?foo=bar&array=%5B%27%E3%84%93%27%2C%27%26%27%2C%27%3D%27%2C%27%23%27%5D");
            uri.query.substr(1).split("&").forEach(
                function (keyValueString) {
                    var keyValue = keyValueString.split("=");
                    console.log(Windows.Foundation.Uri.unescapeComponent(keyValue[0]) + ": " + Windows.Foundation.Uri.unescapeComponent(keyValue[1]));
                    // foo: bar
                    // array: ['','&','=','#']
                });

WsDecodeUrl (C++)

Since WsDecodeUrl decodes all percent-encoded octets it could be used for general purpose percent-decoding but it takes a URI so would require the dev to construct a stub URI around the string they want to decode.  For example they could prefix “http:///#” to their string, run it through WsDecodeUrl and then extract the fragment property.  It is convoluted but will work correctly.

Parse Query

The query of a URI is often encoded as application/x-www-form-urlencoded which is percent-encoded name value pairs delimited by ‘&’ between pairs and ‘=’ between corresponding names and values.
In WinRT we have a class to parse this form of encoding using Windows.Foundation.WwwFormUrlDecoder.  The queryParsed property on the Windows.Foundation.Uri class is of this type and created with the query of its Uri:
    var uri = Windows.Foundation.Uri("http://example.com/?foo=bar&array=%5B%27%E3%84%93%27%2C%27%26%27%2C%27%3D%27%2C%27%23%27%5D");
    uri.queryParsed.forEach(
        function (pair) {
            console.log("name: " + pair.name + ", value: " + pair.value);
            // name: foo, value: bar
            // name: array, value: ['','&','=','#']
        });
    console.log(uri.queryParsed.getFirstValueByName("array")); // ['','&','=','#']
The QueryParsed property is only on Windows.Foundation.Uri and not System.Uri and accordingly is not available in .NET.  However the Windows.Foundation.WwwFormUrlDecoder class is available in C# and can be used manually:
            Uri uri = new Uri("http://example.com/?foo=bar&array=%5B%27%E3%84%93%27%2C%27%26%27%2C%27%3D%27%2C%27%23%27%5D");
            WwwFormUrlDecoder decoder = new WwwFormUrlDecoder(uri.Query);
            foreach (IWwwFormUrlDecoderEntry entry in decoder)
            {
                System.Diagnostics.Debug.WriteLine("name: " + entry.Name + ", value: " + entry.Value);
                // name: foo, value: bar
                // name: array, value: ['','&','=','#']
            }
 

Build Query

To build a query of name value pairs encoded as application/x-www-form-urlencoded there is no WinRT API to do this directly.  Instead a developer must do this manually making use of the code described in “Encode data for including in URI property”.
In terms of public releases, this property is only in the RC and later builds.
For example in JavaScript a developer may write:
            var uri = new Windows.Foundation.Uri("http://example.com/"),
                query = "?" + Windows.Foundation.Uri.escapeComponent("array") + "=" + Windows.Foundation.Uri.escapeComponent("['','&','=','#']");
 
            console.log(uri.combine(new Windows.Foundation.Uri(query)).absoluteUri); // http://example.com/?array=%5B'%E3%84%93'%2C'%26'%2C'%3D'%2C'%23'%5D
 
PermalinkCommentsc# c++ javascript technical uri windows windows-runtime windows-store

Encode-O-Matic Update: Compression, Hex View, Quick Show Output

2010 Mar 9, 9:08

I've just put up an update for Encode-O-Matic with the following improvements:

PermalinkCommentstechnical encodeomatic project

Official Google Blog: Unicode nearing 50% of the web

2010 Jan 28, 4:20Graph of encodings used by documents on the web. Unicode based encodings are thankfully on the rise.PermalinkCommentsunicode encoding web internationalization localization utf8 text html technical

The 'Is It UTF-8?' Quick and Dirty Test

2009 Mar 6, 5:16

I've found while debugging networking in IE its often useful to quickly tell if a string is encoded in UTF-8. You can check for the Byte Order Mark (EF BB BF in UTF-8) but, I rarely see the BOM on UTF-8 strings. Instead I apply a quick and dirty UTF-8 test that takes advantage of the well-formed UTF-8 restrictions.

Unlike other multibyte character encoding forms (see Windows supported character sets or IANA's list of character sets), for example Big5, where sticking together any two bytes is more likely than not to give a valid byte sequence, UTF-8 is more restrictive. And unlike other multibyte character encodings, UTF-8 bytes may be taken out of context and one can still know that its a single byte character, the starting byte of a three byte sequence, etc.

The full rules for well-formed UTF-8 are a little too complicated for me to commit to memory. Instead I've got my own simpler (this is the quick part) set of rules that will be mostly correct (this is the dirty part). For as many bytes in the string as you care to examine, check the most significant digit of the byte:

F:
This is byte 1 of a 4 byte encoded codepoint and must be followed by 3 trail bytes.
E:
This is byte 1 of a 3 byte encoded codepoint and must be followed by 2 trail bytes.
C..D:
This is byte 1 of a 2 byte encoded codepoint and must be followed by 1 trail byte.
8..B:
This is a trail byte.
0..7:
This is a single byte encoded codepoint.
The simpler rules can produce false positives in some cases: that is, they'll say a string is UTF-8 when in fact it might not be. But it won't produce false negatives. The following is table from the Unicode spec. that actually describes well-formed UTF-8.
Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
U+0000..U+007F 00..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF

PermalinkCommentstest technical unicode boring charset utf8 encoding

G1 Android Phone

2008 Nov 9, 11:29

T-Mobile G1 Wallpapers by romainguy
I finally replaced my old regular cell-phone which was literally being held together by a rubber band with a fancy new G1, my first Internet accessible phone.

I had to call the T-Mobile support line to get data added to my plan and the person helping me was disconcertingly friendly. She asked about my weekend plans and so I felt compelled to ask her the same. Her plans involved replacing her video card so she could get back to World of Warcraft and do I enjoy computer gaming? I couldn't tell if she was genuine or if she was signing me up for magazines.

I was with Sarah in her new car, trying out the phone's GPS functionality via Google Maps while she drove. I switched to Street View and happened to find my car. It was a weird feeling, kind of like those Google conspiracy videos.

The phone runs Google's open source OS and I really enjoy the application API. Its all in Java and URIs and mime-types are sort of basics. Rather than invoking the builtin item picker control directly you invoke an 'intent' specifying the URI of your list of items, a mime-type describing the type of items in the list, and an action 'PICK' and whatever is registered as the picker on the system pops up and lets the user pick from that list. The same goes if you want to 'EDIT' an image, or 'VIEW' an mp3.

I wanted to replace the Google search box gadget that appears on the home screen with my own search box widget that uses OpenSearch descriptors but apparently in the current API you can't make home screen gadgets without changing parts of the OS. My other desired application is something to replace this GPS photo tracker device by recording my location to a file and an additional program on my computer to apply those locations to photos.

PermalinkCommentstmobile personal api phone technical g1 android google

Finished Paper Mario Games

2008 May 12, 4:05
Super Paper MarioPaper Mario: The Thousand-Year DoorPaper Mario Title Screen

Sarah and I have finished playing through the games "Paper Mario", "Paper Mario: The Thousand-Year Door", and "Super Paper Mario" last week (including the various Pits of 100 Trials). We played them all on the Wii, because even though Super Paper Mario was the only one released explicitly for that platform, Wii maintains compatibility with Game Cube games such as Thousand-Year Door and Paper Mario although originally released for the Nintendo 64 is now available as a pay for download game on the Wii's Virtual Console. So, yay for Nintendo!

I think my favorite of the three is Thousand-Year Door mostly because of the RPG attack system. In Thousand-Year Door and Paper Mario when you come into contact with an enemy you go into an RPG style attack system where you take turns selecting actions. In Super Paper Mario you still have hit points and such, but you don't go into a turn based RPG style attack system, rather you do the regular Mario jumping on bad guys thing (or hitting them with a mallet etc...). Thousand-Year Door and Paper Mario are very similar in terms of game play but Thousand-Year Door looks very pretty and has made improvements to how your party-mates are handled in battle (they have HP and can fall as you would expect) and there's an audience that cheers you on during your battles.

Even if the gameplay sucked the humor throughout the series might be tempting enough. Mario's clothing and mustache are mocked throughout and standard RPG expectations are subverted. I hate to describe any of these moments for fear of ruining anything but, for instance, an optional and very difficult enemy who may only be killed after hours of work only results in one experience point, or a very intimidating enemy who you imagine you'll have to fight actually challenges you to a quiz.

Despite how I personally rank them, all the games are great and I'd recommend any of them.

PermalinkCommentsmario videogame paper mario nontechnical

Official Google Blog: Moving to Unicode 5.1

2008 May 7, 4:24Woo Unicode! "For the first time, we found that Unicode was the most frequent encoding found on web pages, overtaking both ASCII and Western European encodings"PermalinkCommentsgoogle encoding i18n utf8 unicode ascii web

HTTP headers and non-asci characters (Content-Disposition, filename, attachment) Article

2008 Mar 8, 11:43"I was not able to find universal settings to do this task, but it looks like Mozilla based browsers accepts utf-8 encoded headers and headers Encoded Word Extensions from RFC 2231. Internet explorer accepts utf-8 filenames only when 1. the data are URL ePermalinkCommentshttp http-header charset ascii utf8 mozilla ie browser content-disposition

Juanita Beach Visit and Map

2008 Mar 7, 3:26

Don't Feed the Ducks SignTwo weekends ago it was actually sunny and kind of warm so Sarah and I went down to Spud Fish and Chips and Juanita Beach Park. We ate fish and chips on the dock. I took a few pictures and this time actually put some geographical information on Flickr so now I've got a map of my tiny fish and chips journey. On the map click on the floating marks to view the associated photos.

Flickr provides access to the geo data associated with your photos via GeoRSS feeds. And Google Maps displays GeoRSS feed content on their maps allowing you even to edit the data but doesn't appear to let you easily export the GeoRSS. Live Maps does the inverse, allowing you to create and export GeoRSS data but not import it. I'd like both please. Oh well.

PermalinkCommentsmap photo personal fish-and-chips juanita-beach

Non-ASCII characters in URI attribute values

2007 Aug 23, 6:17From the HTML standard, what to do if you find a URI with a non-US-ASCII character.PermalinkCommentsuri standard reference unicode utf8 encoding ascii iri

HTML 4.01 - Encoding URIs with international characters

2007 Jul 11, 10:36How to encode URIs that contain international characters as specified by the HTML spec.PermalinkCommentshtml uri encoding iri unicode utf8

History of UTF-8

2007 Mar 30, 1:27The origins of the UTF-8 Unicode encoding.PermalinkCommentshistory encoding unicode ken-thompson utf8

UniView 5.0 (en) (© Richard Ishida)

2007 Feb 14, 3:10Richard Ishida's tool to display blocks of Unicode characters.PermalinkCommentsunicode tools i18n tool utf8 uniview encoding richard-ishida

ishida >> utilities: Unicode code converter

2007 Feb 14, 3:09Richard Ishida's website that converts between encodings of UTF-8, UTF-16, and HTML NCRs.PermalinkCommentsunicode conversion html encoding javascript software tool utf8 richard-ishida

W3C I18N Tutorial: Character sets & encodings in XHTML, HTML and CSS

2006 Apr 7, 5:24PermalinkCommentscharset html mime programming reference tutorial utf8 unicode w3c web xml encoding language css

RFC 3987 (rfc3987) - Internationalized Resource Identifiers (IRIs)

2006 Jan 25, 5:38PermalinkCommentsrfc iri uri unicode utf8 reference internet

Unicode Character Search

2005 Oct 25, 3:41Tool that displays information on Unicode Code Points.PermalinkCommentsdevelopment reference search tools unicode utf8 web language

Unicode Encoding Forms

2005 Aug 4, 11:46Specification for encoding forms of Unicode (UTF-8, UTF-16, UTF-32...)PermalinkCommentsunicode utf8 specification reference language

Convert Unicode to UTF8

2005 Jul 28, 11:20PermalinkCommentsunicode utf8 tools language web
Older Entries Creative Commons License Some rights reserved.