us-ascii - Dave's Blog

Search
My timeline on Mastodon

URI functions in Windows Store Applications

2013 Jul 25, 1:00PermalinkCommentsc# c++ javascript technical uri windows windows-runtime windows-store

URLs are tough - Anne's Weblog

2009 Apr 7, 1:30I really dislike how IE deals with non-US-ASCII in URLs. I should write up a post on what exactly IE does with non-US-ASCII characters in URLs. "Just like IRIs the URL is mapped to a URI using UTF-8. Except for the query component of the URL (the bit after the question mark). Here for legacy reasons the encoding of the document is used instead. Except if the encoding of the document is UTF-16, in which case UTF-8 is used. Effectively, using non-ASCII characters in URLs in documents not encoded as UTF-8 or UTF-16 will give you surprising results, to say the least. Yay for browsers!"PermalinkCommentshttp encoding html5 url uri unicode iri

Subst Allows Non-Letter Drive Letters

2009 Mar 4, 2:39

I knew that the command line tool subst would create virtual drives that map to existing directories but I didn't know that subst lets you name the virtual drives with characters that aren't US-ASCII letters. For instance you can run 'subst 4: C:\windows' and then 'more 4:\win.ini' to dump C:\windows\win.ini. This also works for non-US-ASCII characters like, "C" (aka U+FF23, Fullwidth Latin Capital Letter C), which when displayed by cmd.exe via some best fit style character conversions looks just like the regular US-ASCII 'C'. None of Explorer, IE, or the common file dialogs allow the use of these odd virtual drives -- just cmd.exe, so I'm not sure how this would ever be useful but I thought it was odd and I wanted to share.

PermalinkCommentscli technical boring subst windows

URI Fragment Info Roundup

2008 Apr 21, 11:53

['Neverending story' by Alexandre Duret-Lutz. A framed photo of books with the droste effect applied. Licensed under creative commons.]Information about URI Fragments, the portion of URIs that follow the '#' at the end and that are used to navigate within a document, is scattered throughout various documents which I usually have to hunt down. Instead I'll link to them all here.

Definitions. Fragments are defined in the URI RFC which states that they're used to identify a secondary resource that is related to the primary resource identified by the URI as a subset of the primary, a view of the primary, or some other resource described by the primary. The interpretation of a fragment is based on the mime type of the primary resource. Tim Berners-Lee notes that determining fragment meaning from mime type is a problem because a single URI may contain a single fragment, however over HTTP a single URI can result in the same logical resource represented in different mime types. So there's one fragment but multiple mime types and so multiple interpretations of the one fragment. The URI RFC says that if an author has a single resource available in multiple mime types then the author must ensure that the various representations of a single resource must all resolve fragments to the same logical secondary resource. Depending on which mime types you're dealing with this is either not easy or not possible.

HTTP. In HTTP when URIs are used, the fragment is not included. The General Syntax section of the HTTP standard says it uses the definitions of 'URI-reference' (which includes the fragment), 'absoluteURI', and 'relativeURI' (which don't include the fragment) from the URI RFC. However, the 'URI-reference' term doesn't actually appear in the BNF for the protocol. Accordingly the headers like 'Request-URI', 'Content-Location', 'Location', and 'Referer' which include URIs are defined with 'absoluteURI' or 'relativeURI' and don't include the fragment. This is in keeping with the original fragment definition which says that the fragment is used as a view of the original resource and consequently only needed for resolution on the client. Additionally, the URI RFC explicitly notes that not including the fragment is a privacy feature such that page authors won't be able to stop clients from viewing whatever fragments the client chooses. This seems like an odd claim given that if the author wanted to selectively restrict access to portions of documents there are other options for them like breaking out the parts of a single resource to which the author wishes to restrict access into separate resources.

HTML. In HTML, the HTML mime type RFC defines HTML's fragment use which consists of fragments referring to elements with a corresponding 'id' attribute or one of a particular set of elements with a corresponding 'name' attribute. The HTML spec discusses fragment use additionally noting that the names and ids must be unique in the document and that they must consist of only US-ASCII characters. The ID and NAME attributes are further restricted in section 6 to only consist of alphanumerics, the hyphen, period, colon, and underscore. This is a subset of the characters allowed in the URI fragment so no encoding is discussed since technically its not needed. However, practically speaking, browsers like FireFox and Internet Explorer allow for names and ids containing characters outside of the defined set including characters that must be percent-encoded to appear in a URI fragment. The interpretation of percent-encoded characters in fragments for HTML documents is not consistent across browsers (or in some cases within the same browser) especially for the percent-encoded percent.

Text. Text/plain recently got a fragment definition that allows fragments to refer to particular lines or characters within a text document. The scheme no longer includes regular expressions, which disappointed me at first, but in retrospect is probably good idea for increasing the adoption of this fragment scheme and for avoiding the potential for ubiquitous DoS via regex. One of the authors also notes this on his blog. I look forward to the day when this scheme is widely implemented.

XML. XML has the XPointer framework to define its fragment structure as noted by the XML mime type definition. XPointer consists of a general scheme that contains subschemes that identify a subset of an XML document. Its too bad such a thing wasn't adopted for URI fragments in general to solve the problem of a single resource with multiple mime type representations. I wrote more about XPointer when I worked on hacking XPointer into IE.

SVG and MPEG. Through the Media Fragments Working Group I found a couple more fragment scheme definitions. SVG's fragment scheme is defined in the SVG documentation and looks similar to XML's. MPEG has one defined but I could only find it as an ISO document "Text of ISO/IEC FCD 21000-17 MPEG-12 FID" and not as an RFC which is a little disturbing.

AJAX. AJAX websites have used fragments as an escape hatch for two issues that I've seen. The first is getting a unique URL for versions of a page that are produced on the client by script. The fragment may be changed by script without forcing the page to reload. This goes outside the rules of the standards by using HTML fragments in a fashion not called out by the HTML spec. but it does seem to be inline with the spirit of the fragment in that it is a subview of the original resource and interpretted client side. The other hack-ier use of the fragment in AJAX is for cross domain communication. The basic idea is that different frames or windows may not communicate in normal fashions if they have different domains but they can view each other's URLs and accordingly can change their own fragments in order to send a message out to those who know where to look. IMO this is not inline with the spirit of the fragment but is rather a cool hack.

PermalinkCommentsxml text ajax technical url boring uri fragment rfc

RFC 2231 MIME Parameter Value and Encoded Word Extensions: Character Sets, Languages, and Continuations

2008 Mar 8, 11:44"This memo defines extensions to the RFC 2045 media type and RFC 2183 disposition parameter value mechanisms to provide ... a means to specify parameter values in character sets other than US-ASCII..."PermalinkCommentshttp http-header rfc standard reference ietf mime encoding charset language content-disposition

ICANN | On Its Way: One of the Biggest Changes to the Internet

2007 Oct 11, 12:11ICANN plans to support non-US-ASCII top level domain names. I wonder how broken web browser's security measures are about to become.PermalinkCommentsidn dns domain internet uri icann news tld

Non-ASCII characters in URI attribute values

2007 Aug 23, 6:17From the HTML standard, what to do if you find a URI with a non-US-ASCII character.PermalinkCommentsuri standard reference unicode utf8 encoding ascii iri

Second Life Translator

2007 Jul 4, 10:58Hackdiary
I really enjoy reading Matt Biddulph's blog hackdiary. An entry some time ago talked about his Second Life flickr screen which is a screen in Second Life that displays images from flickr.com based on viewers suggested tags. I'm a novice to the Second Life scripting API and so it was from this blog post I became aware of the llHTTPRequest. This is like the XMLHttpRequest for Second Life code in that it lets you make HTTP requests. I decided that I too could do something cool with this.

Translator
I decided to make a translator object that a Second Life user would wear that would translate anything said near them. The details aren't too surprising: The translator object keeps an owner modifiable list of translation instructions each consisting of who to listen to, the language they speak, who to tell the translation to, and into what language to translate. When the translator hears someone, it runs through its list of translation instructions and when it finds a match for the speaker uses the llHTTPRequest to send off what was said to Google translate. When the result comes back the translator simply says the response.

Issues
Unfortunately, the llHTTPRequest limits the response size to 2K and no translation site I can find has the translated text in the first 2K. There's a flag HTTP_BODY_MAXLENGTH provided but it defaults to 2K and you can't change its value. So I decided to setup a PHP script on my site to act as a translating proxy and parse the translated text out of the HTML response from Google translate. Through experimentation I found that their site can take parameters text and langpair queries in the query like so: http://translate.google.com/translate_t?text=car%20moi%20m%C3%AAme%20j%27en%20rit&langpair=fr|en. On the topic of non US-ASCII characters (which is important for a translator) I found that llHTTPRequest encodes non US-ASCII characters as percent-encoded UTF-8 when constructing the request URI. However, when Google translate takes parameters off the URI it only seems to interpret it as percent-encoded UTF-8 when the user-agent is IE's. So after changing my PHP script to use IE7's user-agent non US-ASCII character input worked.

In Use
Actually using it in practice is rather difficult. Between typos, slang, abbreviations, and the current state of the free online translators its very difficult to carry on a conversation. Additionally, I don't really like talking to random people on Second Life anyway. So... not too useful.PermalinkCommentspersonal translate second-life technical translator sl code google php llhttprequest

Sorting It All Out : ASCII? no questions; I tell UNICODE lies

2007 Mar 13, 3:45Michael Kaplan answers my question about MultiByteToWideChar's flexible interpretation of US-ASCII.PermalinkCommentsencoding michael-kaplan us-ascii ascii unicode windows
Older Entries Creative Commons License Some rights reserved.