I'm a big fan of the concept of registerProtocolHandler in HTML 5 and in FireFox 3, but not quite the implementation. From a high level, it allows web apps to register themselves as handlers of an URL scheme so for (the canonical) example, GMail can register for the mailto URL scheme. I like the concept:
registerProtocolHandler("info:lccn/{lccnID}", "htttp://www.librarything.com/search_works.php?q={lccnID}", "LibraryThing LCCN")
I've found while debugging networking in IE its often useful to quickly tell if a string is encoded in UTF-8. You can check for the Byte Order Mark (EF BB BF in UTF-8) but, I rarely see the BOM on UTF-8 strings. Instead I apply a quick and dirty UTF-8 test that takes advantage of the well-formed UTF-8 restrictions.
Unlike other multibyte character encoding forms (see Windows supported character sets or IANA's list of character sets), for example Big5, where sticking together any two bytes is more likely than not to give a valid byte sequence, UTF-8 is more restrictive. And unlike other multibyte character encodings, UTF-8 bytes may be taken out of context and one can still know that its a single byte character, the starting byte of a three byte sequence, etc.
The full rules for well-formed UTF-8 are a little too complicated for me to commit to memory. Instead I've got my own simpler (this is the quick part) set of rules that will be mostly correct (this is the dirty part). For as many bytes in the string as you care to examine, check the most significant digit of the byte:
Code Points | 1st Byte | 2nd Byte | 3rd Byte | 4th Byte |
---|---|---|---|---|
U+0000..U+007F | 00..7F | |||
U+0080..U+07FF | C2..DF | 80..BF | ||
U+0800..U+0FFF | E0 | A0..BF | 80..BF | |
U+1000..U+CFFF | E1..EC | 80..BF | 80..BF | |
U+D000..U+D7FF | ED | 80..9F | 80..BF | |
U+E000..U+FFFF | EE..EF | 80..BF | 80..BF | |
U+10000..U+3FFFF | F0 | 90..BF | 80..BF | 80..BF |
U+40000..U+FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF |
U+100000..U+10FFFF | F4 | 80..8F | 80..BF | 80..BF |
The text/plain fragment documented in RFC 5147 and described on Erik Wilde's blog struck my interest and, like the XML fragment, I wanted to see if I could implement this in IE. In this case there's no XSLT for me to edit so, like my plain/text word wrap bookmarklet I've implemented it as a bookmarklet. This is only a partial implementation as it doesn't implement the integrity checks.
Check out my text/plain fragment bookmarklet.
Windows allows for application protocols in which, through the registry, you specify a URL scheme and a command line to have that URL passed to your application. Its an easy way to hook a webbrowser up to your application. Anyone can read the doc above and then walk through the registry and pick out the application protocols but just from that info you can't tell what the application expects these URLs to look like. I did a bit of research on some of the application protocols I've seen which is listed below. Good places to look for information on URI schemes: Wikipedia URI scheme, and ESW Wiki UriSchemes.
Scheme | Name | Notes |
---|---|---|
search-ms | Windows Search Protocol |
The search-ms application protocol is a convention for querying the Windows Search index. The protocol enables applications, like Microsoft Windows Explorer, to query the index with
parameter-value arguments, including property arguments, previously saved searches, Advanced Query Syntax, Natural Query Syntax, and language code identifiers (LCIDs) for both the Indexer and
the query itself. See the MSDN docs for search-ms for more info. Example: search-ms:query=food |
Explorer.AssocProtocol.search-ms | ||
OneNote | OneNote Protocol |
From the OneNote help: /hyperlink "pagetarget" - Starts OneNote and opens the page specified by the pagetarget parameter. To obtain the hyperlink for any page in a OneNote
notebook, right-click its page tab and then click Copy Hyperlink to this Page.Example: onenote:///\\GUMMO\Users\davris\Documents\OneNote%20Notebooks\OneNote%202007%20Guide\Getting%20Started%20with%20OneNote.one#section-id={692F45F5-A42A-415B-8C0D-39A10E88A30F}&end |
callto | Callto Protocol |
ESW Wiki Info on callto Skype callto info NetMeeting callto info Example: callto://+12125551234 |
itpc | iTunes Podcast |
Tells iTunes to subscribe to an indicated podcast. iTunes documentation. C:\Program Files\iTunes\iTunes.exe /url "%1" Example: itpc:http://www.npr.org/rss/podcast.php?id=35 |
iTunes.AssocProtocol.itpc | ||
pcast | ||
iTunes.AssocProtocol.pcast | ||
Magnet | Magnet URI | Magnet URL scheme described by Wikipedia. Magnet URLs identify a resource by a hash of that resource so that when used in P2P scenarios no central authority is necessary to create URIs for a resource. |
mailto | Mail Protocol |
RFC 2368 - Mailto URL Scheme. Mailto Syntax Opens mail programs with new message with some parameters filled in, such as the to, from, subject, and body. Example: mailto:?to=david.risney@gmail.com&subject=test&body=Test of mailto syntax |
WindowsMail.Url.Mailto | ||
MMS | mms Protocol |
MSDN describes associated protocols. Wikipedia describes MMS. "C:\Program Files\Windows Media Player\wmplayer.exe" "%L" Also appears to be related to MMS cellphone messages: MMS IETF Draft. |
WMP11.AssocProtocol.MMS | ||
secondlife | [SecondLife] |
Opens SecondLife to the specified location, user, etc. SecondLife Wiki description of the URL scheme. "C:\Program Files\SecondLife\SecondLife.exe" -set SystemLanguage en-us -url "%1" Example: secondlife://ahern/128/128/128 |
skype | Skype Protocol |
Open Skype to call a user or phone number. Skype's documentation Wikipedia summary of skype URL scheme "C:\Program Files\Skype\Phone\Skype.exe" "/uri:%l" Example: skype:+14035551111?call |
skype-plugin | Skype Plugin Protocol Handler |
Something to do with adding plugins to skype? Maybe. "C:\Program Files\Skype\Plugin Manager\skypePM.exe" "/uri:%1" |
svn | SVN Protocol |
Opens TortoiseSVN to browse the repository URL specified in the URL. C:\Program Files\TortoiseSVN\bin\TortoiseProc.exe /command:repobrowser /path:"%1" |
svn+ssh | ||
tsvn | ||
webcal | Webcal Protocol |
Wikipedia describes webcal URL scheme. Webcal URL scheme description. A URL that starts with webcal:// points to an Internet location that contains a calendar in iCalendar format. "C:\Program Files\Windows Calendar\wincal.exe" /webcal "%1" Example: webcal://www.lightstalkers.org/LS.ics |
WindowsCalendar.UrlWebcal.1 | ||
zune | Zune Protocol |
Provides access to some Zune operations such as podcast subscription (via Zune Insider). "c:\Program Files\Zune\Zune.exe" -link:"%1" Example: zune://subscribe/?name=http://feeds.feedburner.com/wallstrip. |
feed | Outlook Add RSS Feed |
Identify a resource that is a feed such as Atom or RSS. Implemented by Outlook to add the indicated feed to Outlook. Feed URI scheme pre-draft document "C:\PROGRA~2\MICROS~1\Office12\OUTLOOK.EXE" /share "%1" |
im | IM Protocol |
RFC 3860 IM URI scheme description Like mailto but for instant messaging clients. Registered by Office Communicator but I was unable to get it to work as described in RFC 3860. "C:\Program Files (x86)\Microsoft Office Communicator\Communicator.exe" "%1" |
tel | Tel Protocol |
RFC 5341 - tel URI scheme IANA assignment RFC 3966 - tel URI scheme description Call phone numbers via the tel URI scheme. Implemented by Office Communicator. "C:\Program Files (x86)\Microsoft Office Communicator\Communicator.exe" "%1" |
As noted previously, my page consists of the aggregation of my various feeds and in working on that code recently it was again brought to my attention that everyone has different ways of representing tag metadata in feeds. I made up a list of how my various feed sources represent tags and list that data here so that it might help others in the future.
Source | Feed Type | Tag Markup Scheme | One Tag Per Element | Tag Scheme URI | Human / Machine Names | Example Markup |
---|---|---|---|---|---|---|
LiveJournal | Atom | atom:category | yes | no | no | , (source) |
LiveJournal | RSS 2.0 | rss2:category | yes | no | no |
technical (soure) |
WordPress | RSS 2.0 | rss2:category | yes | no | no |
, (source)
|
Delicious | RSS 1.0 | dc:subject | no | no | no |
photosynth photos 3d tool (source) |
Delicious | RSS 2.0 | rss2:category | yes | yes | no |
domain="http://delicious.com/SequelGuy/"> (source) |
Flickr | Atom | atom:category | yes | yes | no |
term="seattle" (source) |
Flickr | RSS 2.0 | media:category | no | yes | no |
scheme="urn:flickr:tags"> (source) |
YouTube | RSS 2.0 | media:category | no | no | no |
label="Tags"> (source) |
LibraryThing | RSS 2.0 | No explicit tag metadata. | no | no | no | n/a, (source) |
Tag Markup Scheme | Notes | Example |
---|---|---|
Atom Category atom:category xmlns:atom="http://www.w3.org/2005/Atom"
|
|
term="catName"
|
RSS 2.0 category rss2:category empty namespace |
|
domain="tag:deletethis.net,2008:tagscheme">
|
Yahoo Media RSS Module category media:category xmlns:media="http://search.yahoo.com/mrss/"
|
|
scheme="http://dmoz.org"
|
Dublin Core subject dc:subject xmlns:dc="http://purl.org/dc/elements/1.1/"
|
|
humor
|
Update 2009-9-14: Added WordPress to the Tag Markup table and namespaces to the Tag Markup Scheme table.
Information about URI Fragments, the portion of URIs that follow the '#' at the end and that are used to navigate within a document, is scattered throughout various documents which I usually have to hunt down. Instead I'll link to them all here.
Definitions. Fragments are defined in the URI RFC which states that they're used to identify a secondary resource that is related to the primary resource identified by the URI as a subset of the primary, a view of the primary, or some other resource described by the primary. The interpretation of a fragment is based on the mime type of the primary resource. Tim Berners-Lee notes that determining fragment meaning from mime type is a problem because a single URI may contain a single fragment, however over HTTP a single URI can result in the same logical resource represented in different mime types. So there's one fragment but multiple mime types and so multiple interpretations of the one fragment. The URI RFC says that if an author has a single resource available in multiple mime types then the author must ensure that the various representations of a single resource must all resolve fragments to the same logical secondary resource. Depending on which mime types you're dealing with this is either not easy or not possible.
HTTP. In HTTP when URIs are used, the fragment is not included. The General Syntax section of the HTTP standard says it uses the definitions of 'URI-reference' (which includes the fragment), 'absoluteURI', and 'relativeURI' (which don't include the fragment) from the URI RFC. However, the 'URI-reference' term doesn't actually appear in the BNF for the protocol. Accordingly the headers like 'Request-URI', 'Content-Location', 'Location', and 'Referer' which include URIs are defined with 'absoluteURI' or 'relativeURI' and don't include the fragment. This is in keeping with the original fragment definition which says that the fragment is used as a view of the original resource and consequently only needed for resolution on the client. Additionally, the URI RFC explicitly notes that not including the fragment is a privacy feature such that page authors won't be able to stop clients from viewing whatever fragments the client chooses. This seems like an odd claim given that if the author wanted to selectively restrict access to portions of documents there are other options for them like breaking out the parts of a single resource to which the author wishes to restrict access into separate resources.
HTML. In HTML, the HTML mime type RFC defines HTML's fragment use which consists of fragments referring to elements with a corresponding 'id' attribute or one of a particular set of elements with a corresponding 'name' attribute. The HTML spec discusses fragment use additionally noting that the names and ids must be unique in the document and that they must consist of only US-ASCII characters. The ID and NAME attributes are further restricted in section 6 to only consist of alphanumerics, the hyphen, period, colon, and underscore. This is a subset of the characters allowed in the URI fragment so no encoding is discussed since technically its not needed. However, practically speaking, browsers like FireFox and Internet Explorer allow for names and ids containing characters outside of the defined set including characters that must be percent-encoded to appear in a URI fragment. The interpretation of percent-encoded characters in fragments for HTML documents is not consistent across browsers (or in some cases within the same browser) especially for the percent-encoded percent.
Text. Text/plain recently got a fragment definition that allows fragments to refer to particular lines or characters within a text document. The scheme no longer includes regular expressions, which disappointed me at first, but in retrospect is probably good idea for increasing the adoption of this fragment scheme and for avoiding the potential for ubiquitous DoS via regex. One of the authors also notes this on his blog. I look forward to the day when this scheme is widely implemented.
XML. XML has the XPointer framework to define its fragment structure as noted by the XML mime type definition. XPointer consists of a general scheme that contains subschemes that identify a subset of an XML document. Its too bad such a thing wasn't adopted for URI fragments in general to solve the problem of a single resource with multiple mime type representations. I wrote more about XPointer when I worked on hacking XPointer into IE.
SVG and MPEG. Through the Media Fragments Working Group I found a couple more fragment scheme definitions. SVG's fragment scheme is defined in the SVG documentation and looks similar to XML's. MPEG has one defined but I could only find it as an ISO document "Text of ISO/IEC FCD 21000-17 MPEG-12 FID" and not as an RFC which is a little disturbing.
AJAX. AJAX websites have used fragments as an escape hatch for two issues that I've seen. The first is getting a unique URL for versions of a page that are produced on the client by script. The fragment may be changed by script without forcing the page to reload. This goes outside the rules of the standards by using HTML fragments in a fashion not called out by the HTML spec. but it does seem to be inline with the spirit of the fragment in that it is a subview of the original resource and interpretted client side. The other hack-ier use of the fragment in AJAX is for cross domain communication. The basic idea is that different frames or windows may not communicate in normal fashions if they have different domains but they can view each other's URLs and accordingly can change their own fragments in order to send a message out to those who know where to look. IMO this is not inline with the spirit of the fragment but is rather a cool hack.