character page 3 - Dave's Blog

Search
My timeline on Mastodon

URLs are tough - Anne's Weblog

2009 Apr 7, 1:30I really dislike how IE deals with non-US-ASCII in URLs. I should write up a post on what exactly IE does with non-US-ASCII characters in URLs. "Just like IRIs the URL is mapped to a URI using UTF-8. Except for the query component of the URL (the bit after the question mark). Here for legacy reasons the encoding of the document is used instead. Except if the encoding of the document is UTF-16, in which case UTF-8 is used. Effectively, using non-ASCII characters in URLs in documents not encoded as UTF-8 or UTF-16 will give you surprising results, to say the least. Yay for browsers!"PermalinkCommentshttp encoding html5 url uri unicode iri

Platonic Ideals in Anathem and The Atrocity Archives

2009 Apr 7, 11:58
The Atrocity ArchivesThe Jennifer MorgueAnathem

This past week I finished Anathem and despite the intimidating physical size of the book (difficult to take and read on the bus) I became very engrossed and was able to finish it in several orders of magnitude less time than what I spent on the Baroque Cycle. Whereas reading the Baroque Cycle you can imagine Neal Stephenson sifting through giant economic tomes (or at least that's where my mind went whenever the characters began to explain macro-economics to one another), in Anathem you can see Neal Stephenson staying up late pouring over philosophy of mathematics. When not exploring philosophy, Anathem has an appropriate amount of humor, love interests, nuclear bombs, etc. as you might hope from reading Snow Crash or Diamond Age. I thoroughly enjoyed Anathem.

On the topic of made up words: I get made up words for made up things, but there's already a name for cell-phone in English: its "cell-phone". The narrator notes that the book has been translated into English so I guess I'll blame the fictional translator. Anyway, I wasn't bothered by the made up words nearly as much as some folk. Its a good thing I'm long out of college because I can easily imagine confusing the names of actual concepts and people with those from the book, like Hemn space for Hamming distance. Towards the beginning, the description of slines and the post-post-apocalyptic setting reminded me briefly of Idiocracy.

Recently, I've been reading everything of Charles Stross that I can, including about a month ago, The Jennifer Morgue from the surprisingly awesome amalgamation genre of spy thriller and Lovecraft horror. Its the second in a series set in a universe in which magic exists as a form of mathematics and follows Bob Howard programmer/hacker, cube dweller, and begrudging spy who works for a government agency tasked to suppress this knowledge and protect the world from its use. For a taste, try a short story from the series that's freely available on Tor's website, Down on the Farm.

Coincidentally, both Anathem and the Bob Howard series take an interest in the world of Platonic ideals. In the case of Anathem (without spoiling anything) the universe of Platonic ideals, under a different name of course, is debated by the characters to be either just a concept or an actual separate universe and later becomes the underpinning of major events in the book. In the Bob Howard series, magic is applied mathematics that through particular proofs or computations awakens/disturbs/provokes unnamed horrors in the universe of Platonic ideals to produce some desired effect in Bob's universe.

PermalinkCommentsatrocity archives neal stephenson jennifer morgue plato bob howard anathem

Sorting it all Out : What do you get when you combine a base character with a buttload of diacritics?

2009 Mar 6, 11:47"Anyway, I decided to take the letter a and put as many different diacritics on it as I could." Micahel Kaplan sticks like 80 diacritics on the letter 'a'. Awesome.PermalinkCommentsencoding unicode diacritic language letter michael-kaplan

The 'Is It UTF-8?' Quick and Dirty Test

2009 Mar 6, 5:16

I've found while debugging networking in IE its often useful to quickly tell if a string is encoded in UTF-8. You can check for the Byte Order Mark (EF BB BF in UTF-8) but, I rarely see the BOM on UTF-8 strings. Instead I apply a quick and dirty UTF-8 test that takes advantage of the well-formed UTF-8 restrictions.

Unlike other multibyte character encoding forms (see Windows supported character sets or IANA's list of character sets), for example Big5, where sticking together any two bytes is more likely than not to give a valid byte sequence, UTF-8 is more restrictive. And unlike other multibyte character encodings, UTF-8 bytes may be taken out of context and one can still know that its a single byte character, the starting byte of a three byte sequence, etc.

The full rules for well-formed UTF-8 are a little too complicated for me to commit to memory. Instead I've got my own simpler (this is the quick part) set of rules that will be mostly correct (this is the dirty part). For as many bytes in the string as you care to examine, check the most significant digit of the byte:

F:
This is byte 1 of a 4 byte encoded codepoint and must be followed by 3 trail bytes.
E:
This is byte 1 of a 3 byte encoded codepoint and must be followed by 2 trail bytes.
C..D:
This is byte 1 of a 2 byte encoded codepoint and must be followed by 1 trail byte.
8..B:
This is a trail byte.
0..7:
This is a single byte encoded codepoint.
The simpler rules can produce false positives in some cases: that is, they'll say a string is UTF-8 when in fact it might not be. But it won't produce false negatives. The following is table from the Unicode spec. that actually describes well-formed UTF-8.
Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
U+0000..U+007F 00..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF

PermalinkCommentstest technical unicode boring charset utf8 encoding

Subst Allows Non-Letter Drive Letters

2009 Mar 4, 2:39

I knew that the command line tool subst would create virtual drives that map to existing directories but I didn't know that subst lets you name the virtual drives with characters that aren't US-ASCII letters. For instance you can run 'subst 4: C:\windows' and then 'more 4:\win.ini' to dump C:\windows\win.ini. This also works for non-US-ASCII characters like, "C" (aka U+FF23, Fullwidth Latin Capital Letter C), which when displayed by cmd.exe via some best fit style character conversions looks just like the regular US-ASCII 'C'. None of Explorer, IE, or the common file dialogs allow the use of these odd virtual drives -- just cmd.exe, so I'm not sure how this would ever be useful but I thought it was odd and I wanted to share.

PermalinkCommentscli technical boring subst windows

MyFonts Blog - Blog Archive - Introducing WhatTheFont for iPhone!

2009 Feb 11, 10:05"With the iPhone version of WhatTheFont you can use the phone's built-in camera to photograph the text in question (or choose an existing image from your photo albums)... After confirming which characters are used in the image, the app provides a list of possible matching fonts."PermalinkCommentsfont iphone camera typography

Party Movies Recommended by Netflix

2008 Sep 18, 10:31
Poster for 24 Hour Party PeoplePoster for Human TrafficPoster for The Boys and Girls Guide to Getting Down

Netflix has recommended three party movies over my time with Netflix and if you're OK with movies featuring sex, drugs, rock&roll (or techno) as almost the main character then I can recommend at least The Boys and Girls Guide to Getting Down.

24 Hour Party People is based on the true story of Tony Wilson, journalist, band manager, and club owner (not all at once) around the rise of punk and new wave in England. Like many true-story based movies it starts off strong and very interesting but gets very slow at the end like the writers got bored and just started copying the actual events. Unless you have some interest in the history of music in the 80s in Manchester I don't recommend this movie.

Human Traffic is fun and funny following a group of friends going out for a night of clubbing and partying. I had to get over seeing John Simm as not The Master from Doctor Who but rather as a partying youth. It felt like it was geared towards viewers who were on something like the totally odd techno musical interludes with the characters dancing for no apparent reason. Otherwise the movie was good.

The Boys and Girls Guide to Getting Down is done in the style of an old educational movie on the topic of clubbing and partying. It sounds like a premise that would get old but they do a good job. While demonstrating drinking and driving they have scientists push a mouse around in a toy convertible. Enough said. It was funny and I recommend it.

PermalinkCommentsparty movie netflix

Braid Recommendation

2008 Aug 14, 9:38

Braid screen shot. By gamerscoreblogI recently finished Braid, the Xbox Live game, and a comparison with Portal is helpful. From a screen shot Braid looks like a normal 2D platformer, but that's like looking at a screen shot of Portal and saying its a first person shooter. While the scaffolding of the game-play may sort of fall into that category, the games are actually about exploring the character's ability and solving puzzles. In Portal the ability is bending space and in Braid its bending time. However, whereas in Portal there is one space bending mechanism, the portal gun, Braid's protagonist explores several different time bending techniques including, most prominently, reversing time, but also time dilation, multiple time-lines, and other odd things.

Similar to the difference in game-play, while Portal has a strict simplicity to its visual style, Braid is much more ornate, like you're playing in an oil painting. Without seeing video of the game, or playing the demo (which is available for free on Xbox Live) its difficult to convey, but it is quite lovely and the animation adds quite a bit. Both games too are rather short leaving you just a bit hungry for more and have an interesting plot and an ending that I'd hate to spoil although Braid replaces Portal's humor with melancholy. If you enjoyed Portal and Twelve Monkeys then I'd recommend Braid.

PermalinkCommentsbraid game videogame portal nontechnical

VERTIGO

2008 Aug 14, 9:25"When a savage creature known only as the Adversary conquered the fabled lands of legends and fairy tales, all of the infamous inhabitants of folklore were forced into exile. Disguised among the normal citizens of modern-day New York, these magical characters have created their own peaceful and secret society within an exclusive luxury apartment building called Fabletown. But when Snow White's party-girl sister, Rose Red, is apparently murdered, it is up to Fabletown's sheriff, a reformed and pardoned Big Bad Wolf, to determine if the killer is Bluebeard, Rose's ex-lover and notorious wife killer, or Jack, her current live-in boyfriend and former beanstalk-climber."PermalinkCommentscomic read download free via:boingboing fiction

philosecurity - Blog Archive - Guerilla Public Service

2008 Aug 11, 3:58Fellow kindly fixes spelling error on trailer mounted electronic roadway message signs. Pulls up, connects keyboard, reads password off the side of the enclosure, etc. "Not far from my house is one of those temporary trailer-mounted variable message signs, which for the past several weeks has been advising motorists that ..." I always wondered what it would take on those signs. And if all the passwords are four characters long...PermalinkCommentssecurity hack howto sign humor

Tor.com / Science fiction and fantasy / Blog posts / The Singularity Problem and Non-Problem

2008 Jul 22, 1:43Just for this quote: "Since then, the Singularity has come to be an object of almost religious faith in some quarters. In The Cassini Division, Ken MacLeod has a character call it "the Rapture for nerds," and that's just how I see it."PermalinkCommentssingularity scifi blog tor quote

How to Build a Universe That Doesn't Fall Apart Two Days Later

2008 Jul 3, 1:32"Finally he cut the tape entirely, whereupon the world disappeared. However, it also disappeared for the other characters in the story... which makes no sense, if you think about it." That's what I thought when I read that storyPermalinkCommentsarticle essay fiction scifi philip-k-dick via:mach3

Kids in the Hall Live in Seattle

2008 May 17, 7:58

Sarah and I saw the Kids in the Hall "Live As We'll Ever Be" Tour in the WaMu theater in Seattle this past Thursday. I'd only ever seen their television show so it was cool to see them live. I thought that them being in a live format on stage would make the show significantly different, but other than having a bad seat and not being able to see very well, and the Kids sometimes ad-libbing or breaking character, it was like watching their show. It consisted of mostly new material with some returning characters like the Chicken Lady, Buddy Cole, the head crusher, etc. Their Facebook page has two videos that they played during the show.

I've been using the best Kids in the Hall fansite with an archive of searchable transcripts since high school. But now days what with all the new fangled video websites I can link right to some of my favorite sketches from the show. Like the Inexperienced Cannibal.


And the meta-sketch The Raise.

PermalinkCommentskids in the hall humor seattle nontechnical

Finally finished Baroque Cycle Novels

2008 May 2, 10:20
[The cover of Cryptonomicon][The cover of Quicksilver][The cover of The Confusion][The cover of The System of the World]

I've finally finished the Baroque Cycle, a historical fiction series set in the 17th and 18th centuries by Neal Stephenson whose work I always enjoy. There were often delays where I'd forget about the books until I had to take plane somewhere, or get discouraged reading about the character's thoughts on economics, or have difficulty finding the next volume, or become more engrossed in other books, projects or video games, and leave the Baroque Cycle books untouched for many months at a time. Consequently, my reading of this series has, I'm ashamed to say, spanned years. After finishing some books which I enjoy I end up hungry for just a bit more to read. For this series I don't need a bit more to read, I'm done with that, but I do want a badge or maybe a medal. Or barring that, college credit in European History and Macro Economics. I can recommend this book to anyone who has enjoyed Neal Stephenson's other work and has a few years of free time to kill.

PermalinkCommentshistory neal stephenson baroque cycle book nontechnical

URI Fragment Info Roundup

2008 Apr 21, 11:53

['Neverending story' by Alexandre Duret-Lutz. A framed photo of books with the droste effect applied. Licensed under creative commons.]Information about URI Fragments, the portion of URIs that follow the '#' at the end and that are used to navigate within a document, is scattered throughout various documents which I usually have to hunt down. Instead I'll link to them all here.

Definitions. Fragments are defined in the URI RFC which states that they're used to identify a secondary resource that is related to the primary resource identified by the URI as a subset of the primary, a view of the primary, or some other resource described by the primary. The interpretation of a fragment is based on the mime type of the primary resource. Tim Berners-Lee notes that determining fragment meaning from mime type is a problem because a single URI may contain a single fragment, however over HTTP a single URI can result in the same logical resource represented in different mime types. So there's one fragment but multiple mime types and so multiple interpretations of the one fragment. The URI RFC says that if an author has a single resource available in multiple mime types then the author must ensure that the various representations of a single resource must all resolve fragments to the same logical secondary resource. Depending on which mime types you're dealing with this is either not easy or not possible.

HTTP. In HTTP when URIs are used, the fragment is not included. The General Syntax section of the HTTP standard says it uses the definitions of 'URI-reference' (which includes the fragment), 'absoluteURI', and 'relativeURI' (which don't include the fragment) from the URI RFC. However, the 'URI-reference' term doesn't actually appear in the BNF for the protocol. Accordingly the headers like 'Request-URI', 'Content-Location', 'Location', and 'Referer' which include URIs are defined with 'absoluteURI' or 'relativeURI' and don't include the fragment. This is in keeping with the original fragment definition which says that the fragment is used as a view of the original resource and consequently only needed for resolution on the client. Additionally, the URI RFC explicitly notes that not including the fragment is a privacy feature such that page authors won't be able to stop clients from viewing whatever fragments the client chooses. This seems like an odd claim given that if the author wanted to selectively restrict access to portions of documents there are other options for them like breaking out the parts of a single resource to which the author wishes to restrict access into separate resources.

HTML. In HTML, the HTML mime type RFC defines HTML's fragment use which consists of fragments referring to elements with a corresponding 'id' attribute or one of a particular set of elements with a corresponding 'name' attribute. The HTML spec discusses fragment use additionally noting that the names and ids must be unique in the document and that they must consist of only US-ASCII characters. The ID and NAME attributes are further restricted in section 6 to only consist of alphanumerics, the hyphen, period, colon, and underscore. This is a subset of the characters allowed in the URI fragment so no encoding is discussed since technically its not needed. However, practically speaking, browsers like FireFox and Internet Explorer allow for names and ids containing characters outside of the defined set including characters that must be percent-encoded to appear in a URI fragment. The interpretation of percent-encoded characters in fragments for HTML documents is not consistent across browsers (or in some cases within the same browser) especially for the percent-encoded percent.

Text. Text/plain recently got a fragment definition that allows fragments to refer to particular lines or characters within a text document. The scheme no longer includes regular expressions, which disappointed me at first, but in retrospect is probably good idea for increasing the adoption of this fragment scheme and for avoiding the potential for ubiquitous DoS via regex. One of the authors also notes this on his blog. I look forward to the day when this scheme is widely implemented.

XML. XML has the XPointer framework to define its fragment structure as noted by the XML mime type definition. XPointer consists of a general scheme that contains subschemes that identify a subset of an XML document. Its too bad such a thing wasn't adopted for URI fragments in general to solve the problem of a single resource with multiple mime type representations. I wrote more about XPointer when I worked on hacking XPointer into IE.

SVG and MPEG. Through the Media Fragments Working Group I found a couple more fragment scheme definitions. SVG's fragment scheme is defined in the SVG documentation and looks similar to XML's. MPEG has one defined but I could only find it as an ISO document "Text of ISO/IEC FCD 21000-17 MPEG-12 FID" and not as an RFC which is a little disturbing.

AJAX. AJAX websites have used fragments as an escape hatch for two issues that I've seen. The first is getting a unique URL for versions of a page that are produced on the client by script. The fragment may be changed by script without forcing the page to reload. This goes outside the rules of the standards by using HTML fragments in a fashion not called out by the HTML spec. but it does seem to be inline with the spirit of the fragment in that it is a subview of the original resource and interpretted client side. The other hack-ier use of the fragment in AJAX is for cross domain communication. The basic idea is that different frames or windows may not communicate in normal fashions if they have different domains but they can view each other's URLs and accordingly can change their own fragments in order to send a message out to those who know where to look. IMO this is not inline with the spirit of the fragment but is rather a cool hack.

PermalinkCommentsxml text ajax technical url boring uri fragment rfc

Encoding methods in C#

2008 Apr 12, 10:38

For Encode-O-Matic, my encoding tool written in C#, I had to figure out the appropriate DllImport declarations to use IDN Win32 functions which was a pain. To spare others that pain here's the two files CharacterSetEncoding.cs and NationalLanguageSupportUtilities.cs that declare the DllImports for IdnToUnicode, IdnToAscii, NormalizeString, MultiByteToWideChar, and WideCharToMultiByte.

PermalinkCommentsencodeomatic boring csharp widechartomultibyte idn tool dllimport

The Forbes Fictional 15 - Forbes.com

2008 Apr 11, 3:22Forbe's ranks the 15 richest fictional characters: "The characters that make up this year's edition of the Forbes Fictional 15, our annual listing of fiction's richest, boast an aggregate net worth of $137 billion."PermalinkCommentsforbes humor fiction article comic

Extensible Markup Language (XML) 1.1 (Second Edition)

2008 Mar 18, 11:21End-of-line handling in XML. Spoiler: XML processor should normalize most newline character sequences to 0xA.PermalinkCommentsxml spec standard w3c unicode charset newline end-of-line

RFC 2231 MIME Parameter Value and Encoded Word Extensions: Character Sets, Languages, and Continuations

2008 Mar 8, 11:44"This memo defines extensions to the RFC 2045 media type and RFC 2183 disposition parameter value mechanisms to provide ... a means to specify parameter values in character sets other than US-ASCII..."PermalinkCommentshttp http-header rfc standard reference ietf mime encoding charset language content-disposition

HTTP headers and non-asci characters (Content-Disposition, filename, attachment) Article

2008 Mar 8, 11:43"I was not able to find universal settings to do this task, but it looks like Mozilla based browsers accepts utf-8 encoded headers and headers Encoded Word Extensions from RFC 2231. Internet explorer accepts utf-8 filenames only when 1. the data are URL ePermalinkCommentshttp http-header charset ascii utf8 mozilla ie browser content-disposition
Older EntriesNewer Entries Creative Commons License Some rights reserved.