unicode page 2 - Dave's Blog

Search
My timeline on Mastodon

Unicode Character Set Mappings

2012 Jun 11, 2:36

A leaf directory in a whole set of files that map from character set byte value to Unicode code point.  This one is a set of Microsoft character set byte mappings, but there are other vendors in there too.

PermalinkCommentstechnical unicode charset

ShapeCatcher

2011 Nov 14, 12:37

Draw ShapeCatcher a symbol and ShapeCatcher shows you the characters in Unicode that look similar.  Try a smiley face.  (via Eric Lawrence)

PermalinkCommentstechnical unicode via:ericlaw

Completion of IANA Selection of IDNA Prefix

2010 Dec 8, 6:44Description of how they picked 'xn--' as the ACE prefix for IDN. Shockingly elaborate =)PermalinkCommentsidn technical ace encoding unicode rfc ietf

Buzz by Mark Pilgrim from Buzz

2010 Mar 11, 3:33"The headers and captions on http://diveintohtml5.org/ use an open source font called "Essays 1743." The creator of that font was looking for a tutorial on HTML5, came across my site, and was pleasantly surprised to see his own work on prominent display. He now wants to update his font to include stylistically appropriate Unicode arrows, which I will then use with my captions.

The internet is awesome. It's so wonderfully intertwingled."PermalinkCommentshtml html5 mark-pilgrim font technical

Official Google Blog: Unicode nearing 50% of the web

2010 Jan 28, 4:20Graph of encodings used by documents on the web. Unicode based encodings are thankfully on the rise.PermalinkCommentsunicode encoding web internationalization localization utf8 text html technical

Code: Flickr Developer Blog ยป A Chinese puzzle: Unicode and EXIF metadata parsing

2010 Jan 8, 2:08Flickr dev talks image metadata the various forms which to prefer and how to guess at their character encodings.PermalinkCommentsunicode charset flickr photo image exif programming reference xmp technical

Sorting it all Out : UCS-2 to UTF-16, Part 11: Turning it up to Eleven!

2009 Nov 16, 7:40Part 11 (of 11?) in a series on why you shouldn't freak out about the transition from UCS2 to UTF16. The answer is: you're already doing it wrong.PermalinkCommentstechnical ucs2 unicode utf16 text windows programming language michael-kaplan

PowerShell Scanning Script

2009 Jun 27, 3:42

I've hooked up the printer/scanner to the Media Center PC since I leave that on all the time anyway so we can have a networked printer. I wanted to hook up the scanner in a somewhat similar fashion but I didn't want to install HP's software (other than the drivers of course). So I've written my own script for scanning in PowerShell that does the following:

  1. Scans using the Windows Image Acquisition APIs via COM
  2. Runs OCR on the image using Microsoft Office Document Imaging via COM (which may already be on your PC if you have Office installed)
  3. Converts the image to JPEG using .NET Image APIs
  4. Stores the OCR text into the EXIF comment field using .NET Image APIs (which means Windows Search can index the image by the text in the image)
  5. Moves the image to the public share

Here's the actual code from my scan.ps1 file:

param([Switch] $ShowProgress, [switch] $OpenCompletedResult)

$filePathTemplate = "C:\users\public\pictures\scanned\scan {0} {1}.{2}";
$time = get-date -uformat "%Y-%m-%d";

[void]([reflection.assembly]::loadfile( "C:\Windows\Microsoft.NET\Framework\v2.0.50727\System.Drawing.dll"))

$deviceManager = new-object -ComObject WIA.DeviceManager
$device = $deviceManager.DeviceInfos.Item(1).Connect();

foreach ($item in $device.Items) {
        $fileIdx = 0;
        while (test-path ($filePathTemplate -f $time,$fileIdx,"*")) {
                [void](++$fileIdx);
        }

        if ($ShowProgress) { "Scanning..." }

        $image = $item.Transfer();
        $fileName = ($filePathTemplate -f $time,$fileIdx,$image.FileExtension);
        $image.SaveFile($fileName);
        clear-variable image

        if ($ShowProgress) { "Running OCR..." }

        $modiDocument = new-object -comobject modi.document;
        $modiDocument.Create($fileName);
        $modiDocument.OCR();
        if ($modiDocument.Images.Count -gt 0) {
                $ocrText = $modiDocument.Images.Item(0).Layout.Text.ToString().Trim();
                $modiDocument.Close();
                clear-variable modiDocument

                if (!($ocrText.Equals(""))) {
                        $fileAsImage = New-Object -TypeName system.drawing.bitmap -ArgumentList $fileName
                        if (!($fileName.EndsWith(".jpg") -or $fileName.EndsWith(".jpeg"))) {
                                if ($ShowProgress) { "Converting to JPEG..." }

                                $newFileName = ($filePathTemplate -f $time,$fileIdx,"jpg");
                                $fileAsImage.Save($newFileName, [System.Drawing.Imaging.ImageFormat]::Jpeg);
                                $fileAsImage.Dispose();
                                del $fileName;

                                $fileAsImage = New-Object -TypeName system.drawing.bitmap -ArgumentList $newFileName 
                                $fileName = $newFileName
                        }

                        if ($ShowProgress) { "Saving OCR Text..." }

                        $property = $fileAsImage.PropertyItems[0];
                        $property.Id = 40092;
                        $property.Type = 1;
                        $property.Value = [system.text.encoding]::Unicode.GetBytes($ocrText);
                        $property.Len = $property.Value.Count;
                        $fileAsImage.SetPropertyItem($property);
                        $fileAsImage.Save(($fileName + ".new"));
                        $fileAsImage.Dispose();
                        del $fileName;
                        ren ($fileName + ".new") $fileName
                }
        }
        else {
                $modiDocument.Close();
                clear-variable modiDocument
        }

        if ($ShowProgress) { "Done." }

        if ($OpenCompletedResult) {
                . $fileName;
        }
        else {
                $result = dir $fileName;
                $result | add-member -membertype noteproperty -name OCRText -value $ocrText
                $result
        }
}

I ran into a few issues:

PermalinkCommentstechnical scanner ocr .net modi powershell office wia

OpenSearchDescriptionToHTML Tool

2009 Jun 10, 3:36

I've made an OpenSearchDescriptionToHTML XSLT that given an OpenSearch description file produces HTML that describes that file, lets you install it, or search with it. For example, here's a Google OpenSearch description that uses my OpenSearchDescriptionToHTML XSLT.

I had just created an OpenSearch description for WolframAlpha at work and was going about the process of adding another install link to my search provider page so that I could install it. Thinking about it, I realized I could apply an XSLT to the OpenSearch description XML to produce the HTML automatically so I wouldn't have to modify additional documents everytime I create and want to install a new OpenSearch description. While I was in there writing the XSLT I figure why not let the user try out searching with the OpenSearch description file too. And lastly I made the XSLT apply to itself to produce HTML describing its own usage.

Incidentally, I added WolframAlpha at work to replace my FileInfo search provider for the purposes of searching for information about particular Unicode characters. For instance, look at WolframAlpha's lovely output for this search for "Bopomofo zh".

PermalinkCommentstechnical xml wolframalpha opensearchdescriptiontohtml xslt opensearch

Penny Arcade! - Further Education

2009 Apr 27, 3:47FYI: the official title of the 'backslash' is 'reverse-solidus', according to Unicode, ISO 10646, etc. Much cooler name IMO.PermalinkCommentspunctuation comic humor penny-arcade reverse-solidus back-slash

[whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009 Apr 23, 1:35"This e-mail is an attempt to give a relatively concise yet reasonably complete overview of non-Unicode character sets and encodings for 'Chinese characters', excluding those which are not supported by at least one of the four browsers IE, Safari, Firefox and Opera (henceforth 'all browsers'), and tentatively avoiding technical details which are out of scope for HTML5 unless they are important to gain a general understanding of the relevant issues."PermalinkCommentshtml html5 iso-2022 charset encoding character unicode cjk

URLs are tough - Anne's Weblog

2009 Apr 7, 1:30I really dislike how IE deals with non-US-ASCII in URLs. I should write up a post on what exactly IE does with non-US-ASCII characters in URLs. "Just like IRIs the URL is mapped to a URI using UTF-8. Except for the query component of the URL (the bit after the question mark). Here for legacy reasons the encoding of the document is used instead. Except if the encoding of the document is UTF-16, in which case UTF-8 is used. Effectively, using non-ASCII characters in URLs in documents not encoded as UTF-8 or UTF-16 will give you surprising results, to say the least. Yay for browsers!"PermalinkCommentshttp encoding html5 url uri unicode iri

Sorting it all Out : What do you get when you combine a base character with a buttload of diacritics?

2009 Mar 6, 11:47"Anyway, I decided to take the letter a and put as many different diacritics on it as I could." Micahel Kaplan sticks like 80 diacritics on the letter 'a'. Awesome.PermalinkCommentsencoding unicode diacritic language letter michael-kaplan

The 'Is It UTF-8?' Quick and Dirty Test

2009 Mar 6, 5:16

I've found while debugging networking in IE its often useful to quickly tell if a string is encoded in UTF-8. You can check for the Byte Order Mark (EF BB BF in UTF-8) but, I rarely see the BOM on UTF-8 strings. Instead I apply a quick and dirty UTF-8 test that takes advantage of the well-formed UTF-8 restrictions.

Unlike other multibyte character encoding forms (see Windows supported character sets or IANA's list of character sets), for example Big5, where sticking together any two bytes is more likely than not to give a valid byte sequence, UTF-8 is more restrictive. And unlike other multibyte character encodings, UTF-8 bytes may be taken out of context and one can still know that its a single byte character, the starting byte of a three byte sequence, etc.

The full rules for well-formed UTF-8 are a little too complicated for me to commit to memory. Instead I've got my own simpler (this is the quick part) set of rules that will be mostly correct (this is the dirty part). For as many bytes in the string as you care to examine, check the most significant digit of the byte:

F:
This is byte 1 of a 4 byte encoded codepoint and must be followed by 3 trail bytes.
E:
This is byte 1 of a 3 byte encoded codepoint and must be followed by 2 trail bytes.
C..D:
This is byte 1 of a 2 byte encoded codepoint and must be followed by 1 trail byte.
8..B:
This is a trail byte.
0..7:
This is a single byte encoded codepoint.
The simpler rules can produce false positives in some cases: that is, they'll say a string is UTF-8 when in fact it might not be. But it won't produce false negatives. The following is table from the Unicode spec. that actually describes well-formed UTF-8.
Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
U+0000..U+007F 00..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF

PermalinkCommentstest technical unicode boring charset utf8 encoding

Subst Allows Non-Letter Drive Letters

2009 Mar 4, 2:39

I knew that the command line tool subst would create virtual drives that map to existing directories but I didn't know that subst lets you name the virtual drives with characters that aren't US-ASCII letters. For instance you can run 'subst 4: C:\windows' and then 'more 4:\win.ini' to dump C:\windows\win.ini. This also works for non-US-ASCII characters like, "C" (aka U+FF23, Fullwidth Latin Capital Letter C), which when displayed by cmd.exe via some best fit style character conversions looks just like the regular US-ASCII 'C'. None of Explorer, IE, or the common file dialogs allow the use of these odd virtual drives -- just cmd.exe, so I'm not sure how this would ever be useful but I thought it was odd and I wanted to share.

PermalinkCommentscli technical boring subst windows

Official Google Blog: Moving to Unicode 5.1

2008 May 7, 4:24Woo Unicode! "For the first time, we found that Unicode was the most frequent encoding found on web pages, overtaking both ASCII and Western European encodings"PermalinkCommentsgoogle encoding i18n utf8 unicode ascii web

Encoding methods in C#

2008 Apr 12, 10:38

For Encode-O-Matic, my encoding tool written in C#, I had to figure out the appropriate DllImport declarations to use IDN Win32 functions which was a pain. To spare others that pain here's the two files CharacterSetEncoding.cs and NationalLanguageSupportUtilities.cs that declare the DllImports for IdnToUnicode, IdnToAscii, NormalizeString, MultiByteToWideChar, and WideCharToMultiByte.

PermalinkCommentsencodeomatic boring csharp widechartomultibyte idn tool dllimport

Extensible Markup Language (XML) 1.1 (Second Edition)

2008 Mar 18, 11:21End-of-line handling in XML. Spoiler: XML processor should normalize most newline character sequences to 0xA.PermalinkCommentsxml spec standard w3c unicode charset newline end-of-line

Worse Than Failure - Character encoding WTF

2007 Oct 19, 4:10FTA: 'This letter was sent to a Russian student by her French friend, who manually wrote the address that he received by e-mail. His e-mail client, unfortunately, was not set up correctly to display Cyrillic characters, so they were substituted with diacrPermalinkCommentsencoding charset unicode language humor article

Non-ASCII characters in URI attribute values

2007 Aug 23, 6:17From the HTML standard, what to do if you find a URI with a non-US-ASCII character.PermalinkCommentsuri standard reference unicode utf8 encoding ascii iri
Older EntriesNewer Entries Creative Commons License Some rights reserved.