encoding page 2 - Dave's Blog

Search
My timeline on Mastodon

Encode-O-Matic: Guess Encoding

2010 Apr 4, 2:02

I've just updated Encode-O-Matic with a Guess Input Encoding feature. When you start Encode-O-Matic or when you use the 'Guess Input Encoding' menu item from the 'Tools' menu, Encode-O-Matic will try out various combinations of encodings and guess at which set seem to apply to your input. For instance given the following text, Encode-O-Matic will correctly guess that it is percent encoded, base64 encoded, deflate compressed text:

S%2BWqUEhLLMoFUulFpXnZQLogMa%2BkmCuPqxzILk%2FMyeHK4QIA
It should work fairly well for simple things but I did pick 'Guess' for the name of the feature to intentionally lower expectations. It doesn't currently apply to character encodings but that may be something to consider in the future.PermalinkCommentstechnical encodeomatic tool encoding

Encode-O-Matic Update: Compression, Hex View, Quick Show Output

2010 Mar 9, 9:08

I've just put up an update for Encode-O-Matic with the following improvements:

PermalinkCommentstechnical encodeomatic project

No, you can’t do that with H.264 « Digital Diary of Ben Schwartz

2010 Feb 4, 2:01On the crappy licensing of the H.264 and MPEG codecs in popular video encoding software.PermalinkCommentsvideo encoding codec patent legal law apple microsoft theora h.264 technical

Official Google Blog: Unicode nearing 50% of the web

2010 Jan 28, 4:20Graph of encodings used by documents on the web. Unicode based encodings are thankfully on the rise.PermalinkCommentsunicode encoding web internationalization localization utf8 text html technical

Code: Flickr Developer Blog » A Chinese puzzle: Unicode and EXIF metadata parsing

2010 Jan 8, 2:08Flickr dev talks image metadata the various forms which to prefer and how to guess at their character encodings.PermalinkCommentsunicode charset flickr photo image exif programming reference xmp technical

English Shellcode

2009 Nov 27, 6:10"What follows is a brief description of the method we have developed for encoding arbitrary shellcode as English text. This English shellcode is completely self-contained, i.e., it does not require an external loader, and executes as valid IA32 code."PermalinkCommentssecurity polyglot intel paper research programming hack obfuscation english language technical system:filetype:pdf system:media:document

Coding Horror: The Paper Data Storage Option

2009 Aug 3, 11:06"But how efficient is the alphabet at encoding information on a page?"PermalinkCommentsvia:ericlaw humor paper storage encoding

PowerShell Scanning Script

2009 Jun 27, 3:42

I've hooked up the printer/scanner to the Media Center PC since I leave that on all the time anyway so we can have a networked printer. I wanted to hook up the scanner in a somewhat similar fashion but I didn't want to install HP's software (other than the drivers of course). So I've written my own script for scanning in PowerShell that does the following:

  1. Scans using the Windows Image Acquisition APIs via COM
  2. Runs OCR on the image using Microsoft Office Document Imaging via COM (which may already be on your PC if you have Office installed)
  3. Converts the image to JPEG using .NET Image APIs
  4. Stores the OCR text into the EXIF comment field using .NET Image APIs (which means Windows Search can index the image by the text in the image)
  5. Moves the image to the public share

Here's the actual code from my scan.ps1 file:

param([Switch] $ShowProgress, [switch] $OpenCompletedResult)

$filePathTemplate = "C:\users\public\pictures\scanned\scan {0} {1}.{2}";
$time = get-date -uformat "%Y-%m-%d";

[void]([reflection.assembly]::loadfile( "C:\Windows\Microsoft.NET\Framework\v2.0.50727\System.Drawing.dll"))

$deviceManager = new-object -ComObject WIA.DeviceManager
$device = $deviceManager.DeviceInfos.Item(1).Connect();

foreach ($item in $device.Items) {
        $fileIdx = 0;
        while (test-path ($filePathTemplate -f $time,$fileIdx,"*")) {
                [void](++$fileIdx);
        }

        if ($ShowProgress) { "Scanning..." }

        $image = $item.Transfer();
        $fileName = ($filePathTemplate -f $time,$fileIdx,$image.FileExtension);
        $image.SaveFile($fileName);
        clear-variable image

        if ($ShowProgress) { "Running OCR..." }

        $modiDocument = new-object -comobject modi.document;
        $modiDocument.Create($fileName);
        $modiDocument.OCR();
        if ($modiDocument.Images.Count -gt 0) {
                $ocrText = $modiDocument.Images.Item(0).Layout.Text.ToString().Trim();
                $modiDocument.Close();
                clear-variable modiDocument

                if (!($ocrText.Equals(""))) {
                        $fileAsImage = New-Object -TypeName system.drawing.bitmap -ArgumentList $fileName
                        if (!($fileName.EndsWith(".jpg") -or $fileName.EndsWith(".jpeg"))) {
                                if ($ShowProgress) { "Converting to JPEG..." }

                                $newFileName = ($filePathTemplate -f $time,$fileIdx,"jpg");
                                $fileAsImage.Save($newFileName, [System.Drawing.Imaging.ImageFormat]::Jpeg);
                                $fileAsImage.Dispose();
                                del $fileName;

                                $fileAsImage = New-Object -TypeName system.drawing.bitmap -ArgumentList $newFileName 
                                $fileName = $newFileName
                        }

                        if ($ShowProgress) { "Saving OCR Text..." }

                        $property = $fileAsImage.PropertyItems[0];
                        $property.Id = 40092;
                        $property.Type = 1;
                        $property.Value = [system.text.encoding]::Unicode.GetBytes($ocrText);
                        $property.Len = $property.Value.Count;
                        $fileAsImage.SetPropertyItem($property);
                        $fileAsImage.Save(($fileName + ".new"));
                        $fileAsImage.Dispose();
                        del $fileName;
                        ren ($fileName + ".new") $fileName
                }
        }
        else {
                $modiDocument.Close();
                clear-variable modiDocument
        }

        if ($ShowProgress) { "Done." }

        if ($OpenCompletedResult) {
                . $fileName;
        }
        else {
                $result = dir $fileName;
                $result | add-member -membertype noteproperty -name OCRText -value $ocrText
                $result
        }
}

I ran into a few issues:

PermalinkCommentstechnical scanner ocr .net modi powershell office wia

Infrared Paint Link Roundup

2009 May 29, 2:50

I like the idea of QR codes, encoding URLs and placing them on real world objects, but the QR codes themselves are kind of ugly. To make them less obvious I thought I could spray QR codes on to an object with an infrared reflective paint and shine infrared light on the QR codes, since most cameras, for instance the camera in my G1 phone, pick up infrared that our eyes do not.

In my search for infrared paint I've found a seller of IR ink (via programming forum) and an Infrared Paint Recipe (via IR FAQ).

In looking for this paint I've found that it comes up a lot in relation to the military for things like paint markers that are visible at night with proper equipment, and paint that absorbs IR light to make vehicles less obvious to night vision goggles. Even though the first reflects infrared light and the second absorbs it websites end up refering to both as infrared paint which made it difficult to search.

Additionally I found links to some other geeky infrared projects:

PermalinkCommentsir paint technical ir infrared qr qr code

[whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009 Apr 23, 1:35"This e-mail is an attempt to give a relatively concise yet reasonably complete overview of non-Unicode character sets and encodings for 'Chinese characters', excluding those which are not supported by at least one of the four browsers IE, Safari, Firefox and Opera (henceforth 'all browsers'), and tentatively avoiding technical details which are out of scope for HTML5 unless they are important to gain a general understanding of the relevant issues."PermalinkCommentshtml html5 iso-2022 charset encoding character unicode cjk

URLs are tough - Anne's Weblog

2009 Apr 7, 1:30I really dislike how IE deals with non-US-ASCII in URLs. I should write up a post on what exactly IE does with non-US-ASCII characters in URLs. "Just like IRIs the URL is mapped to a URI using UTF-8. Except for the query component of the URL (the bit after the question mark). Here for legacy reasons the encoding of the document is used instead. Except if the encoding of the document is UTF-16, in which case UTF-8 is used. Effectively, using non-ASCII characters in URLs in documents not encoded as UTF-8 or UTF-16 will give you surprising results, to say the least. Yay for browsers!"PermalinkCommentshttp encoding html5 url uri unicode iri

Sorting it all Out : What do you get when you combine a base character with a buttload of diacritics?

2009 Mar 6, 11:47"Anyway, I decided to take the letter a and put as many different diacritics on it as I could." Micahel Kaplan sticks like 80 diacritics on the letter 'a'. Awesome.PermalinkCommentsencoding unicode diacritic language letter michael-kaplan

The 'Is It UTF-8?' Quick and Dirty Test

2009 Mar 6, 5:16

I've found while debugging networking in IE its often useful to quickly tell if a string is encoded in UTF-8. You can check for the Byte Order Mark (EF BB BF in UTF-8) but, I rarely see the BOM on UTF-8 strings. Instead I apply a quick and dirty UTF-8 test that takes advantage of the well-formed UTF-8 restrictions.

Unlike other multibyte character encoding forms (see Windows supported character sets or IANA's list of character sets), for example Big5, where sticking together any two bytes is more likely than not to give a valid byte sequence, UTF-8 is more restrictive. And unlike other multibyte character encodings, UTF-8 bytes may be taken out of context and one can still know that its a single byte character, the starting byte of a three byte sequence, etc.

The full rules for well-formed UTF-8 are a little too complicated for me to commit to memory. Instead I've got my own simpler (this is the quick part) set of rules that will be mostly correct (this is the dirty part). For as many bytes in the string as you care to examine, check the most significant digit of the byte:

F:
This is byte 1 of a 4 byte encoded codepoint and must be followed by 3 trail bytes.
E:
This is byte 1 of a 3 byte encoded codepoint and must be followed by 2 trail bytes.
C..D:
This is byte 1 of a 2 byte encoded codepoint and must be followed by 1 trail byte.
8..B:
This is a trail byte.
0..7:
This is a single byte encoded codepoint.
The simpler rules can produce false positives in some cases: that is, they'll say a string is UTF-8 when in fact it might not be. But it won't produce false negatives. The following is table from the Unicode spec. that actually describes well-formed UTF-8.
Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
U+0000..U+007F 00..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF

PermalinkCommentstest technical unicode boring charset utf8 encoding

GIVE [dive into mark]

2009 Jan 26, 2:12Mark Pilgrim's series of articles and slides from a corresponding talk on video encoding.PermalinkCommentsmark-pilgrim video encoding audio reference codec

Official Google Blog: Moving to Unicode 5.1

2008 May 7, 4:24Woo Unicode! "For the first time, we found that Unicode was the most frequent encoding found on web pages, overtaking both ASCII and Western European encodings"PermalinkCommentsgoogle encoding i18n utf8 unicode ascii web

URI Fragment Info Roundup

2008 Apr 21, 11:53

['Neverending story' by Alexandre Duret-Lutz. A framed photo of books with the droste effect applied. Licensed under creative commons.]Information about URI Fragments, the portion of URIs that follow the '#' at the end and that are used to navigate within a document, is scattered throughout various documents which I usually have to hunt down. Instead I'll link to them all here.

Definitions. Fragments are defined in the URI RFC which states that they're used to identify a secondary resource that is related to the primary resource identified by the URI as a subset of the primary, a view of the primary, or some other resource described by the primary. The interpretation of a fragment is based on the mime type of the primary resource. Tim Berners-Lee notes that determining fragment meaning from mime type is a problem because a single URI may contain a single fragment, however over HTTP a single URI can result in the same logical resource represented in different mime types. So there's one fragment but multiple mime types and so multiple interpretations of the one fragment. The URI RFC says that if an author has a single resource available in multiple mime types then the author must ensure that the various representations of a single resource must all resolve fragments to the same logical secondary resource. Depending on which mime types you're dealing with this is either not easy or not possible.

HTTP. In HTTP when URIs are used, the fragment is not included. The General Syntax section of the HTTP standard says it uses the definitions of 'URI-reference' (which includes the fragment), 'absoluteURI', and 'relativeURI' (which don't include the fragment) from the URI RFC. However, the 'URI-reference' term doesn't actually appear in the BNF for the protocol. Accordingly the headers like 'Request-URI', 'Content-Location', 'Location', and 'Referer' which include URIs are defined with 'absoluteURI' or 'relativeURI' and don't include the fragment. This is in keeping with the original fragment definition which says that the fragment is used as a view of the original resource and consequently only needed for resolution on the client. Additionally, the URI RFC explicitly notes that not including the fragment is a privacy feature such that page authors won't be able to stop clients from viewing whatever fragments the client chooses. This seems like an odd claim given that if the author wanted to selectively restrict access to portions of documents there are other options for them like breaking out the parts of a single resource to which the author wishes to restrict access into separate resources.

HTML. In HTML, the HTML mime type RFC defines HTML's fragment use which consists of fragments referring to elements with a corresponding 'id' attribute or one of a particular set of elements with a corresponding 'name' attribute. The HTML spec discusses fragment use additionally noting that the names and ids must be unique in the document and that they must consist of only US-ASCII characters. The ID and NAME attributes are further restricted in section 6 to only consist of alphanumerics, the hyphen, period, colon, and underscore. This is a subset of the characters allowed in the URI fragment so no encoding is discussed since technically its not needed. However, practically speaking, browsers like FireFox and Internet Explorer allow for names and ids containing characters outside of the defined set including characters that must be percent-encoded to appear in a URI fragment. The interpretation of percent-encoded characters in fragments for HTML documents is not consistent across browsers (or in some cases within the same browser) especially for the percent-encoded percent.

Text. Text/plain recently got a fragment definition that allows fragments to refer to particular lines or characters within a text document. The scheme no longer includes regular expressions, which disappointed me at first, but in retrospect is probably good idea for increasing the adoption of this fragment scheme and for avoiding the potential for ubiquitous DoS via regex. One of the authors also notes this on his blog. I look forward to the day when this scheme is widely implemented.

XML. XML has the XPointer framework to define its fragment structure as noted by the XML mime type definition. XPointer consists of a general scheme that contains subschemes that identify a subset of an XML document. Its too bad such a thing wasn't adopted for URI fragments in general to solve the problem of a single resource with multiple mime type representations. I wrote more about XPointer when I worked on hacking XPointer into IE.

SVG and MPEG. Through the Media Fragments Working Group I found a couple more fragment scheme definitions. SVG's fragment scheme is defined in the SVG documentation and looks similar to XML's. MPEG has one defined but I could only find it as an ISO document "Text of ISO/IEC FCD 21000-17 MPEG-12 FID" and not as an RFC which is a little disturbing.

AJAX. AJAX websites have used fragments as an escape hatch for two issues that I've seen. The first is getting a unique URL for versions of a page that are produced on the client by script. The fragment may be changed by script without forcing the page to reload. This goes outside the rules of the standards by using HTML fragments in a fashion not called out by the HTML spec. but it does seem to be inline with the spirit of the fragment in that it is a subview of the original resource and interpretted client side. The other hack-ier use of the fragment in AJAX is for cross domain communication. The basic idea is that different frames or windows may not communicate in normal fashions if they have different domains but they can view each other's URLs and accordingly can change their own fragments in order to send a message out to those who know where to look. IMO this is not inline with the spirit of the fragment but is rather a cool hack.

PermalinkCommentsxml text ajax technical url boring uri fragment rfc

Encoding methods in C#

2008 Apr 12, 10:38

For Encode-O-Matic, my encoding tool written in C#, I had to figure out the appropriate DllImport declarations to use IDN Win32 functions which was a pain. To spare others that pain here's the two files CharacterSetEncoding.cs and NationalLanguageSupportUtilities.cs that declare the DllImports for IdnToUnicode, IdnToAscii, NormalizeString, MultiByteToWideChar, and WideCharToMultiByte.

PermalinkCommentsencodeomatic boring csharp widechartomultibyte idn tool dllimport

RFC 2231 MIME Parameter Value and Encoded Word Extensions: Character Sets, Languages, and Continuations

2008 Mar 8, 11:44"This memo defines extensions to the RFC 2045 media type and RFC 2183 disposition parameter value mechanisms to provide ... a means to specify parameter values in character sets other than US-ASCII..."PermalinkCommentshttp http-header rfc standard reference ietf mime encoding charset language content-disposition

i18n: languages, countries and character sets

2007 Nov 7, 4:28Out of date W3C document containing stats on frequency of use of various charsets in HTML pages (in 1997)PermalinkCommentscharset encoding i18n language reference w3c statistics

RFC 3548 The Base16, Base32, and Base64 Data Encodings

2007 Nov 5, 4:33Syntax of base64PermalinkCommentsbase64 encoding syntax reference rfc standard
Older EntriesNewer Entries Creative Commons License Some rights reserved.