I've just updated Encode-O-Matic with a Guess Input Encoding feature. When you start Encode-O-Matic or when you use the 'Guess Input Encoding' menu item from the 'Tools' menu, Encode-O-Matic will try out various combinations of encodings and guess at which set seem to apply to your input. For instance given the following text, Encode-O-Matic will correctly guess that it is percent encoded, base64 encoded, deflate compressed text:
S%2BWqUEhLLMoFUulFpXnZQLogMa%2BkmCuPqxzILk%2FMyeHK4QIA
It should work fairly well for simple things but I did pick 'Guess' for the name of the feature to intentionally lower
expectations. It doesn't currently apply to character encodings but that may be something to consider in the future.I've just put up an update for Encode-O-Matic with the following improvements:
I've hooked up the printer/scanner to the Media Center PC since I leave that on all the time anyway so we can have a networked printer. I wanted to hook up the scanner in a somewhat similar fashion but I didn't want to install HP's software (other than the drivers of course). So I've written my own script for scanning in PowerShell that does the following:
Here's the actual code from my scan.ps1 file:
param([Switch] $ShowProgress, [switch] $OpenCompletedResult)
$filePathTemplate = "C:\users\public\pictures\scanned\scan {0} {1}.{2}";
$time = get-date -uformat "%Y-%m-%d";
[void]([reflection.assembly]::loadfile( "C:\Windows\Microsoft.NET\Framework\v2.0.50727\System.Drawing.dll"))
$deviceManager = new-object -ComObject WIA.DeviceManager
$device = $deviceManager.DeviceInfos.Item(1).Connect();
foreach ($item in $device.Items) {
$fileIdx = 0;
while (test-path ($filePathTemplate -f $time,$fileIdx,"*")) {
[void](++$fileIdx);
}
if ($ShowProgress) { "Scanning..." }
$image = $item.Transfer();
$fileName = ($filePathTemplate -f $time,$fileIdx,$image.FileExtension);
$image.SaveFile($fileName);
clear-variable image
if ($ShowProgress) { "Running OCR..." }
$modiDocument = new-object -comobject modi.document;
$modiDocument.Create($fileName);
$modiDocument.OCR();
if ($modiDocument.Images.Count -gt 0) {
$ocrText = $modiDocument.Images.Item(0).Layout.Text.ToString().Trim();
$modiDocument.Close();
clear-variable modiDocument
if (!($ocrText.Equals(""))) {
$fileAsImage = New-Object -TypeName system.drawing.bitmap -ArgumentList $fileName
if (!($fileName.EndsWith(".jpg") -or $fileName.EndsWith(".jpeg"))) {
if ($ShowProgress) { "Converting to JPEG..." }
$newFileName = ($filePathTemplate -f $time,$fileIdx,"jpg");
$fileAsImage.Save($newFileName, [System.Drawing.Imaging.ImageFormat]::Jpeg);
$fileAsImage.Dispose();
del $fileName;
$fileAsImage = New-Object -TypeName system.drawing.bitmap -ArgumentList $newFileName
$fileName = $newFileName
}
if ($ShowProgress) { "Saving OCR Text..." }
$property = $fileAsImage.PropertyItems[0];
$property.Id = 40092;
$property.Type = 1;
$property.Value = [system.text.encoding]::Unicode.GetBytes($ocrText);
$property.Len = $property.Value.Count;
$fileAsImage.SetPropertyItem($property);
$fileAsImage.Save(($fileName + ".new"));
$fileAsImage.Dispose();
del $fileName;
ren ($fileName + ".new") $fileName
}
}
else {
$modiDocument.Close();
clear-variable modiDocument
}
if ($ShowProgress) { "Done." }
if ($OpenCompletedResult) {
. $fileName;
}
else {
$result = dir $fileName;
$result | add-member -membertype noteproperty -name OCRText -value $ocrText
$result
}
}
I ran into a few issues:
I like the idea of QR codes, encoding URLs and placing them on real world objects, but the QR codes themselves are kind of ugly. To make them less obvious I thought I could spray QR codes on to an object with an infrared reflective paint and shine infrared light on the QR codes, since most cameras, for instance the camera in my G1 phone, pick up infrared that our eyes do not.
In my search for infrared paint I've found a seller of IR ink (via programming forum) and an Infrared Paint Recipe (via IR FAQ).
In looking for this paint I've found that it comes up a lot in relation to the military for things like paint markers that are visible at night with proper equipment, and paint that absorbs IR light to make vehicles less obvious to night vision goggles. Even though the first reflects infrared light and the second absorbs it websites end up refering to both as infrared paint which made it difficult to search.
Additionally I found links to some other geeky infrared projects:
I've found while debugging networking in IE its often useful to quickly tell if a string is encoded in UTF-8. You can check for the Byte Order Mark (EF BB BF in UTF-8) but, I rarely see the BOM on UTF-8 strings. Instead I apply a quick and dirty UTF-8 test that takes advantage of the well-formed UTF-8 restrictions.
Unlike other multibyte character encoding forms (see Windows supported character sets or IANA's list of character sets), for example Big5, where sticking together any two bytes is more likely than not to give a valid byte sequence, UTF-8 is more restrictive. And unlike other multibyte character encodings, UTF-8 bytes may be taken out of context and one can still know that its a single byte character, the starting byte of a three byte sequence, etc.
The full rules for well-formed UTF-8 are a little too complicated for me to commit to memory. Instead I've got my own simpler (this is the quick part) set of rules that will be mostly correct (this is the dirty part). For as many bytes in the string as you care to examine, check the most significant digit of the byte:
Code Points | 1st Byte | 2nd Byte | 3rd Byte | 4th Byte |
---|---|---|---|---|
U+0000..U+007F | 00..7F | |||
U+0080..U+07FF | C2..DF | 80..BF | ||
U+0800..U+0FFF | E0 | A0..BF | 80..BF | |
U+1000..U+CFFF | E1..EC | 80..BF | 80..BF | |
U+D000..U+D7FF | ED | 80..9F | 80..BF | |
U+E000..U+FFFF | EE..EF | 80..BF | 80..BF | |
U+10000..U+3FFFF | F0 | 90..BF | 80..BF | 80..BF |
U+40000..U+FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF |
U+100000..U+10FFFF | F4 | 80..8F | 80..BF | 80..BF |
Information about URI Fragments, the portion of URIs that follow the '#' at the end and that are used to navigate within a document, is scattered throughout various documents which I usually have to hunt down. Instead I'll link to them all here.
Definitions. Fragments are defined in the URI RFC which states that they're used to identify a secondary resource that is related to the primary resource identified by the URI as a subset of the primary, a view of the primary, or some other resource described by the primary. The interpretation of a fragment is based on the mime type of the primary resource. Tim Berners-Lee notes that determining fragment meaning from mime type is a problem because a single URI may contain a single fragment, however over HTTP a single URI can result in the same logical resource represented in different mime types. So there's one fragment but multiple mime types and so multiple interpretations of the one fragment. The URI RFC says that if an author has a single resource available in multiple mime types then the author must ensure that the various representations of a single resource must all resolve fragments to the same logical secondary resource. Depending on which mime types you're dealing with this is either not easy or not possible.
HTTP. In HTTP when URIs are used, the fragment is not included. The General Syntax section of the HTTP standard says it uses the definitions of 'URI-reference' (which includes the fragment), 'absoluteURI', and 'relativeURI' (which don't include the fragment) from the URI RFC. However, the 'URI-reference' term doesn't actually appear in the BNF for the protocol. Accordingly the headers like 'Request-URI', 'Content-Location', 'Location', and 'Referer' which include URIs are defined with 'absoluteURI' or 'relativeURI' and don't include the fragment. This is in keeping with the original fragment definition which says that the fragment is used as a view of the original resource and consequently only needed for resolution on the client. Additionally, the URI RFC explicitly notes that not including the fragment is a privacy feature such that page authors won't be able to stop clients from viewing whatever fragments the client chooses. This seems like an odd claim given that if the author wanted to selectively restrict access to portions of documents there are other options for them like breaking out the parts of a single resource to which the author wishes to restrict access into separate resources.
HTML. In HTML, the HTML mime type RFC defines HTML's fragment use which consists of fragments referring to elements with a corresponding 'id' attribute or one of a particular set of elements with a corresponding 'name' attribute. The HTML spec discusses fragment use additionally noting that the names and ids must be unique in the document and that they must consist of only US-ASCII characters. The ID and NAME attributes are further restricted in section 6 to only consist of alphanumerics, the hyphen, period, colon, and underscore. This is a subset of the characters allowed in the URI fragment so no encoding is discussed since technically its not needed. However, practically speaking, browsers like FireFox and Internet Explorer allow for names and ids containing characters outside of the defined set including characters that must be percent-encoded to appear in a URI fragment. The interpretation of percent-encoded characters in fragments for HTML documents is not consistent across browsers (or in some cases within the same browser) especially for the percent-encoded percent.
Text. Text/plain recently got a fragment definition that allows fragments to refer to particular lines or characters within a text document. The scheme no longer includes regular expressions, which disappointed me at first, but in retrospect is probably good idea for increasing the adoption of this fragment scheme and for avoiding the potential for ubiquitous DoS via regex. One of the authors also notes this on his blog. I look forward to the day when this scheme is widely implemented.
XML. XML has the XPointer framework to define its fragment structure as noted by the XML mime type definition. XPointer consists of a general scheme that contains subschemes that identify a subset of an XML document. Its too bad such a thing wasn't adopted for URI fragments in general to solve the problem of a single resource with multiple mime type representations. I wrote more about XPointer when I worked on hacking XPointer into IE.
SVG and MPEG. Through the Media Fragments Working Group I found a couple more fragment scheme definitions. SVG's fragment scheme is defined in the SVG documentation and looks similar to XML's. MPEG has one defined but I could only find it as an ISO document "Text of ISO/IEC FCD 21000-17 MPEG-12 FID" and not as an RFC which is a little disturbing.
AJAX. AJAX websites have used fragments as an escape hatch for two issues that I've seen. The first is getting a unique URL for versions of a page that are produced on the client by script. The fragment may be changed by script without forcing the page to reload. This goes outside the rules of the standards by using HTML fragments in a fashion not called out by the HTML spec. but it does seem to be inline with the spirit of the fragment in that it is a subview of the original resource and interpretted client side. The other hack-ier use of the fragment in AJAX is for cross domain communication. The basic idea is that different frames or windows may not communicate in normal fashions if they have different domains but they can view each other's URLs and accordingly can change their own fragments in order to send a message out to those who know where to look. IMO this is not inline with the spirit of the fragment but is rather a cool hack.
For Encode-O-Matic, my encoding tool written in C#, I had to figure out the appropriate DllImport declarations to use IDN Win32 functions which was a pain. To spare others that pain here's the two files CharacterSetEncoding.cs and NationalLanguageSupportUtilities.cs that declare the DllImports for IdnToUnicode, IdnToAscii, NormalizeString, MultiByteToWideChar, and WideCharToMultiByte.