2010 Feb 4, 2:01On the crappy licensing of the H.264 and MPEG codecs in popular video encoding software.
video encoding codec patent legal law apple microsoft theora h.264 technical 2010 Jan 28, 4:20Graph of encodings used by documents on the web. Unicode based encodings are thankfully on the rise.
unicode encoding web internationalization localization utf8 text html technical 2010 Jan 20, 8:28GZip vs Deflate execution speeds. Deflate found to be much faster in particular cases and about the same in the rest.
gzip deflate performance technical http compression programming development blog 2010 Jan 8, 2:08Flickr dev talks image metadata the various forms which to prefer and how to guess at their character encodings.
unicode charset flickr photo image exif programming reference xmp technical 2009 Dec 15, 2:01"Jeff Atwood (Coding Horror fame) was in for a horror when he realized that his server crashed and his data was gone and due to some reason, the backup mechanism was not working. ... So what should
Jeff do now? Since Coding horror is a high traffic blog, I think there is a way to get back at least some of the images." Reconstruct the HTML from Google's cache, change the HTTP server to tell
the client it has the correct cached image for all the images, add script to the HTML to grab the images and send them back. Awesome idea. Of course now I want to setup Fiddler to swap in random
images...
via:ericlaw jeff-atwood backup web http cache image javascript technical 2009 Nov 27, 6:10"What follows is a brief description of the method we have developed for encoding arbitrary shellcode as English text. This English shellcode is completely self-contained, i.e., it does not require
an external loader, and executes as valid IA32 code."
security polyglot intel paper research programming hack obfuscation english language technical system:filetype:pdf system:media:document 2009 Sep 30, 5:12Bjarne Stroustrup answers the age old style questions like "int *p or int* p?" and "const int a or int const a?"
reference c++ faq style coding programming bjarne-stroustrup technical 2009 Sep 3, 7:17"This specification defines a lossless compressed data format that compresses data using a combination of the LZ77 algorithm and Huffman coding." Also see RFC 1950 zlib, a wrapper compression format
that can use deflate, and RFC 1952 gzip, a compressed file format that can use deflate.
technical rfc ietf compression http deflate gzip zlib 2009 Aug 3, 11:06"But how efficient is the alphabet at encoding information on a page?"
via:ericlaw humor paper storage encoding 2009 Jun 27, 3:42
I've hooked up the printer/scanner to the Media Center PC since I leave that on all the time anyway so we can have a networked printer. I wanted to hook up the scanner in a somewhat similar fashion
but I didn't want to install HP's software (other than the drivers of course). So I've written my own script for scanning in PowerShell that does the following:
- Scans using the Windows Image Acquisition APIs via COM
- Runs OCR on the image using Microsoft Office Document Imaging via COM (which may already be on your PC if you have Office installed)
- Converts the image to JPEG using .NET Image APIs
- Stores the OCR text into the EXIF comment field using
.NET Image APIs (which means Windows Search can index the image by the text in the image)
- Moves the image to the public share
Here's the actual code from my scan.ps1 file:
param([Switch] $ShowProgress, [switch] $OpenCompletedResult)
$filePathTemplate = "C:\users\public\pictures\scanned\scan {0} {1}.{2}";
$time = get-date -uformat "%Y-%m-%d";
[void]([reflection.assembly]::loadfile( "C:\Windows\Microsoft.NET\Framework\v2.0.50727\System.Drawing.dll"))
$deviceManager = new-object -ComObject WIA.DeviceManager
$device = $deviceManager.DeviceInfos.Item(1).Connect();
foreach ($item in $device.Items) {
$fileIdx = 0;
while (test-path ($filePathTemplate -f $time,$fileIdx,"*")) {
[void](++$fileIdx);
}
if ($ShowProgress) { "Scanning..." }
$image = $item.Transfer();
$fileName = ($filePathTemplate -f $time,$fileIdx,$image.FileExtension);
$image.SaveFile($fileName);
clear-variable image
if ($ShowProgress) { "Running OCR..." }
$modiDocument = new-object -comobject modi.document;
$modiDocument.Create($fileName);
$modiDocument.OCR();
if ($modiDocument.Images.Count -gt 0) {
$ocrText = $modiDocument.Images.Item(0).Layout.Text.ToString().Trim();
$modiDocument.Close();
clear-variable modiDocument
if (!($ocrText.Equals(""))) {
$fileAsImage = New-Object -TypeName system.drawing.bitmap -ArgumentList $fileName
if (!($fileName.EndsWith(".jpg") -or $fileName.EndsWith(".jpeg"))) {
if ($ShowProgress) { "Converting to JPEG..." }
$newFileName = ($filePathTemplate -f $time,$fileIdx,"jpg");
$fileAsImage.Save($newFileName, [System.Drawing.Imaging.ImageFormat]::Jpeg);
$fileAsImage.Dispose();
del $fileName;
$fileAsImage = New-Object -TypeName system.drawing.bitmap -ArgumentList $newFileName
$fileName = $newFileName
}
if ($ShowProgress) { "Saving OCR Text..." }
$property = $fileAsImage.PropertyItems[0];
$property.Id = 40092;
$property.Type = 1;
$property.Value = [system.text.encoding]::Unicode.GetBytes($ocrText);
$property.Len = $property.Value.Count;
$fileAsImage.SetPropertyItem($property);
$fileAsImage.Save(($fileName + ".new"));
$fileAsImage.Dispose();
del $fileName;
ren ($fileName + ".new") $fileName
}
}
else {
$modiDocument.Close();
clear-variable modiDocument
}
if ($ShowProgress) { "Done." }
if ($OpenCompletedResult) {
. $fileName;
}
else {
$result = dir $fileName;
$result | add-member -membertype noteproperty -name OCRText -value $ocrText
$result
}
}
I ran into a few issues:
- MODI doesn't seem to be in the Office 2010 Technical Preview I installed first. Installing Office 2007 fixed that.
- The MODI.Document class, at least via PowerShell, can't be instantiated in a 64bit environment. To run the script on my 64bit OS I had to start powershell from the 32bit cmd.exe
(C:\windows\syswow64\cmd.exe).
- I was planning to hook up my script to the scanner's 'Scan' button, but HP didn't get the button
working for their Vista driver. Their workaround is "don't do that!".
- You must call Image.Dispose() to get .NET to release its reference to the corresponding image file.
- In trying to figure out how to store the text in the files comment, I ran into a dead-end trying to find the corresponding setter for GetDetailsOf which folks like James O'Neil use in PowerShell for interesting ends.
technical scanner ocr .net modi powershell office wia 2009 Jun 20, 9:43How to use the WIA APIs in C#. WIA is Windows API to get images from scanners and cameras. And, as I found out, if you want to use the API in PowerShell try '$deviceManager = new-object -ComObject
WIA.DeviceManager'
video scanner api wia csharp howto programming camera image photo .net webcam technical 2009 May 29, 2:50
I like the idea of QR codes, encoding URLs and placing them on real world
objects, but the QR codes themselves are kind of ugly. To make them less obvious I thought I could spray QR codes on to an object with an infrared reflective paint and shine infrared light on the
QR codes, since most cameras, for instance the camera in my G1 phone, pick up infrared that our eyes do not.
In my search for infrared paint I've found a seller of IR ink (via programming forum) and an Infrared Paint Recipe (via IR FAQ).
In looking for this paint I've found that it comes up a lot in relation to the military for things like paint markers that are visible at
night with proper equipment, and paint that absorbs IR light to make vehicles less obvious to night vision goggles. Even though the first
reflects infrared light and the second absorbs it websites end up refering to both as infrared paint which made it difficult to search.
Additionally I found links to some other geeky infrared projects:
ir paint technical ir infrared qr qr code 2009 Apr 23, 1:35"This e-mail is an attempt to give a relatively concise yet reasonably complete overview of non-Unicode character sets and encodings for 'Chinese characters', excluding those which are not supported
by at least one of the four browsers IE, Safari, Firefox and Opera (henceforth 'all browsers'), and tentatively avoiding technical details which are out of scope for HTML5 unless they are important
to gain a general understanding of the relevant issues."
html html5 iso-2022 charset encoding character unicode cjk 2009 Apr 7, 1:30I really dislike how IE deals with non-US-ASCII in URLs. I should write up a post on what exactly IE does with non-US-ASCII characters in URLs. "Just like IRIs the URL is mapped to a URI using UTF-8.
Except for the query component of the URL (the bit after the question mark). Here for legacy reasons the encoding of the document is used instead. Except if the encoding of the document is UTF-16, in
which case UTF-8 is used. Effectively, using non-ASCII characters in URLs in documents not encoded as UTF-8 or UTF-16 will give you surprising results, to say the least. Yay for browsers!"
http encoding html5 url uri unicode iri 2009 Mar 6, 11:47"Anyway, I decided to take the letter a and put as many different diacritics on it as I could." Micahel Kaplan sticks like 80 diacritics on the letter 'a'. Awesome.
encoding unicode diacritic language letter michael-kaplan 2009 Mar 6, 5:16
I've found while debugging networking in IE its often useful to quickly tell if a string is encoded in UTF-8. You can check for the Byte Order Mark (EF BB BF in UTF-8) but, I rarely see the BOM on
UTF-8 strings. Instead I apply a quick and dirty UTF-8 test that takes advantage of the well-formed UTF-8 restrictions.
Unlike other multibyte character encoding forms (see Windows supported character sets or IANA's list of character sets), for example Big5, where sticking together any two bytes is more likely than not to give a valid byte sequence, UTF-8 is more restrictive. And unlike
other multibyte character encodings, UTF-8 bytes may be taken out of context and one can still know that its a single byte character, the starting byte of a three byte sequence, etc.
The full rules for well-formed UTF-8 are a little too complicated for me to commit to memory. Instead I've got my own simpler (this is the quick part) set of rules that will be mostly correct (this
is the dirty part). For as many bytes in the string as you care to examine, check the most significant digit of the byte:
-
F:
-
This is byte 1 of a 4 byte encoded codepoint and must be followed by 3 trail bytes.
-
E:
-
This is byte 1 of a 3 byte encoded codepoint and must be followed by 2 trail bytes.
-
C..D:
-
This is byte 1 of a 2 byte encoded codepoint and must be followed by 1 trail byte.
-
8..B:
-
This is a trail byte.
-
0..7:
-
This is a single byte encoded codepoint.
The simpler rules can produce false positives in some cases: that is, they'll say a string is UTF-8 when in fact it might not be. But it won't produce false negatives. The following is table
from the
Unicode spec. that actually describes well-formed UTF-8.
Code Points
|
1st Byte
|
2nd Byte
|
3rd Byte
|
4th Byte
|
U+0000..U+007F
|
00..7F
|
U+0080..U+07FF
|
C2..DF
|
80..BF
|
U+0800..U+0FFF
|
E0
|
A0..BF
|
80..BF
|
U+1000..U+CFFF
|
E1..EC
|
80..BF
|
80..BF
|
U+D000..U+D7FF
|
ED
|
80..9F
|
80..BF
|
U+E000..U+FFFF
|
EE..EF
|
80..BF
|
80..BF
|
U+10000..U+3FFFF
|
F0
|
90..BF
|
80..BF
|
80..BF
|
U+40000..U+FFFFF
|
F1..F3
|
80..BF
|
80..BF
|
80..BF
|
U+100000..U+10FFFF
|
F4
|
80..8F
|
80..BF
|
80..BF
|
test technical unicode boring charset utf8 encoding 2009 Jan 26, 2:12Mark Pilgrim's series of articles and slides from a corresponding talk on video encoding.
mark-pilgrim video encoding audio reference codec 2008 Oct 13, 2:40Watch out for too good to be true washing services (or free network traffic anonymization): "The laundry would then send out "color coded" special discount tickets, to the effect of "get two loads
for the price of one," etc. The color coding was matched to specific streets and thus when someone brought in their laundry, it was easy to determine the general location from which a city map was
coded. While the laundry was indeed being washed, pressed and dry cleaned, it had one additional cycle -- every garment, sheet, glove, pair of pants, was first sent through an analyzer, located in
the basement, that checked for bomb-making residue." From the comment section of Schneier on Security on this topic: "Yet another example of how inexpensive, reliable home washers and dryers help
terrorists. When will we learn?"
security history laundromat ira terrorism bomb 2008 Oct 5, 9:17
Sarah asked me if I knew of a syntax highlighter for the QuickBase formula language which she uses at work. I couldn't find one but thought it might be fun to make a QuickBase Formula syntax highlighter based on the QuickBase help's
description of the formula syntax. Thankfully the language is relatively simple since my skills with ANTLR, the parser generator, are rusty now and I've only
used it previously for personal projects (like Javaish, the ridiculous Java based shell idea I had).
With the help of some great ANTLR examples and an ANTLR cheat
sheet I was able to come up with the grammar that parses the QuickBase Formula syntax and prints out the same formula marked up with HTML SPAN tags and various CSS classes. ANTLR produces the
parser in Java which I wrapped up in an applet, put in a jar, and embedded in an HTML page. The script in that page runs user input through the applet's parser and sticks the output at the bottom
of the page with appropriate CSS rules to highlight and print the formula in a pretty fashion.
What I learned:
- I didn't realize that Java applets are easy to use via script in an HTML page. In the JavaScript I
can simply refer to publicly exposed methods on the applet and run JavaScript strings through them. It makes for a great combination: do the heavy coding in Java and do the UI in HTML. I may end up
doing this again in the future.
- I love ANTLRWorks, the ANTLR IDE, that didn't exist the last time I used ANTLR. It tells you about issues with your grammar as you create it,
lets you easily debug the grammar running it forwards and backwards, display parse trees, and other useful things.
java technical programming quickbase language antlr antlrworks 2008 May 7, 4:24Woo Unicode! "For the first time, we found that Unicode was the most frequent encoding found on web pages, overtaking both ASCII and Western European encodings"
google encoding i18n utf8 unicode ascii web