2009 Sep 10, 8:22Geoff Nunberg investigates issues in Google Books and in the comments Google Book's team manager responds in the comments. Apparently metadata is bad everywhere and not an issue new to the Web and
user generated content or tagging. Like finding Feynman lectures categorized as Death Metal on Napster back in the day.
language google library metadata catalog 2009 Sep 3, 7:17"This specification defines a lossless compressed data format that compresses data using a combination of the LZ77 algorithm and Huffman coding." Also see RFC 1950 zlib, a wrapper compression format
that can use deflate, and RFC 1952 gzip, a compressed file format that can use deflate.
technical rfc ietf compression http deflate gzip zlib 2009 Sep 1, 4:25"Each unit has a stable URI, making it possible to link to it from your own domain models in a reliable way. For each unit, the ontology defines some useful metadata including abbreviation, a link to
DBpedia and a categorization of units into groups, such as length units."
semanticweb via:connolly web unit conversion uri technical 2009 Aug 31, 4:22"This document is intended to describe a HTTP Archive format that should be used when exporting data from Firebug Net panel. The current version of the format isn't finalized and is open for further
proposals."
http fiddler debug format firebug technical via:mnot 2009 Aug 12, 8:08"In a formal academic paper, every claim is referenced to another academic paper... This convention gives us an opportunity to study how ideas spread, and myths grow, because in theory you could
trace who references what, and how, to see an entire belief system evolve from the original data."
science meme research health medicine ben-goldacre network graph 2009 Aug 12, 5:02W3C File API makes it to first published working draft. Like the use of data URLs, don't like the new filedata URLs.
html5 w3c file upload script url data-scheme technical 2009 Aug 11, 5:21"Michael Niggel took a look at Journey Under the Sea, and mapped out all possible paths. It turns out that death and unfavorable endings are in fact much more likely than the rest."
visualization via:ethan_t_hein literature fiction if interactive flowchart infographics chooseyourownadventure 2009 Aug 5, 7:57"Ten times smaller than barcodes, Bokodes’ low-cost optical design can be read from as far as 4 meters away, much farther than barcodes, by taking an out-of-focus photo with any off-the-shelf
camera." Love for stuff like this to catch on, however compared to QR codes, these are much more difficult to produce than barcodes in that you can't just print them out and they require changes to
the photography technique (must be out of focus) rather than just analyzing any photograph of a barcode. They seem to be solving slightly different problems.
qrcode qr barcode camera information design bokode augmented-reality technical 2009 Aug 3, 11:06"But how efficient is the alphabet at encoding information on a page?"
via:ericlaw humor paper storage encoding 2009 Jul 31, 6:09An interactive touchable table to help you browse and select fonts.
art visualization design font typography surface table touchscreen video 2009 Jul 27, 5:34"This specification provides an API used to prompt the user with a file selection dialogue and obtain the data contained in files on the user's file system."
web w3c api upload script dom technical 2009 Jul 19, 11:44
I've redone my blog's layout to remind myself how terrible CSS is -- err I mean to play
with the more advanced features of CSS 2.1 which are all now available in IE8. As part of the new layout I've included my Delicious links by default but at a smaller size and I've replaced the
navigation list options with Technical, Personal and Everything as I've heard from folks that that would actually be useful. Besides the layout I've also updated the back-end, switching from my
handmade PHP+XSLT+RSS/Atom monster to a slightly less horrible PHP+DB solution. As a result everything should be much much faster including search which, incidentally, is so much easier to
implement outside of XSLT.
blog database redisgn xslt mysql homepage 2009 Jul 17, 4:36"For Windows 7, we’ve added support for Federated Search using OpenSearch v1.1 and worked to make the experience a seamless one." Explorer in Win7 supports OpenSearch descriptions (that use RSS)
opensearch search windows win7 technical 2009 Jul 1, 6:21"The QR code, used to store and decode small bits of data via printed symbol, received an artistic rendering by SET as part of its campaign for Marc by Marc Jacobs." I like the idea although in this
case its not very subtle or different from a regular QR code IMHO. Also, I was surprised that my phone could still read the QR code in this form.
qr qrcode marketing art internet mobile technical 2009 Jun 27, 3:42
I've hooked up the printer/scanner to the Media Center PC since I leave that on all the time anyway so we can have a networked printer. I wanted to hook up the scanner in a somewhat similar fashion
but I didn't want to install HP's software (other than the drivers of course). So I've written my own script for scanning in PowerShell that does the following:
- Scans using the Windows Image Acquisition APIs via COM
- Runs OCR on the image using Microsoft Office Document Imaging via COM (which may already be on your PC if you have Office installed)
- Converts the image to JPEG using .NET Image APIs
- Stores the OCR text into the EXIF comment field using
.NET Image APIs (which means Windows Search can index the image by the text in the image)
- Moves the image to the public share
Here's the actual code from my scan.ps1 file:
param([Switch] $ShowProgress, [switch] $OpenCompletedResult)
$filePathTemplate = "C:\users\public\pictures\scanned\scan {0} {1}.{2}";
$time = get-date -uformat "%Y-%m-%d";
[void]([reflection.assembly]::loadfile( "C:\Windows\Microsoft.NET\Framework\v2.0.50727\System.Drawing.dll"))
$deviceManager = new-object -ComObject WIA.DeviceManager
$device = $deviceManager.DeviceInfos.Item(1).Connect();
foreach ($item in $device.Items) {
$fileIdx = 0;
while (test-path ($filePathTemplate -f $time,$fileIdx,"*")) {
[void](++$fileIdx);
}
if ($ShowProgress) { "Scanning..." }
$image = $item.Transfer();
$fileName = ($filePathTemplate -f $time,$fileIdx,$image.FileExtension);
$image.SaveFile($fileName);
clear-variable image
if ($ShowProgress) { "Running OCR..." }
$modiDocument = new-object -comobject modi.document;
$modiDocument.Create($fileName);
$modiDocument.OCR();
if ($modiDocument.Images.Count -gt 0) {
$ocrText = $modiDocument.Images.Item(0).Layout.Text.ToString().Trim();
$modiDocument.Close();
clear-variable modiDocument
if (!($ocrText.Equals(""))) {
$fileAsImage = New-Object -TypeName system.drawing.bitmap -ArgumentList $fileName
if (!($fileName.EndsWith(".jpg") -or $fileName.EndsWith(".jpeg"))) {
if ($ShowProgress) { "Converting to JPEG..." }
$newFileName = ($filePathTemplate -f $time,$fileIdx,"jpg");
$fileAsImage.Save($newFileName, [System.Drawing.Imaging.ImageFormat]::Jpeg);
$fileAsImage.Dispose();
del $fileName;
$fileAsImage = New-Object -TypeName system.drawing.bitmap -ArgumentList $newFileName
$fileName = $newFileName
}
if ($ShowProgress) { "Saving OCR Text..." }
$property = $fileAsImage.PropertyItems[0];
$property.Id = 40092;
$property.Type = 1;
$property.Value = [system.text.encoding]::Unicode.GetBytes($ocrText);
$property.Len = $property.Value.Count;
$fileAsImage.SetPropertyItem($property);
$fileAsImage.Save(($fileName + ".new"));
$fileAsImage.Dispose();
del $fileName;
ren ($fileName + ".new") $fileName
}
}
else {
$modiDocument.Close();
clear-variable modiDocument
}
if ($ShowProgress) { "Done." }
if ($OpenCompletedResult) {
. $fileName;
}
else {
$result = dir $fileName;
$result | add-member -membertype noteproperty -name OCRText -value $ocrText
$result
}
}
I ran into a few issues:
- MODI doesn't seem to be in the Office 2010 Technical Preview I installed first. Installing Office 2007 fixed that.
- The MODI.Document class, at least via PowerShell, can't be instantiated in a 64bit environment. To run the script on my 64bit OS I had to start powershell from the 32bit cmd.exe
(C:\windows\syswow64\cmd.exe).
- I was planning to hook up my script to the scanner's 'Scan' button, but
HP didn't get the button working for their Vista driver. Their workaround is "don't do that!".
- You must call Image.Dispose() to get .NET to release its reference to the corresponding image file.
- In trying to figure out how to store the text in the files comment, I ran into a dead-end trying to find the corresponding setter for GetDetailsOf which folks like James O'Neil use in PowerShell for interesting ends.
technical scanner ocr .net modi powershell office wia 2009 Jun 8, 3:58"Everyone can file bugs against HTML5, including you. To be clear, that something is filed in the W3C bug database does not mean it is likely it will be included."
html5 blog bug html w3c 2009 May 26, 11:28"But Data.gov is different. It is primarily for machines, not people, at least as a first step. It is a catalog of various sets of data from government agencies. And the idea is to offer the data in
one of several standardized formats, ranging from a simple text file that can be read by a spreadsheet program to the XML format widely used these days for the exchange of information between Web
services. Other data is presented in formats that are meant to feed into mapping programs."
data nytimes xml government 2009 Apr 23, 4:46Some lovely data visualizations. Is their Crimespotting visualization supposed to look like the map interface from GTA3SA? "Since 2001, Stamen has developed a reputation for beautiful and
technologically sophisticated projects in a diverse range of commercial and cultural settings."
blog web art visualization information interactive interface portfolio mashup 2009 Apr 20, 3:14This site does user generated reports on (mostly) spam phone numbers. They have a RESTful API to get at that data too! I'm looking for more like this.
api phone spam search reference telemarketing telephone lookup 2009 Apr 10, 9:48
A while ago I promised to say how an xsltproc Meddler script would be useful and the general answer is
its useful for hooking up a client application that wants data from the web in a particular XML format and the data is available on the web but in another XML format. The specific case for this
post is a Flickr Search service that includes IE8 Visual Search Suggestions. IE8
wants the Visual Search Suggestions XML format and Flickr gives out search data in their Flickr web API XML format.
So I wrote an XSLT to convert from Flickr Search XML to Visual Suggestions XML and used my xsltproc Meddler script to actually
apply this xslt.
After getting this all working I've placed the result in two places: (1) I've updated the xsltproc Meddler script to include this XSLT and an
XML file to install it as a search provider - although you'll need to edit the XML to include your own Flickr API key. (2) I've created a service for this so you can just install the Flickr search provider if you're interested in having the functionality and don't care about the implementation. Additionally, to the
search provider I've added accelerator preview support to show the Flickr slideshow which I think looks snazzy.
Doing a quick search for this it looks like there's at least one other such implementation, but mine has the distinction of being done through XSLT which I provide, updated XML namespaces to work
with the released version of IE8, and I made it so you know its good.
meddler xml ie8 xslt flickr technical boring search suggestions