data page 34 - Dave's Blog

Search
My timeline on Mastodon

Language Log » Google Books: A Metadata Train Wreck

2009 Sep 10, 8:22Geoff Nunberg investigates issues in Google Books and in the comments Google Book's team manager responds in the comments. Apparently metadata is bad everywhere and not an issue new to the Web and user generated content or tagging. Like finding Feynman lectures categorized as Death Metal on Napster back in the day.PermalinkCommentslanguage google library metadata catalog

RFC 1951 - DEFLATE Compressed Data Format Specification version 1.3

2009 Sep 3, 7:17"This specification defines a lossless compressed data format that compresses data using a combination of the LZ77 algorithm and Huffman coding." Also see RFC 1950 zlib, a wrapper compression format that can use deflate, and RFC 1952 gzip, a compressed file format that can use deflate.PermalinkCommentstechnical rfc ietf compression http deflate gzip zlib

Composing the Semantic Web: Units ontology with SPIN support published

2009 Sep 1, 4:25"Each unit has a stable URI, making it possible to link to it from your own domain models in a reliable way. For each unit, the ontology defines some useful metadata including abbreviation, a link to DBpedia and a categorization of units into groups, such as length units."PermalinkCommentssemanticweb via:connolly web unit conversion uri technical

HTTP Tracing - Export Format - Firebug Working Group | Google Groups

2009 Aug 31, 4:22"This document is intended to describe a HTTP Archive format that should be used when exporting data from Firebug Net panel. The current version of the format isn't finalized and is open for further proposals."PermalinkCommentshttp fiddler debug format firebug technical via:mnot

How myths are made – Bad Science

2009 Aug 12, 8:08"In a formal academic paper, every claim is referenced to another academic paper... This convention gives us an opportunity to study how ideas spread, and myths grow, because in theory you could trace who references what, and how, to see an entire belief system evolve from the original data."PermalinkCommentsscience meme research health medicine ben-goldacre network graph

Ajaxian » W3C publish first working draft of File API

2009 Aug 12, 5:02W3C File API makes it to first published working draft. Like the use of data URLs, don't like the new filedata URLs.PermalinkCommentshtml5 w3c file upload script url data-scheme technical

Choose Your Own Adventure – Most Likely You’ll Die | FlowingData

2009 Aug 11, 5:21"Michael Niggel took a look at Journey Under the Sea, and mapped out all possible paths. It turns out that death and unfavorable endings are in fact much more likely than the rest."PermalinkCommentsvisualization via:ethan_t_hein literature fiction if interactive flowchart infographics chooseyourownadventure

The Future of Data Tags: Bokodes | Brain Pickings

2009 Aug 5, 7:57"Ten times smaller than barcodes, Bokodes’ low-cost optical design can be read from as far as 4 meters away, much farther than barcodes, by taking an out-of-focus photo with any off-the-shelf camera." Love for stuff like this to catch on, however compared to QR codes, these are much more difficult to produce than barcodes in that you can't just print them out and they require changes to the photography technique (must be out of focus) rather than just analyzing any photograph of a barcode. They seem to be solving slightly different problems.
PermalinkCommentsqrcode qr barcode camera information design bokode augmented-reality technical

Coding Horror: The Paper Data Storage Option

2009 Aug 3, 11:06"But how efficient is the alphabet at encoding information on a page?"PermalinkCommentsvia:ericlaw humor paper storage encoding

fontplore // an interactive application designed for searching and exploring font databases

2009 Jul 31, 6:09An interactive touchable table to help you browse and select fonts.PermalinkCommentsart visualization design font typography surface table touchscreen video

W3C File Upload API

2009 Jul 27, 5:34"This specification provides an API used to prompt the user with a file selection dialogue and obtain the data contained in files on the user's file system."PermalinkCommentsweb w3c api upload script dom technical

Blog Layout and Implementation Improvements

2009 Jul 19, 11:44

Monticello, home of Thomas Jefferson, Charlottesville, Va. (LOC) from Flickr CommonsI've redone my blog's layout to remind myself how terrible CSS is -- err I mean to play with the more advanced features of CSS 2.1 which are all now available in IE8. As part of the new layout I've included my Delicious links by default but at a smaller size and I've replaced the navigation list options with Technical, Personal and Everything as I've heard from folks that that would actually be useful. Besides the layout I've also updated the back-end, switching from my handmade PHP+XSLT+RSS/Atom monster to a slightly less horrible PHP+DB solution. As a result everything should be much much faster including search which, incidentally, is so much easier to implement outside of XSLT.

PermalinkCommentsblog database redisgn xslt mysql homepage

Engineering Windows 7 : Federating Windows Search with Enterprise Data Sources

2009 Jul 17, 4:36"For Windows 7, we’ve added support for Federated Search using OpenSearch v1.1 and worked to make the experience a seamless one." Explorer in Win7 supports OpenSearch descriptions (that use RSS)PermalinkCommentsopensearch search windows win7 technical

Hand Drawn QR Code for Marc Jacobs - PSFK.com

2009 Jul 1, 6:21"The QR code, used to store and decode small bits of data via printed symbol, received an artistic rendering by SET as part of its campaign for Marc by Marc Jacobs." I like the idea although in this case its not very subtle or different from a regular QR code IMHO. Also, I was surprised that my phone could still read the QR code in this form.PermalinkCommentsqr qrcode marketing art internet mobile technical

PowerShell Scanning Script

2009 Jun 27, 3:42

I've hooked up the printer/scanner to the Media Center PC since I leave that on all the time anyway so we can have a networked printer. I wanted to hook up the scanner in a somewhat similar fashion but I didn't want to install HP's software (other than the drivers of course). So I've written my own script for scanning in PowerShell that does the following:

  1. Scans using the Windows Image Acquisition APIs via COM
  2. Runs OCR on the image using Microsoft Office Document Imaging via COM (which may already be on your PC if you have Office installed)
  3. Converts the image to JPEG using .NET Image APIs
  4. Stores the OCR text into the EXIF comment field using .NET Image APIs (which means Windows Search can index the image by the text in the image)
  5. Moves the image to the public share

Here's the actual code from my scan.ps1 file:

param([Switch] $ShowProgress, [switch] $OpenCompletedResult)

$filePathTemplate = "C:\users\public\pictures\scanned\scan {0} {1}.{2}";
$time = get-date -uformat "%Y-%m-%d";

[void]([reflection.assembly]::loadfile( "C:\Windows\Microsoft.NET\Framework\v2.0.50727\System.Drawing.dll"))

$deviceManager = new-object -ComObject WIA.DeviceManager
$device = $deviceManager.DeviceInfos.Item(1).Connect();

foreach ($item in $device.Items) {
        $fileIdx = 0;
        while (test-path ($filePathTemplate -f $time,$fileIdx,"*")) {
                [void](++$fileIdx);
        }

        if ($ShowProgress) { "Scanning..." }

        $image = $item.Transfer();
        $fileName = ($filePathTemplate -f $time,$fileIdx,$image.FileExtension);
        $image.SaveFile($fileName);
        clear-variable image

        if ($ShowProgress) { "Running OCR..." }

        $modiDocument = new-object -comobject modi.document;
        $modiDocument.Create($fileName);
        $modiDocument.OCR();
        if ($modiDocument.Images.Count -gt 0) {
                $ocrText = $modiDocument.Images.Item(0).Layout.Text.ToString().Trim();
                $modiDocument.Close();
                clear-variable modiDocument

                if (!($ocrText.Equals(""))) {
                        $fileAsImage = New-Object -TypeName system.drawing.bitmap -ArgumentList $fileName
                        if (!($fileName.EndsWith(".jpg") -or $fileName.EndsWith(".jpeg"))) {
                                if ($ShowProgress) { "Converting to JPEG..." }

                                $newFileName = ($filePathTemplate -f $time,$fileIdx,"jpg");
                                $fileAsImage.Save($newFileName, [System.Drawing.Imaging.ImageFormat]::Jpeg);
                                $fileAsImage.Dispose();
                                del $fileName;

                                $fileAsImage = New-Object -TypeName system.drawing.bitmap -ArgumentList $newFileName 
                                $fileName = $newFileName
                        }

                        if ($ShowProgress) { "Saving OCR Text..." }

                        $property = $fileAsImage.PropertyItems[0];
                        $property.Id = 40092;
                        $property.Type = 1;
                        $property.Value = [system.text.encoding]::Unicode.GetBytes($ocrText);
                        $property.Len = $property.Value.Count;
                        $fileAsImage.SetPropertyItem($property);
                        $fileAsImage.Save(($fileName + ".new"));
                        $fileAsImage.Dispose();
                        del $fileName;
                        ren ($fileName + ".new") $fileName
                }
        }
        else {
                $modiDocument.Close();
                clear-variable modiDocument
        }

        if ($ShowProgress) { "Done." }

        if ($OpenCompletedResult) {
                . $fileName;
        }
        else {
                $result = dir $fileName;
                $result | add-member -membertype noteproperty -name OCRText -value $ocrText
                $result
        }
}

I ran into a few issues:

PermalinkCommentstechnical scanner ocr .net modi powershell office wia

Misattribution - Anne's Weblog

2009 Jun 8, 3:58"Everyone can file bugs against HTML5, including you. To be clear, that something is filed in the W3C bug database does not mean it is likely it will be included."PermalinkCommentshtml5 blog bug html w3c

Data.gov: Unlocking the Federal Filing Cabinets - Bits Blog - NYTimes.com

2009 May 26, 11:28"But Data.gov is different. It is primarily for machines, not people, at least as a first step. It is a catalog of various sets of data from government agencies. And the idea is to offer the data in one of several standardized formats, ranging from a simple text file that can be read by a spreadsheet program to the XML format widely used these days for the exchange of information between Web services. Other data is presented in formats that are meant to feed into mapping programs."PermalinkCommentsdata nytimes xml government

stamen design | big ideas worth pursuing

2009 Apr 23, 4:46Some lovely data visualizations. Is their Crimespotting visualization supposed to look like the map interface from GTA3SA? "Since 2001, Stamen has developed a reputation for beautiful and technologically sophisticated projects in a diverse range of commercial and cultural settings."PermalinkCommentsblog web art visualization information interactive interface portfolio mashup

whocalled.us

2009 Apr 20, 3:14This site does user generated reports on (mostly) spam phone numbers. They have a RESTful API to get at that data too! I'm looking for more like this.PermalinkCommentsapi phone spam search reference telemarketing telephone lookup

Flickr Visual Search in IE8

2009 Apr 10, 9:48

A while ago I promised to say how an xsltproc Meddler script would be useful and the general answer is its useful for hooking up a client application that wants data from the web in a particular XML format and the data is available on the web but in another XML format. The specific case for this post is a Flickr Search service that includes IE8 Visual Search Suggestions. IE8 wants the Visual Search Suggestions XML format and Flickr gives out search data in their Flickr web API XML format.

So I wrote an XSLT to convert from Flickr Search XML to Visual Suggestions XML and used my xsltproc Meddler script to actually apply this xslt.

After getting this all working I've placed the result in two places: (1) I've updated the xsltproc Meddler script to include this XSLT and an XML file to install it as a search provider - although you'll need to edit the XML to include your own Flickr API key. (2) I've created a service for this so you can just install the Flickr search provider if you're interested in having the functionality and don't care about the implementation. Additionally, to the search provider I've added accelerator preview support to show the Flickr slideshow which I think looks snazzy.

Doing a quick search for this it looks like there's at least one other such implementation, but mine has the distinction of being done through XSLT which I provide, updated XML namespaces to work with the released version of IE8, and I made it so you know its good.

PermalinkCommentsmeddler xml ie8 xslt flickr technical boring search suggestions
Older EntriesNewer Entries Creative Commons License Some rights reserved.