language page 3 - Dave's Blog

Search
My timeline on Mastodon

Language Log - Send a private message to

2009 Mar 16, 4:23The underwhelming answer to the question of "What are the commonest five-word sequences on the Web?"PermalinkCommentslanguagelog culture internet web research language english

LDC Catalog - Web 1T 5-gram Version 1

2009 Mar 16, 4:22"This data set, contributed by Google Inc., contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. We expect this data will be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses." 6 DVDs for only $150 with licensing restri... ok nm.PermalinkCommentslanguage google statistics database text

Sorting it all Out : What do you get when you combine a base character with a buttload of diacritics?

2009 Mar 6, 11:47"Anyway, I decided to take the letter a and put as many different diacritics on it as I could." Micahel Kaplan sticks like 80 diacritics on the letter 'a'. Awesome.PermalinkCommentsencoding unicode diacritic language letter michael-kaplan

Cursebird: What the f#@! is everyone swearing about?

2009 Feb 10, 6:34Real time stats on folks cursing on Twitter. Shows percentage change in usage by curse word.PermalinkCommentstwitter humor language swearing mashup

Tab Expansion in PowerShell

2008 Nov 18, 6:38

PowerShell gives us a real CLI for Windows based around .Net stuff. I don't like the creation of a new shell language but I suppose it makes sense given that they want something C# like but not C# exactly since that's much to verbose and strict for a CLI. One of the functions you can override is the TabExpansion function which is used when you tab complete commands. I really like this and so I've added on to the standard implementation to support replacing a variable name with its value, tab completion of available commands, previous command history, and drive names (there not restricted to just one letter in PS).

Learning the new language was a bit of a chore but MSDN helped. A couple of things to note, a statement that has a return value that you don't do anything with is implicitly the return value for the current function. That's why there's no explicit return's in my TabExpansion function. Also, if you're TabExpansion function fails or returns nothing then the builtin TabExpansion function runs which does just filenames. This is why you can see that the standard TabExpansion function doesn't handle normal filenames: it does extra stuff (like method and property completion on variables that represent .Net objects) but if there's no fancy extra stuff to be done it lets the builtin one take a crack.

Here's my TabExpansion function. Probably has bugs, so watch out!


function EscapePath([string] $path, [string] $original)
{
    if ($path.Contains(' ') -and !$original.Contains(' '))
    {
        '"'   $path   '"';
    }
    else
    {
        $path;
    }
}

function PathRelativeTo($pathDest, $pathCurrent)
{
    if ($pathDest.PSParentPath.ToString().EndsWith($pathCurrent.Path))
    {
        '.\'   $pathDest.name;
    }
    else
    {
        $pathDest.FullName;
    }
}

#  This is the default function to use for tab expansion. It handles simple
# member expansion on variables, variable name expansion and parameter completion
# on commands. It doesn't understand strings so strings containing ; | ( or { may
# cause expansion to fail.

function TabExpansion($line, $lastWord)
{
    switch -regex ($lastWord)
    {
         # Handle property and method expansion...
         '(^.*)(\$(\w|\.) )\.(\w*)$' {
             $method = [Management.Automation.PSMemberTypes] `
                 'Method,CodeMethod,ScriptMethod,ParameterizedProperty'
             $base = $matches[1]
             $expression = $matches[2]
             Invoke-Expression ('$val='   $expression)
             $pat = $matches[4]   '*'
             Get-Member -inputobject $val $pat | sort membertype,name |
                 where { $_.name -notmatch '^[gs]et_'} |
                 foreach {
                     if ($_.MemberType -band $method)
                     {
                         # Return a method...
                         $base   $expression   '.'   $_.name   '('
                     }
                     else {
                         # Return a property...
                         $base   $expression   '.'   $_.name
                     }
                 }
             break;
          }

         # Handle variable name expansion...
         '(^.*\$)([\w\:]*)$' {
             $prefix = $matches[1]
             $varName = $matches[2]
             foreach ($v in Get-Childitem ('variable:'   $varName   '*'))
             {
                 if ($v.name -eq $varName)
                 {
                     $v.value
                 }
                 else
                 {
                    $prefix   $v.name
                 }
             }
             break;
         }

         # Do completion on parameters...
         '^-([\w0-9]*)' {
             $pat = $matches[1]   '*'

             # extract the command name from the string
             # first split the string into statements and pipeline elements
             # This doesn't handle strings however.
             $cmdlet = [regex]::Split($line, '[|;]')[-1]

             #  Extract the trailing unclosed block e.g. ls | foreach { cp
             if ($cmdlet -match '\{([^\{\}]*)$')
             {
                 $cmdlet = $matches[1]
             }

             # Extract the longest unclosed parenthetical expression...
             if ($cmdlet -match '\(([^()]*)$')
             {
                 $cmdlet = $matches[1]
             }

             # take the first space separated token of the remaining string
             # as the command to look up. Trim any leading or trailing spaces
             # so you don't get leading empty elements.
             $cmdlet = $cmdlet.Trim().Split()[0]

             # now get the info object for it...
             $cmdlet = @(Get-Command -type 'cmdlet,alias' $cmdlet)[0]

             # loop resolving aliases...
             while ($cmdlet.CommandType -eq 'alias') {
                 $cmdlet = @(Get-Command -type 'cmdlet,alias' $cmdlet.Definition)[0]
             }

             # expand the parameter sets and emit the matching elements
             foreach ($n in $cmdlet.ParameterSets | Select-Object -expand parameters)
             {
                 $n = $n.name
                 if ($n -like $pat) { '-'   $n }
             }
             break;
         }

         default {
             $varNameStar = $lastWord   '*';

             foreach ($n in @(Get-Childitem $varNameStar))
             {
                 $name = PathRelativeTo ($n) ($PWD);

                 if ($n.PSIsContainer)
                 {
                     EscapePath ($name   '\') ($lastWord);
                 }
                 else
                 {
                     EscapePath ($name) ($lastWord);
                 }
             }

             if (!$varNameStar.Contains('\'))
             {
                foreach ($n in @(Get-Command $varNameStar))
                {
                    if ($n.CommandType.ToString().Equals('Application'))
                    {
                       foreach ($ext in @((cat Env:PathExt).Split(';')))
                       {
                          if ($n.Path.ToString().ToLower().EndsWith(($ext).ToString().ToLower()))
                          {
                              EscapePath($n.Path) ($lastWord);
                          }
                       }
                    }
                    else
                    {
                        EscapePath($n.Name) ($lastWord);
                    }
                }

                foreach ($n in @(Get-psdrive $varNameStar))
                {
                    EscapePath($n.name   ":") ($lastWord);
                }
             }

             foreach ($n in @(Get-History))
             {
                 if ($n.CommandLine.StartsWith($line) -and $n.CommandLine -ne $line)
                 {
                     $lastWord   $n.CommandLine.Substring($line.Length);
                 }
             }

             # Add the original string to the end of the expansion list.
             $lastWord;

             break;
         }
    }
}

PermalinkCommentscli technical tabexpansion powershell

Sam Kass's Blog

2008 Nov 17, 8:21"The organization is now five years old, and our membership is larger than ever. It is estimated that one out of every four people on Earth is now a devoted member. The secret language has been perfected-- it allows us to talk freely, and sounds just like normal small talk. Also, we have an order of special agents who are particularly dangerous, and are all disguised as normal people. Our goal is the annihilation of all paranoid people."PermalinkCommentshumor club paranoia sam-kass

Investigation of a Few Application Protocols (Updated)

2008 Oct 25, 6:51

Windows allows for application protocols in which, through the registry, you specify a URL scheme and a command line to have that URL passed to your application. Its an easy way to hook a webbrowser up to your application. Anyone can read the doc above and then walk through the registry and pick out the application protocols but just from that info you can't tell what the application expects these URLs to look like. I did a bit of research on some of the application protocols I've seen which is listed below. Good places to look for information on URI schemes: Wikipedia URI scheme, and ESW Wiki UriSchemes.

Some Application Protocols and associated documentation.
Scheme Name Notes
search-ms Windows Search Protocol The search-ms application protocol is a convention for querying the Windows Search index. The protocol enables applications, like Microsoft Windows Explorer, to query the index with parameter-value arguments, including property arguments, previously saved searches, Advanced Query Syntax, Natural Query Syntax, and language code identifiers (LCIDs) for both the Indexer and the query itself. See the MSDN docs for search-ms for more info.
Example: search-ms:query=food
Explorer.AssocProtocol.search-ms
OneNote OneNote Protocol From the OneNote help: /hyperlink "pagetarget" - Starts OneNote and opens the page specified by the pagetarget parameter. To obtain the hyperlink for any page in a OneNote notebook, right-click its page tab and then click Copy Hyperlink to this Page.
Example: onenote:///\\GUMMO\Users\davris\Documents\OneNote%20Notebooks\OneNote%202007%20Guide\Getting%20Started%20with%20OneNote.one#section-id={692F45F5-A42A-415B-8C0D-39A10E88A30F}&end
callto Callto Protocol ESW Wiki Info on callto
Skype callto info
NetMeeting callto info
Example: callto://+12125551234
itpc iTunes Podcast Tells iTunes to subscribe to an indicated podcast. iTunes documentation.
C:\Program Files\iTunes\iTunes.exe /url "%1"
Example: itpc:http://www.npr.org/rss/podcast.php?id=35
iTunes.AssocProtocol.itpc
pcast
iTunes.AssocProtocol.pcast
Magnet Magnet URI Magnet URL scheme described by Wikipedia. Magnet URLs identify a resource by a hash of that resource so that when used in P2P scenarios no central authority is necessary to create URIs for a resource.
mailto Mail Protocol RFC 2368 - Mailto URL Scheme.
Mailto Syntax
Opens mail programs with new message with some parameters filled in, such as the to, from, subject, and body.
Example: mailto:?to=david.risney@gmail.com&subject=test&body=Test of mailto syntax
WindowsMail.Url.Mailto
MMS mms Protocol MSDN describes associated protocols.
Wikipedia describes MMS.
"C:\Program Files\Windows Media Player\wmplayer.exe" "%L"
Also appears to be related to MMS cellphone messages: MMS IETF Draft.
WMP11.AssocProtocol.MMS
secondlife [SecondLife] Opens SecondLife to the specified location, user, etc.
SecondLife Wiki description of the URL scheme.
"C:\Program Files\SecondLife\SecondLife.exe" -set SystemLanguage en-us -url "%1"
Example: secondlife://ahern/128/128/128
skype Skype Protocol Open Skype to call a user or phone number.
Skype's documentation
Wikipedia summary of skype URL scheme
"C:\Program Files\Skype\Phone\Skype.exe" "/uri:%l"
Example: skype:+14035551111?call
skype-plugin Skype Plugin Protocol Handler Something to do with adding plugins to skype? Maybe.
"C:\Program Files\Skype\Plugin Manager\skypePM.exe" "/uri:%1"
svn SVN Protocol Opens TortoiseSVN to browse the repository URL specified in the URL.
C:\Program Files\TortoiseSVN\bin\TortoiseProc.exe /command:repobrowser /path:"%1"
svn+ssh
tsvn
webcal Webcal Protocol Wikipedia describes webcal URL scheme.
Webcal URL scheme description.
A URL that starts with webcal:// points to an Internet location that contains a calendar in iCalendar format.
"C:\Program Files\Windows Calendar\wincal.exe" /webcal "%1"
Example: webcal://www.lightstalkers.org/LS.ics
WindowsCalendar.UrlWebcal.1
zune Zune Protocol Provides access to some Zune operations such as podcast subscription (via Zune Insider).
"c:\Program Files\Zune\Zune.exe" -link:"%1"
Example: zune://subscribe/?name=http://feeds.feedburner.com/wallstrip.
feed Outlook Add RSS Feed Identify a resource that is a feed such as Atom or RSS. Implemented by Outlook to add the indicated feed to Outlook.
Feed URI scheme pre-draft document
"C:\PROGRA~2\MICROS~1\Office12\OUTLOOK.EXE" /share "%1"
im IM Protocol RFC 3860 IM URI scheme description
Like mailto but for instant messaging clients.
Registered by Office Communicator but I was unable to get it to work as described in RFC 3860.
"C:\Program Files (x86)\Microsoft Office Communicator\Communicator.exe" "%1"
tel Tel Protocol RFC 5341 - tel URI scheme IANA assignment
RFC 3966 - tel URI scheme description
Call phone numbers via the tel URI scheme. Implemented by Office Communicator.
"C:\Program Files (x86)\Microsoft Office Communicator\Communicator.exe" "%1"
(Updated 2008-10-27: Added feed, im, and tel from Office Communicator)PermalinkCommentstechnical application protocol shell url windows

Language Log - Congress plans bailout for grammar epidemic

2008 Oct 23, 2:18I had no idea lingual prescriptivists vs descriptivists were split in a partisan manner: '... The Secretary [of the Department of Education] released a report that includes dire warnings of impending doom...The cause of this immanent catastrophe is, of course, those pesky linguists, the libertarian destroyers of good usage who claim that, well, anything goes. According to the report, "the language problem has now reached the crisis level and we are now experiencing a severe epidemic of bad grammar that will affect the very fiber of our nation." The Secretary added, "an alarming number of children are suffering from the bad advice given by those socialist, left-wing, atheistic linguists and we just gotta do something about it."'PermalinkCommentshumor language politics grammar

Language Log - Nerdview

2008 Oct 23, 10:34Geoffrey K. Pullum of Language Log defines 'nerdview': "It is a simple problem that afflicts us all: people with any kind of technical knowledge of a domain tend to get hopelessly (and unwittingly) stuck in a frame of reference that relates to their view of the issue, and their trade's technical parlance, not that of the ordinary humans with whom they so signally fail to engage... The phenomenon - we could call it nerdview - is widespread." Woo, go year-month-day, go!PermalinkCommentsnerdview language date programming nerd writing

Home - tuProlog

2008 Oct 11, 12:44GPL Java Prolog library. "tuProlog is a Java-based light-weight Prolog for Internet applications and infrastructures."PermalinkCommentsdevelopment java language jvm prolog tuprolog logic programming

List of language inventors

2008 Oct 10, 1:43A blog comment included the phrase 'hard-core conlangers' which at first glance sounds dirty, then based on the context I thought it was made up, but of course Wikipedia has the actual answer: "A conlanger ... is person who invents conlangs (constructed languages)."PermalinkCommentslanguage klingon nerd wikipedia conlang

QuickBase Formula Pretty Printer and Syntax Highlighter

2008 Oct 5, 9:17

Sarah asked me if I knew of a syntax highlighter for the QuickBase formula language which she uses at work. I couldn't find one but thought it might be fun to make a QuickBase Formula syntax highlighter based on the QuickBase help's description of the formula syntax. Thankfully the language is relatively simple since my skills with ANTLR, the parser generator, are rusty now and I've only used it previously for personal projects (like Javaish, the ridiculous Java based shell idea I had).

With the help of some great ANTLR examples and an ANTLR cheat sheet I was able to come up with the grammar that parses the QuickBase Formula syntax and prints out the same formula marked up with HTML SPAN tags and various CSS classes. ANTLR produces the parser in Java which I wrapped up in an applet, put in a jar, and embedded in an HTML page. The script in that page runs user input through the applet's parser and sticks the output at the bottom of the page with appropriate CSS rules to highlight and print the formula in a pretty fashion.

What I learned:

PermalinkCommentsjava technical programming quickbase language antlr antlrworks

Disemvowelment and Reemvowelment Tools

2008 Oct 3, 5:29I thought the disemvowelment of trolls was a pretty funny punishment -- much better than simply removing the comment: "Disemvowelment is - obviously enough - the act of removing the vowels from a passage of text, as well as a pun on the word 'disembowelling'. A number of blogs and websites do this to offensive text which has been placed in their 'comments' section. ... This site exists because I couldn't resists the challenge of trying to re-emvowel disemvowelled text. This is a challenging task, as the disemvowelled word 'dg' may well have been 'dog', but also 'dig', 'dug', 'doge', diego' and so on. I have a first cut of this functionality at the re-emvowel link at the side of the page. A more advanced version is in progress."PermalinkCommentstool disemvowelment web comment forum troll language

ANTLRWorks: The ANTLR GUI Development Environment

2008 Oct 2, 9:37Cool graphical ANTLR IDE! They didn't have this the last time I used ANTLR. "ANTLRWorks is a novel grammar development environment for ANTLR v3 grammars written by Jean Bovet (with suggested use cases from Terence Parr). It combines an excellent grammar-aware editor with an interpreter for rapid prototyping and a language-agnostic debugger for isolating grammar errors. ANTLRWorks helps eliminate grammar nondeterminisms, one of the most difficult problems for beginners and experts alike, by highlighting nondeterministic paths in the syntax diagram associated with a grammar."PermalinkCommentsantlr ide graph grammar tool free download development opensource java

ANTLR Cheat Sheet - ANTLR 3 - ANTLR Project

2008 Oct 2, 9:26Cheat sheet on ANTLR's syntax. ANTLR's another language parser generator.PermalinkCommentsantlr cheat parser language grammar opensource java software syntax quickreference

QuickBase Help - Formulas in QuickBase

2008 Oct 2, 9:24Sarah uses QuickBase formulas at work and this is the language's description. Looking at making a syntax highlighter.PermalinkCommentsquickbase language reference help

Business & Technology | Jobs with real authority: working on Microsoft's spell-checker | Seattle Times Newspaper

2008 Sep 30, 11:05Article on the team that owns the Office spell-checker: 'But, the team asked itself, should "calender" be flagged, or squiggled - have the red squiggly underline that indicates a misspelling? Yes, because letting it go through as correct "more often masks the really common spelling error that people make for calendar."' I didn't even realize they had written calender rather than calendar in the articlePermalinkCommentsmicrosoft office spell-check language

Deriving a Non-Recursive Fibonacci Function Using Linear Algebra

2008 Aug 20, 10:51

In my Intro to Algorithms course in college the Fibonacci sequence was used as the example algorithm to which various types of algorithm creation methods were applied. As the course went on we made better and better performing algorithms to find the nth Fibonacci number. In another course we were told about a matrix that when multiplied successively produced Fibonacci numbers. In my linear algebra courses I realized I could diagonalize the matrix to find a non-recursive Fibonacci function. To my surprise this worked and I found a function.
The Nth Fibonacci value is (1 + sqrt(5))^N - (1 - sqrt(5))^N all over sqrt(5) * 2^N
Looking online I found that of course this same function was already well known. Mostly I was irritated that after all the algorithms we created for faster and faster Fibonacci functions we were never told about a constant time function like this.

I recently found my paper depicting this and thought it would be a good thing to use to try out MathML, a markup language for displaying math. I went to the MathML implementations page and installed a plugin for IE to display MathML and then began writing up my paper in MathML. I wrote the MathML by hand and must say that's not how its intended to be created. The language is very verbose and it took me a long time to get the page of equations transcribed.

MathML has presentation elements and content elements that can be used separately or together. I stuck to content elements and while it looked great in IE with my extension when I tried it in FireFox which has builtin MathML support it didn't render. As it turns out FireFox doesn't support MathML content elements. I had already finished creating this page by hand and wasn't about to switch to content elements. Also, in order to get IE to render a MathML document, the document needs directives at the top for specific IE extensions which is a pain. Thankfully, the W3C has a MathML cross platform stylesheet. You just include this XSL at the top of your XHTML page and it turns content elements into appropriate presentation elements, and inserts all the known IE extension goo required for you. So now my page can look lovely and all the ickiness to get it to render is contained in the W3C's XSL.

PermalinkCommentstechnical mathml fibonacci math

Wondermark by David Malki ! - 434: In which there is Taunting

2008 Aug 19, 11:02In which Grumpy Grammar Gus is taunted with poor grammar.PermalinkCommentscomic language humor geek english grammar via:languagelog

BBC NEWS | Entertainment | Your American accents

2008 Jul 23, 5:19"Everyone can do an American accent... at least everyone thinks they can. After the BBC's Stephen Robb took a lesson from one of the movie industry's top accent coaches, we asked readers to record their best US accents."PermalinkCommentsbbc audio accent language english
Older EntriesNewer Entries Creative Commons License Some rights reserved.