Tag Hierarchy Design Notes


  1. Tag Hierarchy Tool
    1. Design
    2. Algorithm
  2. TH Input XML
  3. TH Output XML

Tag Hierarchy Tool

The Tag Hierarchy Tool thtool takes an XML file in the form described in TH Input XML and produces an output file in the form described in TH Output XML. The input file has a set of tags, a set of items, and what items are associated with what tags. The output has the same set of tags and items as the input but describes the parent and child relationships between items and tags so as to create a directed acyclic graph (DAG). The items in the output DAG never have children. All of the tag parents of an item in the input are either parents directly or indirectly in the output. If two tags have the same set of items as children then the tags are merged into one tag. If the item children of tag foo are a proper subset of the item children of tag bar then tag foo is made a child of tag bar and the item children of tag foo are removed from the child set of tag bar.

THTool Code Design

I've got the following classes and interfaces.
         CommandLine (X)
         Input (X)
            - DebugInput (X)
            - THXMLInputDOM (X)
            - THXMLInputSAX ( )
            - DeliciousInput ( )
         Processor (/)
         Output (X)
            - DebugOutput (X)
            - THXMLOutput (X)

THTool Algorithm

The THTool takes a flat list of items with associated tags and produces a hierarchy of items and tags. The input is a graph in which the nodes are items and tags. In the input a tag is a parent of an item if that item has been tagged with that tag. Edges exist from parent tags to child items. The output is also a graph with the same item and tag nodes. Give each tag the set of items that it tags. In the output a tag is a parent of an item if its a parent in the input and it has no child tags that are parents of that item. In the output tag one is a parent of tag two if tag one's item set is a superset of tag two's item set and if tag one has no child tags that are parents of tag two.

TH Input XML

The Tag Hierarchy input is an XML document that describes a flat set of tags and items.

Set Element

The root element is the set element. The namespace for everything in the document is <tag:deletethis.net,2007:ns/taghierarchy>. The set element may contain zero or more tag elements followed by zero or more item elements.

Tag Element

Each tag is required to have an ID attribute that is unique within the scope of the document. Tag IDs are compared for equality with a case sensitive string comparison. Each tag may contain one content element.

Item Element

Each item has zero or more link elements and a may contain one content element.

Link Element

Link elements have two attributes rel and targetid. The targetid attribute value is the ID of a previously defined tag. The rel attribute value defines the relationship between the current item and the named tag in targetid. Valid rel values are 'parent' meaning that the referenced tag, tags this item.

Content Element

The content of the content element will be carried through to the output as the content element of the corresponding item or tag in the output.


TH Output XML

The output is defined in terms of the previously defined input. The tag element may contain link elements. The tag element may contain any number of content elements. The link element's rel attribute value may also be 'child' or 'equal' where the name value is the name of a tag or item. All parent and child relationships will be specified explicitly in the output.


© David Risney. All Rights Reserved.