InfoVis 2003 Contest - TreeJuxtaposer Entry
James Slack, Tamara Munzner
{jslack,tmm}@cs.ubc.ca
University of British Columbia
Francois Guimbretiere
francois@cs.umd.edu
University of Maryland
August 1, 2003
General Tasks
Pair Wise Comparisons of Trees
- Topological changes
- Did anything change, in general, or in a subtree. Were there small
changes or major changes?
- There were major changes from mammalia_A to mammalia_B. mammalia_B has
quite a few more nodes than mammalia_A and there are many types of differences
between them.
- There were small changes from hcil.logs_A through hcil.logs_D. The changes
from week to week were due to isolated additions or deletions and each change
can be inspected individually using TreeJuxtaposer's browsing capabilities.
- There are major differences in the structure between phylo_A and phylo_B.
The similarities are in the leaf nodes and some similar subtrees (and
some almost similar) also exist. All similarities are easy to see in this
uncrowded tree and it's clear that when a subtree is marked that leaf nodes
can exist in any other part of the tree.
- What nodes were added, deleted?
- The nodes that were added (from mammalia_A to mammalia_B) are all the leaf
nodes in mammalia_B that are red (marked as different and leaf nodes are
different if they don't have a corresponding leaf node in the other tree).
The nodes that were deleted are all the leaf nodes in mammalia_A that are red.
There may be some leaf nodes that are red which do exist in the other tree as
non-leaf nodes. TreeJuxtaposer interprets a leaf node as being different
from any non-leaf node.
- There are a few nodes that are added between hcil.logs_A through
hcil.logs_D. Since all 4 hcil subtrees are browsable simultaneously, it is
visually clear which nodes have been added and which have been deleted.
Observe the counterpoint directory for deletions and the iv03contest directory
for additions.
- No importantly named nodes are added or deleted in the phylo_A
and phylo_B trees.
- Did any node or subtrees "move" in the tree. Can you characterize those movements?
- We know at least one subtree moved in the mammalia tree
(see the specific results in the Classification section)
but the TreeJuxtaposer system is unable to automatically characterize
these movements. Users can quickly explore the dataset by marking a
subtree on one side and seeing whether its components are widely
separated on the other side. The examples where the components are
separated by close to each other are not true movements, but simply
the result of the many additions and deletions. With a search of less
than ten minutes, we were able to find a few examples: "octodontidae",
"peramelemorphia."
- No nodes were found to have moved between successive hcil logs as all
activity was either insertions or deletions.
- Most nodes in the phylo_A and phylo_B trees moved, but a few nodes
managed to remain together and match between the two trees.
- Attribute value changes
- Global impression: did things change a lot or not?
- What nodes or subtrees changed the most
- Did the value of attribute XYZ for this node increase or decrease?
In absolute terms, or relatively to other siblings or other nodes
- Since TreeJuxtaposer is not designed to handle attributes at this time,
the attribute section of the contest will not be considered.
General Visualization of Trees
- Topology
- Overall characteristics: How large is the tree? How many levels deep? What is the deepest branch? Does the depth vary between subtrees or not?
- The size of the tree can be determined by the number of leaf nodes since they
are allocated a portion of vertical screen relative to the number of nodes
in the tree. Using the keyboard arrow keys is useful for determining how dense
the nodes are at the leaf level. The depth of the tree cannot be determined
as easily. The deepest branch is also a question that
cannot be answered visually. Since the trees are right-aligned, the depth
of subtrees is also not visually obvious.
- Path: What is the path of this node
- The path of a node to the root is simply found using the left arrow key,
the left arrow visits the parent of the current node. Also, if the nodes are
fully-qualified in their internal structure (i.e. node names appear as
relative to their parents but nodes are searched for using a "directory-like"
structure from the root of the entire tree) the Find panel shows the fully
qualified name. Here is a small example using homo sapiens:
- Local relatives: what are the children, siblings or cousins of this node?
- The children of the node can be scrolled through with the keyboard. The
right arrow will select the first child and pressing the down arrow will then
cycle through the children. Growing the node with the keyboard or mouse is
a much better alternative and all children that can be displayed will be
visible, instead of the keyboard method which only highlights one node at a
time. Cousins are most easily found with the keyboard method though, as when
the last child is selected, the next arrow down (or up, depending on the
direction) will select the closest cousin.
- Filtering by level: e.g. show me only the first level, or show only 3 levels down, or removes all the leaves
- TreeJuxtaposer cannot filter by rank or level.
- Which branch contains the largest number of nodes? Which branch has the largest fan-out?
- The largest fan-out (the largest number of leaf nodes) is clearly visible
when either selecting the root of an interesting subtree with a User Group
or by hovering over. User Groups are more visible since they color the
interesting nodes, but hovering over is much quicker at the expense of having
to use the gray box that disappears when the mouse focusses on a different node.
The largest number of nodes in a subtree is not quantitatively visible in
TreeJuxtaposer, but the density of tree edges in a region corresponds to
a larger number of nodes in a subtree and can be used to judge the relative
size of a subtree.
- Attribute based:
Those tasks only occur when leaf nodes have attributes that can be aggregated at the parent level
- Find nodes with high values of a numerical attribute X ? (relative query)
- Find nodes with given value of a numerical attribute X ? (absolute query)
- Find nodes with value Y of categorical attribute X
- What value of a categorical attribute occurs more often?
e.g. Are there more farm animals or pets
- Find nodes with certain values of two or more attributes
e.g. what video file is used the most
- Since TreeJuxtaposer is unable to assess attributes of nodes, these
tasks are not applicable for our system.
- Some topologies queries fall under attribute dependant queries because all trees/subtrees have at least one attribute: their number of nodes!
- Number of nodes in a tree, or subtree?
e.g. How many animal?, how many mammals?
- The number of animal species is equal to the number of leaf nodes aligned
on the right side of TreeJuxtaposer. The actual number of leaf nodes is not
displayed but the total number of named nodes is available in the Find panel
with a fully qualified naming structure. The number of mammals can be
determined in the same way (or the number of dolphins, etc).
- Comparison of branches of the tree (i.e. subtrees with most nodes)?
e.g. Is there more mammals or fish?
- Determining if there are more mammals or fish is also easy to do with the
Find panel and fully qualified names. By highlighting all nodes that start
with "///animal/mammal" the number of
mammals can be found. If you wanted to find the mammal subtree to perform
this operation and typed "mammal" into the Find panel for
classif_B, there are "mammal-nest beetles"
which are not mammals. Since there are very few non-mammals with "mammal" in
their name, it is easy to deselect the non-mammals from the
Find panel and find the mammal subtree. This method is useful for marking
and observing the size of two subtrees (say mammals and fish) visually.
- Largest fanout
e.g. What is the largest group of animals with same lineage?
- The largest fan-out (the largest number of leaf nodes) is clearly visible
when either selecting the root of an interesting subtree with a User Group
or by hovering over. User Groups are more visible since they color the
interesting nodes, but hovering over is much quicker at the expense of having
to use the gray box that disappears when the mouse focusses on a different node.
The keyboard arrow keys (up and down) can be used to cycle through sibling or
cousin nodes.
- Known items
- Which node(s) has a label containing this string?
e.g. find "giraffe" in a tree of animals
- Use the Find panel and type in giraffe. All giraffes are now highlighted
with the Found group. Use the Groups panel to resize the Found group. Here
is an example from classif_B:
- Locate a node knowing its path.
- Finding a node with a known path can be done in the Find panel, or
by browsing through the tree. The method used would depend on exactly
what about the path is known. If the full path is known, then browsing through
the tree from subtree to subtree may be faster since you wouldn't have to
type the entire path into the tree. However, due to screen real-estate and
the limited number of visible node labels, not being able to find a path element
on a bushy tree is hard so time may be saved by simply typing it into the
Find panel as a hint to which area to grow.
- Go back to a node you have visited before.
- There is no undo feature in TreeJuxtaposer, but if you know that you would
probably like to return to this node after exploring other parts of the tree,
then marking the node with a User Group would be a good idea, if you don't run
out of User Groups.
- Labeling: Review all the labels in a subtree
- All the labels in a subtree can be extracted through the Find panel. If
a name is entered into the Find panel, the results are limited to the
nodes that match the entry. The labels of a subtree can then be examined
and appear highlighted in the tree visualization for mouse manipulation if there aren't
too many.
- Browsing:
Explore the tree by performing a series of up and downs in the tree
e.g. you are looking for a cute animal... so you look into mammals, then primates, then gorillas, and chimpanzees, but you realize that are not that cute, so you go to felines, to tigers and cheetahs, but now remember that pandas are your favorites and you go there.
- These actions are easily performed with the mouse interface to resize
subtrees to find interesting paths to leaf nodes. Starting with the animals
tree classif_B (with common names), grow the vertebrates, mammals bigger,
then find primates and then find gorillas and chimpanzees in the great
apes subtree. Finding cats is a little more tricky but starting with mouse on
primates, press the up arrow until carnivores is highlighted. Grow the
carnivores selection with the keyboard until it is large enough to see
the cats subtree. Use the mouse to resize the cats subtree until it's large
enough to see cheetah and tiger.
- Managing the analysis
- Marking nodes of interest
- Up to 4 User Groups can be used to mark nodes of interest. The
granularity of marking can either be Node or Subtree and multiple Node/Subtrees
can be marked in the same group. A node may belong to multiple groups
simultaneously, and the groups are given drawing order relative to when they
were last selected; the last group selected will draw over previous user groups
if they both want to draw the same edge.
- Removing special anomalies.
- Saving visualization settings for future reference.
- Keeping the history of your analysis, reviewing it and replaying it with
different parameters.
- TreeJuxtaposer can't modify the tree, and doesn't support saving or history.
Phylogenies
Application Specific Tasks
- The higher-level problem is to find the best way to map the similarities
between the two trees topologies, which would indicate co-evolution, and,
maybe, the point(s) where the two proteins were not co-evolving.
Actions:
- Load trees phylo_A_ABC_03-02-01.nh and phylo_B_IM__03-02-01.nh
Comments about phylogeny comparison:
- All of the leaves match. The leaves in phylo_A are all in phylo_B and
vice versa.
- Some leaf nodes have identical names in the same tree.
TreeJuxtaposer assumes all leaves have
1-to-1 relationships with other similar leaves but is only able to automatically
assign leaves; a different leaf assignment between trees might have produced
a different tree comparison result.
- A subtree of 5 leaf nodes almost matches. The subtree has a structural
difference in only one child subtree. The blue marked region in the picture
shows this.
- A larger subtree of 7 leaf nodes matches. The green marked region
in the picture shows this.
- An even larger subtree of 8 nodes nearly matches. The subtree has a
larger difference in 3 internal nodes but may be useful. The cyan marked
region in the picture shows this.
-
Lower-level tasks involve interacting with the tree matching process to solve
inconsistencies that can arise.
-
Still lower-level tasks involve displaying the trees (with or without taking
into account the branch length (i.e. the length of the links), showing the
relationships and differences from a computed or interactively constructed
mapping, and providing ways to permute links and nodes to verify hypotheses
interactively.
Comments about interaction low-level tasks:
- The difference marking is provided by the automatic best-corresponding node
algorithm in TreeJuxtaposer. Navigating through with mouse-over
highlighting and marking subtrees with User Groups allows the user to
recognize further similarities in the tree.
- TreeJuxtaposer does not have the functionality required to interact with
the nodes at a low level since the matching process used is automatic.
- TreeJuxtaposer also is not designed to modify the structure of the
input given.
Classifications
Application Specific Tasks
- Comparing the two classification datasets:
- To what extent are the differences in the classifications due to
differences in how animals are thought to be related?
- Are there other kinds of differences and can you explain them?
Actions:
- Load trees mammalia_[AB]_03-04-16.nh (latin, not fully qualified)
Comments about mammalia comparison:
- The differences in the classifications are mostly due to additions
to the tree (from A to B), deletions from the tree, or slight modifications
such as splitting (a leaf node in A becomes a subtree with two children in B)
- Additions and deletions on the leaves can be quantified by examination
of the "redness" of the leaf level as the leaves are equally spaced at that
level and therefore the percentage of red (red marking a node that is
different) indicates the percentage of added nodes relative to the other tree.
- If a large subtree, for example rodentia, is highlighted in A with blue,
the nodes in B that are highlighted are all in the rodentia subtree.
Furthermore, if the rodentia subtree in B is now highlighted with green,
there are no nodes highlighted in either classification tree that are outside
of the rodentia group. This is far from being complete but investigation of
the mammalia trees shows mostly differences in the leaf level nodes.
-
Some differences such as the movement of pitheciidae (marked in
green) from primates in in classif_A to cebidae (new world monkeys,
marked in blue) in classif_B can be found through exploration but
there is no easy way for TreeJuxtaposer to automatically highlight or
count these types of differences. The subtree marking capability does
speed up the exploration process, as explained in the movement
characterization answer in the General Pairwise section above.
- Considering one dataset or the other
- Can you say in how many different subtrees a particular common name (such as "dolphin" or "horse") is used? How closely are these animals related?
- Are common names a good guide to understanding relationships?
Actions:
- Load tree classif_A_03-04-16.nh (common, fully qualified)
- Search for "dolphin": Find panel finds 53 leaf and non-leaf dolphins
Comments about "dolphin" search:
- "myzomela adolphinae": probably not named with respect to common dolphins
- many dolphins in "marine dolphins" hierarchy
- Search for "horse": Find panel finds 47 leaf and non-leaf horses
Comments about "horse" search:
- In addition to mammalian horses, "horse" appears in many different
subtrees across different
parts of the classification tree (arthropods, insects, seahorses, snails, etc)
- The animal species with "horse" in their names are not closely related
at all
- Several "horse-groups" exist which includes the members that do not have
horse in their species names but rather a higher rank horse relationship.
Comments about common name searches:
- Common names are not a good guide to understanding relationships. Common
names lack structure and do not have the same hierarchical classification
as their latin equivalents.
- Common names may have historical or geographical influences and one
classification may even look different from an identical classification tree
if a naming convention is not adhered to; the trees provided are not good
to find differences if common names are used. See for example how
mammalia_A labels "vancouver island marmot" while mammalia_B
labels "vancouver marmot" which is another name for "marmota vancouverensis"
- Some common names may be simple and included in other common names (i.e.
"horse" occurs in "seahorse"). TreeJuxtaposer Find can be used to ignore
or focus in on sections of species, but it requires some user input in the
search window.
- For species such as dolphins that are not expected to occur frequently
across very different species, it was interesting to see non-mammals occur
(especially non-porpoises, using fully qualified names can see them clearly:
a mollusk, 2 bony fishes, and some kind of perching bird) which may either have
dolphin-like properties or "dolphin" in their name by chance.
-
Although common names are very useful for providing recognizable
names when a layperson browses a single tree, they dramatically
impede comparison.
- How many species or subspecies are named after biologists named "Townsend"?
Note that the answer will be different if you are looking at common names
versus Latin names. Can you look at the pattern of names to deduce where in
the world they might have done research? On what kinds of animals?
Actions:
- Load tree classif_A_03-04-16.nh (latin)
- Search for "townsend": Find panel finds 51 leaf and non-leaf townsend nodes
- Start new TreeJuxtaposer with classif_A (common)
- Search for "townsend": Find panel finds 45 leaf and non-leaf townsend nodes
Some latin names appear in common trees since if a node has no common name,
the latin name is used as a label.
Comments about "Townsend" searches:
- The names returned in the search do not show a pattern that can be used
to deduce where in the world Townsend (or for all Townsends if there were in
fact more than one Townsend naming animals) might have done research.
The common names give a range of geographic locations with chipmunks, shrimp,
and bats.
- The kinds of animals the search returns provides quite a range in the
classification tree: the search highlights are distributed throughout the
classification tree.
- Some scientific names are maddeningly similar. For example, Spirulida and Spirurida are two nodes in two different subtrees. A user types in the wrong one. What kind of feedback does your tool provide to alert the user quickly? Do the names have the same rank? Is the typed name in the expected part of the tree?
Actions:
- Load tree classif_A_03-04-16.nh (latin)
- Assume searching for Spirurida which is in the Nemata phylum
Since you know Spirurida is a type of nemata (you're knowledgeable about worms
and want to see the hierarchy under Spirurida).
- Enter "Spirulida" in Find box
- Grow Found nodes and notice that the wrong section grows and no
worms appear.
- You read what you typed into the search box and realize
the mistake and correct it.
-
The feedback from TreeJuxtaposer is the nodes which were found
did not grow a subtree as expected. Since TreeJuxtaposer doesn't store the
rank as an attribute, determining if both names have the same rank is not
possible within the system. The typed name was not in the expected part of
the subtree that we chose to highlight, which would be an excellent indication
of user error or at least a warning to examine what was found by the Find panel.
-
For the top five subtrees with the most nodes -- are they likely to have a
parent of a particular rank? Or does this happen in many ranks? Can you
comment on how useful "rank" is?
- We are unable to comment on rank since rank is an attribute that
the TreeJuxtaposer system does not handle
File System and Usage Logs
Application Specific Tasks
- Questions about the file system itself
- Where are the big directories?
- Can you see different patterns in those files? e.g. can you make out the
difference between personal pages, class pages and research project pages,
were there a lot of pages created recently, in which part of the file system?
Actions:
- Load tree logs_A_03-02-01.nh
Comments about finding big directories:
- Big directories are immediately visible from the layout since the
vertical space consumed by directories indicates how many total leaves are in
the subdirectory structure. It's obvious from the tree logs_A that
"users" and "class" are the biggest directories linked to the root
of the tree. Finding the biggest directory in any subtree can be done in this
way, as long as no ancestor nodes of the subtree were previously grown or
shrunk.
- Finding directories with the biggest number of immediate
leaves (files) is more difficult with TreeJuxtaposer. Since the leaves
are right-aligned and children are ordered alphabetically, the leaves for
a particular node are interspersed between the non-leaf children of the
node, making accurate estimations of the number of immediate files
in a directory hard.
- Personal pages are found in 2 locations: in the users subdirectory such
as "///users/hollings" and each user also has a users subdirectory directly
attached to the root such as "///usershollings". The contents of these
directories are different ("///usersshankar" has more leaves than
"///users/shankar" but "///users/building" has more leaves than
"///usersbuilding") but not much can be said about why the directory
structure is set up this way without attributes.
- The personal pages comprise of more than 50% of the total number of leaf
nodes in the system. Of the 76547 nodes, personal pages make up 42877 nodes:
20480 of which are in the "///users/hollings" type personal pages and 22397 in
the "///usershollings" type personal pages. The totals are displayed by the
Find panel but not displayed on the visualization as found nodes since there
are too many nodes that would be highlighted to be useful.
- Class pages are found in the class subtree which breaks the years 1997-2003
into fall, spring and summer terms, each of which contains cmsc course pages.
- There are many fewer research pages ("///projects") than there are
personal or class pages. The largest directory in "///projects" is hcil.
- Are the newer directories bigger than the older projects?
- When was the page giving directions to the department last updated?
Actions:
- Load trees logs_[AB]_03-02-01.nh
Comments about comparing different trees:
- TreeJuxtaposer isn't able to determine the age of a directory unless the
directory has been added between the times which data was collected. The size,
in total number of files, of the projects subtree is quite a bit smaller than
the users directory. Furthermore, user "hollings" has about as many files as
the entire projects directory combined. Using the Find panel,
"///users/hollings" has 7194 nodes (leaves and internal nodes) and
"///projects" has 8447 nodes.
- Finding the page giving directions to the department can not be done
with TreeJuxtaposer since this would require the attribute describing the
file contents (extracted from the <title> tag).
- Personal pages show the most diverse and sporadic differences. There
appear to be many people who either added/deleted/moved files in their personal
directories or they didn't do anything in their directory that week.
- Class pages show a pattern of difference that is regular and expected. The
only differences are between leaves in fall2002 and spring2003 subdirectories.
- Closer examination of the fall2002 differences shows that some files were
deleted in the projects directory of cmsc434-0101.
- Examination of the
changes in spring2003 show that cmsc838p has changed, and the changes
were one delete ("design/openimpl.pdf") and several additions in multiple
subdirectories.
- Spring2003 has several additional subdirectories,
possibly reflecting these courses beginning. These courses include: cmsc102,
cmsc106, cmsc412-201, cmsc417, cmsc433, cmsc733, and the cmsc434 directory has
been further populated.
- There are very few changes in the project pages in this time period. The
only leaf modifications are in the "jazz-chat" directory, where some files
have been added. These changes ripple up the tree to the root; the ripples
do not reflect the entire structure changing.
- Questions about usage
- Which are the popular web pages?
- Are there some labs more popular than others?
- Which areas are getting more popular? less popular?
- Are new pages more popular that old pages?
- Which old page are popular?
- What proportion of the pages are never used? seldom used?
- Can't comment on usage since usage attributes aren't handled by TreeJuxtaposer
- Using the HCIL subtree
- Additionally, examination of the HCIL subtree can be done with all 4 logs
loaded (a 4-way comparison)
Actions:
- Load trees hcil.logs_[ABCD]_03-02-01.nh
Comments about comparing different trees:
- The counterpoint directory changes. Growing the directory, it's clear that
the directory changes only between logs_C and logs_D.
- The iv03contest directory is added between logs_B and logs_C. Between
logs_C and logs_D, the directory is further populated with contest information
and the datasets (all except logs_D, of course).
- spacetree and timesearcher also show some additions between logs_B and
logs_C.