Automated Tag Clustering: Improving search and exploration in the tag space
Authors:
- Grigory Begelman, Technion - Israel Institute of Technology Computer Science Dpt
- Philipp Keller, local.ch ag
- Frank Smadja, RawSugar
This HTML document is based on our paper and the presentation we gave at www2006.
Summary
The use of clustering techniques enhances the user experience and thus the success of collaborative tagging services. Clustering techniques could improve the user experience of current tagging services. This document
- describes the current limitations of tagging services
- gives an overview of existing approaches
- describes the algorithms for tag clustering
- shows experimental results and a variety of conclusions
Motivation
The success of tagging services like Flickr (part of Yahoo!) del.icio.us (part of Yahoo!) and technorati has shown that tagging is a great collaboration tool. Tagging seems to be the natural way for people to classify objects as well as an attractive way to discover new material. Tagging services provide users with a repository of tagged resources (a.k.a tagspace) that can be searched and explored in different ways. More and more people use at least one tagging service and enjoy them as discovery tools. Indeed, tagging is simple, it does not require a lot of thinking and it is very useful to find the tagged objects later. People tag pictures, videos, and other resources with a couple of keywords to easily retrieve them in a later stage. However, looking for information in the tag space has a number of hard limitations.
The difficulty comes from the fact that several people usually use different tags for the same document. In fact, even a single user’s tagging practice may vary over time. Usually, this variability is compensated by looking at many users’ tags; which is only possible when the page has been tagged many times. However, for less popular pages the problem remains. Currently tagging services still provide a relatively marginal value for information discovery and we claim that with the use of clustering techniques this can be greatly improved. We first discuss the main limitations of the current tagging services.
Limited Search
Let us imagine that you would like to tag the picture in «Figure sushi». It is a picture of a piece of sushi called nigiri (hand formed) sushi as opposed to other types of sushi like maki, futomaki or temaki sushi. A person not aware of this classification of sushi would tag this picture with any combination of the following tags: food, fish, raw fish, rice, Japanese. However a more expert person would use: nigiri sushi, or toro. Without delving on the psychological aspects of tagging (see A cognitive analysis of tagging by Rashmi Sinha for a good discussion on the subject) nor on the nuances of sushi; we clearly see that people think and tag differently. This creates a noisy tagspace and thus makes it harder to find material tagged by other people.

Figure sushi:
How would you tag this?
In summary, if you tag the above picture as toro, people searching for sushi or food will not find it. This type of problem is rooted in the language, words are often related and do not stand in isolation. Such relations among words are called lexical relations. We refer the reader to WordNet for a thorough treatment of semantic and syntagmatic relations among words.
Our point is that, without accounting for lexical relations, searching in a tag space in which many people of various background collaborate is bound to be very limited.
Limited Exploration
The whole promise of collaborative tagging is that by exploring the tag space you can discover a lot of useful information you would not find with traditional search engines. When your information need is not well defined, the idea that you can explore and see what other people tagged with certain tags is very attractive. We believe that tagging will be able to reach a very wide audience only when exploration techniques will be effective.
Currently, there are two main ways the tag space can be explored: using search/refine, and using some kind of tag space visualization such as a tag cloud.
Say you are looking for a restaurant in your area, searching on del.icio.us for restaurants, your results look like the image in «Figure delicious restaurant search».

Figure delicious restaurant search:
Searching for a restaurant with del.icio.us
You definitely get a few useful links at the top, Yelp is a very useful site to find restaurants. However, if you want to explore and see what types of restaurants are available in what locations or in what kind of price range, the tag list on the right is not useful. At this point, it is better to continue the exploration at yelp for example. This is very similar to what you get by searching for restaurants with Google. The first links are very relevant and are probably directories or hubs that contain information on restaurants. The true exploration will be pursued at one (or several) of these hubs. We believe that searching is only the first step in exploring, and the user wants to continue exploring in a way they would do in a directory like Open Directory or with shopping sites. This is only possible if tags are grouped in clusters.
The other way to explore the tag space is to look at popular pages or tags, for example, a tag cloud in which the size of the tags is proportional to their popularity (see tag cloud on delicious or flickr) Although a great visualization paradigm, we believe that with today’s tagclouds it is hard to find more than one or two tags to click on. Tags are not grouped, there is too much information, so that you find lot of related tags scattered on the tag cloud. One or two popular topics and all their related tags tend to dominate the whole cloud. For example, looking at the del.icio.us tagcloud, one would mostly see tags related to web design and technologies. This is because these topics are overwhelmingly more frequent than anything else. There are some 4,500 links tagged with chocolate and some 61,000 links tagged with food. However, these hardly show up on the tag clouds or the popular pages. We claim that accounting for tag clusters by, for example, showing five semantically more cohesive tag clouds is much more informative.
The key in building an effective exploration space seems to be able to group and show related items and to explain how the items are related. In hierarchical classification systems like dmoz it is easy to present related items, namely the parent, siblings and children items. However, in tagging spaces, such relations don’t exist. Some tagging systems present lists like «this tag often occurs together with the following tags» (related tag list in del.icio.us) or «this item is tagged x, here are other items tagged x». This information is too raw to build an exploration space upon.
We claim that if we could automatically and dynamically cluster tags without putting more burden on the user, we could provide a much stronger service. Searching, subscribing and exploring would be much more effective.
Existing Approaches
There is a lot of relevant work to discuss and we will briefly mention some here. First, we should mention the taxonomy projects such as Open Directory (dmoz.org) and Yahoo! Directories who in fact recognized the issue of tagging even before it existed. Their solution was flawed, however, because they put the burden on the tagger which in their case was either some Yahoo! Employee or a volunteering librarian. See Hierarchy versus Facets versus Tags for a discussion on the topic. Along these lines are the shopping sites (shopping.com, Yahoo! Shopping, Froogle) who use somehow semi-automated techniques for tagging and are based on controlled vocabulary.

Figure rawsugar restaurant search:
Searching for restaurants on RawSugar
With RawSugar taggers can specify tag hierarchies in their own accounts (saying that sushi is a subtag of food for example). The system uses these hierarchies to provide a strong exploration and search experience. «Figure rawsugar restaurant search» is extracted from RawSugar’s global search page for restaurant. The tags are grouped according to the overall hierarchies defined by RawSugar users and thus present a more powerful exploration space.
Flickr has Flickr clusters, which, provided a popular tag, give related tags grouped into clusters. For example, looking at the clusters for the word Jaguar, we see that the clusters neatly fall into several semantic categories of Jaguars: animal, car and plane. The hereby presented guidepost is what makes the difference. Clustering makes it possible to present a guidepost, to provide the means that allow the user to explore the information space. In addition, Flickr also has an interestingness exploration technique which they define as a factor of several parameters including the pageviews, the comments left by users, the specific users, etc.
Rashmi Sinha has published a number of entries on tagging and clustering. Bielenberg and Zachera mention tag clustering, see also their projects group.us and clusty. Recently, Paul Heymann and Hector Garcia-Molina published a paper called Collaborative Creation of Communal Hierarchical Taxonomies in Social Tagging Systems that takes a different approach on improving search and exploration in the tag space.
One should also mention a growing number of tag visualization techniques in various stages of development that are currently available on the Web:
- Newzingo
- HubLog: Graph del.icio.us related tags
- Many-to-Many: Visualizing the collective brain
- Tagnautica - Explore the Flickr Tag Space
- Revealicious
- TagCloud
Proposed cluster algorithm
We hope we convinced you that clustering is a valid solution to some of the problems in tagging applications. We further assume that clustering makes sense and the question simply lies in the question how to cluster and how to use this data in our tag applications. The rest of this document is a possible answer to «how to cluster». Hopefully this is a good reference for anyone who wants to build such a cluster algorithm.
It must be pointed out that the following approach is just one of many, our algorithm is based on tag-relations and deals with problems of graph theory. There are other ways to cluster tags such as hyperdimensional data analysis that could be further investigated.
1. Getting the data

Figure building tag-relations:
building tag-relation graphs from delicious’ rss feed
You first have to choose a source of information to get data about tag relations. One of our source was del.icio.us’ recent bookmarks RSS feed. The following figures and results are based on this data sample.
In «Figure building tag-relations» a certain url of the data feed is tagged with socialsoftware, conference, tagging and folksonomy. These tags are built into a little tag relation graph: Each tag relates to all the other tags in the tag bundle.
When you sum up all these small tag relation graphs you usually end up with a big, fat graph, accompanied with a few smaller graphs. It surely depends on the tag data you have. The goal is, that in the end you’d have many smaller graphs. We’d say clusters should have sizes from 4 to 20 tags each.

Figure summed tag-relations:
sum up all tag-relation data and you end up with a big graph
In «Figure summed tag-relations», a small portion of the whole cake is depicted. The number 80 between web2.0 and semantic means that of all the processed metadata, 80 times the tag “web2.0” appeared together with tag “semantic”.
2. Choosing a similarity measure

Similarity measures:
a set of common similarity measures
To build the cluster you need to first compute similarity between the tags. There are several proposed similarity measures. «Why can’t we take the numbers in Figure “summed tag relations”?» The problem would be that you’d favor popular tags. Tag web2.0 nowadays is so popular and is combined wildly with anything. In fact this tag is so overused that if you look at tag bookmarks in the del.icio.us dataset, the most used cotag is web2.0 (based on the box “related tags” on the right, 2006-08-25). Basing tag similarity on these numbers often doesn’t make sense at all. The similarity measure should be chosen so the popularity of a tag doesn’t affect the set of a tags related tags. Don’t cut the long tail. The success of blogs is driven by the importance of the long tail. We all know that it is crucial to support the niches. Tagging applications should empower the long tail too. If you just sort by popularity, you’d loose all those niches. So don’t take similarity measure “matching” in Figure «Similarity measures».
Similarity measures are based on probabilities. E.g. the dice similarity of tag web2.0 and tag semantic is computed by 2 times the probability that both tags appear in a post (|A∩B|) divided by the sum of their independent probability (|A| + |B|). That means that you need to have numbers of coocurrences of tags as well as numbers of relative occurrences of a single tag. That is, in addition to just building the coocurrence graph in «Figure building tag-relations» you also need to count how many times a given tag appears in your samples’ posts. From these numbers you can compute their relative frequency, that is their probability.

Same graph with dice numbers:
Please note that similarity measure have to be symmetric, that means that the following equation must be fulfilled: similarity(A,B) = similarity(B,A). We not yet have conclusions about which similarity measure to take for which data. To stick with the dice number: in Figure «Same graph with dice numbers» the dice numbers are computed for the graph «summed tag-relations».
3. Clustering
After the computation of the similarity weights, there is still this one big graph. Actually the tag connections are still the same as before. The algorithm then has to cluster this graphs into smaller graphs:

Cluster the graph:
cut the big tag-relation graph into a number of clusters
The trick lies in choosing the right points to cut the graph. Here the similarity numbers are taken in account. That is: Cut where the graph edges’ weights are small.
This is actually a common task in data mining, there are several algorithms to fulfill this task. It is crucial that the algorithm is not fixed on the number of clusters as you certainly don’t know how many clusters there should be. That means, algorithms that purely rely on k-means drop out.
We propose using clustering based on techniques similar to spectral clustering based on a “modularity function” Q which measures the quality of a particular clustering of nodes in a graph.
The task of clustering a graph actually is not that easy (lot of hairy math) and we propose using already existing libraries such as metis.
If the clustering succeeds you should end up with some nice clusters as in Figure «Cluster the graph», including the shopping cluster containing the tags fashion, clothes and gifts. This proves that there are also women who use this tagging services.. :-)
Experimental results
Let’s move on to some experimental results to show you that the algorithm not only works in theory but also in practice. The first paragraph depicts results done on clustering RawSugar data (the results can be accessed on the RawSugar lab page), the two following paragraphs are results/anomalies from experiments on the del.icio.usdata.
Clustering “per tag”
Clustering the RawSugar tag data we faced the problem that there is simply too much data. So, instead of clustering the whole tag space we took some tags frequent enough in the tag space and looked at their related tags. Then we clustered these related tags. Below are clusters done on tag health and sports.
Query tag: health:
- shopping, research
- nutrition, food, diet
- fitness, workout, running
- article, science
- life, lifehack, product,howto, gtd, reference, tip
- esport, sport
Query tag: sports:
- hockey, nhl
- baseball, mlb, triple
- basketball, nba, nbdl, wnba
- football, nfl
- alcohol, beer, tv, food, bar
- computer game, action game, free game
Definition vs. Usage

cluster around google:
definition of terms vs. their usage
While clustering our sample of the del.icio.us tag space we found some anomalies our graph algorithm came up with. The first type of anomalies can be found in the Figure «cluster around google»
- The terms china and censorship clearly don’t relate to the term google. The cause for this lies in the hype around the story of Googles’ censorship in china. We suppose if you’d search for china in the per-user bookmark sets of del.icio.us, you’d often get only links pointing to discussions about the censorship story. Even though the dataset used for this experiments consisted of data collected during a timespan of almost a year we observed a few of this “anomalies”. On the other side this probably is how the community thinks at a given point in time. It is not that absurd to think of the google censorship problem when you are online and you are searching for china.
- Tag internet: You all learned in school that Internet not equals world wide web. Here it even appears next to search. On the other hand this shows the power of tagging. The structure arises from how people think, not from the exact definition of the terms.
This anomaly points out that clustering must be done time sensitive and community sensitive. There won’t be a tag clustering that is valid for all time and for all communities:
- time sensitivemeans that the clustering algorithm must be rerun periodically. With changing clusters over time some challenges arise:
- As we may use the clustering data for computing other meta data (for instance assess the interests of a user on the basis of his tag usage) the computed metadata becomes invalid after each clustering. If we find out which clusters correspond to each other on two points in time, the computed metadata could be linked to the corresponding cluster in the new clustering and therefore doesn’t have to be recomputed. The matching of two clusters in two clusterings shouldn’t be that hard: if the size of the two clusters’ intersection exceeds a certain threshold the two clusters are considered as corresponding. We suppose observing clusters over time could be a fairly interesting topic, we suppose that splitting and merging of clusters can be observed.
- The user is confronted with ever changing categorization he didn’t directly initialize. It will be a challenge to minimize this disruption experience.
- community sensitive means that clustering results shouldn’t be used across different services, e.g. early adopters oriented sites as del.icio.us will have different clusterings than academic sites as connotea. On bigger sites, there should be different clusterings for people with different mindsets. Even one user can have clashing terminologies. Although this is caused by the ambiguity of tagging per se, it shows the limits of automated tag clustering.
Ambiguity

“raw cluster” language/library:
“language” and “library” are ambiguous tags
Although sites as del.icio.us are used by people of similar mind set, there is already a certain level of ambiguity. In Figure «”raw cluster” language/library» the tags language and library are ambiguous (please note that this figure depicts a intermediate state in the clustering. This isn’t how the cluster looked like in the end). language could mean a natural language or a programming language, library could mean a programming library or a book library. As briefly mentioned above, in such cases there should be different clusterings for distinct communities within one service. A community recognition would help, communities would have to be detected with the help of clustering. We haven’t investigated into clustering users, there is still much to be researched.
Conclusion
Apart from the problems already mentioned we dealt with issues like tag spam (tagging data that wasn’t caused by humans) and a overwhelming mass of data (problem with processing time). The solutions to these problems are left out in this document and may be described in future documents or blog entries.
We are eager to show that clustering isn’t only theory but has very practical applications. Therefore we are still in the process of building proof of concept applications. If you are interested you can follow RawSugars’ or Philipp’s blog. They are likely to release some further results and prototypes in the future.
You may want to participate in the discussion of this article.