Statistics is a broad mathematical discipline which studies ways to collect, summarize and draw conclusions from data. [Wikipedia]
Statistics help us to draw conclusions from data. In a way this whole tagging thing just popped up and now we are trying to figure out what really is happening. I think statistics can help us to understand tags.
When I did set up my performance test system I wanted to know the metrics of delicious so I did try to extrapolate some hand collected data but it didn’t turn out that well.
After that I started collecting post data from del.icio.us and am happy to announce that I’ve set up a site with delicious statistics that is fully automated (my hands can rest now..). There are trends about number of posts per day as well as numbers of tags per post.
The stats are based on data I extract from the most recent posts feed, which I’m grabbing 6 times an hour (I’m trying to not be evil: No screen scraping, no grabbing each minute.) I miss a big portion of the posts (actually I record just about 10% of the data) but I guess the stats are precice enough to draw some conclusions.
I’m fond of del.icio.us (as you may know) and when I’m fond of a website I urge to know how many people are using it, if the service is attracting or scaring away folk, I feel a need to know what’s up. Especially after delicious has been acquired by Yahoo, you may ask “do people stay?”.
Anyway, that’s not the only cause for stats. When I set up the performance tests I wanted to have real numbers. I also asked on the delicious mailing list. That same question was asked a few times, but no answers..
Now my stats don’t answer all question. If you’re asking yourself “how many inserts has my tag system to scope with if it gets really big” these will help you. But I cannot do any query-stats, maybe alexa may give you some query trends (maybe you subtract my number from alexas and will get the query stats?).
From the stats you can see the two downtimes of delicious since August.
You also see that the recent growth of del.icio.us merely started in december. I think it has got to do with the more elaborated look and feel (changed in the middle of november) as well as with the new firefox plugin that does give a more professional touch to the service. This grow is a thank you to Joshua and this team.
Then, take a look at the “tag hump” at 10 tags per posts:
My first quick investigations show that this is caused by - you guess it - tag spammers.
I found two spammers that constantly post bookmarks with 10 tags (look out, the first link has got chinese characters in it, my firefox slowed down big time). This shows that stats can help finding anomalies such as spam.
I also thought that maybe the lazy sheep bookmarklet can cause such humps but, by default, lazy sheep’s posts have a maximum of 6 tags. There’s no irregularity at “6” so I guess lazy sheep doen’t have a big influence (which is a fact I’m quite happy with).
I think it will be interesting to observe these tag graphs when the bookmark post user interface changes. I believe the interface plays a big role in how people tag and this sort of graphs could prove that.
I may give statistics about the number of estimated users (currently tracked: 100k) and number of bookmarks (currently tracked: 500k) but I’m not yet sure how I can compute numbers that seem accurate.
I plan to come up with a few other del.icio.us services such as tag clusters but I’m not yet sure if that project comes to an end so I’ve decided to put up the stats so you’ll have at least this.. :-)
Hold on, that’s too much del.icio.us for me
Uh, all this talk about del.icio.us is too much [Otis]
Yeah, you are right. The point is that this stats can be computed from all tagging-powered webservices that serve a “most recent posts” feed. If you’re interested to have a stas on a different service or you want to do del.icio.us stats by your own just leave a comment. If there is enough request, I’ll comment&refactor the code and will publish it as LGPL.
Comparing to other services
Del.icio.us vs. Yahoo MyWeb 2.0
Dorrian Porter has tracked the number of posts of Yahoo’s MyWeb2.0:
Newly saved pages have averaged between 10,000 to 20,000 per week
These numbers are per week. Del.icio.us has got an average of about 55’000 posts per day! This means that right now the data base at del.icio.us grows about 20 times as fast as the one of Yahoo’s MyWeb2.0. That leaves no question as to why they have aquired delicious.