Sunday, November 15, 2009

Running the iTunes genre tagger script with OS X Automator

Due to public demand here's a little recipe how to run last post's mp3 tagger without using the command line on OS X:
  • Open Automator
  • Start a new "Application" project
  • Drag the "Run Shell Script" action into the right workflow panel, set the "pass input" drop-down to "as arguments" and edit the script to (see screenshot below):
for f in "$@"
/opt/local/bin/python /Users/michaelmarth/Development/Code/mp3tagger/ -d "$f"

(you will have to adapt the paths to your local setup)
  • Save the application and happily start dropping mp3 folders onto the application's icon.

Tuesday, November 03, 2009

Python script to set genre in iTunes with tags

Now that I have started to seriously use iTunes I figured it might be nice to have the genre tag set in a meaningful way. Since I have a reasonably large collection of mp3s doing that manually was out of question - I wrote me a Python script to do that. There seems to be a large demand for such a functionality (at least I found a lot of questions on how to automatically set the genre tag) so maybe someone else finds the script useful. It is pasted below.

General Strategy

The basic idea is to use's tags for genre tagging. In iTunes the genre tag is IMO best used when it only contains one single genre, i.e. something like "Electronica", not something like "Electronica / Dance". On the other hand dropping all but one tag would lose a lot of information, so I decided to use the groupings tag for additional information that is contained in the list of tags that an artist has on In the example above that would be something like "Electronica, Dance, 80s, German". In that way it is simple to use iTunes' Smart Playlist feature to create play lists of all, say, dance music. This approach is probably not suitable for classical music..

The ID3 field that is exposed in iTunes' UI as "grouping" is defined in the ID3v2 spec as:
The 'Content group description' frame is used if the sound belongs to a larger category of sounds/music. For example, classical music is often sorted in different musical sections (e.g. "Piano Concerto", "Weather - Hurricane").
So, the strategy I described above seems to be kind of in line with the spec. In general, it is a good idea to have a look at the ID3v2 spec if you consider dabbling with mp3 tags.

Practical Considerations

If one would just take an artist's highest-rated tag for the genre one would end up with pretty inconsistent genre tags (think "hip-hop", "hip hop", and "hiphop"). Therefore, I chose to use a fixed set of values for genre. In a previous version of ID3 the list of possible genres was fixed. While this is clearly a terrible idea to start with it came along handy in this case. I used his as a fixed list for genres.

The second practical consideration was which tags to include. In parlance each artist tag comes with a weight (values form 0 to 100). Selecting only the tags with weight larger than 50 worked out fine for me (usually I had 1-5 tags per artist).

A third thing you might want to be aware of: if you programmatically change tags in an mp3 iTunes will not pick up these changes automatically. A simple way of letting it know: select the "Get Info" command on these items. This will trigger a reload of the new tag values.


To run the script you will need the Python libraries mutagen and pylast installed. Run it with the option
-d directory_with_mp3s
The script will walk along this directory and modify all mp3s it finds. Also, you will need a API key and set your API_KEY and API_SECRET accordingly in the script.

#!/usr/bin/env python
# encoding: utf-8

Created by Michael Marth on 2009-11-02.
Copyright (c) 2009 All rights reserved.

import sys
import getopt
import pylast
import os.path
from mutagen.id3 import TCON, ID3, TIT1

help_message = '''
Adds ID3 tags to mp3 files for genre and groupings. Tag values are retrieved from Last.FM. Usage:
-d mp3_directory

class Usage(Exception):
def __init__(self, msg):
self.msg = msg

all_genres = TCON.GENRES
genre_cache = {}
groupings_cache = {}
API_KEY = "your key here"
API_SECRET = "your secret here"
network = pylast.get_lastfm_network(api_key = API_KEY, api_secret = API_SECRET)

def artist_to_genre(artist):
if genre_cache.has_key(artist):
return genre_cache[artist]
tags = network.get_artist(artist).get_top_tags()
for tag in tags:
if all_genres.__contains__(tag[0].name.title()):
genre_cache[artist] = tag[0].name.title()
print "%20s %s" % (artist,tag[0].name.title())
return tag[0].name.title()

def artist_to_groupings(artist):
if groupings_cache.has_key(artist):
return groupings_cache[artist]
tags = network.get_artist(artist).get_top_tags()
relevant_tags = []
for tag in tags:
if int(tag[1]) >= 50:
groupings = ", ".join(relevant_tags)
groupings_cache[artist] = groupings
print "%20s %s" % (artist,groupings)
return groupings

def walk_mp3s():
for root, dirs, files in os.walk('.'):
for name in files:
if name.endswith(".mp3"):
audio = ID3(os.path.join(root, name))
artist = audio["TPE1"]
genre = artist_to_genre(artist[0])
grouping = artist_to_groupings(artist[0])
if genre != None:
audio["TCON"] = TCON(encoding=3, text=genre)
if grouping != None:
audio["TIT1"] = TIT1(encoding=3, text=grouping)

def main(argv=None):
if argv is None:
argv = sys.argv
opts, args = getopt.getopt(argv[1:], "ho:vd:", ["help", "output="])
except getopt.error, msg:
raise Usage(msg)

# option processing
for option, value in opts:
if option == "-v":
verbose = True
if option in ("-h", "--help"):
raise Usage(help_message)
if option in ("-o", "--output"):
output = value
if option in ("-d"):
except Exception,e:
print "error with directory " + value
print e

except Usage, err:
print >> sys.stderr, sys.argv[0].split("/")[-1] + ": " + str(err.msg)
print >> sys.stderr, "\t for help use --help"
return 2

if __name__ == "__main__":

Monday, November 02, 2009

Sunday, October 18, 2009

Colayer's approach to collaboration software

Chances are you have not heard of Colayer, a Swiss-Indian company producing a SaaS-based collaboration software. I did a small project with the guys, that is how I got to know. When I first saw their product I immediately thought the guys are onto something good, so it is worthwhile to share a bit of their application concepts. They follow an approach I have not seen anywhere else.

On first look Colayer seems to be a mixture between wikis and forums: the logical data structure resembles a hierarchical web-based forum and the forum entries are editable and versioned like in a wiki. But there is more: presence and real-time. All users that are currently logged in are visible and one can have real-time chats within the context of the page one is in or see updates to the page in real-time (similar as in Google Docs). These chats are treated as atomic page elements (called sems in Colayer parlance) just like the forum entries or other texts. Through this mechanism, all communication around one topic stays on one page and in the same context.

There are two more crucial elements: time and semantics. All sem's visibility is controlled by their age and their importance. As such, a simple chat is given less weight that a project decision and will fade out of view after some time. All new items from all pages (i.e. discussions or topics) are aggregated on a personal home page and shown within the context where they occurred.

Below is a screenshot of such different sems in one page. One page corresponds to one topic or forum or wiki page. You can see the hierarchical model and the different semantics (denoted by the colors).

Here is an example screen shot that aggregates different recent sems on one page (essentially a context-aware display of new items including time and context in the same display). Note that this way of displaying new items manages to map importance, time and context into a two-dimensional page, which I find a very cool achievement.

The funny thing about Colayer's product (especially when compared to Google Wave) is that one "gets it" when first looking at it. It solves a problem I am facing in my work on a daily basis: where to put or find crucial information - on an internal mailing list or on the wiki?

The Colayer application is delivered as a browser-based SaaS solution (mainly targeted towards company-internal collaboration). This limits potential usage scenarios outside of the firewall. It would be cool if Colayer found a way of opening up their application to other data sources or consumers. It would be worth it, the app rocks.

Thursday, October 01, 2009

Found out about FlexEvent.UPDATE_COMPLETE

Here is a small bit about the asynchronous nature of the Flex layout mechanisms that I learned while slapping together a presales demo yesterday:

When changing properties of UIComponents listen for FlexEvent.UPDATE_COMPLETE events. They get fired when the change is actually done. In my case I needed to get the textWidth of a Label after changing the label text. Right after calling the setter of text the getter of textWidth will still return the old value.

[Some background reading]

Tuesday, September 29, 2009

Talk at Java User Group

Yesterday, I gave a talk at the Java User Group Switzerland (JUGS) titled "Agile RESTful Web Development". It was about the REST style in principle and hands-on RESTful development with Apache Sling. I enjoyed giving the talk and think that it was well received. Here's the slide deck:

Thursday, September 17, 2009

Ruby script for generating Google sitemaps

I just wrote a small Ruby script to generate a Google sitemap out of file directory. I thought that it might come handy as a quick start for someone, so here is the code (requires the builder gem)

require 'builder'
htmlfiles = Dir.glob("*.html")
x = => $stdout, :indent => 1)
x.urlset( "xmlns" => "" ) {
for file in htmlfiles do
x.url{x.loc "{file}"; x.lastmod "2009-09-17"; x.changefreq "monthly"; x.priority "0.8"}

Thursday, July 16, 2009

NoSQL: A long-time relation(ship) comes to an end

(cross-posting from here)

OK, I admit it, declaring that "the RDBMS is dead" is a meme that has been going around the software industry for a while. Remember object-oriented data bases that were supposed to replace the relational ones? Well, guess who is still here. However, despite the RDBMS's amazing survival skills I would like to propose a related prediction:

I believe that the year 2009 will go down in history as the year when the "relational model default" ended. The term "relational model default" was coined by me to describe a peculiar thing that goes on in application development: start talking to your average application developer about some arbitrary business requirement and chances are that simultaneously he mentally constructs a relational model to fit those requirements.

That relational approach to modeling your problem may or may not be suitable. The real problem is that all too often this default does not get challenged. As a consequence, whatever the fitting data model might be, it gets shoehorned into tables and relations.

This default "thinking" has not yet changed for the masses, but I believe that it has changed for the early adopters (which means that invariably it will change for the masses in some years).

I see the default to change from:

"I need to store some data i.e. I need a relational database"
"I need to store something, let me see the data to decide how to store it."
The most concrete and visible manifestation of the rising interest in non-relational data store is the "NoSQL" movement. NoSQL denotes a group of people interested in exploring and comparing alternatives to the traditional relational data storages like MySQL or Postgres. The inaugural get-together has been covered in Computerworld, see also Johan Oskarsson's post and there is, of course, a Hashtag.

Other than the NoSQL group I have a second data point to offer: there is a Cambrian Explosion happening in terms of projects exploring non-relational data stores. During the Cambrian Explosion a major diversification of organisms took place. Similarly a plethora of new projects that explore alternatives to relational models continue to gain interest. Here is an incomplete list:

AllegroGraph, Amazon's SimpleDB, Cassandra, CouchDB, Dynomite, Google's App Engine datastore, HBase, Hypertable, Kai, MemcacheDB, Mongo DB, Neo4J, OpenRDF, Project Voldemort, Redis, Ringo, Scalaris , ThruDB, Tokyo Cabinet (and Tokyo Tyrant and LightCloud)

Last, but certainly not least, there are Apache Jackrabbit and Apache Sling.

From my perspective there are three main areas of innovation in this Cambrian Explosion of data stores:

1. Models

In the relational model you break down your data into tables and relations. This model implies that the data is somewhat tabular. However, in some cases the data simply is not tabular.

Consider web content, which is hierarchical and mixes fine-granular data with binary files (this model is implemented in Jackrabbit). Other (not mutually exclusive) alternative models are document-oriented, key-value pairs, or Graphs/RDF.

One very important aspect of many alternative models is that they are schemaless. That means that they accommodate for Data First approaches where it is not required to define the data structure before one can actually store any data. This enables agile approaches to software development in the short term as well as more flexibility in the long term evolution of business requirements.

Without defining a data structure first it is not possible to store anything at all in an RDBMS. This fact is probably one of the root causes of the relational default thinking. An RDBMS-based developer simply cannot develop anything without thinking about table structure.

2. Scalability

A second area of innovation is scalability. This can be split down into two sections: One is scalability achieved by distributing the data store across separate machines, the approach pioneered by Google. Opposed to classical clustering of RDBMSs the order of magnitude of machines that are considered is hundreds rather than ten. Obviously, different trade-offs regarding consistency and availability of individual cluster nodes must be taken when architecting for such a high number of cluster nodes. Eventual consistency is one of the interesting concepts invented in this space.

While the commoditization of server hardware triggered this first approach to scalability, a second area is related to the rise of multi-core processors. For a number of years CPUs have not gotten faster, but rather the number of cores has increased. There is no explicit contradiction in running a classical RDBMS on a multi-core machine and even having the RDBMS take advantage of them. However, it seems to me that the SQL language is a poor fit for queries in a multi-core environment when compared with alternatives such as Map/Reduce which are parallel by design.

3. Web

The third area of innovation revolves around the fact that the web is the dominant paradigm for computing in our time. This is also acknowledged by the two considerations discussed above. However, a third one is that HTTP is used for accessing the data. Other types of connectivity that were typically implemented as JDBC or ODBC drivers are not needed/used anymore. In many cases the data store exposes its resources in a RESTful API. An obvious benefit is the ubiquitous availability of clients including the browser itself. The classical RDBMS approach involving a dedicated driver looks like a client-server architecture mindset in comparison (I wrote about this 1.5 years ago).

At this point let me re-iterate that RDBMSs are here to stay, just like mainframes never went away. Moreover, a couple of the innovation areas cited above are not that new at all, especially, when it comes to non-relational data models (for example, I recently dug into the foundations of the Lotus Notes document store and came out very impressed). However, it is only now that the relational model default will disappear.

What about content management systems?

Considering the content management system industry as a whole I am extremely happy about this shift away from RDBMSs. Especially the model aspect is crucial: RDBMSs embody a fundamentally wrong model for content. There are varying opinions in the industry about what "content" really is, but one thing is more or less universally accepted: it is (at least partially) unstructured. Well, RDBMSs are designed for structured data. Duh.

So why are there one gazillion LAMP-based CMSs? I blame the relational model default. But as this default vanishes we will see more and more CMSs that are not based on an RDBMS (see the Jackrabbit wiki for a list of JCR-based ones, as well as the recent PHP-based JCR implementations Jackalope or for Typo3 or the Midgard content repository).

Don't laugh, but I truly envision a better (CMS) world once more CMSs are built upon proper tools and not forced into a relational model anymore. It will be a better world for developers and consequently for the CMS users.

What about Day?

REST and content repositories were invented and evangelized by Day's Chief Scientist Roy and Day's CTO David years ago already. So it is no surprise that Day's content management systems are in an excellent shape with respect to these considerations. CQ5 is built upon Apache Jackrabbit, i.e. a data store that implements a content-centric model, and Apache Sling, a web framework designed to be RESTful right from the start.

When it comes to scaling: a week ago we gave a live demonstration on how to install and cluster CQ5 on Amazon's EC2 service. But, expect even more exciting news in this area.

Monday, June 22, 2009

Jazoon talk on "Scalable Agile Web Development"

On Thursday, I will give a talk at the Jazoon conference in Zurich. It will be about Apache Sling, the web framework for content-centric applications. The agenda is:

Scalable Agile Web Development: REST meets JCR meets OSGI

This session is a very hands-on lab that shows how a real web application is developed from scratch in a very agile fashion leveraging a heavy-weight enterprise ready back-end yet allowing for unprecedented agility in development in building rest-style web applications. Thinking of a classic j2ee stack this may sound like a contradiction.

Agility of development begins with the amount of tooling and setup we need to get started, so expect to see the entire walk-through from installation of the server software to the development of a complete application within the time constraints of the session.

(1) Web architecture, think outside the box.
(2) Meet: apache sling.
(3) Building a real-life webapp from scratch.

The full conference agenda is here. I shall also help Michael Dürig with his session on Scala and Sling.

Wednesday, May 06, 2009

CMIS Technical Committee

It has been a while since I have been on a standards committee (the last one was OMTP), but I have now joined the Technical Committee of CMIS: Content Management Interoperability Services. Better interop is certainly something the CMS world is in dire need of.

Tuesday, April 28, 2009 says: Hello world!

Today, I am happy to announce that is finally "officially" going live. minimeme is a news aggregator focused on tech and software development news.

minimeme was born out of a personal frustration of mine: each morning I would skim through my feed reader only to find the relevant items twice or more times. On the other hand the signal to noise ratio of many feeds was way too low. I felt like a machine trying to retrieve the important items. So I decided to build a machine to do that for me.

There is no human intervention in the news selection - it is all done in a bias-free, neutral algorithm. Hence there is the claim "little Switzerland of tech news", minimeme is supposed to be neutral like Switzerland.

Having tested the algorithm for a couple of months I believe minimeme is now stable enough to be officially let loose. On top of the two currently implemented sections "dev" (feed) and "valley" (feed) there is a Twitter account you might like to follow. "dev" covers software development aspects from Ruby to CSS to REST. In the "valley" section you will find news from Google to startups to gadgets.

For the future I plan to add other topics as well as look into some recommendation algorithms. Let me now on the feedback forum which features you would like to see.

Thursday, April 16, 2009

The Lifetime of a CMS Installation

(cross-posting from here)

CMS analyst Janus Boye has blogged about the expected lifetime of a CMS installation, i.e. for how long an installed CMS can be expected to be in production. His guess is a lifetime of 3 years. On the blog's comments Janus and I got into a discussion about the accuracy of that guess where he asked Day to publish actual real data about this topic.

I like this idea because publishing this data provides a benefit to our potential new customers: a reliable indicator (without any hand-waving or gut feelings) of the CMS's lifetime that can be used in business plan

The data

The data I have used is taken from Day's support contracts. Only customer data from outside ouf Europe was used (simply because it was available to me). This selection is likely to bias the results towards shorter lifetimes as Day's oldest customers are based in Europe. The basic assumption is that the life time of the CMS is equivalent to the duration of the support contract. The used end point of each contract period is the date up to which the contract is paid for as of today.

You might argue that there could be customers that have a contract but do not actually use the product anymore, which could in fact be the case (I do not know of any). On the other hand, I am aware of customers that still use the product and have terminated their support contract. Therefore, in order to reduce selection bias I did not remove any data points due to this particular consideration.

Each customer was counted once for each product he purchased, i.e. a customer that has two distinct support contracts for CRX and CQ was sampled twice. I discarded all OEM contracts because they are of their different nature (they would skew the result towards longer lifetimes). Finally, I also dropped a data point where the support contract was cancelled because the customer went out-of-business alltogether.

I believe that this data set is reasonably unbiased to provide meaningful results with respect to the question of the lifetime of a customer's CQ/CRX installation.

The Method

Luckily for Day, the data is what is called "right censored". That means that it is unknown for how long an existing support contract will go on - actually the majority of the available data points are right censored.

The scientific discipline that is concerned with analyzing data of this kind is called "survival analysis". One is interested in the survival function which maps a set of events onto time. The survival function is a property of a random variable, i.e. it needs to be estimated (in the statistical sense of the word).

One well know estimator for the survival function is the Kaplan-Meier estimator (which is non-parametric, i.e. there are no underlying assumptions about the distribution of the data). In a nutshell:

The Kaplan-Meier estimate of the survival function, S_hat(t), corresponds to the non-parametric MLE estimate of S(t). The resulting estimate is a step function that has jumps at observed event times, ti. In general, it is assumed the ti are ordered: 0 <1>i is di, and the number of individuals at risk (ie, who have not experienced the event) at a time before ti is Yi, then the Kaplan-Meier estimate of the survival function and its estimated variance is given by:

The quantity of interest is the mean survival time (and its respective estimate) which is given by:

Because S(t) may not converge to zero, the estimate may diverge. Therefore the integral is only taken up to a finite number. A reasonable choice of is the largest observed or censored time.


Resisting a geek's urge to implement the estimator myself I used the freely available R to calculate the results. Here is a plot of the Kaplan-Meier estimate for the survival function with 95% confidence bounds (time is in days):

And finally, the estimated value for the mean survival time, i.e. the estimated lifetime of a Day CMS installation is: 2453 days with a standard deviation of 154 days. That's about 6.7 years. Mind you, this result is likely to be lower than if the whole customer base had been analyzed.

Thursday, March 05, 2009

HATEOAS in 3 lines

Stefan Tilkov brilliantly sums up the out-of-band information problem in REST's HATEOAS constraint on the REST-discuss group:

Given the representation contains

<link rel="some-concept" ref="/some-uri">

you don't hardcode the string "/some-uri" into your client, but rather the string "some-concept".

Wednesday, March 04, 2009

The misuse of the term "RESTful" in the Rails community

Today I went to a talk at the local Ruby on Rails group. The speaker was quite clueful. He had even implemented his own DSL to describe his business problem. Obviously, the guy was not a noobie in Ruby.

However, what really turned me off was his usage of the word "RESTful". For him, it seemed to be a way to describe the inner workings of his application, like, say, "separation of concerns". RoR guys are generally not the most clueless people, but nobody in the audience challenged him about this. It seemed to be the generally accepted usage of the term in the Rails community.

This made me think that DHH and Rails have done two things to REST:
  • First, they greatly help to evangelize the term "RESTful"
  • Second, they hijacked the meaning of the term and changed it from "architectural style" to "application architecture"
As it happens I listened to a podcast from the Pragmatic Programmers on my way home. It was about the .Net Ruby implementation and, of course, Rails and consequently REST were brought up. One of the speakers said that he was only introduced to REST through Rails. He went on to explain REST in way that confused the hell out of me, but the essence was along the lines of "http is good". If the Rails community is fuzzy about what REST is, people who get it second hand from them are as well.

I believe that a part of the misunderstanding is that the term "architectural style" (as opposed to "architecture") is not understood well enough in the development community. However, Roy Fielding has written a brilliant post about that difference between an architectural style and an architecture: "On Software Architecture".
Web implementations are not equivalent to Web architecture and Web architecture is not equivalent to the REST style.
RESTful-Rails-people please have a look at that post.

PS: Ted Neward had some predictions for 2009 (as I silently predicted, nobody cared that I did not make any predictions for 2009), one of them just came to my mind (emphasis mine):
Roy Fielding will officially disown most of the "REST"ful authors and software packages available. Nobody will care--or worse, somebody looking to make a name for themselves will proclaim that Roy "doesn't really understand REST". And they'll be right--Roy doesn't understand what they consider to be REST, and the fact that he created the term will be of no importance anymore. Being "REST"ful will equate to "I did it myself!", complete with expectations of a gold star and a lollipop.

Tuesday, January 27, 2009

Microformat rel-tags are a broken spec

In my current pet project I do a fair amount of parsing of rel-tags (the microformat spec for "tagging"). At first I got a bit agitated how many occurrences there are where the spec is not implemented correctly. But I have come to realize that the spec is simply broken. There are two ways I can think of a spec to be broken:
  1. If it's internally inconsistent or inconsistent with other specs.
  2. If it's somehow useless.
The rel-tag is almost 1.), which makes it a little bit of 2.). Here's why: the spec says that this tag
<a href="" rel="tag">fish</a>
denotes "tech" rather than "fish." This means that this microformat restricts the URL space on your server. You need to have the "tag" folder and in it there must be a file "tech" - unless you link to another site which is not a solution to the problem.
Being able to control my own URL space is one of the pronciples of the web. That's what I mean with "the rel-tag spec is inconsistent with other specs." The problem is much the same as with the /favicon.ico:
The use of a reserved location on a website conflicts with the Architecture of the World Wide Web and is known as link squatting or URI squatting.

As a result there is endless rel-tags on the web that are constructed like:
or similar. This has been spotted previously, of course. I wonder why this criticism has not been addressed so far.

Tuesday, January 06, 2009

One button - endless possibilities

By pure chance I found a useful feature on my iPhone: when you play music but have another application in the foreground double-clicking "the" button will bring up a mini iPod control. That's nice e.g. for skipping a song without having to leave the app in the foreground.

Related news from the Macworld:

Apple Introduces Revolutionary New Laptop With No Keyboard