Tuesday, November 03, 2009

Python script to set genre in iTunes with Last.fm tags

Now that I have started to seriously use iTunes I figured it might be nice to have the genre tag set in a meaningful way. Since I have a reasonably large collection of mp3s doing that manually was out of question - I wrote me a Python script to do that. There seems to be a large demand for such a functionality (at least I found a lot of questions on how to automatically set the genre tag) so maybe someone else finds the script useful. It is pasted below.

General Strategy

The basic idea is to use Last.fm's tags for genre tagging. In iTunes the genre tag is IMO best used when it only contains one single genre, i.e. something like "Electronica", not something like "Electronica / Dance". On the other hand dropping all but one tag would lose a lot of information, so I decided to use the groupings tag for additional information that is contained in the list of tags that an artist has on Last.fm. In the example above that would be something like "Electronica, Dance, 80s, German". In that way it is simple to use iTunes' Smart Playlist feature to create play lists of all, say, dance music. This approach is probably not suitable for classical music..

The ID3 field that is exposed in iTunes' UI as "grouping" is defined in the ID3v2 spec as:
TIT1
The 'Content group description' frame is used if the sound belongs to a larger category of sounds/music. For example, classical music is often sorted in different musical sections (e.g. "Piano Concerto", "Weather - Hurricane").
So, the strategy I described above seems to be kind of in line with the spec. In general, it is a good idea to have a look at the ID3v2 spec if you consider dabbling with mp3 tags.

Practical Considerations

If one would just take an artist's highest-rated Last.fm tag for the genre one would end up with pretty inconsistent genre tags (think "hip-hop", "hip hop", and "hiphop"). Therefore, I chose to use a fixed set of values for genre. In a previous version of ID3 the list of possible genres was fixed. While this is clearly a terrible idea to start with it came along handy in this case. I used his as a fixed list for genres.

The second practical consideration was which Last.fm tags to include. In Last.fm parlance each artist tag comes with a weight (values form 0 to 100). Selecting only the tags with weight larger than 50 worked out fine for me (usually I had 1-5 tags per artist).

A third thing you might want to be aware of: if you programmatically change tags in an mp3 iTunes will not pick up these changes automatically. A simple way of letting it know: select the "Get Info" command on these items. This will trigger a reload of the new tag values.

Script

To run the script you will need the Python libraries mutagen and pylast installed. Run it with the option
-d directory_with_mp3s
The script will walk along this directory and modify all mp3s it finds. Also, you will need a Last.fm API key and set your API_KEY and API_SECRET accordingly in the script.


#!/usr/bin/env python
# encoding: utf-8
"""
tag_groupings.py

Created by Michael Marth on 2009-11-02.
Copyright (c) 2009 marth.software.services. All rights reserved.
"""

import sys
import getopt
import pylast
import os.path
from mutagen.id3 import TCON, ID3, TIT1

help_message = '''
Adds ID3 tags to mp3 files for genre and groupings. Tag values are retrieved from Last.FM. Usage:
-d mp3_directory
'''

class Usage(Exception):
def __init__(self, msg):
self.msg = msg

all_genres = TCON.GENRES
genre_cache = {}
groupings_cache = {}
API_KEY = "your key here"
API_SECRET = "your secret here"
network = pylast.get_lastfm_network(api_key = API_KEY, api_secret = API_SECRET)

def artist_to_genre(artist):
if genre_cache.has_key(artist):
return genre_cache[artist]
else:
tags = network.get_artist(artist).get_top_tags()
for tag in tags:
if all_genres.__contains__(tag[0].name.title()):
genre_cache[artist] = tag[0].name.title()
print "%20s %s" % (artist,tag[0].name.title())
return tag[0].name.title()

def artist_to_groupings(artist):
if groupings_cache.has_key(artist):
return groupings_cache[artist]
else:
tags = network.get_artist(artist).get_top_tags()
relevant_tags = []
for tag in tags:
if int(tag[1]) >= 50:
relevant_tags.append(tag[0].name.title())
groupings = ", ".join(relevant_tags)
groupings_cache[artist] = groupings
print "%20s %s" % (artist,groupings)
return groupings

def walk_mp3s():
for root, dirs, files in os.walk('.'):
for name in files:
if name.endswith(".mp3"):
audio = ID3(os.path.join(root, name))
artist = audio["TPE1"]
genre = artist_to_genre(artist[0])
grouping = artist_to_groupings(artist[0])
if genre != None:
audio["TCON"] = TCON(encoding=3, text=genre)
if grouping != None:
audio["TIT1"] = TIT1(encoding=3, text=grouping)
audio.save()

def main(argv=None):
if argv is None:
argv = sys.argv
try:
try:
opts, args = getopt.getopt(argv[1:], "ho:vd:", ["help", "output="])
except getopt.error, msg:
raise Usage(msg)

# option processing
for option, value in opts:
if option == "-v":
verbose = True
if option in ("-h", "--help"):
raise Usage(help_message)
if option in ("-o", "--output"):
output = value
if option in ("-d"):
try:
os.chdir(value)
except Exception,e:
print "error with directory " + value
print e
walk_mp3s()

except Usage, err:
print >> sys.stderr, sys.argv[0].split("/")[-1] + ": " + str(err.msg)
print >> sys.stderr, "\t for help use --help"
return 2

if __name__ == "__main__":
sys.exit(main())

Monday, November 02, 2009

Interviewed by Internet Briefing Blog

Reto Hartinger of the Internet Briefing group has interviewed me about what I work on these days. The interview is in German.

Sunday, October 18, 2009

Colayer's approach to collaboration software

Chances are you have not heard of Colayer, a Swiss-Indian company producing a SaaS-based collaboration software. I did a small project with the guys, that is how I got to know. When I first saw their product I immediately thought the guys are onto something good, so it is worthwhile to share a bit of their application concepts. They follow an approach I have not seen anywhere else.

On first look Colayer seems to be a mixture between wikis and forums: the logical data structure resembles a hierarchical web-based forum and the forum entries are editable and versioned like in a wiki. But there is more: presence and real-time. All users that are currently logged in are visible and one can have real-time chats within the context of the page one is in or see updates to the page in real-time (similar as in Google Docs). These chats are treated as atomic page elements (called sems in Colayer parlance) just like the forum entries or other texts. Through this mechanism, all communication around one topic stays on one page and in the same context.

There are two more crucial elements: time and semantics. All sem's visibility is controlled by their age and their importance. As such, a simple chat is given less weight that a project decision and will fade out of view after some time. All new items from all pages (i.e. discussions or topics) are aggregated on a personal home page and shown within the context where they occurred.

Below is a screenshot of such different sems in one page. One page corresponds to one topic or forum or wiki page. You can see the hierarchical model and the different semantics (denoted by the colors).



Here is an example screen shot that aggregates different recent sems on one page (essentially a context-aware display of new items including time and context in the same display). Note that this way of displaying new items manages to map importance, time and context into a two-dimensional page, which I find a very cool achievement.



The funny thing about Colayer's product (especially when compared to Google Wave) is that one "gets it" when first looking at it. It solves a problem I am facing in my work on a daily basis: where to put or find crucial information - on an internal mailing list or on the wiki?

The Colayer application is delivered as a browser-based SaaS solution (mainly targeted towards company-internal collaboration). This limits potential usage scenarios outside of the firewall. It would be cool if Colayer found a way of opening up their application to other data sources or consumers. It would be worth it, the app rocks.

Thursday, October 01, 2009

Found out about FlexEvent.UPDATE_COMPLETE

Here is a small bit about the asynchronous nature of the Flex layout mechanisms that I learned while slapping together a presales demo yesterday:

When changing properties of UIComponents listen for FlexEvent.UPDATE_COMPLETE events. They get fired when the change is actually done. In my case I needed to get the textWidth of a Label after changing the label text. Right after calling the setter of text the getter of textWidth will still return the old value.

[Some background reading]

Tuesday, September 29, 2009

Talk at Java User Group

Yesterday, I gave a talk at the Java User Group Switzerland (JUGS) titled "Agile RESTful Web Development". It was about the REST style in principle and hands-on RESTful development with Apache Sling. I enjoyed giving the talk and think that it was well received. Here's the slide deck:

Thursday, September 17, 2009

Ruby script for generating Google sitemaps

I just wrote a small Ruby script to generate a Google sitemap out of file directory. I thought that it might come handy as a quick start for someone, so here is the code (requires the builder gem)

require 'builder'
htmlfiles = Dir.glob("*.html")
x = Builder::XmlMarkup.new(:target => $stdout, :indent => 1)
x.instruct!
x.urlset( "xmlns" => "http://www.sitemaps.org/schemas/sitemap/0.9" ) {
for file in htmlfiles do
x.url{x.loc "http://www.example.com/#{file}"; x.lastmod "2009-09-17"; x.changefreq "monthly"; x.priority "0.8"}
end
}

Thursday, July 16, 2009

NoSQL: A long-time relation(ship) comes to an end

(cross-posting from here)

OK, I admit it, declaring that "the RDBMS is dead" is a meme that has been going around the software industry for a while. Remember object-oriented data bases that were supposed to replace the relational ones? Well, guess who is still here. However, despite the RDBMS's amazing survival skills I would like to propose a related prediction:

I believe that the year 2009 will go down in history as the year when the "relational model default" ended. The term "relational model default" was coined by me to describe a peculiar thing that goes on in application development: start talking to your average application developer about some arbitrary business requirement and chances are that simultaneously he mentally constructs a relational model to fit those requirements.

That relational approach to modeling your problem may or may not be suitable. The real problem is that all too often this default does not get challenged. As a consequence, whatever the fitting data model might be, it gets shoehorned into tables and relations.

This default "thinking" has not yet changed for the masses, but I believe that it has changed for the early adopters (which means that invariably it will change for the masses in some years).

I see the default to change from:

"I need to store some data i.e. I need a relational database"
to:
"I need to store something, let me see the data to decide how to store it."
The most concrete and visible manifestation of the rising interest in non-relational data store is the "NoSQL" movement. NoSQL denotes a group of people interested in exploring and comparing alternatives to the traditional relational data storages like MySQL or Postgres. The inaugural get-together has been covered in Computerworld, see also Johan Oskarsson's post and there is, of course, a Hashtag.

Other than the NoSQL group I have a second data point to offer: there is a Cambrian Explosion happening in terms of projects exploring non-relational data stores. During the Cambrian Explosion a major diversification of organisms took place. Similarly a plethora of new projects that explore alternatives to relational models continue to gain interest. Here is an incomplete list:


AllegroGraph, Amazon's SimpleDB, Cassandra, CouchDB, Dynomite, Google's App Engine datastore, HBase, Hypertable, Kai, MemcacheDB, Mongo DB, Neo4J, OpenRDF, Project Voldemort, Redis, Ringo, Scalaris , ThruDB, Tokyo Cabinet (and Tokyo Tyrant and LightCloud)

Last, but certainly not least, there are Apache Jackrabbit and Apache Sling.

From my perspective there are three main areas of innovation in this Cambrian Explosion of data stores:

1. Models

In the relational model you break down your data into tables and relations. This model implies that the data is somewhat tabular. However, in some cases the data simply is not tabular.

Consider web content, which is hierarchical and mixes fine-granular data with binary files (this model is implemented in Jackrabbit). Other (not mutually exclusive) alternative models are document-oriented, key-value pairs, or Graphs/RDF.

One very important aspect of many alternative models is that they are schemaless. That means that they accommodate for Data First approaches where it is not required to define the data structure before one can actually store any data. This enables agile approaches to software development in the short term as well as more flexibility in the long term evolution of business requirements.

Without defining a data structure first it is not possible to store anything at all in an RDBMS. This fact is probably one of the root causes of the relational default thinking. An RDBMS-based developer simply cannot develop anything without thinking about table structure.


2. Scalability

A second area of innovation is scalability. This can be split down into two sections: One is scalability achieved by distributing the data store across separate machines, the approach pioneered by Google. Opposed to classical clustering of RDBMSs the order of magnitude of machines that are considered is hundreds rather than ten. Obviously, different trade-offs regarding consistency and availability of individual cluster nodes must be taken when architecting for such a high number of cluster nodes. Eventual consistency is one of the interesting concepts invented in this space.


While the commoditization of server hardware triggered this first approach to scalability, a second area is related to the rise of multi-core processors. For a number of years CPUs have not gotten faster, but rather the number of cores has increased. There is no explicit contradiction in running a classical RDBMS on a multi-core machine and even having the RDBMS take advantage of them. However, it seems to me that the SQL language is a poor fit for queries in a multi-core environment when compared with alternatives such as Map/Reduce which are parallel by design.


3. Web

The third area of innovation revolves around the fact that the web is the dominant paradigm for computing in our time. This is also acknowledged by the two considerations discussed above. However, a third one is that HTTP is used for accessing the data. Other types of connectivity that were typically implemented as JDBC or ODBC drivers are not needed/used anymore. In many cases the data store exposes its resources in a RESTful API. An obvious benefit is the ubiquitous availability of clients including the browser itself. The classical RDBMS approach involving a dedicated driver looks like a client-server architecture mindset in comparison (I wrote about this 1.5 years ago).

At this point let me re-iterate that RDBMSs are here to stay, just like mainframes never went away. Moreover, a couple of the innovation areas cited above are not that new at all, especially, when it comes to non-relational data models (for example, I recently dug into the foundations of the Lotus Notes document store and came out very impressed). However, it is only now that the relational model default will disappear.


What about content management systems?

Considering the content management system industry as a whole I am extremely happy about this shift away from RDBMSs. Especially the model aspect is crucial: RDBMSs embody a fundamentally wrong model for content. There are varying opinions in the industry about what "content" really is, but one thing is more or less universally accepted: it is (at least partially) unstructured. Well, RDBMSs are designed for structured data. Duh.

So why are there one gazillion LAMP-based CMSs? I blame the relational model default. But as this default vanishes we will see more and more CMSs that are not based on an RDBMS (see the Jackrabbit wiki for a list of JCR-based ones, as well as the recent PHP-based JCR implementations Jackalope or for Typo3 or the Midgard content repository).

Don't laugh, but I truly envision a better (CMS) world once more CMSs are built upon proper tools and not forced into a relational model anymore. It will be a better world for developers and consequently for the CMS users.

What about Day?

REST and content repositories were invented and evangelized by Day's Chief Scientist Roy and Day's CTO David years ago already. So it is no surprise that Day's content management systems are in an excellent shape with respect to these considerations. CQ5 is built upon Apache Jackrabbit, i.e. a data store that implements a content-centric model, and Apache Sling, a web framework designed to be RESTful right from the start.

When it comes to scaling: a week ago we gave a live demonstration on how to install and cluster CQ5 on Amazon's EC2 service. But, expect even more exciting news in this area.