Alex Ross: “The Rest is Noise”

“The Rest is Noise” by Alex Ross gives a great overview of the musical history of the twentieth century. Starting with a last toe in the Romantic period of the nineteenth century, the twentieth century hosted an incredible evolution and revolution of musical production and reception. The book spans the beginning of the century opera, over the French-German dichotomy of Debussy versus Schoenberg, towards the conceptual music of Cage. In the meantime it touches upon jazz, the American composers, minimalism and The Beatles. Completeness and thoroughness are the keywords.

The first chapters of the book made me run to the local music dealer and spend more than hundred euro on the enchanting canon of French impressionistic music. Since then, Debussy and Ravel live within a one meter radius of my CD player. They were accompagnied later on by Shostakovich, Stravinsky and Gershwin. Driven by the minute, yet feasible analysis of the works, I ventured in the second part of the book.

However, in accordance to the more scattered situation of the music scene in the second half of the century, the book took on a more fragmented tone and style. Whereas the first part only took two weekends to read, the other part stayed with me for a significantly longer period. Alex Ross did manage to bring over the same inspiration was still there, as my acquisition of Steve Reich’s “Music for 18 Musicians” and my renewed interest in “Sergeant Peppers’ Lonely Hearts Club Band” shows.

As a conclusion, I want to point out that “The Rest is Noise” deserves a place on every music lover’s book shelves. Be aware that, apart from the modest price of the book, you will most certainly spend a lot of money on buying records from the list at the end of the book.

Why the gender genie does not work

The Gender Genie

Argamon, Koppel

The Gender Genie is a popular offshoot from some papers that Argamon and Koppel published in 2000-something. A quick summary: if you have a lot of texts and you know the gender of the author, you can “train” a model that distinguishes male texts from female texts. The model consists in this case of a bunch of function words (e.g. me, myself, your). Each of these words is classified by the training as a “female” word, or a “male word”, and they receive an “importance factor”. So, a text that contains many “me”s and some “with”s is probably female, even if there are some “above”s in the text.

Argamon and Koppel are clever people, and they also incorporate a “genre” difference, which you have to specify for yourself. They probably have three models that predict the gender of an author, for each genre (fiction, non-fiction, blog) one model. That is in itself a very important observation — that social/gender patterns can be different for each situation/genre.

But, their insight is at the same time also their problem. First of all, the insight that you have to define the context in order to get a good social prediction leads inevitably to the insight that you have to define that context so narrowly that you can not generalize anymore. An example: distinguishing between fiction and non-fiction is nice, but within non-fiction, you can have popular science books, but also philosophy and “learn a language” books. Within these science books, you have the die hard physics books, but also more humanities oriented books. In the humanities, you have linguistics, but you have also literature. Within literature, etc. It is always possible to go deeper in the taxonomy. That same depth of structure is also present on the demographic side. Ok, females talk different than males, but you have older and younger females, and some are highly educated, other are not. Or, some females live in New York, others in California, etc. So, both on the situational as on the social side there are so many distinctions that you can make, and all these distinctions have an influence on each other.

If you want to make the claim that you can predict the gender of a text, independent of all these possible confounding features, you have to train your model on a huge corpus that is so large that it flattens out all these distinctions. However, such a corpus is not available.

So, I am not saying that the Gender Genie is totally useless. On the contrary, you have to start somewhere. And that somewhere is in the case of the Gender Genie the prediction of gender, controlling (on a relatively high level) for genre. From there, we have to dive into the taxonomies of situational and social distinctions.

Yves F. Peeters: I will comply

Made by Yves F. Peeters

If Heisenberg were a variational linguist …

The current state of variational linguistics can be compared to a fundamental, philosophical idea from quantum mechanics, i.e. Heisenberg’s Uncertainty Principle. That principle states roughly that all physical quantities that can be observed are subject to unpredictable fluctuations, so that their values are not precisely defined. In the world of physics, this can be applied to an electron, which has a momentum (angular speed) and a position. Traditionally, the electron is presented as spinning (momentum) around a nucleus (position). However, Heisenberg said that the momentum and the position can not be determined simultaneously; either you have a precise idea of the position, or you know the exact speed. There is always a trade-off between momentum and position. This has led to the joke that Heisenberg, driving his car was stopped by a policeman who asked: “do you know how fast you were going?” Heisenberg replied: “Well, I know exactly where I was…”

The implication of the Uncertainty Principle is an incomplete physical system of probabilities. Unlike meteorology, where probabilistic predictions are due to our incomplete grasp of all the explaining dimensions, the incompleteness and uncertainty are fundamental in quantum mechanics. This was upsetting for Einstein who is claimed to have said: “God does not play dice with the universe”. Heisenberg noticed the linguistic — or rather semantic — problems that a system of probabilities implies. In the classic view, physical phenomena are accurately described in mathematics and logic (either true or false, “tertium non datur”) and a language that correlates with the logical symbols exists. However, for quantum mechanics, the situation changes. Let’s take the concept of temperature. In our classic view, we have very strict definition of what temperature is. On the level of quantum physics — the atomic level — this definition does not hold anymore. All that is left is a correlation between the measurement of temperature and certain statistical expectations about the properties of the atom.

A number of people — linguists, philosophers — have followed this semantics path, and have drifted off in the direction of the Sapir-Whorf hypothesis. Others have pointed out that the observer’s paradox links up neatly with the Uncertainty Principle. The observer’s paradox addresses the problem of a language user changing his (linguistic) behavior because he knows his language use is being observed. In the methodology of Labov, who points out that the sociolinguist should study the vernacular — the speech of someone who is not paying attention to his speech –, this is an important topic.

However, in my opinion, the Uncertainty Principle is also at work in the current state of variational linguistics. Variational linguistics is the study of grammar in relation with context. One could say that a variational linguist sees “Language” as a phenomenon with two characteristics: on the one hand, it is the production of sounds, words, clauses, sentences, text and discourse, on the other hand, it is a social phenomenon spoken by people in a certain setting. Just like in quantum physics, it is only possible to get statistical expectations of these two dimensions, and just like the momentum and position of an electron, it is not possible to predict both dimensions at the same time.

Before I give an example that might clarify this philosophical idea, let me introduce the two dimensions of variational linguistics. Each dimension is a collection of many variables. The grammar dimension contains phonology, morphology, syntax, etc. (and these are again group names for even more fine-grained variables) and the context dimension contains variables that describe the social characteristics of the language user, and the setting of the utterance (register, audience, topic, etc.). The variational linguist observed that both the grammar dimension as well as the context dimension have many configurations and assumes that a configuration on the context dimension correlates with a configuration on the grammar dimension.

The example that I’d like to give falls back on the classic socio-linguistic finding e.g. the correlation between the pronunciation of a final /r/ in New York American English and social class. If Labov were a traditional physicist, he would have quantified both the pronunciation (X) as well as social class (Y) and written out a formula of the form X = Y * l, with “l” the Labov constant. He did not do such thing because it is hard to quantify social class, and, more importantly, because such a constant can not be calculated due to the probabilistic properties of the correlation. Instead, socio-linguists have been writing “variable rules”. These rules are of the form “K ~ L + M + …”, with “K” a linguistic feature and L, M, etc. some explaining context features.

The classic socio-linguistic finding resembles the Uncertainty Principle at its best if you consider the fact that you can not predict both dimensions, grammar and context, at the same time. Although you know the context of an utterance perfectly, you can not predict the pronunciation of /r/ 100% accurately. Vice versa, if you observe a pronunciation of /r/, variational linguistics is not able to pin down the context exactly. The best you can do, is to give a probabilistic estimate.

As a conclusion, it appears that the current state of variational linguistics is incomplete: there are no fundamental laws that work in all cases. The question now is: will variational linguistics ever be a complete theory? In other words, will we hypothetically be able to oversee all possible underlying dimensions and variables and make 100% correct estimations?

Language variety recognition Flanders

Recently, “De Standaard” published an article on the recognition of language varieties, based on speech. A computer in Technopolis guesses your origin when you read out loud a text. That sounds like sociolinguistics from an engineering point of view!

The engineer at hand is D. van Compernolle, a professor from my University (of Leuven/Louvain) but in a different faculty (ESAT, PSI departement). In my opinion, this is a facinating application from fundamental research in sociolinguistics. It would be great if (socio-)linguists would start to think from this application-minded point of view.

Guessing at how the technique of van Compernolle actually works, I assume that they have read the work of Labov, Milroy, Pierrehumbert, Preston, Eckert (and all the great sociolinguists out there) and must have thought: if one can observe differences between classes, regions and so on, and if these observations reach statistical significance, then it must be possible to base predictions on them! Of course, that is basically what a regression analysis does. The engineer would probably use a machine learning technique (Naive Bayes?) to find out these prediction patterns, but this would give more or less the same result. So, if you read out loud a text at Technopolis, a computer filters out a number of distinctive sounds (do it yourself with PRAAT: http://www.fon.hum.uva.nl/praat/) and compares it to a Gold Standard data set.

The article mentions that they would like to look at intonation as well. But there is so much more! If you let people just speak freely in a microphone, without a prescribed text, they would use lexical and morfosyntactic markers as well. Possibly these differences are not as “significant” as the phonological differences, but that does not make them less interesting.

To conclude, I write a mental note to myself to think more from an application-based point of view. It would be great if my PhD could lead to an application like the one in Technopolis. From the other direction, engineers could keep an eye out for the fundamental research in Natural Language Processing. The phonological differences between groups of people is something that is a topic in quantitative sociolinguistics since the 60s (and even before). The engineers come about 50 years behind…

R collapse data points

I have a corpus for my research that consists of frequency counts of many linguistic variables. I entered these counts in a table, where every row is one single text, and the columns are the variables. This gives me many thousands rows and as such, it is a very easy task to find significant differences (a p-value gets smaller if you have more N).

So, it is only fair to “collapse” some datapoints. In the current phase of my research, I thought it would be interesting to collapse all the frequency counts for all articles from one author. I didn’t quickly found an R solution, so I wrote a function myself. It’s called “collapsr”:


collapsr = function(t, l, nc){
# take in a data frame "t" (headers = TRUE)
# every row is an observation
# every column is
# a) a numeric count of a certain feature
# b) a factor with categorical information
# the function takes the numeric features, listed in nc, (as text)
# and collapses the information
# for the given "l" (categorical)
out = c() # the result, now still empty
nms = c() # the column names, now still empty
i = which(colnames(t) == l) # finding the index of the column in "t"
for (s in levels(t[, i])){ # take every level from the category
nms = cbind(nms, s) # add the name of the level to colnames
tt = as.data.frame(t[t[i] == s, ]) # make a df from the rows that represent the given level
lingo = c() # declare a variable to store the sums in
for (num in nc){ # for each numerical feature
ni = which(colnames(tt) == num) # find the index of the feature
lingo = cbind(lingo, as.numeric(sum(na.omit(tt[ni])))) # sum all the values for that feature
}
out = cbind(out, lingo)
}
dim(out) = c(length(nc), length(nms))
colnames(out) = nms
rownames(out) = nc
return(as.data.frame(t(out)))
}

# example
# read in usenet dataset
#t <- read.delim("usenet_bepolitics.txt", header=TRUE)
#t.collapse <- collapsr(t, "sender", c("vlaming", "waal"))

In a later stage, I wanted to collapse datapoints for every author, but it was necessary to keep track of their interaction pattern. Without going into details: every author writes an “email” to another author, and from Social Network Analysis, I found two communities. Every author is in 1 community, so the interaction pattern can be “1-1″ (from a community 1 author to another community 1 author), “1-2″ (from a community 1 author to a community 2 author), etc.

So, I had to extend my collapsr function to the collapsr.duo function:


collapsr.duo = function(t, l, ll, nc){
# take in a data frame "t" (headers = TRUE)
# every row is an observation
# every column is
# a) a numeric count of a certain feature
# b) factors with categorical information
# the function takes the numeric features, listed in nc, (as text)
# and collapses the information
# for the given combination "l" and "ll" (categorical)
out = c() # the result, now still empty
nms = c() # the column names, now still empty
i = which(colnames(t) == l) # finding the index of the column l in "t"
for (s in levels(as.factor(as.vector(t[,i])))){ # take every level from the category l
tt = as.data.frame(t[t[,i] == s, ]) # make a df from the rows that represent the given level
j = which(colnames(tt) == ll)
for (ss in levels(as.factor(as.vector((tt[,j]))))){
ttt = as.data.frame(tt[tt[,j] == ss, ])
nms = cbind(nms, paste(as.character(s), as.character(ss))) # add the name of the levels l and ll to colnames
lingo = c() # declare a variable to store the sums in
for (num in nc){ # for each numerical feature
ni = which(colnames(ttt) == num) # find the index of the feature
lingo = cbind(lingo, as.numeric(sum(na.omit(ttt[ni])))) # sum all the values for that feature
}
out = cbind(out, lingo)
}
}
dim(out) = c(length(nc), length(nms))
colnames(out) = nms
rownames(out) = nc
return(as.data.frame(t(out)))
}

# example
# read in usenet dataset
#t <- read.delim("usenet_bepolitics.txt", header=TRUE, sep="\t")
#t.collapse <- collapsr.duo(t, "sender", "from_to", c("vlaming", "waal"))

More information, and some sample datasets on http://perswww.kuleuven.be/~u0055858/resources/Rcode and http://perswww.kuleuven.be/~u0055858/resources/Rdatasets

Geachte Miles & More en SN Brussels Airlines

Open brief aan Miles & More en Brussels Airlines

Geachte Miles&More en SN Brussels Airlines,

ik ben een frequent flyer. Elke maandag vlieg ik van Berlijn naar Brussel, en op woensdag keer ik terug naar huis. Vakantiedagen en uitzonderingen in acht genomen kom ik bijna aan 100 vluchten per jaar.

Loyaal kies ik steeds voor SN Brussels Airlines. Zij garanderen meestal stipte vluchten en een comfortabele reis tussen andere pendelaars. Omdat ik mijn vluchten zelf betaal, kies ik voor het b.light tarief waarmee ik voor ongeveer 100 euro heen en weer kan vliegen. Dat is uiterst betaalbaar en een gepaste prijs voor de dienst die er wordt aangeboden.

Het is dan ook geen wonder dat ik me bij het frequent flyer programma heb aangesloten. Eerst was ik lid van Privilege, en sinds kort van Miles & More. Die overstap heeft te maken met de fusie van SN en Lufthansa, en de opname van SN in de Start Alliance groep. Daardoor is ook het frequent flyer programma gewijzigd.

Dat mijn status mijlen van SN plotseling award mijlen van M&M werden, dat kan ik nog begrijpen. Maar dat ik nu slechts 125 mijlen per vlucht in plaats van 250 kreeg, dat vond ik minder. Bovendien verhoogde M&M ook het aantal benodigde miles om een Frequent Flyer status te krijgen.

Even rekenen leerde me dat ik met mijn tempo van heen en weer vliegen nooit een status te pakken zou kunnen krijgen. Met de schamele 125 miles per vlucht duurt het bovendien een eeuwigheid vooraleer ik me een “gratis” vlucht kan veroorloven. Met andere woorden, mijn loyaliteit wordt eigenlijk nooit beloond.

De oorzaak hiervoor is te vinden in de misleidende naam voor “frequent flyer”. Het gaat immers niet om de frequentie waarmee je vliegt, maar om de hoeveelheid geld die je binnen brengt. Het is dus geen “frequent flyer”, maar een “much money” programma.

In mijn woordenboek wordt loyaliteit gemeten aan de hand van de frequentie waarmee je van een bepaalde dienstenaanbieder gebruikt maakt. In jullie boek wordt loyaliteit echter gemeten in centen. Dat is niet eerlijk.

Zanginstallatie bij Sonora Leuven

Voor het huwelijksfeest van m’n zus heb ik met vrienden enkele nummertjes muziek gespeeld. “Echte” muziek is toch altijd net een tikkeltje fijner dan een DJ die plaatjes draait. Onze bezetting was heel kleinschalig, met enkel een piano, ritmegitaar en drie zangeressen. Ondanks die beperkte groep zijn we er toch in geslaagd om mooie muziek te maken.

Voor een publiek van ongeveer 100 mensen heb je toch al een klein beetje versterking nodig. Ik ben geen volledige kluns met geluidsmateriaal en wist dat een kleine zanginstallatie voldoende moest zijn. Na contact gezocht te hebben met enkele PA-verhuurbedrijfjes in en rond Leuven – die doorgaans gewoon niet reageerden of een niet werkend email adres hebben; je moet weten, ik woon in Berlijn en dan is een mailtje iets goedkoper dan een telefoontje – kwam ik uit bij Sonora Leuven (http://www.sonora.name). Niet zo heel vindbaar via Google, maar uiteindelijk bleken zij precies te zijn wat ik wou.

Vlotte email-omgang, goeie kennis van zaken met aanbevelingen en suggesties (die ik helemaal niet aanvoelde als “neem dit ook nog, dan verdien ik er geld aan”, maar eerder als “ik spreek uit ervaring”) die later zeker hun waarde hebben bewezen, en zeer stipte en juiste afhandeling van de zaken. Zei ik al dat ze een handige stoep en garagepoort hebben waarlangs het afhalen van materiaal echt easy-peasy is?

Enfin, ik ben dus heel tevreden.

Corporate photography

I am taking pictures of my fellow PhD students – and one day of my supervisors. The goal is to make it look like semi-relaxed, “I am working and I do not know you’re there” pictures, but still with a professional air over it. And that is way more difficult then I ever thought.

Usually, my pictures from birthday parties and so on are completely spontaneous, and everybody is in a sense not really worried about the pictures. This time it is different. Everybody knows that these pictures will end up on the website, and moreover, it is not spontaneous at all!

So, I am confronted with the need to “motivate” people for the pictures, make them feel at ease, and push the lever on the exact right time. Very difficult!

Michael O’Leary is a parasite

After reading an interview with Michael O’Leary, the so called brain behind Ryan Air, I am convinced that people should never ever fly again with Ryan Air. This man is an absolute parasite who started a company only and only to make profit and money, without the slightest love for his product and his clients. Every passenger on his air planes is merely a bag of money that “wants to go from point A to point B”.

Although there is some truth in this statement – and for many tourists, youngsters and one-time flyers it is the full truth – I do not accept it as such. Yes, I’d like to transport myself from A to B, but I am not prepared to de-humanize me for the length of the trip. And that is exactly what Michael O’Leary wants: from the time that you arrive in your departure airport until the front door of your – not so close as you’d expect it to be to your actual destination – arrival airport, you are a bag of money, tossed around without respect.

Perhaps one-time flyers are prepared to sell their soul for a slightly cheaper fare, but frequent flyers will never accept Ryan Air as their means of transportation. As a 50 flights per year person (which is still not so much, actually!) I really enjoy the faithfulness to my carrier Brussels Airlines, the usually fair prices, the interesting departure and arrival times and the fact that I do not have the impression that they see me as a bag of money. In sum, Ryan Air might be interesting for the tourists, but a large segment of the market will stay clear of Ryan Air.

My actual point that I’d like to make is however not on the quality differences between Ryan Air and other, decent carriers. Instead, I’d like to accuse Michael O’Leary of not having the slightest ethical sense. A more or less quote from the interview: “I do not invest in long-distance flights yet, because that market is still good. Only when it starts going bad in that market, I’ll start something.” I shiver when I read this, how can you be so crude?

Perhaps, more importantly, why do people choose for money, instead of quality? Quality is what makes your day, not money in your pocket.