I just discovered this cool blogpost on OpenCog. It is a very interesting blog by someone who has both time and talent for nice experiments (and he seems to have the needed hardware as well…). Although he is not directly in my line of research, which is quantitative sociolinguistics, he touches with my fondness for computational linguistis, statistics and lexical semantics via NLP.
Pushing the limits of sociolinguistics May 15, 2009
This essay is merely a wandering in my mind and contains thinking experiments that I would like to debate with my collegues.
Studying actual language use shows immediately that there is a huge amount of variation to be accounted for. Differences between two distinct languages (inter-lingual variation) are easy to spot. You could try to compare Dutch and German, and make an intuitive list of similarities and differences. Sometimes even, variation within one language (intra-lingual variation) is clear as well. Intuitively, laymen can write down what the difference is between Belgian Dutch and Netherlandic Dutch, or German German and Austrian German, or British English and Irish English, etc.
Sociolinguistics
Sociolinguistics looks for a connection between social parameters and intra-lingual variation. A very simple research question could be about the difference between male and female speakers of Irish English. Traditionally, sociolinguists look for differences along the following sociological lines: region, socio-economic situation (social class), ethnicity, sex (gender), age, education, religion, power differences between two speakers, social distance between speakers, etc. They hope to find a correlation with a certain linguistic feature. These linguistic features can be of all kinds: syntax (e.g. word order), morfology (e.g. adjective declination), phonology (e.g. h-deletion), pragmatics (e.g. requests), etc.
There are two ways of looking at the situation: either you start from a certain sociological profile, and you look how a linguistic feature behaves, or you start from a varying linguistic feature, and see if there is a sociological profile that correlates with the internal linguistic variation. A lot of interesting research has been performed, with big names such as Labov, Chambers, Eckert, etc. Traditionally, the sociolinguists – sometimes called variationalists – are opposed to Generative linguists. The latter do research on an abstract concept “competence” that might be the perfect language. They are not interested in variation, but want to describe the language as “it is in all our heads”. The variationalists on the other hand are more interested in language as “it comes out of our mouth”, thus incorporating “errors” and “mistakes”. Moreover, this variation is not random but claimed to be sociologically motivated.
Looking at the large body of work that sociolinguists already produced, you cannot but notice that the research only looks at one feature at the time, or sometimes combining just a couple of features. As an example, the difference between male, working class people from village A are compared to male, working class people from village B is measured by their use of the work “like”. It would be of course very ambitious (but oh so complete and beautiful) if research could compare all possible sociological fingerprints, based on all possible linguistic features. This task is in itself not feasible because of limits to data collection, time constraints, sampling, ethics etc.
Not everything, but…
Starting from this ideal situation, one could say that each possible sociological fingerprint matches with (probably) a unique “vector” of linguistic features. The granularity of these sociological fingerprints will be so high that probably one or two actual language users fit the fingerprint. One could conclude that we could skip the step of detailed sociological description, and just observe individual people.
Based on their linguistic features vector, one could use similarity measures as used in neo-structuralist distributional lexical semantics which language users “speak” alike. Only at that time, sociology pops in looking for sociological structure withing the group of “speak alike” language users.
However, a big flaw in the reasoning appears here. The leap from a sociological fingerprint to an actual language user is problematic as language users tend to “change” their language depending on situational micro-sociological parameters, such as power, social distance, etc. Moreover, one cannot expect the same behavioural tendencies for all language users alike; I might speak more regional, when I meet somebody from my own region, but somebody else might actually try to speak more “standard” when he or she meets someone from my region.
This thus calls for a three dimensional matrix representation. The cells will contain the values that are approriate; we assume that we have the sociological fingerprint of the speakers (that will be “described” by using linguistic features) in a separate database. The left-to-right rowcells contain the values for the linguistic features; the top-to-bottom columncells contain the different speakers that we observe in the corpus; the front-to-bottom cells represent the conversational partner. In other words, for every other conversational partner, we make a vector of linguistic features of the speaker.
The Sky is the Limit: caveat
A representation in a three dimensional matrix might seem the end of what we can conceptualize, but it is not the limit. Consider a four dimensional matrix, with the three dimension already mentioned, plus a dimension of age. Suddenly, the study is no longer “in apparant time” and synchronous, but opens up to a diachronic survey.
This of course seems all wonderful, but there is an important caveat and it is called time. To collect this amount of data, years would be needed, yet alone the transcription of possible recorded samples will take decades. At that moment, we only have data, and no coding for linguistic features and sociological footprint whatsoever.
There would be two ways of dealing with this issue. On the one hand, the coding could be done automatically. Computational linguists might be able to help with the implementation of programs that can fill in coding schemes automatically. On the other hand, the principle of “many hands make the work light” can be applied. If there was some sort of generally accepted coding schema, a standard way of POS tagging, syntactic parsing and transcription of audio material, it would be possible to combine all the research that is done in the field into a large large dataset that can be studied.
Will this remain a dream?