Göteborg Spoken Language Corpus (GSLC)

GSLC is an incrementally growing corpus of spoken language from different social activities. Based on the fact that spoken language varies considerably in different social activities with regard to pronunciation, vocabulary and grammar, the goal of the corpus is to include spoken language from as many social activities as possible.

The Transcription Standard

The transcription standard used in GSLC consists of two parts. Göteborg Transcription Standard (GTS) is the language independent part, dealing with the format for utterances, overlaps, speakers, comments, etc. Modified Standard Orthography, version 6 (MSO6) is the standard for how to write spoken Swedish.

In MSO6, standard orthography is used unless there are several spoken language pronunciations of a word (see Allwood (1998), Some Frequency based Differences between Spoken and Written Swedish for a detailed discussion). When there are several variants, these are kept apart graphically. According to this principle, the Swedish word "jag" (I), which is mostly pronounced "ja" but occasionally as "jag" is written in both these ways, depending on which form is actually used. What variants can be distinguished is, however, to some extent arbitrary and has, therefore, in some cases been decided on a stipulative basis. Thus, we have not, in general, distinguished words on the basis of vowel length.

For an example of a transcription with a short explanation see example.

Through this practice, sometimes words which are pronounced the same way, but kept apart in standard orthography, will coincide. This, for example, happens to "jag" (I) pronounced as "ja" and "ja" (yes). When this happens, the words have been disambiguated by brackets or numerical indexes. In this case, "ja{g}" (jag) and "ja" (yes). If the spoken form is produced by just removing letters from the standard form, then brackets are used to indicate the corresponding standard form. If the spoken forms can't be disambiguated by brackets, then numerical indexes are used. For example, the spoken form "å" can mean "och" ("and") or "att" ("to" - infinitive marker), so the transcribed form is "å0" for "och" and "å1" for "att". Thus, MSO maintains the same degree of disambiguation as standard written orthography but adds to this the disambiguations which are actually added by spoken language, e.g. between Swedish standard orthography "att" (that, to) which can be pronounced as "å" ("to" - infinitive marker) or "att" ("that" - conjunction). However, no attempt is made to separate homonyms which are separated neither in written or spoken language. This means that one can not know from a word form like "springa" (run, chink) whether it is a verb or a noun.

Analyses

Regarding analysis of the corpus we have produced a first book of frequencies of Swedish spoken language. The book contains word frequencies both for the words in MSO format and in standard format. It also contains comparisons between word frequencies in spoken and written language. These lists are given in alphabetical and frequency order. There are list of frequencies for collocations in MSO, standard orthography and written language. Connected with the word frequencies, there are lists of words which are unique to or very much more common in spoken MSO spoken language rendered in standard orthography of written language. Finally, there is statistics on the parts of speech represented in the corpus, based on an automatic probabilistic tagging, yielding a 96% correct classification.

Further, there has been work on the corpus using various kinds of manual coding for communication management (including hesitations, changes, feedback and turntaking), speech acts, obligations, maximal grammatical units, etc. For this work we have sample transcription with coding and manuals available.