Language corpora and ethnography

I am due to give a paper entitled ‘sign language corpus creation as digital humanities ethnography’ at the 4th international corpus linguistics conference held in Birmingham (UK) between 27 and 30 July 2007.

With this paper I aim to suggest that if the design of a corpus (in this case of sign language, see for example the ECHO sign language datasets) extends its research value well beyond linguistics as a discipline (to include e.g. education, sociology, psychology, history etc.), then the methodology for the creation of such a corpus can usefully be based on ethnographic methods, including also in particular ethnographic collaboration with language users — in this case, collaboration with deaf people themselves. At this point I am unsure to what extent this might also include virtual ethnographic approaches. Any ideas?

Secondly, it is clear that current language corpora are less dynamic (in terms of interaction between users, producers and commentators of language within the datasets) than could perhaps be envisaged. For example, wikis might be considered corpora of lexical meanings that derive from broad participation in the construction of lemmata and the semantic description of lexis, although they lack data on word frequency and distribution. The second question I am pondering for my paper is therefore, what role might virtual ethnographers play in working towards ‘next-generation’ language corpora — more dynamic datasets based on broad participation? Again, your comments are welcome.


  1. Ernst,
    A wonderful domain to explore! There are many possibilities, here are a couple that come to mind…

    I’m taking ‘broad participation in the construction of lemmata and the semantic description of lexis’ to evoke a kind of public participation and a low-threshold kind of (web?) interface. If so, as an ethnographer, you could follow the variety of ways in which such practices are meaningful to participants, and suggest ways of enhancing and supporting these. (Perhaps the work with Simakova, on public databases, would be useful?)

    Another possibility, which I think was evoked in your poster at the Virtual Ethno Workshop in the fall, would be to use ethnographic approaches to make explicit the richness and context of particular aspects of the corpus. The ethnographic practice would then be that of pointing to the messiness and excess of corpus material (again, by tracing the variety of meanings the corpus carries). A more STS approach would then move to showing the disciplining and normalising of corpus contents, in order to ensure particular kinds of scientific legitimacy and circulation.

  2. Excellent suggestions (I did of course find that the Wiktionary already contains a section on word frequency lists, which includes summaries taken from text corpora), but the frequency reports seem to lack any particular insight — the reports are like mountains that are climbed because they are there.

