What have corpora ever done for us?

Following a conversation over on the facebook page I use for talking about teaching and language, I’ve decided to post a talk I did at IATEFL many moons ago. I do remember, with a faint smile, that Dave Wills himself came along to watch this one, but at some point became overcome with either rage or tedium and flounced out, thus allowing me to make the cheap jibe about Elvis having left the building before carrying on. Were this post to generate even a tenth of that heady level of excitement, I’d be delighted!

Written maybe ten years ago, at the height of the corpora promo boom, it was intended as a partially tongue-in-cheek critical overview of corpora linguistics. And yes, for those of you that were wondering, the title WAS inspired by this rather splendid Monty Python sketch:

With that in place, here goes nothing . . .

The use of computers to store and help analyse language has obviously revolutionised many aspects of language teaching, and corpora linguists have become an ever-present feature at IATEFL and other similar conferences. Obviously, much good has come from this. We have had a whole new generation of much-improved dictionaries, all of which contain better information about usage, collocation and frequency; superb new reference books such as the Longman Grammar of Spoken and Written English have been made possible, and, perhaps inadvertently, corpora linguistics helped to launch the Lexical Approach and to thus help to move language at least some way back towards the centre of language teaching. Nevertheless, it seems to me that despite all these advances, corpora linguistics has also had several negative side-effects on the way teachers perceive their roles, and that they have actually enslaved us in ways which are not entirely healthy. I would like to move on to consider the ways in which I feel this has occurred.

The fallacy of frequency

Corpora linguists repeatedly promote their products with often highly-detailed reference to frequency counts and the idea that frequency is central has become a common one. However, should a Pre-Intermediate learner wish to be passed the salt over dinner, simply knowing the infrequent item ‘Salt’ will facilitate this in a way that knowing the far more frequent ‘Could’, ‘you’, ‘pass’, ‘the’ and ‘please’ would not. Generally, it’s not the most common words which carry core meanings; rather, it’s the far rarer items that do. Simply knowing the 800 most common words in the language makes you only able to say a lot about not very much. In the same way, failure to learn word which may well be low-frequency generally, but which are possibly much higher frequency within specific types of conversations condemns you to not being able to say very much about a lot!! Frequency tells us nothing more than what is frequent. It cannot tell us what’s useful, what’s necessary or even what’s teachable.

There are deeper problems here to do with the way in which frequency is actually calculated. Corpora remains word-obsessed and the process of lemmatisation compounds this. Hence, an idiom like ‘You’re a dark horse’ is entered not as a two-word idiom, but rather as one example of ‘dark’ and another of ‘horse, thus defaulting on two fronts.  Similarly, plural nouns are currently counted as other examples of singular ones, which is a rather major oversight. Is, for instance, the singular of ‘Many Happy Returns’ ‘A Happy Return’? ‘Meetings’ is not simply  the plural of ‘meeting’, and it collocates with different words. Finally, knowing that, say, ‘get’ is a very common word does little to help teachers know whether ‘get on with it’ is more frequent that’ Let’s get down to business’. Sadly, until corpora start sorting by chunk they will remain of limited relevance.

The fallibility of human endeavour.

That corpora need to be approached cautiously and with one’s intuition fully tuned is made apparent by a cursory glance at the word ‘thaw’ on several published CDs. Should one access the word, wishing to know whether snow melts or thaws, one would be surprised to learn that a far more frequent example of the word, and thus – if we follow the logic of corpora linguists – a more useful collocate for our students is actually John, as in John Thaw, the late, great British actor.

Similarly, I once saw a Jane Willis talk wherein she suggested that one of the most common three-word lexical items in the English language was ‘Princess of Wales’. It was only when pushed during questioning that she actually admitted that the corpora she had taken this data from was based almost exclusively on a couple of radio phone-in programmes. In the same, way, the actual construction of corpora-based materials – dictionaries and the like – also inevitably involve a degree of hammering out by researchers, often by means of a vote or a fudge. Corpora are by necessity human constructs based on limited samples of data, are easily skewed by input and thus are best viewed sceptically.

The limitations of what corpora can offer

While spoken language, conversation, may well form the basis – even the majority – of many corpora, what corpora can’t show us is what typical conversations look like. It’s not possible, for instance, to access ten typical conversations had by people talking about what they did last night or to look at the 20 most common ways of answering the question “So what do you do for a living, then?”. As such, if we want to present our students with models of the kinds of conversations they themselves might actually want to have, we are forced to fall back on our (actually ample) experience of such conversations in order to script them. However, I would argue that it is precisely because we have got such broad experience of such conversations that we do tend to know how they work and sound and look.

For teaching purposes. we need to be able to script conversations that aren’t so culturally and spatially bound as to exclude students; we need to ensure the conversations students are exposed to still somehow facilitate intra-class bonding. Input needs to be proto-typical and to include items which are easy for us to systematise and for learners to appropriate and assimilate. Corpora cannot do this for us.

Corpora and the non-native speaker teacher

It is often claimed – mainly by those who are employed to make, package and sell corpora – that corpora are an invaluable aid for the non-native speaker teacher. I would personally argue that the opposite is far too often true and that as they stand, corpora massively favour native speakers.

One understandable reaction many teachers, both native and non-native, have to the notion that they should teach more spoken English is the ‘but I’d never say this or that bit of language” response when faced with a spoken text. Ironically, written texts never elicit a similar “But I’d never write that myself” response, and there are several reasons for this, I feel. There is possibly an assumption that writing is a more creative realm where anything goes; there’s also the fact that the grammar and the lexis of the written language have already been codified and disseminated and are thus more familiar to teachers; thirdly, I think, there’s the fact that we pin our identities on our speech – our idiolect, our regional, class-based, age-oriented, in-group, gender-based grasp of lexis and grammar – far more profoundly than we do on what we write. We are so aware of differences in the way we speak that we usually fail to notice the massive similarities. A good example of this is the fact that every EFL book which focuses on the UK / US divide fails to note that the vast majority of the language used in both countries is remarkably similar, and instead frets over the present perfect, sidewalks versus pavements and the correct pronunciation of aluminium. Yet for every “It can out of the blue” / “It came out of left-field’ divergence, there must surely be ten other idioms we all have in common.

Given this, I personally feel it doesn’t take much to persuade non-native speaker teachers to stick to the already familiar, tried-and-tested formula of written texts and comprehension questions and structural grammar. By spending so much time pointing out relatively obscure quirks and neologisms, such as the fact that ‘like’ is being increasingly used to report speech (as in “He was like ‘Hi’ so I was like ‘Bye’) , corpora linguists are inadvertently making spoken English more of a foreign language for non-native speaker teachers than is perhaps wise for people who claim to believe – as I do – that spoken English should become much more a part of General English than is currently the case. Too relentless a focus on the new, the odd, the interesting, the different obscures the wealth of English that unites us all.

I also feel that it is not only many non-native speaker teachers who would never use ‘like’ in this way, but also many native speakers too. The vast majority of language teachers do NOT need corpora to tell us that this is a relatively unuseful piece of lexis, so long as it remains still relatively unused. Indeed, my own rule of thumb would be that if YOU don’t say it, don’t TEACH it. English as a foreign language is NOT English as the corpora knows it. If you believe, as I do, that the kind of model conversations coursebooks provide for teaching purposes should be better modelled on the information provided by corpora than is currently the case, then I find it hard to see how you couldn’t also support the idea that corpora specialists should concentrate more on insights which will be of direct use to coursebook writers and teachers alike. Indeed, given the problematic status of spoken language within the classroom at present, I’d go so far as to say assert that failure to do anything less serves to sabotage attempts to spread a methodology based on spoken language (and here, of course, I’m compelled to acknowledge my own interest in this area as a coursebook writer).

I find it particularly interesting to note that the constructors of corpora – or at least their backers – seem as yet very reluctant to work on a corpus of English as used by non-native speakers. Obviously, this would be in essence the same corpus, but with much left out. This is precisely the point : that which is left out by competent non-native speakers has no real place in most – and especially most pre-Advanced – teaching materials.

Animal Farm (or Beware of the oppressive tendencies of those who come claiming to liberate us!!)

It would be churlish to deny that corpora have provided us with some useful insights into such features of language as the fact that would is three times more common when talking about past habits than used to is, but at the same time it must also be added that the way in which corpora have been presented has all-too often intimidated us into pretending that we didn’t already know much – if not most – of what they confirm. For example, Mike McCarthy, at IATEFL Brighton 2001 spent half an hour blinding us with the statistics that showed – entirely unsurprisingly – that ‘take the mickey’ is far more common than ‘mickey-taker’ or ‘mickey-taking’. Surely any fluent speaker of the language could have guessed this (dubiously relevant) fact themselves, based on their own intuitions about the language.

The relentless emphasis on the finality of corporal truth no only denies the reality of the classroom practitioner who has to get in there each and every day and try to give their students information about the language being studied, but also refuses to acknowledge the fact that we all have heard and read millions and millions more words than any corpus will ever hold and thus have good hunches about words as a result. Sure, hunches about language can be wrong, but more often than not, they aren’t. I personally really resent the notion that not only are corpora useful for showing us the errors of our ways, but also for confirming when we’re right. The implication is that we are not right UNTIL we’ve checked! This way lies madness – and the deskilling of us all!!


Obviously, it is important that teachers do keep themselves up-to-date with corpora findings and adapt their understanding of the way language works accordingly. Here I totally agree with Ron Carter that one thing corpora has helped us become more aware of is the fact that grammar is much broader than sentence-based / tense-based grammar would seem to suggest. Words have their own micro-grammar and so lexis needs to continuously be grammaticalised in typical ways. Nevertheless, it is also vital that teachers are encouraged to believe that they can tap into and trust their own inner corpora.

If Carter and McCarthy can proclaim that the more students are encouraged and trained to notice, the more they actually will notice, then the same much surely be true for us as teachers. Indeed, the true sign of corpora-work well done is its own eventual redundancy. This really brings me to my final point – one of the great ironies of corpora is that they have actually unwittingly made teachers more intuitive, not less. What corpora have done is to place language back at the centre of classrooms and, as such, we all now have to think much more about how we actually use language.

To a degree, corpora and teachers exist in a parent-child relationship, and many teachers are now ready to leave home. Thanks Mum and Dad – you’ve done a great job, we may be back to visit every now and then, but we’ve basically already got the message!

However, lest we forget, corpora are bank-rolled by major publishing houses and have endless spin-off publications derived from them in an effort to recoup much of this investment. As such, maybe I’m expecting too much by asking those in receipt of the publisher’s pound to loose the reins on much of their power and place it back where it rightly belongs – back in the hands of the humble classroom practitioners!!!


20 responses

  1. get on with is nearly 1.6 times as frequent as get down to http://corpus.byu.edu/coca/?c=coca&q=20787193

    thaw as in John Thaw (first collocation to the left) does top the British corpus but not the American one http://corpus.byu.edu/coca/?c=coca&q=20787648&q1=20787673 though more interesting results if you plug thaw here http://www.wordandphrase.info

    and was wondering why you put up the frequency strawman argument, why no mention of concordance lines?

    maybe a better title would have been – Corpus, it’s not the saviour it’s just a very useful toy 🙂


    1. Thanks for this Mura.

      It doesn’t surprise at all that GET ON WITH is more frequent than GET DOWN TO. IN fact, I’d have expected it to be more than 1.6 times as frequent, given the common currency of HOW DO YOU GET ON WITH THEM, I DON’T REALLY GET ON VERY WELL WITH MY DAD and so on, whereas GET DOWN TO isn’t that widely used outside of GETTING DOWN TO BUSINESS. This doesn’t disprove my comment above, though, as what I was getting at is the fact that even when you can search for phrasal verbs like this, we still can’t discover for instance whether ‘get on with it’ is more frequent than ‘Let’s get down to business’.

      The frequency argument put up because I think it’s important to think about frequency!!
      It IS obviously important and as a writer, it is one of the things I select language to teach based on, particularly at the lowest levels. However, as someone smarter than me once said, frequency doesn’t tell you what’s useful; it just tells you what’s frequent, and as I mentioned above, words that are low frequency may either be frequent within particular kinds of exchanges, or else may be vital in order to communicate message on occasion.

      As for concordance lines, what do you think needed saying about them? The only thing I could think of worth mentioning about them is that, as raw data, they are often obscure, full of lexis most learners wouldn’t know and hardly proto-typical!!

      Liked the alternative Python-inspired title, by the way!

      1. hi again

        the links to COCA weren’t meant to disprove anything you said but to show how accessible and quick corpora is nowadays, especially COCA.

        concordance lines suffer from limits you mention but they also provide ready made multiple examples of words in context. for someone like yourself who has written books coming up with relevant examples in class may not be an issue. however for the rest of us having a databank of examples sentences to deploy in class is great.

        admittedly data driven learning has yet to prove itself with learners but how about teachers? i think it is a very effective way for teachers to build up their awareness.


      2. Not convinced that the COCA stuff gives you anything you can now get from just Googling with quotation marks around chunks, to be honest, and it certainly doesn’t give examples worth taking into class. As I said, what I DO think corpora HAS done is help inform dictionaries and the like, so we have better reference sources, and the ‘cooked’ data that goes into them is generally far more useful for teachers and students alike than the ‘raw’ corpora data or examples.

        That said, as a writer and as a teacher, I do still occasionally use Google as a kind of corpora just to check hunches or check common collocations or to see how chunks are actually used or simply for ideas on how best to ‘cook’ things. Anything that helps teachers become more aware of language in action is obviously a good thing. Just not the be-all-and-end-all that I felt corpora was being pushed as a few years back!

  2. I enjoyed this post first time around and tried to find my original reply to this but couldn’t so here are three things they have done for us:
    1. Provided us with excellent learner dictionaries, packed full of real examples and useful information about frequency, meaning and usage.
    2. Provided learners with an initial list of high frequency lexis on which they can focus at the early stages of learning. Without this, many learners (and teachers and textbooks) may spend a lot of time learning/teaching obscure items which are of very little use.
    3. Provided teachers and researchers with real evidence about how language is used beyond sentence level.This is something which should at least inform what is taught in classrooms and must be better than guesswork. It is surely worth knowing that modal verbs are not the only or indeed always the most frequent way to express modality, for example.

    I agree that there are not enough open-access corpora but this is changing, thankfully.


    1. Hi Chris –
      Thanks for taking the time to post a response.

      Absolutely agree on the dictionary front, though I’d suggest that the examples that work best are actually NOT ‘real examples’, but ones tweaked so as to be of maximum utility. That’s why, say, the Macmillan Advanced Learners will always trump the old Cobuild dictionary, which prided itself on authentic examples, and thus rendered itself semi-unintelligible to all but the best learners.

      Also agree that is is good to know about frequency, so long as we’re aware of the limitations of the concept, the limitations in the way frequency is calculated and so on. We’re currently in the middle of a Starter level book and are trying to ensure that we only explicitly teach two or three star words, so I’m obviously not averse to the concept. I also think anything that can hep teachers focus more on lexis with a higher surrender value and maybe spend less time on relatively useless items can only be good, so no arguments there either.

      As for the third point you make, well I kind of agree, but I’m just not convinced that most teachers / learners DO actually glean such insights from corpora. I for one was aware as soon as I’d read The Lexical Approach that there were many lexical ways of doing what can also be done with grammar, such as expressing modalities. I’m not convinced that much we didn’t intuitively already kind of sense has really been uncovered by corpora, and at the time I initially wrote this talk was concerned that the hair-splitting minutiae that corpora linguists obsessed over at conference talks were actually doing their good work more harm than good. if that’s changed since then, then good.

  3. I know what you mean about The Lexical Approach but my issue with what Lewis wrote, much as I like it, was that he seemed so adverse to using corpus data. Teachers and materials designers need to have an idea about what chunks etc they should focus on beyond his rather vague ideas about teaching the most useful ones and frequency seems a good starting point to me. And it may seem obvious that we can express modality with lexis and lexico-grammar but I don’t see this a lot in textbooks because I suspect many writers (not you!) simply adopt the same old structural syllabus and don’t look at a corpus. If you take something so common as tense and aspect, I can’t recall a textbook which tells learners something as easy as Past Simple is much more frequent than Past Perfect. Instead, everything is treated as equally important and learners then overuse Past Perfect when they encounter it.

    1. Well, hard to argue with much of this Chris.
      I’m in total agreement with all of this.

      It’s amazing how little of this stuff has trickled down to coursebooks yet.

      Much of it though is due to the old vicious circle of teacher expectation (or at least publishers’ notions of teachers’ expectations) and the publishing industry’s innate conservatism.

      In a perfect world, you’d not really even need to tag stuff as grammar or vocabulary, but there’d be space for the vast intermediary zone. With INNOVATIONS, we tried to cover as much of this kind of lexico-grammar as we could and more often than not, it simply confounded a broader audience who don’t see things as ‘grammar’ unless first it’s appeared in headway or Murphey’s! Sad, but true.

    2. This is actually a misrepresentation of Lewis – he was absolutely in favour of corpora and fond of quoting Sinclair’s why would a botanist study plastic flowers (or something to that effect and also the frequency of the past perfect continuous. I agree with you that frequency is worth looking at and looking at concordances can be helpful – especially when dealing with different genres such as Academic English.

      1. I can’t see it’s a misrepresentation. He is very sniffy in the books about Willis’ lexical syllabus and went on to produce/publish the LTP book of collocations which seemed to be based on intuition. As I said, both lexical approach books had a big influence on me and much of what is there is excellent but I think that this was a weakness.

  4. […] Dellar’s recent post dismisses the hype behind corpora that was prevalent a few years back with typical gusto. I would […]

  5. Another great post, Hugh. The phrase that resonated with me most was “trust your inner corpora”. I might have it framed and put above my bed.

    1. Glad you enjoyed it Amanda.

      The inner corpora idea is very much what Hoey’s getting at in Lexical Priming. It’s a lovely idea as it’s totally not to do with native or non-native: it’s just to do with how much language of different types you’ve exposed yourself to – and how able you are to access and analyse the contents of the brain’s own corpora.

  6. So you have posted this one after all?!
    Another “in response to…” post is on its way 🙂

    1. Look forward to that Leo.
      In your own time.

  7. Emmanuel Quilala | Reply

    Hi Hugh,

    My name is Emmanuel and I am using, for the first time, your Outcomes Advanced textbook in my class. I am trying to create an exam that contains 4 components in it: writing, reading comprehension and listening comprehension and speaking. I got a hold of the Examview Test Bank from my supervisor and was very disappointed because it did not have the capability of creating this specific type of exam. Or, did I miss something while exploring it?

    Can I get advice from you please because I would like to give my class an exam that revolves around the topics and contents of your textbook.

    Sincerely yours,

    Emmanuel Quilala

    + 034 686 899 319

    C/ Evaristo San Miguel, 13, 2 Izq.
    28008 Madrid

    1. Hi there Emmanuel –
      Thanks for getting in touch.
      Sorry for not getting back to you earlier.
      I’ll email you about this today.

    2. Hi Emmanuel –
      I just wanted to check whether or not you now had all the information you needed about the Examview stuff that accompanies the OUTCOMES series?
      Please do let me know.

  8. […] highlighted a post, mentioned already, which was a response to @hughdellar‘s earlier post, which had reiterated doubts from ten years ago about the use(fulness) of corpora.  It includes a […]

  9. […] entitled ‘What have corpora ever done for us?’ For those interested, the post is still online, even if that specific blog has […]

What do you think?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: