Corpus Linguistics 101

Apr 15, 2016
10 min read

What is a corpus (plural: corpora)?

“A collection of naturally occurring examples of language, consisting of anything from a few sentences to a set of written texts or tape recordings, which have been collected for linguistic study. More recently, the word has been reserved for collections of texts (or parts of text) that are stored and accessed electronically” (Hunston, 2002, p. 2).

“A corpus is a principled collection of authentic texts stored electronically that can be used to discover information about language that may not have been noticed through intuition alone” (Bennett, 2010, p. 12).

What is corpus linguistics?

“Corpus linguistics is essentially a methodology or set of methodologies, rather than a theory of language description.

Essentially, corpus linguistics means this:

looking at naturally occurring language;
looking at relatively large amounts of such language;
observing relative frequencies, either in raw form or mediated through statistical operations;
observing patterns of association, either between a feature and a text type or between

groups of words.

Reduced to its essence in this way, corpus linguistics appears to be ‘theory neutral,’ although the practice of doing corpus linguistics is never neutral, as each practitioner defines what is meant by a ‘feature’ and what frequencies should be observed, in line with a theoretical approach to what matters in language” (Hunston, 2006, p. 244).

What is the value of corpus linguistics?

“Corpus Linguistics serves to answer two fundamental research questions:

What particular patterns are associated with lexical or grammatical features?
How do these patterns differ within varieties and registers?” (Bennett, 2010, p. 2)

What are the limitations of corpus linguistics?

Corpora cannot provide all possible language no matter how large they are.
Corpora cannot provide explanation or information.
Corpora present language out of context.
Corpora cannot do much without the appropriate (concordancing) software.

The history of corpus linguistics

The principles have been around for almost a century through lexicographers’ work.

The Brown Corpus is the first computer-based corpus, and it was created in 1961 and contained about one million words (Bennett, 2010).

Some renowned corpus linguistics scholars

John Sinclair, Susan Hunston, Douglas Biber, Geoffrey Leech, Stig Johansson, Susan Conrad, Gill Francis, Michael McCarthy, Averil Coxhead, Randi Reppen, Anne O’Keeffe, Ronald Carter, and Edward Finegan.

The characteristics of the corpus approach

It is empirical, analyzing the actual patterns of language use in natural texts.
It utilizes a large and principled collection of natural texts as the basis for analysis.
It makes extensive use of computers for analysis.
It depends on both quantitative and qualitative analytical techniques. (Biber, Conrad, & Reppen, 1998, p. 4, as cited in Bennett, 2010, p. 7).

How does corpus linguistics help language teachers?

Frequency
Phraseology
Lexicogrammar
Register
ESP
Nuances of language
Syllabus design
Cultural attitudes
Translation studies

What are the various types of corpora?

How can a teacher create her/his own corpora?

Bennette’s (2010) framework for creating corpus-designed activities

Ask a research question.
Determine the register on which your students are focused.
Select a corpus appropriate for the register (or compile authentic texts from that register).
Utilize a concordancing program for quantitative analysis.
Engage in qualitative analysis.
Create exercises for students.
Engage student in a whole-language activity. (p. 18).

Corpus Linguistics Terminology

Token: any word is a token.
Type: the repeated occurrences of one specific token is counted as one type only.
Hapax: a word / token that occurs one time only.
Lemma: a superordinate of a group of word forms (see below).
Word form: e.g. “book” and “books” are two word forms belonging to the same lemma.
Tag: to add a code to each word in a corpus to indicate, for example, its part of speech.
Parse: to analyze a text into constituents (e.g. clauses and groups).
Annotate: a superordinate term for tagging and parsing. (Hunston, 2002, pp. 16 – 20)

Some Popular Corpora Available for Free

The Compleat Lexical Tutor contains a lot of activities for date-driven language learning (DDLL) including a concordancer and vocabulary profiler. http://www.lextutor.ca/
British National Corpus (BNC) “is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross- section of current British English, both spoken and written”. http://www.natcorp.ox.ac.uk/
Collins Cobuild Concordance and Collocation Sampler contains around 56 million words. http://www.collins.co.uk/corpus/CorpusSearch.aspx
Open American National Corpus (OANC): includes over 14 million words from the Second Release that can be freely distributed. http://www.americannationalcorpus.org/OANC/index.html
The Michigan Corpus of Upper-Level Student Papers (MICUSP) “is a collection of around 830 A grade papers (roughly 2.6 million words) from a range of disciplines across four academic divisions (Humanities and Arts, Social Sciences, Biological and Health Sciences, Physical Sciences) of the University of Michigan (U-M), Ann Arbor. MICUSP was created by a team of researchers and students at the U-M English Language Institute (ELI)”. http://micusp.elicorpora.info/
Michigan Corpus of Academic Spoken English (MICASE) “is a collection of nearly 1.8 million words of transcribed speech (almost 200 hours of recordings) from the University of Michigan (U-M) in Ann Arbor, created by researchers and students at the U-M English Language Institute (ELI). MICASE contains data from a wide range of speech events (including lectures, classroom discussions, lab sections, seminars, and advising sessions) and locations across the university”. http://micase.elicorpora.info/

Example of a study that uses language corpora

Hatzitheodorou A. and Mattheoudakis M. (2011). The impact of culture on the use of stance exponents as persuasive devices: the case of GRICLE and English native speaker corpora. In A. Frankenberg-Garcia, L. Flowerdew, and G. .Aston (Eds.), New trends in corpora and language learning (pp. 229 – 246). New York, N.Y.: Continuum International Publishing Group.

Purpose of the study: how stance exponents are used by advanced Greek learners of English at tertiary level and native speakers of English (American university students) to achieve persuasion. Stance exponents are:

a) boosters (e.g. of course, undoubtedly, evidently),

b) attitude markers (e.g. luckily, hopefully, unfortunately), and

c) hedges (e.g. possibly, perhaps, probably).

Participants: “176 Greek native speakers in the third and fourth years of university studies at the School of English, Aristotle University of Thessaloniki in Greece, aged 20 to 22 years” (p. 234).

Corpora used in data analysis:

GRICLE (Greek Corpus of Learner English): 177,490 words.
LOCNESS (Louvain Corpus of Native English Speakers): 149,580 words.
PELCRA (Polish and English Language Corpora for Research and Application): 25,467

words.

HNC (Hellenic National Corpus): 1,725,214 words.

Findings: Greek learners are more emphatic writers and tend to use more boosters and attitude markers than their American counterparts do. On the other hand, American university students use more hedges than the Greek learners. The authors relate this to the tendency of the Greek learners to transfer some linguistic and cultural patterns from their L1 to their L2, e.g. projecting a confident attitude and showing certainty. The study is an important addition to cross-cultural and contrastive rhetoric research.

Books:

1. Bennett, G. R. (2010). Using corpora in the language learning classroom : Corpus linguistics for teachers. Ann Arbor, MI: University of Michigan Press.

This is an enjoyable and easy-to-read book written in simple language for classroom teachers. The book provides the readers with the basic theoretical foundations of corpus linguistics in addition to many examples of corpus-designed activities that language teachers could use as is in their classrooms or design similar ones that suit their learners. The book also has some useful guidelines in how to create one’s own corpora to use in the classroom. It is highly recommended for busy language teachers who want to get a good understanding of the subject and hands-on activities without delving into complex academic details and / or arguments.

2. Jaén M. M., Serrano F. and Calzada Pérez M. (Eds.) (2010). Exploring new paths in language pedagogy : Lexis and corpus-based language teaching. Oakville, CT, USA: Equinox.

This book is a collection of 18 chapters divided into three distinct sections that update the readers on the recent theory, research, and practice in a) the development of L2 vocabulary knowledge and b) the contribution of corpus-based evidence to language teaching. The book is a very good combination of theory and practice in these two areas and is written mainly to benefit learners / teachers of English as a foreign / second language even though some chapters include other languages such as French and Spanish. It is a very good read for people who have a particular interest in vocabulary learning / acquisition and are willing to learn how corpus linguistics could be of help in this regard.

3. Sinclair J. M. (Ed.) (2004). How to use corpora in language teaching. Philadelphia, PA: J. Benjamins.

This book is a collection of twelve different chapters that range in focus from practical issues of concern to language teachers (basically delineating how to use corpora in the language classroom) to recent research in corpus linguistics. The chapters are mainly written by a new generation of language teachers who have the zeal and acknowledged expertise in diverse areas related to the use of corpora in language learning. The book thus updates the readers on recent research findings, draws their attention to future research directions, and provides them with practical ideas that they can use with their own learners.

4. Frankenburg-Garcia A., Aston G. and Flowerdew L. (Eds.) (2011). New trends in corpora and language learning. New York, N.Y.: Continuum International Publishing Group.

This book comprises a collection or articles / chapters that give the readers an up-to-date idea of the recent research and developments in the use of corpora not only in language learning but also in other areas such as translation studies. The book is divided into three macro parts, five chapters each, that represent the current trends in corpus linguistics study design and application. Part one (corpora with language learners: use) presents five distinct uses of corpora by different learners coming from five different contexts. Part two (corpora for language learners: tools) gives the readers innovative examples of classroom-focused tools that were designed to help language learners meet their specific needs. The final part (corpora by language learners: learner language) presents exploratory studies conducted to investigate the language characteristic of various groups of learners using learner corpora. The book is scholarly and written for professionals who have very good knowledge about corpus linguistics and want to be updated on the latest studies and the direction of future research.

5. Kawaguchi Y., Takagaki T., Tomimori N. and Tsuruga Y. (Eds.) (2007). Corpus-based perspectives in linguistics. Philadelphia: J. Benjamins Pub. Co.

This book is not written for language teachers. It is rather written from a linguistics (versus applied linguistics) perspective in order to report on some corpus-based research findings across various domains such as dictionaries, translation, sociolinguistics, linguistic atlas, and others. The twenty one chapters of the book thus introduce several studies that focus on other languages such as the Chinese, Japanese, French, and Malay languages. It will be of interest to readers who would like to explore the area of corpus linguistics outside of the classroom, which is a very rich and diverse domain.

6. Hunston, S. (2002). Corpora in applied linguistics. New York: Cambridge University Press.

This book is mainly written for students of applied linguistics in order to provide them with solid (and sometimes detailed) knowledge about corpus linguistics and how to use language corpora for various purposes. There are two important themes that run consistently through the book: 1) how corpus linguistics has affected the study, theories, and even description of language, and 2) an explanation of the various methods used in investigating the various corpora. The book also has three chapters (i.e. 6 – 8) devoted to the use of corpora in language teaching particularly ESL / EFL, but similar examples can be used in the teaching of other languages.

7. Aston G., Bernardini S. and Stewart D. (Eds.) (2004). Corpora and language learners. Philadelphia: John Benjamins Pub.

Similar to fourth book on this list.

8. Aijmer K. and Altenberg B. (Eds.) (2004). Advances in Corpus Linguistics. Amsterdam: Rodopi.

This book is a collection of the presentations given at the 23rd ICAME (International Computer Archive of Modern and Medieval English) held in 2002. The 22 chapters are divided into six broad categories that cover a wide range of interests including grammar and lexis. The book is basically written for corpus linguists and other interested researchers.

9. Renouf A. and Kehoe A. (Eds.) (2006). The Changing Face of Corpus Linguistics. Amsterdam: Rodopi.

Similar to the above reference except that it is about the 24th ICAME held in 2003.

Journal Articles:

1. Romer, U. (2011). Corpus Research Applications in Second Language Teaching. Annual Review of Applied Linguistics, 31, 205 – 225.

This article is an excellent resource for language teachers as it elucidates how corpus linguistics could have many applications in the second language classroom. The author divides the pedagogical corpus applications into indirect applications (e.g. materials writing and syllabus design) and direct ones (e.g. teacher-corpus interaction and learner-corpus interaction) and gives examples of each using general and specialized language corpora. Highly recommended for language teachers!

2. Kachru, Y. (2008). Language variation and corpus linguistics. World Englishes, 27 (1), pp. 1 – 8.

This journal article gives a critical account of corpus linguistics and some of its shortcomings by examining data from the lexicon and grammar of world Englishes. The author’s position is to avoid over-reliance on patterns of use as this could trick the researchers. She states, for example, that analyzing language corpora cannot help the researchers detect the recurrent features of language users (i.e. diatypic variation versus dialect variation).

3. Flowerdew, L. (2009). Applying corpus linguistics to pedagogy: A critical evaluation. International Journal of Corpus Linguistics, 14 (3), 393 – 417.

This article is a response to some of the literature that adopted a more critical review of corpus linguistics and data-driven learning (DDL). The article reviews some of these criticisms (particularly the decontextualisation of corpus data) and offers some suggestions and insights in applying corpus linguistics to pedagogy. It is an empowering article for language teachers, as the author has her eyes clearly focused on pedagogy.

Specialized Journals:

1. Corpus Linguistics and Linguistic Theory

“[It] is a newly founded, peer-reviewed journal publishing high-quality original corpus-based research focusing on theoretically relevant issues in all core areas of linguistic research (phonology, morphology, syntax, semantics, pragmatics), or other recognized topic areas”. [Source: http://www.degruyter.com/view/j/cllt. Accessed: Feb. 14, 2012. Available through UofT].

2. The International Journal of Corpus Linguistics (IJCL)

“[It] publishes original research covering methodological, applied and theoretical work in any area of corpus linguistics. Through its focus on empirical language research, IJCL provides a forum for the presentation of new findings and innovative approaches in any area of linguistics (e.g. lexicology, grammar, discourse analysis, stylistics, sociolinguistics, morphology, contrastive linguistics), applied linguistics (e.g. language teaching, forensic linguistics), and translation studies. Based on its interest in corpus methodology, IJCL also invites contributions on the interface between corpus and computational linguistics. The journal has a major reviews section publishing book reviews as well as corpus and software reviews. The language of the journal is English, but contributions are also invited on studies of languages other than English. IJCL occasionally publishes special issues (for details please contact the editor). All contributions are peer-reviewed. [Source: http://benjamins.com/#catalog/journals/ijcl. Accessed Feb. 14, 2012. Available through UofT].

Caesura Collective

Corpus Linguistics 101

Comments

Follow Us

Recent Posts

How to woo French teachers to stay in Canada's schools

So, I organize the Archaeological Theory Interest Group

The identities and classroom instruction of non-native FSL teachers in Ontario secondary schools

Caesura Spring Potluck!!

Acceptance

What does collaboration sound like?

Bring yr brain and get weird!

A very special RAD Book Club

Do you know the Graduate Speaker Series @ the Grad Room?

GSRC2017 Call for contributors