Corpus linguistics is the study of language as a language is expressed in its textual corpus ( plural corpora ), its “real world” body of text. Corpus linguistics proposes that reliable analysis of language is most feasible with corpora collected in the field in their natural context (“realia”), and with minimal experimental interference. The text-corpus method is a digestive approach that derives a set of abstract rules that govern a natural language from texts in that language and explores how that language relates to other languages. Deriving corpora from source texts was first done manually, but is now automated. Corpus linguistics definition

Corpora were not just used for linguistic research, they were also used to compile dictionaries (beginning with the American Heritage Dictionary of English Language in 1969) and grammar guides such as A Comprehensive Grammar of the English Language , published in 1985 .

Experts in the field have different opinions about annotating a corpus. These views range from John McHardy Sinclair , who advocates minimal annotation so that texts speak for themselves, to the Survey of English Usage team ( University College, London ), who advocates annotation as enabling greater linguistic understanding through rigorous records.


A corpus is a database in which everything written and spoken in a language is stored. Scientists who study a language ( corpus linguists ) take everything that is published in a language ( English, for example ) and put it on a computer: texts from newspapers, books, magazines, pamphlets, newsletters, medicine leaflets… can take everything possible and save it on a super computer. All this information gathered in one place is called a  written corpus ( after all, we only have written texts there ).

As for the spoken corpus , the thing is much more interesting. Linguists record ( with people’s permission ) conversations at work, in the supermarket, at home, on the phone, on the streets, park benches, buses, etc. They also record TV shows, interviews, radio shows, news, etc. Afterwards, they transcribe everything and transfer it to the computer, thus obtaining the spoken corpus ( the data of the spoken language ).

With these two sets of data – written corpus and spoken corpus -, we – linguist researchers – can verify everything with the help of a program developed to search the information in the corpus . So we can discover interesting things. Corpus linguistics definition

For example, did you know that the most used word in the English language is the article “the”? This in the written corpus ! However, if we evaluate only the spoken corpus , we will find that the most used word is the pronoun “I”! If we put the two corpus together, “the” wins out over everything that is a word.

Another curiosity: did you know that the passive voice in English is used much more often in scientific and journalistic texts? In other words, if you want to learn English, just to travel and make friends, you don’t need to memorize the rules of passive voice in English. But if you want to be a good journalist or write good scientific texts then the conversation will be different.

With the corpus we also discover which words are most used with other words ( collocations ). We found that the present perfect is used more often than the past simple . And we also found that the present simple is by far the most used tense in the English language.

Anyway, with this wonderful science English teachers can have an idea of ​​what to teach their students. Book authors can write more accurate information about one grammatical structure or another, they can also tell readers and students how words are used in conjunction with other words.

And that’s how folks, based on this information, I tell you how a word or another is used in English and how I also inform the ranking of another word. I remind you that the explanation given here is very simple and just to satisfy the curiosity of many. After all, there is still a lot to be said about such corpus linguistics and its benefits to the teaching/learning of a language. Corpus linguistics definition


Some early efforts at grammatical description were based, at least in part, on corpora of particular religious or cultural significance. For example, Prātiśākhya literature has described Sanskrit sound patterns as found in the Vedas , and Pāṇini ‘s Classical Sanskrit grammar was based, at least in part, on analysis of this same corpus. Likewise, early Arab grammarians paid special attention to the language of the Qur’an . In the Western European tradition, scholars prepared concordances to allow for detailed study of the language of the Bible and other canonical texts.

English corpora

A milestone in modern corpus linguistics was the publication of Computational Analysis of Current American English in 1967. Written by Henry Kučera and W. Nelson Francis , the work was based on an analysis of the Brown Corpus , which was a contemporary compilation of about a million words in American English, carefully selected from a wide variety of sources. Kučera and Francis subjected the Brown Corpus to a variety of computational analyzes and then combined elements from linguistics, language teaching, psychology , statistics, and sociology .to create a rich and varied work. Another important publication was Randolph Quirk ‘s “Towards a description of English Usage” in 1960 [4] in which he introduced the English Usage Survey. Corpus linguistics definition

Shortly afterward, Boston editor Houghton-Mifflin approached Kučera to provide a three-line, one-million-word citation base for his new American Heritage Dictionary , the first dictionary compiled using corpus linguistics. The AHD took the innovative step of combining prescriptive elements (how the language should be used) with descriptive information (how it is actually used).

Other publishers followed suit. British publisher Collins’ monolingual COBUILD Student Dictionary, designed for users learning English as a foreign language, was compiled using the Bank of English. The Survey of English Usage Corpus was used in the development of one of the most important Corpus-based Grammars, which was written by Quirk et al. and published in 1985 as A Comprehensive Grammar of the English Language .

The Brown Corpus also spawned a number of similarly structured corpora: the LOB Corpus (1960s British English), Kolhapur (Indian English), Wellington (New Zealand English), Australian Corpus of English (Australian English), the Frown Corpus (early 1990s American English) and the FLOB Corpus (1990s British English). Other corpora represent many languages, varieties, and modes and include the International Corpus of English and the British National Corpus, a 100-million-word collection of a range of spoken and written texts created in the 1990s by a consortium of publishers and universities. (Oxford and Lancaster) and the British Library. For contemporary American English, work has stalled on the American National Corpus, Corpus linguistics definition

The first computerized corpus of transcribed spoken language was constructed in 1971 by the Montreal French Project, containing one million words, which inspired Shana Poplack’s much larger corpus of French spoken in the Ottawa-Hull area.

Multilingual corporation

In the 1990s, many of the first notable successes in statistical methods in natural language programming (NLP) occurred in the field of machine translation, primarily due to work at IBM Research. These systems were able to take advantage of the existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws that required the translation of all government procedures into all the official languages ​​of the corresponding government systems.

Ancient language corpora

In addition to these corpora of living languages, computerized corpora were also made up of collections of texts in ancient languages. An example is the Andersen-Forbes database of the Hebrew Bible, developed since the 1970s, in which each clause is analyzed using graphs representing up to seven levels of syntax, and each segment marked with seven fields of information. The Arabic corpus of the Qur’an is an annotated corpus for the classical Arabic language of the Qur’an. This is a recent project with several layers of annotation, including morphological segmentation, markup of grammatical classes, and parsing using dependency grammar. Corpus linguistics definition

Specific field corpora

In addition to pure linguistic inquiry, researchers have begun to apply corpus linguistics to other academic and professional fields, such as the emerging subdiscipline of law and corpus linguistics, which seeks to understand legal texts using corpus data and tools.


Corpus linguistics has spawned several research methods that attempt to trace a path from data to theory. Wallis and Nelson (2001) first introduced what they called the 3A perspective: Annotation, Abstraction and Analysis.

  • Annotation consists of applying a schema to texts. Annotations can include structural markup, parsing markup, parsing, and various other representations.
  • Abstraction is the translation (mapping) of terms in the schema to terms in a theoretically motivated model or dataset. Abstraction typically includes research directed by linguists, but can include, for example, learning rules for parsers.
  • Analysis consists of probing, manipulating and statistically generalizing from the data set. The analysis may include statistical evaluations, rule base optimization, or knowledge discovery methods.

Most of today’s lexical corpora are marked with word classes (marked with POS). However, even corpus linguists working with ‘plain unannotated text’ inevitably apply some method of isolating salient terms. In such situations, annotation and abstraction are combined in a lexical search.

The advantage of publishing an annotated corpus is that other users can perform experiments on the corpus (via corpus managers). Linguists with interests and perspectives other than those that originated them can explore this work. By sharing data, corpus linguists are able to treat the corpus as a locus of linguistic debate and further study. Corpus linguistics definition

