Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
учебное пособие_лексикография_англ.doc
Скачиваний:
226
Добавлен:
10.06.2015
Размер:
385.54 Кб
Скачать

2. Corpus-based lexicography

The recent development of corpus linguistics has given birth to corpus-based lexicography and a new corpus-based generation of dictionaries. For example, the COBUILD English Dictionary used the Bank of English – the corpus of 20 million words in contemporary English developed at the Birmingham University. The Longman Dictionary of Contemporary English and the Oxford Advanced Learner’s Dictionary of Current English used the British National Corpus.

The British National Corpus is a very large (over 100 million words) corpus of modern English, both spoken and written. The Corpus is designed to represent as wide range of modern British English as possible. The written part (90 %) includes, for example, extracts from regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, school and university essays, among many other kinds of text. Texts are selected for inclusion in the corpus according to three independent selection criteria: domain (75% of texts from informative writings, e.g. from the fields of applied sciences or art, etc.; 25% from imaginative writings – literary and creative works), time (mostly texts since 1975) and medium (60% of written texts are books, 25 % – periodicals).

The spoken part (10%) of the British National Corpus includes a large amount of unscripted informal conversation, recorded by volunteers selected from different age, region and social classes in a demographically balanced way, together with spoken language collected in all kinds of different contexts, ranging from formal business or government meetings to radio shows and phone-ins.

The use of corpora in dictionary-making practices gives a compiler a lot of opportunities; among the most important ones is the opportunity:

1) to produce and revise dictionaries much more quickly than before, thus providing up-to-date information about language;

2) to give more complete and precise definitions since a larger number of natural examples are examined;

3) to keep on top of new words entering the language, or existing words changing their meanings due to the open-ended (constantly growing) monitor corpus;

4) to describe usages of particular words or phrases typical of particular varieties and genres as corpus data contains a rich amount of textual information – regional variety, author, date, part-of-speech tags, genre, etc.;

5) to organize easily examples extracted from corpora into more meaningful groups for analysis and describe/present them laying special stress on their collocation. For example, by sorting the right-hand context of the word alphabetically so that it is possible to see all instances of a particular collocate together;

6) to treat phrases and collocations more systematically than was previously possible due to the ability to call up word-combinations rather than words and due to the existence of mutual information tools which establish relationships between co-occurring words;

7) to register cultural connotations and underlying ideologies which a language has.

As we compare the third (1974), the fifth (1995), and the seventh (2005) editions of OALD, we shall see that the more prominent and frequent meanings of words come to the fore, replacing the less frequent ones even if they remain the primary meanings of words. The first meaning of the verb to bear in the fifth edition is ‘to show’, as used, for example, in: The document bore his signature or She bears little resemblance to her mother. That was a new development as compared with the third edition where the first meaning of to bear is the prototypical sense of the verb: ‘to carry’, e.g. to bear a heavy load. In the seventh edition of OALD to bear is defined as ‘to accept, to deal with’ in the first place, as in: The pain was almost more than he could bear, the meaning ‘to show’ being only the sixth to be mentioned.

We may suggest that in its present state OALD fully justifies its name as a dictionary of Current English. The use of the British National Corpus enabled the editors, as the Preface to the fifth edition by Jonathan Crowther reads, “to determine as never before the relative frequency of words and their meanings, to identify new words and co-occurrences of words, and to present a wholly accurate picture of the syntactic patterns of today’s English”.

Some of lexicographical giants have their own electronic text archives which they use depending on the type of dictionary compiled. For example, the Longman Corpus Network is a diverse, far-reaching group of databases consisting of many millions of words. Five highly sophisticated language databases form the nucleus of the Network: the Longman Learners’ Corpus (comprised of 10 million words of writing in English by learners of the language from over 125 different countries); the Longman Written American Corpus (comprised of 140 million words of American newspaper and book text); the Longman Spoken American Corpus (a unique resource of 5 million words of everyday American speech); the Spoken British Corpus (gives objective information for the first time on what spoken English is really like and how it differs from written British English); and the Longman/ Lancaster Corpus (with over 30 million words it covers an extensive range of written texts from literature to bus timetables).

If we look at the first (1978) and the third (1995) editions of LDOCE, we shall observe similar developments. For example, the adjective flat is now registered in the meaning ‘not busy’ (The building industry’s been completely flat for several years), which was missing in the first edition.

Most current learner’s dictionaries have one feature in common: first and foremost – they no longer use invented examples but rely on corpora of authentic English. The Collins COBUILD group was the first to put an accent on providing the learner with authentic data of real English. The techniques used to create the Collins COBUILD English Language Dictionary involved modern computer technology: "For the first time, a dictionary has been compiled by the thorough examination of a representative group of English texts, spoken and written, running to many million words" (J. Sinclair, Editor in Chief, Introduction to COBUILD, 1987). The large group of texts, called the Bank of English, contained about 200 million words including 15 million words of spoken English. This provided a reasonable ground for measuring the frequencies of words and deciding on which words should be left out as out-dated items, and which should be preserved and given precedence in the dictionary. The words in the corpus came from books, magazines, newspapers, pamphlets, leaflets, conversations, radio and television broadcasts. The aim of collecting this textual evidence on computer was “to provide a fair representation of contemporary English”.

A new generation of ‘computer-corpus-based’ dictionaries differ from traditional dictionaries in that the electronic computer tools are applied, which affects the choice of words for the dictionary and the order of meanings in entries. It is obvious, however, that any dictionary of any time rests on a certain corpus of words, the difference consists in how the data are being processed because in traditional dictionaries the process of collecting and analyzing the corpus was conducted manually.

Most contemporary learner’s dictionaries are to varying extent ‘computer-corpus-based’. LDOCE was the first to benefit from a profound scrutiny of the Spoken English Corpus, which had a remarkable effect on the coverage of some frequent words. The corpora that make up the Longman Corpus Network enabled the lexicographers to establish frequencies of usage and the most common constructions of words. The major result of this study was the marking of the 3,000 most frequent words “both in spoken and written English, again relying on the authentic data from American as well as British English” (D. Summers, Introduction to LDOCE, 1995). Although the use of computerized corpora in this case primarily affected the choice of words, and not definitions (one of the integral features of LDOCE is its 2,000 words defining vocabulary) – it caused a considerable transformation. As a result, a greatly enhanced coverage was achieved, and already in the third edition of the dictionary (1995) all the definitions of words’ meanings were presented in frequency order with the most common meanings first.

The significant innovations that were introduced in the forth (1989), the fifth (1995), and the seventh (2005) editions of OALD by A. S. Hornby stemmed from the use of the British National Corpus – “a massive and carefully balanced computer data bank of modern written and spoken English developed by a consortium of British publishers led by Oxford University Press” (J. Crowther, Preface to OALD, 1995).

Behind CIDE (1995) is a corpus called the International Cambridge Language Survey. It covers instances of words within one hundred million items representing major varieties of English. The specific innovation of this dictionary is how it solves the problem of polysemy. Each entry presents only one core meaning of the word to which the reader is directed by the guide-word: e.g. head — 1) BODY PART, 2) MIND, 3) TOP PART (Diana, the guest of honour, sat at the head of the table), 4) LEADER (the head of the History Department), 5) DEVICE (You need to keep your tape recorder heads clean by using a special cleaning fluid). The search for the right word among homographs is thus facilitated through the use of innovative design features. For its development modern lexicography depends on further use of computer corpora providing authentic data and texts of real English. Its progress also presupposes the creation of new devices helping learners to use English actively and to the best advantage.

One such device is the Russian Learner Corpus of English as part of the International Corpus of Learner English (ICLE). It consists of essays written by Russian students of English, which have been keyed into computer and are now available in a computer-readable form. As we compare the use of words in the Russian Learner Corpus with the Native Speaker Corpus of comparable size and register, we may arrive at factors of non-nativeness and lack of idiomaticity in the Russian learner writing. For example, the use of -ly adverbs as modal disjuncts is limited to certainly and really, which are grossly overused in the Russian corpus, partly because their translation equivalents конечно, действительно, безусловно are functionally prominent words in the Russian language. These -ly adverbs are overused to the exclusion of other modal adverbs, e.g. clearly, definitely, apparently, which are underrepresented in the Russian Learner Corpus, and admittedly, presumably and supposedly are absent altogether.

Particularly disappointing are the numbers showing certainly in the initial position, which is immediately associated with spoken language. The interpersonal or pragmatic uses of certainly (Certainly, science and technology influence people; Certainly, there are great changes; Certainly, I know from my own experience) are much more rare in the Native Speaker Corpus. This suggests that non-native writing tends to be more speech-oriented.

Here is an indication of the fact that the learners’ sense of the functional significance and utility of modal adverbs in modern English needs to be corrected in order not to be dominated by the mother-tongue pattern. Contrastive analysis is crucial here because it enables us to target the problem areas more accurately and more to the benefit of a given national group of learners. It may also lead to the production of more learner-aware pedagogical facilities designed for EL learners in general or focused on the needs of specific national groups. Particularly interesting are the findings which relate to factors of overuse and ‘underuse’ of language items in discourse. This information can then be presented in dictionaries specially meant for learners coming from a particular mother-tongue background.