Text vs. Corpus
(1) Literary analysis by computer studies a specific text looking for lexical pattern that enhance the interpretation of the text. Computer literary analysis is in its early stages, primarily due to the fact that the energies of interested researchers are exhausted in getting the literary texts marked up for analysis and put online. Nevertheless, some interesting work has been done. The following list is a sample:
Armstrong, Guyda. 1996. Machievelli's Il Principe.
Lancashire, Ian. 1993. Computer-assisted critical analysis: Atwood's Handmaid's Tale. in George P. Landow and Paul Delany. The Digital Word: Text-based Computing in the Humanities. MIT Press.
Price, Kenneth, et al. 2001. Whitman's Manuscripts in Scholarly Context
Smith, John B. 1980. Imagery and theMind of Stephen Dedalus: A Computer-Assisted
Study of Joyce's Portrait of the Artist as a Young Man. Lewisburg:
Bucknell UP.
Tetreault, Ronald. 1997. Electrifying Wordsworth.
(2) Linguistic analysis by computer often involves the use of a corpus, a representative sampling of textual or spoken genre. The Table of Contents from the Brown Corpus Manual gives an idea of what goes into making a representative sample.
Computer Literary Analysis
(3) Lexical patterns in a literary text can be used to support or refute a particular interpretation. Ian Lancashire (1993) in Computer-Assisted Critical Analysis: Atwood's Handmaid's Tale, uses such patterns to argue against a feminist interpretation of the character Offred in the novel. One such pattern involves the words that collocate with the word woman versus those that collocate with the word man. These are shown below.
|
|
|
| pregnant 5/11
deceived 2/3 desperate 2/3 seated 2/4 kneeling on the floor 2/2 sitting 2/2 fur 2/4 reproductive 3/4 |
humming 2/3
cruel 1/2 driven 2/4 standing 2/2 studies 2/4 wandering 1/2 uniforms 4/9 passports 2/5 loved 2/6 cars 3/10 |
| frequency collocating/total frequency |
According to Lancashire, the words that collocate with woman/women show "subservience and confinement . . . . In contrast, the words man and men collocate with terms of action and movement." (p. 305)
(4) Computer literary analysis can also track changes in an author's writing over time or even in a single literary work. For example, in Charles Dickens' A Christmas Carol, there are many references to death. However, if we split the text into 1000 word chunks, we can easily see that references to death are concentrated at the beginning of the story (where Marley’s status as a ghost is established) and again at the climax (where Scrooge looks upon his own death).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(5) Relevant websites for Computer Literary Analysis (Humanities Computing):
Association for Computers and the Humanities
(ACH)
Association for
Linguistic and Literary Computing (ALLC)
The Text Encoding Initiative (TEI)
Applications1
Applications in Lexicography
(A) Find new meanings of words by comparing corpora. Using the Hong Kong Web Concordancer, give a definition for the word virtual based on the Brown corpus compiled in 1961. Then give a definition for virtual based on Computing Texts compiled in 1998. What was the meaning/context of this word in 1961? In 1998?
1961: ___________________________________________________________________________________________________
1998: ___________________________________________________________________________________________________
To write a specialized dictionary, you have to examine the texts that represent the specialized field. To determine what entries should go into the dictionary, you have to look at the distribution of words in the text by frequency. The most frequently occurring words will usually be words of the standard rather than the specialized language. Words that belong to the specialized vocabulary will be less commonly occurring.
(B) Generate a word frequency list. The columns below give a
partial frequency list for Charles Darwin's The Formation of Vegetable
Mould, which contains 60,567 words.
If you were writing a dictionary to aid a reader of Darwin's works,
what words would you include in the dictionary?
| 4627 the | 38 petioles | 1 wisseuschaft |
| 2771 of | 21 denudation | 1 wise |
| 1587 in | 19 concretions | 1 utricularia |
| 1513 and | 18 calciferous | 1 unsupported |
| 1260 a | 16 perrier | 1 terricolous |
| 1026 to | 15 calcareous | 1 terrestrial |
| 685 by | 12 zoolog | 1 schist |
| 608 that | 11 claparede | 1 safe |
(C) Identify words that belong to the specialized language. You
have probably chosen to exclude the very common words in the first column,
but what other words would you exclude? If you are writing a specialized
dictionary, you want to exclude words that belong to the standard language.
To do this, you would have to compare the frequency list from Darwin with
a frequency list from a representative sample of English as contained in
a corpus like the Brown Corpus of 1,300,000 words. This course will teach
you how to do such a comparison, but for the moment you can search for
each of the words in the second and third columns above in a
word frequency list extracted from the British National Corpus. Go
to the list on the web, click on Edit, and go to "Find (on this page)".
Then type in the word you are looking for. Does it occur in the standard
English of the Brown Corpus? If it occurs, does it occur less frequently
than it does in the Darwin text? If you answered "No" to either of these
questions, the word should probably go into your specialized Darwin dictionary.
(D) Write dictionary entries. Using the Collins CoBuild Concordancer, give a brief definition of the word cause.
After "Type in your query", type cause/VERB
chose Check off "American books, ephemera, and radio"
Press "Show Concs"
_____________________________________________________________________________________________________________
_____________________________________________________________________________________________________________
(E) Look again at the citations for the verb cause. What semantic pattern characterizes the objects of cause? ___________________________
_____________________________________________________________________________________________________________
If you were writing an ESL dictionary, you might want to include this semantic pattern as a note to the dictionary user.
Applications in Language Teaching
(F) Compare lexical use over time. In the Montclair State ESL Writing Program, students are asked to write sample essays at the beginning and end of the semester on the same topic. With permission from the student authors, we store these essays on sapir in a database called MELD (Montclair Electronic Language Database). Below is the top of the word frequency list from the intake and exit essays for Summer 2000. Are there any differences between the two lists? Consider both the frequency of each word and its rank order in the list. Also, keep in mind that the August 3rd sample is longer.
| June 26, 2000
3062 words in 17 essays |
August 3, 2000
4783 words in 18 essays |
| 134 to
121 I 120 the 79 and 55 have 55 at 48 home 46 a 45 of 43 traditional 43 they 43 that 42 is 40 in 38 you 37 study 35 can 34 or 32 would 32 be 32 are 30 technology 30 for 27 with 27 studying 26 students 26 schools 23 computers 23 because 22 on 22 not 16 people |
205 to
189 the 131 and 123 I 85 you 80 would 80 at 79 of 75 traditional 74 is 69 home 68 or 65 in 64 not 63 be 60 school 55 have 53 studying 50 schools 49 will 49 a 42 that 42 for 40 more 40 are 37 technology 35 people 35 because 34 study 32 with 32 can 32 by 30 my |
(G) Compare POS usage over time. We can also tag the essays for part of speech information (although tagger accuracy goes down with error-prone writing). Does the table below show any changes in part of speech usage?
| June 26, 2000
3062 words in 17 essays |
August 3, 2000
4783 words in 18 essays |
| 447 NN noun
399 IN preposition 282 PRP personal pronoun 267 VB uninflected verb 245 DT determiner 222 JJ adjective 164 VBP present tense verb 155 NNS possessive noun 131 TO pre-verbal 'auxiliary' 123 CC coordinating conjunction 109 RB adverb 102 MD modal 100 VBG verb present particle 64 NNP proper noun 62 CD punctuation 56 VBZ 3rd person singular present tense |
636 NN
589 IN 414 VB 401 PRP 396 JJ 341 DT 223 CC 206 TO 198 MD 197 NNS 190 RB 189 VBG 169 VBP 166 NNP 158 CD 95 VBZ |
(H) Track specific L2 errors. The MELD database is marked for error, so that we can analyze patterns of error. You will learn to search for and track errors. What pattern of error do the following data illustrate? How would the data have to be marked up to signal these errors?
. . . the traditional school systems is very necessary for educating
a child.
the class teacher explain it in front of the class
the skills and knowledge that the student need for his career
if somebody ask me
talking to different people need different tactics
computers provides infinite information
its informations are also current
Applications in Translation
(I). Disambiguate word senses. The word sentence
in English can refer to a judicial phenomenon or a grammatical one. If
the former, then the French equivalent of sentence is peine; if
the latter it is phrase. To avoid consigning a criminal to a phrase
of five years, the translator (or the machine translation system, has to
be aware of the context of the word sentence which determines which
French equivalent to use. Computational tools allow us to model the contexts
of sentence that disambiguate it. To get some sense of how this works,
- Go to the Hong
Kong Web Concordancer.
- type sentence in the empty box
- Select corpus: The Times January 1995
- Sort type: sort left
- Click on "Search for concordances"
List some words surrounding sentence that clearly mark it as equivalent to peine? _____________ _____________ _____________
List some words that clearly mark it as equivalent to phrase? _____________ _____________ _____________ _____________
This concordancer gives you a window of seven words on either side of the chosen word. Is this always enough to disambiguate the meaning of sentence? ________ If not, you can click on any highlighted instance of sentence and get a larger context.
Now go to the Collins
CoBuild Concordancer to determine whether any of the collocates of
sentence that disambiguate it in the The Times
collocate with any statistical significance in the Collins corpus.
- Scroll down to CobuildDirect Collocation Sampler
- Type in the word sentence
- Select Mutual Information (MI) score as the significance score to
be calculated.2
Do any of the words that seemed to disambiguate sentence in The Times appear on the Collins MI list? If so, which ones?
_______________________________________________________________________________________________________
(J) Translate sentences a la Alta Vista's Babble Fish.
Applications in Language Arts Teaching/Publishing
(K) Determine the ratio of Latinate to Anglo-Saxon morphemes in English.
There are two classes of affixes in English (Siegel, 1979). Basically,
the Class I affixes align with Latinate words while the Class II affixes
align with words of Anglo-Saxon origin. The Class I affixes cause phonological
change in the root word; for example the stress placement for co@nsequence
is different from that of conseque@ntial.
The Class II affixes do not cause phonological change; for example, ge@ntleman
and ge@ntlemanly have the same
stress placement.
Which suffixes occur more frequently in English, the Anglo-Saxon suffixes
or the Latinate suffixes? From a sample list of English words, the following
lists can be retrieved to answer this question.
| -ial | -ly |
| actuarial
adverbial aerial alluvial ambrosial arterial artificial asocial axial bacterial baronial beneficial bestial biaxial bicentennial biennial bilabial binomial biracial bronchial celestial censorial centennial |
bimonthly
burly cuddly drizzly friendly gentlemanly ghastly ghostly grisly grizzly hilly jowly mealy melancholy niggardly northerly orderly otherworldly puddly scraggly silly sniffly snuffly |
(L) Determine the usage of Latinate forms in children's literature
in English. In (K) we introduced the notion of Class I and Class II
morphemes. Because Class II affixes don’t involve phonological change,
we might predict that they would be easier to learn. This might also be
true because of their Anglo-Saxon roots. We can look for support for this
prediction by comparing children’s texts with adult texts with respect
to the distribution of Latinate and Anglo-Saxon affixes.
Off-the-Shelf Software
Many of the tasks that we have examined above can be done with existing software, and it helps to have some familiarity with what is out there. Catherine Ball's Tutorial on Corpora and Concordances gives a good introduction to what is available. Reading this tutorial is today's homework. (You are not required to enter your email address to use the tutorial).
To give you a feel for using off-the-shelf software, we will look at some of the features of TACT, Text Analysis Computing Tools, a package of concordancing tools developed at the University of Toronto.
TACT is installed on the machines in DI282. Follow this path to
find it:
C:/apps/course.rel/TACT/text
Follow these steps to see the basic functionality of TACT.
1. move volpone.tdb into the TACT folder
2. move volpone.mks into the TACT folder
3. in the TACT folder, click on Usebase.exe
4. type in "volpone" at the prompt
5. Wait until you see a Usebase window. This window is
controlled with the Spacebar,
the arrow keys, and the Enter key.
6. Using the Spacebar, get to Select.
7. Under Select, chose Selected word list.
8. Type in "act", or some other word you would like to track.
9. To get out of the word list, hit the Enter key.
10. Under Displays, look at KWIC, Variable context, Text, Distribution,
and Collocates.
1Results of the sample applications given here are illustratory only. The numbers do n ot represent statistical significance.
2Mutual information calculates the probability (P) of
2 words occurring in close proximity divided by the probability of each
word occurring separately: log P(x,y)/P(x)P(y). A high positive
number indicates that the two words are closely connected.
.