Text Analysis Tools: Worksheet 1
 
 

Text vs. Corpus

(1) Literary analysis by computer studies a specific text looking for lexical pattern that enhance the interpretation of the text. Computer literary analysis is in its early stages, primarily due to the fact that the energies of interested researchers are exhausted in getting the literary texts marked up for analysis and put online. Nevertheless, some interesting work has been done. The following list is a sample:

Armstrong, Guyda. 1996. Machievelli's Il Principe.

Lancashire, Ian. 1993. Computer-assisted critical analysis: Atwood's Handmaid's Tale. in George P. Landow and Paul Delany. The Digital Word: Text-based Computing in the Humanities. MIT Press.

Price, Kenneth, et al. 2001. Whitman's Manuscripts in Scholarly Context

Smith, John B. 1980. Imagery and theMind of Stephen Dedalus: A Computer-Assisted Study of Joyce's Portrait of the Artist as a Young Man. Lewisburg: Bucknell UP.
 

Tetreault, Ronald. 1997. Electrifying Wordsworth.

(2) Linguistic analysis by computer often involves the use of a corpus, a representative sampling of textual or spoken genre. The Table of Contents from the Brown Corpus Manual gives an idea of what goes into making a representative sample.



Computer Literary Analysis

(3) Lexical patterns in a literary text can be used to support or refute a particular interpretation. Ian Lancashire (1993) in Computer-Assisted Critical Analysis: Atwood's Handmaid's Tale, uses such patterns to argue against a feminist interpretation of the character Offred in the novel. One such pattern involves the words that collocate with the word woman versus those that collocate with the word man. These are shown below.

woman/women
man/men
pregnant 5/11
deceived 2/3
desperate 2/3
seated 2/4
kneeling on the floor 2/2
sitting 2/2
fur 2/4
reproductive 3/4
humming 2/3
cruel 1/2
driven 2/4
standing 2/2
studies 2/4
wandering 1/2
uniforms 4/9
passports 2/5
loved 2/6
cars 3/10
frequency collocating/total frequency  

 

According to Lancashire, the words that collocate with woman/women show "subservience and confinement . . . . In contrast, the words man and men collocate with terms of action and movement." (p. 305)

(4) Computer literary analysis can also track changes in an author's writing over time or even in a single literary work. For example, in Charles Dickens' A Christmas Carol, there are many references to death. However, if we split the text into 1000 word chunks, we can easily see that references to death are concentrated at the beginning of the story (where Marley’s status as a ghost is established) and again at the climax (where Scrooge looks upon his own death).
1-1000
1001-2000
2001-3000
3001-4000
4001-5000
5001-6000
7001-7599
10
5
0
1
6
8
0

 

(5) Relevant websites for Computer Literary Analysis (Humanities Computing):
Association for Computers and the Humanities (ACH)
Association for Linguistic and Literary Computing (ALLC)
The Text Encoding Initiative (TEI)

Applications1

Applications in Lexicography

(A) Find new meanings of words by comparing corpora. Using the Hong Kong Web Concordancer, give a definition for the word virtual based on the Brown corpus compiled in 1961. Then give a definition for virtual based on Computing Texts compiled in 1998. What was the meaning/context of this word in 1961? In 1998?

1961: ___________________________________________________________________________________________________

1998: ___________________________________________________________________________________________________
 
 

To write a specialized dictionary, you have to examine the texts that represent the specialized field. To determine what entries should go into the dictionary, you have to look at the distribution of words in the text by frequency. The most frequently occurring words will usually be words of the standard rather than the specialized language. Words that belong to the specialized vocabulary will be less commonly occurring.

(B) Generate a word frequency list. The columns below give a partial frequency list for Charles Darwin's The Formation of Vegetable Mould, which contains 60,567 words.
If you were writing a dictionary to aid a reader of Darwin's works, what words would you include in the dictionary?
 
4627 the 38 petioles 1 wisseuschaft
2771 of 21 denudation 1 wise
1587 in 19 concretions 1 utricularia
1513 and 18 calciferous 1 unsupported
1260 a 16 perrier 1 terricolous
1026 to 15 calcareous 1 terrestrial
685 by 12 zoolog 1 schist
608 that 11 claparede 1 safe

(C) Identify words that belong to the specialized language. You have probably chosen to exclude the very common words in the first column, but what other words would you exclude? If you are writing a specialized dictionary, you want to exclude words that belong to the standard language. To do this, you would have to compare the frequency list from Darwin with a frequency list from a representative sample of English as contained in a corpus like the Brown Corpus of 1,300,000 words. This course will teach you how to do such a comparison, but for the moment you can search for each of the words in the second and third columns above in a word frequency list extracted from the British National Corpus. Go to the list on the web, click on Edit, and go to "Find (on this page)". Then type in the word you are looking for. Does it occur in the standard English of the Brown Corpus? If it occurs, does it occur less frequently than it does in the Darwin text? If you answered "No" to either of these questions, the word should probably go into your specialized Darwin dictionary.
 

(D) Write dictionary entries. Using the Collins CoBuild Concordancer, give a brief definition of the word cause.

After "Type in your query", type cause/VERB
chose Check off "American books, ephemera, and radio"
Press "Show Concs"

_____________________________________________________________________________________________________________

_____________________________________________________________________________________________________________
 

(E) Look again at the citations for the verb cause. What semantic pattern characterizes the objects of cause? ___________________________

_____________________________________________________________________________________________________________

If you were writing an ESL dictionary, you might want to include this semantic pattern as a note to the dictionary user.


Applications in Language Teaching

(F) Compare lexical use over time. In the Montclair State ESL Writing Program, students are asked to write sample essays at the beginning and end of the semester on the same topic. With permission from the student authors, we store these essays on sapir in a database called MELD (Montclair Electronic Language Database). Below is the top of the word frequency list from the intake and exit essays for Summer 2000. Are there any differences between the two lists? Consider both the frequency of each word and its rank order in the list. Also, keep in mind that the August 3rd sample is longer.
June 26, 2000
3062 words in 17 essays
August 3, 2000
4783 words in 18 essays
134 to
121 I
120 the
79 and
55 have
55 at
48 home
46 a
45 of
43 traditional
43 they
43 that
42 is
40 in
38 you
37 study
35 can
34 or
32 would
32 be
32 are
30 technology
30 for
27 with
27 studying
26 students
26 schools
23 computers
23 because
22 on
22 not

16 people

205 to
189 the
131 and
123 I
85 you
80 would
80 at
79 of
75 traditional
74 is
69 home
68 or
65 in
64 not
63 be
60 school
55 have
53 studying
50 schools
49 will
49 a
42 that
42 for
40 more
40 are
37 technology
35 people
35 because
34 study
32 with
32 can
32 by
30 my

 

(G) Compare POS usage over time. We can also tag the essays for part of speech information (although tagger accuracy goes down with error-prone writing). Does the table below show any changes in part of speech usage?
June 26, 2000
3062 words in 17 essays
August 3, 2000
4783 words in 18 essays
447 NN noun
399 IN preposition
282 PRP personal pronoun
267 VB uninflected verb
245 DT determiner
222 JJ adjective
164 VBP present tense verb
155 NNS possessive noun
131 TO pre-verbal 'auxiliary'
123 CC coordinating conjunction
109 RB adverb
102 MD modal
100 VBG verb present particle
64 NNP proper noun
62 CD punctuation
56 VBZ 3rd person singular
present tense
636 NN
589 IN
414 VB
401 PRP
396 JJ
341 DT
223 CC
206 TO
198 MD
197 NNS
190 RB
189 VBG
169 VBP
166 NNP
158 CD
95 VBZ

 

(H) Track specific L2 errors. The MELD database is marked for error, so that we can analyze patterns of error. You will learn to search for and track errors. What pattern of error do the following data illustrate? How would the data have to be marked up to signal these errors?

. . . the traditional school systems is very necessary for educating a child.
the class teacher explain it in front of the class
the skills and knowledge that the student need for his career
if somebody ask me
talking to different people need different tactics
computers provides infinite information
its informations are also current


Applications in Translation

 (I). Disambiguate word senses. The word sentence in English can refer to a judicial phenomenon or a grammatical one. If the former, then the French equivalent of sentence is peine; if the latter it is phrase. To avoid consigning a criminal to a phrase of five years, the translator (or the machine translation system, has to be aware of the context of the word sentence which determines which French equivalent to use. Computational tools allow us to model the contexts of sentence that disambiguate it. To get some sense of how this works,
- Go to the Hong Kong Web Concordancer.
- type sentence in the empty box
- Select corpus: The Times January 1995
- Sort type: sort left
- Click on "Search for concordances"

List some words surrounding sentence that clearly mark it as equivalent to peine? _____________ _____________ _____________

List some words that clearly mark it as equivalent to phrase? _____________ _____________ _____________ _____________

This concordancer gives you a window of seven words on either side of the chosen word. Is this always enough to disambiguate the meaning of sentence? ________ If not, you can click on any highlighted instance of sentence and get a larger context.

Now go to the Collins CoBuild Concordancer to determine whether any of the collocates of sentence that disambiguate it in the The Times
collocate with any statistical significance in the Collins corpus.
- Scroll down to CobuildDirect Collocation Sampler
- Type in the word sentence
- Select Mutual Information (MI) score as the significance score to be calculated.2

Do any of the words that seemed to disambiguate sentence in The Times appear on the Collins MI list? If so, which ones?

_______________________________________________________________________________________________________
 
 

(J) Translate sentences a la Alta Vista's Babble Fish.



Applications in Language Arts Teaching/Publishing

(K) Determine the ratio of Latinate to Anglo-Saxon morphemes in English. There are two classes of affixes in English (Siegel, 1979). Basically, the Class I affixes align with Latinate words while the Class II affixes align with words of Anglo-Saxon origin. The Class I affixes cause phonological change in the root word; for example the stress placement for co@nsequence is different from that of conseque@ntial. The Class II affixes do not cause phonological change; for example, ge@ntleman and ge@ntlemanly have the same stress placement.
Which suffixes occur more frequently in English, the Anglo-Saxon suffixes or the Latinate suffixes? From a sample list of English words, the following lists can be retrieved to answer this question.

-ial -ly
actuarial
adverbial
aerial
alluvial
ambrosial
arterial
artificial
asocial
axial
bacterial
baronial
beneficial
bestial
biaxial
bicentennial
biennial
bilabial
binomial
biracial
bronchial
celestial
censorial
centennial
bimonthly
burly
cuddly
drizzly
friendly
gentlemanly
ghastly
ghostly
grisly
grizzly
hilly
jowly
mealy
melancholy
niggardly
northerly
orderly
otherworldly
puddly
scraggly
silly
sniffly
snuffly

 
 
 

(L) Determine the usage of Latinate forms in children's literature in English. In (K) we introduced the notion of Class I and Class II morphemes. Because Class II affixes don’t involve phonological change, we might predict that they would be easier to learn. This might also be true because of their Anglo-Saxon roots. We can look for support for this prediction by comparing children’s texts with adult texts with respect to the distribution of Latinate and Anglo-Saxon affixes.
 

                                                                Off-the-Shelf Software

Many of the tasks that we have examined above can be done with existing software, and it helps to have some familiarity with what is out there.  Catherine Ball's Tutorial on Corpora and Concordances gives a good introduction to what is available.  Reading this tutorial is today's homework.  (You are not required to enter your email address to use the tutorial).

To give you a feel for using off-the-shelf software, we will look at some of the features of TACT, Text Analysis Computing Tools, a package of concordancing tools developed at the University of Toronto.

TACT is installed on the machines in DI282.  Follow this path to find it:
    C:/apps/course.rel/TACT/text

Follow these steps to see the basic functionality of TACT.

1.  move volpone.tdb into the TACT folder
2.  move volpone.mks into the TACT folder
3.  in the TACT folder, click on Usebase.exe
4.  type in "volpone" at the prompt
5.  Wait until you see a Usebase window.  This window is controlled with the Spacebar,
     the arrow keys, and the Enter key.
6.  Using the Spacebar, get to Select.
7.  Under Select, chose Selected word list.
8.  Type in "act", or some other word you would like to track.
9.  To get out of the word list, hit the Enter key.
10.  Under Displays, look at KWIC, Variable context, Text, Distribution, and Collocates.
 
 

1Results of the sample applications given here are illustratory only. The numbers do n ot represent statistical significance.

2Mutual information calculates the probability (P) of 2 words occurring in close proximity divided by the probability of each word occurring separately: log P(x,y)/P(x)P(y). A high positive number indicates that the two words are closely connected.
 
 
 
 
 
 
 
 
 
 
 
 

 .