Uses hand-created social networks to compare the tragedies of Hamlet, Macbeth, and King Lear, and to compare English and Chinese literature as reflected in the novelistic structures of Charles Dickens’ Our Mutual Friend and Cao Xueqin’s The Story of the Stone.
Edges exist between nodes when characters directly speak to each other.
Also uses hand-created social network maps. Focuses on comparing the plays’ social networks to real-life social networks, according to the following features and methodology:
“The analysis of the each play was carried out using the printed text, by tabulating the speaking characters present on the stage at each time slice through the play. A new time slice was deemed to begin whenever a character was stated or could be inferred to have left the stage. The scene size is the number of speaking characters present during the time slice.
The network structure calculations were obtained by treating each speaking character as a node, and deeming two characters to be linked if there was at least one time slice of the play in which both were present.
We calculated the connectance (C) of the network of each play. This is the proportion of possible links between characters that are in fact realized, and in a network containing S nodes and L observed links C is given by L/S2 (Dunne et al. 2002; Williams et al. 2002). This parameter ranges from 0 for a group of completely unlinked nodes to 1 for a fully connected set in which each character interacts with each other character.
We also calculated the characteristic path length (D), or "degrees of separation." For each pair of characters, the number of links in the shortest possible route connecting the two is found. D is the mean of these path lengths for all pairs of characters in the network (for a formal statement, see Montoya and Sole 2002:406).
Finally, we calculated the cluster coefficient (T) for each play. This coefficient is the probability that two nodes each linked to a third will also be linked themselves (for a formal definition, see Montoya and Sole 2002:406). This parameter reflects the extent to which the network is subdivided into densely interconnected sub-parts. In a randomly connected network, the cluster coefficient (T) is equal to the connectance (C).”
Uses combinatorial ngrams to find the sources of quotations:
“The methods used within the present study may be best illustrated by comparison with previous work in the field of historical text reuse. In their research on intertextuality in classical texts, Neil Coffee et al. use a sliding window technique to find passages wherein two texts share at least two words, then rank the matching window results according to the proximity and rarity of the words shared in the two windows. Jean-Gabriel Ganascia et al. also use a sliding window technique and score matches based on word rarity, though they allow for missing words in their sliding windows such that two windows of five words can count as a match if they share three words in common. David Smith et al. use a sliding n-gram window to establish candidate pairs, leverage the Smith-Waterman algorithm to align matching sequences in documents, and finally sort documents with matching windows according to the frequency of their common n-grams. Similarly, Constance Mews et al. also use a sequence alignment technique that matches strings based on identical words. Moving away from a sliding window technique, Glenn Roe et al. measure the cosine distance between passages in a term document matrix, while David Bamman et al. use Moore's Bilingual Sentence Aligner and a translation probability table generated from MGIZA++ to identify cross-lingual instances of intertextuality.
The present study extends the candidate retrieval step of Jean-Gabriel Ganascia et al., and implements a minimal probabilistic model to remove high probability ngrams from the database in order to reduce both storage requirements and processing time. The first step of the method used within the present study is to preprocess each file in a corpus, transform each file into an array of sentences using the Punkt sentence boundary detector, drop stopwords, remove non-token punctuation, and lowercase all text. Because orthography was non-standardized in the eighteenth century, the lookup table from MorphAdorner is then used to regularize the spelling of each word in the text, using a simple hash table replacement. Finally, the WordNet Lemmatizer is used to transform each word into its lemma form.
Once the texts are all represented as lists of clean and normalized sentences, the next step of analysis is to find sentences with unusually similar language. To accomplish this, a sliding window of length w is slid over each sentence in the corpus such that for each sentence, the window first contains words 0 through w-1 from the sentence, then 1 through w, then 2 through w+1, and so on. For each of these windows, a list of all possible combinations of c words from the window is generated. Suppose c is 3, and the algorithm is considering the following sentence: All saw her spots but few her brightness took.
Using this window of text, the list of "combinatorial ngrams" produced includes "all saw her", "all saw spots", "all saw but" . . . "her brightness took". In order to minimize the storage requirements and maximize the utility of these ngrams, the algorithm next estimates the rarity of each ngram by calculating the product of the relative frequency of each word in the ngram. The rarity of the ngram "spots few brightness", for instance, is calculated as p(spots) * p(few) * p(brightness). Because ngrams with high probability are much more likely to be found often in a corpus and are therefore much less useful for detecting textual reuse, they are dropped from the list, and the remaining ngrams are stored in a database.
Once this database is produced, one can scour it for instances of textual reuse by feeding in additional documents, processing them in the same way, and comparing the combinatorial ngrams in each sentence from the input document to the sentences archived in the database. An estimation of the similarity of two sentences can be produced by simply summing up the number of combinatorial ngrams shared by those two sentences and normalizing by the length of the sentence. Take for example the following passage from Nathaniel Lee’s Alexander the Great, which the present method flags as a candidate for the quotation from Betsy Thoughtless discussed above: All find my spots, but few observe my brightness
While the sequential sliding window technique often employed in studies of text reuse only identifies a single shared trigram between these two passages ("spots but few"), the combinatorial ngram method gives a stronger indication of their semantic similarity, yielding 10 combinatorial trigram matches for the pair. Using this simple and scalable technique on the works of Eliza Haywood, we shall see below, allows one to rapidly identify instances of textual reuse and thereby improve our understanding of the production and dissemination of literary texts in the long eighteenth century.”
email: sgauch at uark dot edu
email: lawrenceevalyn at gmail dot com
email: kclabill at uark dot edu
email: mshukla at email dot uark dot edu
email: mpchiari at uark dot edu