CLIP: Computational
Literature Project


This project applies state-of-the-art computational approaches to literary analysis. Our goal is to make research contributions in both Computer Science and Literature. Our initial work focuses on social network analysis of dramatic literature and tracing quotations and their contexts.

Social Network Analysis of Plays


We are investigating how to represent plays with social networks and how social network analysis can be applied to compare authors, genres, and contexts. We will investigate automatic techniques to build and analyze social networks of characters in plays and explore different ways of presenting the resulting social newtorks.


  • Model a play through a social network
  • Automatically build social networks of characters
  • use social network analysis to compare authors, genres, and contexts
  • Investigate different ways of displaying the resulting social network

Research Questions

  • Can the social network of a play’s staged interactions stand as a proxy for its basic plot structure?

  • Can social network analysis be used to distinguish Shakespeare’s comedies, tragedies, and histories?

  • Do these social networks shed light on his works of ambiguous genres?

Related Work

Moretti - "Network Theory, Plot Analysis"

Uses hand-created social networks to compare the tragedies of Hamlet, Macbeth, and King Lear, and to compare English and Chinese literature as reflected in the novelistic structures of Charles Dickens’ Our Mutual Friend and Cao Xueqin’s The Story of the Stone.
Edges exist between nodes when characters directly speak to each other.

Stiller et al. - "The Small World of Shakespeare's Plays"

Also uses hand-created social network maps. Focuses on comparing the plays’ social networks to real-life social networks, according to the following features and methodology:

“The analysis of the each play was carried out using the printed text, by tabulating the speaking characters present on the stage at each time slice through the play. A new time slice was deemed to begin whenever a character was stated or could be inferred to have left the stage. The scene size is the number of speaking characters present during the time slice.
The network structure calculations were obtained by treating each speaking character as a node, and deeming two characters to be linked if there was at least one time slice of the play in which both were present.
We calculated the connectance (C) of the network of each play. This is the proportion of possible links between characters that are in fact realized, and in a network containing S nodes and L observed links C is given by L/S2 (Dunne et al. 2002; Williams et al. 2002). This parameter ranges from 0 for a group of completely unlinked nodes to 1 for a fully connected set in which each character interacts with each other character.
We also calculated the characteristic path length (D), or "degrees of separation." For each pair of characters, the number of links in the shortest possible route connecting the two is found. D is the mean of these path lengths for all pairs of characters in the network (for a formal statement, see Montoya and Sole 2002:406).
Finally, we calculated the cluster coefficient (T) for each play. This coefficient is the probability that two nodes each linked to a third will also be linked themselves (for a formal definition, see Montoya and Sole 2002:406). This parameter reflects the extent to which the network is subdivided into densely interconnected sub-parts. In a randomly connected network, the cluster coefficient (T) is equal to the connectance (C).”

Quotation Tracking


We will explore how literary works are quoted by subsequent authors. We will track quotations accross time to analyze how frequency and context change. We will also explore partial matches to identify paraphrases and misquotations.

Research Questions

  • How and how often was Shakespeare quoted in 15th century through 20th century texts?

  • What works were most quoted, and what parts of those works?

  • Which authors clearly quoted from a broad knowledge of Shakespeare, and which authors were likely just recycling quotations from books of poetic extracts?


  • Automatically find the exact match for a quote in a corpus of texts from the 15th century through the 20th century.
  • Find partially quoted or misquoted works of literature.
  • Track and study the change in a quote's literary context.
  • Create a web interface to search the corpus of texts and analyze the context using different methods.


  • We have a fully working web-application allowing the user to search for a quote.
  • We can find the exact match for the quote.
  • We successfully display the quote highlighted in the part of the book it was found in.
  • We successfully display the timeline of quote using the dates of the books present in the database.
  • We scaled the application to the ECCO-TCP corpus, the EBBO-TCP corpus and the Irish_texts corpus for a total of 5,500 texts dated 1400 - 1999.

Related Work

Duhaime - Textual Reuse in the Eighteenth Century: Mining Eliza Haywood's Quotations

Uses combinatorial ngrams to find the sources of quotations:

“The methods used within the present study may be best illustrated by comparison with previous work in the field of historical text reuse. In their research on intertextuality in classical texts, Neil Coffee et al. use a sliding window technique to find passages wherein two texts share at least two words, then rank the matching window results according to the proximity and rarity of the words shared in the two windows. Jean-Gabriel Ganascia et al. also use a sliding window technique and score matches based on word rarity, though they allow for missing words in their sliding windows such that two windows of five words can count as a match if they share three words in common. David Smith et al. use a sliding n-gram window to establish candidate pairs, leverage the Smith-Waterman algorithm to align matching sequences in documents, and finally sort documents with matching windows according to the frequency of their common n-grams. Similarly, Constance Mews et al. also use a sequence alignment technique that matches strings based on identical words. Moving away from a sliding window technique, Glenn Roe et al. measure the cosine distance between passages in a term document matrix, while David Bamman et al. use Moore's Bilingual Sentence Aligner and a translation probability table generated from MGIZA++ to identify cross-lingual instances of intertextuality.
The present study extends the candidate retrieval step of Jean-Gabriel Ganascia et al., and implements a minimal probabilistic model to remove high probability ngrams from the database in order to reduce both storage requirements and processing time. The first step of the method used within the present study is to preprocess each file in a corpus, transform each file into an array of sentences using the Punkt sentence boundary detector, drop stopwords, remove non-token punctuation, and lowercase all text. Because orthography was non-standardized in the eighteenth century, the lookup table from MorphAdorner is then used to regularize the spelling of each word in the text, using a simple hash table replacement. Finally, the WordNet Lemmatizer is used to transform each word into its lemma form.
Once the texts are all represented as lists of clean and normalized sentences, the next step of analysis is to find sentences with unusually similar language. To accomplish this, a sliding window of length w is slid over each sentence in the corpus such that for each sentence, the window first contains words 0 through w-1 from the sentence, then 1 through w, then 2 through w+1, and so on. For each of these windows, a list of all possible combinations of c words from the window is generated. Suppose c is 3, and the algorithm is considering the following sentence: All saw her spots but few her brightness took.
Using this window of text, the list of "combinatorial ngrams" produced includes "all saw her", "all saw spots", "all saw but" . . . "her brightness took". In order to minimize the storage requirements and maximize the utility of these ngrams, the algorithm next estimates the rarity of each ngram by calculating the product of the relative frequency of each word in the ngram. The rarity of the ngram "spots few brightness", for instance, is calculated as p(spots) * p(few) * p(brightness). Because ngrams with high probability are much more likely to be found often in a corpus and are therefore much less useful for detecting textual reuse, they are dropped from the list, and the remaining ngrams are stored in a database.
Once this database is produced, one can scour it for instances of textual reuse by feeding in additional documents, processing them in the same way, and comparing the combinatorial ngrams in each sentence from the input document to the sentences archived in the database. An estimation of the similarity of two sentences can be produced by simply summing up the number of combinatorial ngrams shared by those two sentences and normalizing by the length of the sentence. Take for example the following passage from Nathaniel Lee’s Alexander the Great, which the present method flags as a candidate for the quotation from Betsy Thoughtless discussed above: All find my spots, but few observe my brightness
While the sequential sliding window technique often employed in studies of text reuse only identifies a single shared trigram between these two passages ("spots but few"), the combinatorial ngram method gives a stronger indication of their semantic similarity, yielding 10 combinatorial trigram matches for the pair. Using this simple and scalable technique on the works of Eliza Haywood, we shall see below, allows one to rapidly identify instances of textual reuse and thereby improve our understanding of the production and dissemination of literary texts in the long eighteenth century.”


The following is a list of people who are involved in CLIP


Dr. Susan Gauch

Principal investogator
email: sgauch at uark dot edu

Current Participants

Lawrence Evalyn

email: lawrenceevalyn at gmail dot com

Kevin Labille

email: kclabill at uark dot edu

Manisha Shukla

email: mshukla at email dot uark dot edu

Preston Evans

email: pdevans at uark dot edu

Marion Chiariglione

email: mpchiari at uark dot edu

Pierre-Emmanuel Vignon