The Sparsity of Poetry

As part of my new work with computational reading, I’ve begun creating a library of network tables for different authors’ corpuses. I’m taking a course on network theory right now and I’m interested in trying to see what happens when we think about representing an author’s writings as a network — what can the structure of that network tell us about the formal features of that author’s work?

The process begins by making a document term matrix for an author’s works, essentially a table in which the works are listed as rows and all the words are listed in the columns as frequencies of their appearance. When I did this for the poetry of Friedrich Hölderlin, before I even got to the network table something very interesting happened. In a corpus of 253 poems, there are around 11,000 unique words in the corpus that do not include stop words (those little ones like and, the, or, but, etc.). By way of comparison, in a corpus of 150 novels, there are close to 300,000 unique words.

When I tried to control for sparsity — if I asked the computer to keep only those words that appear in 60% of the poems — I got 0 words. When I asked the computer to keep only those words that appear in 1/3 of the poems, I got just two words: “life” and “heart” (Leben and Herz). In other words, across Hölderlin’s entire collection of poems, there are only two words that appear at least one third of the time. Those words happen to be two of the most elementary words in the German language.

There is a basic point in here that poems are often short and so the odds of words overlapping between them are less likely, especially when we’re not dealing with high probability words like conjunctions, articles, or prepositions, but words of high semantic value (like heart and life). But I was still taken aback by just how sparse Hölderlin’s poetic vocabulary was. No word of significance appears in over half of the poems and only two appear in one third of the corpus. You have to go down to ten percent to get a list of words that exceeds 100. There are just 116 words in Hölderlin’s vocabulary that appear in 10% of his poetic corpus, or in other words about 25 poems.

It made me realize in a very visceral way something I had always felt but never been able to articulate: that I often go to read poetry to experience language in its singularity, its distinction. I know this is a very modern way of thinking about poetry, but it is one that I seek out when I pick up a poem to read. The repetitions within the poem stand contrapuntally to the repetitions of language we experience in our everyday life. Poems mark out clearings in our cluttered linguistic lives.