This is a first look at a work in progress. I’m using Python to study text from an etymological perspective. Specifically, I’m measuring how many words in a given English language text have Anglo-Saxon origin. Many people (including myself) think that Anglo-Saxon words convey a different sense than their counterparts of French/Latin origin. To demonstrate the point in a small way, I’ve included a Latin and Anglo-Saxon version of each heading in this blog post.
English is a Germanic language with Scandinavian influence, with a big layer of Old French poured on top. That Old French (Anglo Norman French, to be specific) was principally derived from Latin, so English is a hybrid between two major Indo-European language groups. Those mongrel origins are a big part of why English is messy and rich.
French was introduced to English as the language of conquerers and nobility. French was also the language of some European royalty in the 18th and 19th century, further adding to its reputation as a language associated with high status. Even today, English words with French origins often have higher cultural status than their counterparts with Anglo-Saxon origins (think cuisine versus cooking, illumination versus light, create versus make, and escargot versus snail). By contrast, the Anglo-Saxon words are often considered more visceral (think sea versus ocean, sweat versus perspire, and free versus emancipated — more on that last pair in a moment).
For instance, when taunting someone, you reach for blunt Anglo-Saxon words. “Your mother was a hamster, and your father smelled of elderberries!” is 100% Anglo-Saxon, except for “elderberries” which was coined in Middle English from “elder” and “berry”, both of Anglo-Saxon origin.
Legal documents and government issuances, on the other hand, tend to include more words of Latin/French origin. It’s no coincidence that the Latin/French words “Emancipation Proclamation” describe a legal act, but if you want to stir the heart about emancipation, you say something like “Free at last!”(1) which is all Anglo-Saxon.
Others have written more eloquently than I about how word origin influences tone (Annalisa Quinn at NPR, Gemma Varnom, and M. Birch, to suggest a few), so I won’t belabor the point more than I already have. But I wanted to talk about how it inspired the project I’ve been working on.
The Project (The Work)
I should preface this by saying that I Am Not A Linguist, and I don’t even play one on TV.
I thought it would be interesting to perform lexicographical analysis of text from an etymological perspective. My etymological categorization is necessarily simple. When I look at a text, I put each word into one of three etymological categories: Anglo-Saxon, non-Anglo-Saxon, or unknown. From this rough grouping I generate statistics that allow me to compare one text to another.
For instance, does one author consistently use more Anglo-Saxon words than other authors? Does an author’s usage of Anglo-Saxon words change from one work to another? Also of interest to me is the etymology of words as the book progresses from front to back. Do the relative frequencies of etymologies change as the book progresses towards its exciting conclusion? For authors writing in English as a second language, is their word selection influenced by their first language?
All of the questions above can be explored with the tool I’ve written. It’s easier to show the tool’s output than describe it, so here’s an analysis of Lewis Carroll’s 1865 work “Alice’s Adventures in Wonderland”.
The graph below shows the relative frequency of the three etymological categories as the book progresses from beginning to end.
This graph shows the relative frequency of the three etymological categories as counting statistics for various part-of-speech categories.
The table below is a more detailed version of the chart immediately above. Some percentages may not add up to 100% due to rounding.
|Total||%age of All Words||Anglo-Saxon||non-Anglo-Saxon||Unknown|
|All Words||26624||100%||18233 (68%)||3812 (14%)||4579 (17%)|
|Unique||3528||13%||1354 (38%)||899 (25%)||1275 (36%)|
|Nouns||8522||32%||4521 (53%)||2354 (27%)||1647 (19%)|
|Verbs||5479||20%||2994 (54%)||565 (10%)||1920 (35%)|
|Adjectives||1639||6%||896 (54%)||375 (22%)||368 (22%)|
|Adverbs||1974||7%||1348 (68%)||420 (21%)||206 (10%)|
|Other||9010||33%||8474 (94%)||98 (1%)||438 (4%)|
Observations (What I See)
There’s some minor observations to be made here, but the strength of this tool will be in comparative analysis. It’s hard to draw conclusions from one analysis before I have an idea of what’s typical.
For instance, at first glance, the ratio of Anglo-Saxon to non-Anglo-Saxon words looks dramatic, but this says more about English than it does about Carroll. The most common words in English are overwhelmingly Anglo-Saxon in origin. (2) For the small sample size of works I’ve processed so far (just 8 in total), I can see that it’s common for roughly three quarters of the words to be Anglo-Saxon. Alice in Wonderland isn’t an outlier by that standard.
We can also see that the frequency of Anglo-Saxon words decreases slightly throughout the book. This is the kind of trend that I find interesting, but in this case it’s due to an increase in the number of words of unknown etymology. Sometimes a word’s etymology is truly unknown. More often, though, the etymology is classified as unknown for other reasons. Most likely, it’s simply not in my etymological database (which isn’t very complete yet). Also, the word could be a proper noun, an invented word (like “woodshadows” from James Joyce’s Ulysses), or a word for which the etymology is ambiguous. An example of this last category is “bank” which is Anglo-Saxon in origin when referring to the side of a river, but French/Italian in origin when referring to a place that handles money.
At present, the quantity of words classified as “unknown” is too large for my tastes, and I plan to reduce it by improving both my database and the tool.
Verbs are overrepresented in the “unknown” category. My guess is that this is an artifact of my stemmer having difficulty stemming verbs. (I’m currently using the Snowball Stemmer from NLTK.)
As you can see, at this point it’s easier to draw conclusions about the representation of the data than it is about the data themselves. That leads me to the next (and final) topic in this post.
Future (What’s to Come)
As I said in the introduction, this is an early look at a work in progress. Here’s some of the things I’d like to add –
- Better etymological data
- Large scale comparisons of text to look for trends (across authors, genres, etc.)
- More numeric (rather than visual) descriptions of the data to facilitate automated comparison. One idea is to add the mean and standard deviation of the percentage of Anglo-Saxon words.
- Open sourcing
If you have any suggestions on how to use this tool or make it more interesting, I’d love to hear them in the comments below. I moderate all comments to filter spam which is yet another Viking influence on England.
Like English itself, “Endnote” is an etymological hybrid. “End” is of Anglo-Saxon origin, while “note” comes from Old French/Latin.
1. Martin Luther King, Jr. isn’t the only person to have said “Free at last!”, but his use of it is perhaps the most famous. His “I Have a Dream” speech makes brilliant use of etymological contrasts. Many of his memorable phrases in that speech (“I have a dream today”, “Let freedom ring”, “Free at last”) are Anglo-Saxon.
2. In 2014 I pulled from Wikipedia a list of the 100 most common English words. At the time, it contained just four non-Anglo-Saxon words. They were “just” (ME < Latin), “people” (ME < Anglo-French < Latin), “use” (ME < Old French, replaced OE brucan, cognate w/modern Swedish bruk-), and “because” (ME < Fr ‘par cause’). There are lots of ways to count the 100 most common words, and doubtless the list would have been different in Carroll’s day. But my guess is that the presence of Anglo-Saxon hasn’t changed dramatically from that 96% regardless of when and how one counts.