How to Measure Anglo-Saxonicity – With a Ruler or Yardstick?

Summary (Nutshell)

This is a first look at a work in progress. I’m using Python to study text from an etymological perspective. Specifically, I’m measuring how many words in a given English language text have Anglo-Saxon origin. Many people (including myself) think that Anglo-Saxon words convey a different sense than their counterparts of French/Latin origin. To demonstrate the point in a small way, I’ve included a Latin and Anglo-Saxon version of each heading in this blog post.

Background (Milieu)

English is a Germanic language with Scandinavian influence, with a big layer of Old French poured on top. That Old French (Anglo Norman French, to be specific) was principally derived from Latin, so English is a hybrid between two major Indo-European language groups. Those mongrel origins are a big part of why English is messy and rich.

French was introduced to English as the language of conquerers and nobility. French was also the language of some European royalty in the 18th and 19th century, further adding to its reputation as a language associated with high status. Even today, English words with French origins often have higher cultural status than their counterparts with Anglo-Saxon origins (think cuisine versus cooking, illumination versus light, create versus make, and escargot versus snail). By contrast, the Anglo-Saxon words are often considered more visceral (think sea versus ocean, sweat versus perspire, and free versus emancipated — more on that last pair in a moment).

For instance, when taunting someone, you reach for blunt Anglo-Saxon words. “Your mother was a hamster, and your father smelled of elderberries!” is 100% Anglo-Saxon, except for “elderberries” which was coined in Middle English from “elder” and “berry”, both of Anglo-Saxon origin.

A still of the French taunting King Arthur from Monty Python and the Holy Grail
William of Normandy in 1067, addressing his English subjects.

Legal documents and government issuances, on the other hand, tend to include more words of Latin/French origin. It’s no coincidence that the Latin/French words “Emancipation Proclamation” describe a legal act, but if you want to stir the heart about emancipation, you say something like “Free at last!”(1)  which is all Anglo-Saxon.

Others have written more eloquently than I about how word origin influences tone (Annalisa Quinn at NPRGemma Varnom, and M. Birch, to suggest a few), so I won’t belabor the point more than I already have. But I wanted to talk about how it inspired the project I’ve been working on.

The Project (The Work)

I should preface this by saying that I Am Not A Linguist, and I don’t even play one on TV.

I thought it would be interesting to perform lexicographical analysis of text from an etymological perspective. My etymological categorization is necessarily simple. When I look at a text, I put each word into one of three etymological categories: Anglo-Saxon, non-Anglo-Saxon, or unknown. From this rough grouping I generate statistics that allow me to compare one text to another.

For instance, does one author consistently use more Anglo-Saxon words than other authors? Does an author’s usage of Anglo-Saxon words change from one work to another? Also of interest to me is the etymology of words as the book progresses from front to back. Do the relative frequencies of etymologies change as the book progresses towards its exciting conclusion? For authors writing in English as a second language, is their word selection influenced by their first language?

All of the questions above can be explored with the tool I’ve written. It’s easier to show the tool’s output than describe it, so here’s an analysis of Lewis Carroll’s 1865 work “Alice’s Adventures in Wonderland”.

The graph below shows the relative frequency of the three etymological categories as the book progresses from beginning to end.

A graphical representation of how the etymological ration of Alice in Wonderland changes as one progresses through the book

This graph shows the relative frequency of the three etymological categories as counting statistics for various part-of-speech categories.

A graphical representation of the counts of words by parts of speech and etymology in Alice in Wonderland

The table below is a more detailed version of the chart immediately above. Some percentages may not add up to 100% due to rounding.

Total %age of All Words Anglo-Saxon non-Anglo-Saxon Unknown
All Words 26624 100% 18233 (68%) 3812 (14%) 4579 (17%)
Unique 3528 13% 1354 (38%) 899 (25%) 1275 (36%)
Nouns 8522 32% 4521 (53%) 2354 (27%) 1647 (19%)
Verbs 5479 20% 2994 (54%) 565 (10%) 1920 (35%)
Adjectives 1639 6% 896 (54%) 375 (22%) 368 (22%)
Adverbs 1974 7% 1348 (68%) 420 (21%) 206 (10%)
Other 9010 33% 8474 (94%) 98 (1%) 438 (4%)

Observations (What I See)

There’s some minor observations to be made here, but the strength of this tool will be in comparative analysis. It’s hard to draw conclusions from one analysis before I have an idea of what’s typical.

For instance, at first glance, the ratio of Anglo-Saxon to non-Anglo-Saxon words looks dramatic, but this says more about English than it does about Carroll. The most common words in English are overwhelmingly Anglo-Saxon in origin. (2)  For the small sample size of works I’ve processed so far (just 8 in total), I can see that it’s common for roughly three quarters of the words to be Anglo-Saxon. Alice in Wonderland isn’t an outlier by that standard.

We can also see that the frequency of Anglo-Saxon words decreases slightly throughout the book. This is the kind of trend that I find interesting, but in this case it’s due to an increase in the number of words of unknown etymology. Sometimes a word’s etymology is truly unknown. More often, though, the etymology is classified as unknown for other reasons. Most likely, it’s simply not in my etymological database (which isn’t very complete yet). Also, the word could be a proper noun, an invented word (like “woodshadows” from James Joyce’s Ulysses), or a word for which the etymology is ambiguous. An example of this last category is “bank” which is Anglo-Saxon in origin when referring to the side of a river, but French/Italian in origin when referring to a place that handles money.

At present, the quantity of words classified as “unknown” is too large for my tastes, and I plan to reduce it by improving both my database and the tool.

Verbs are overrepresented in the “unknown” category. My guess is that this is an artifact of my stemmer having difficulty stemming verbs. (I’m currently using the Snowball Stemmer from NLTK.)

As you can see, at this point it’s easier to draw conclusions about the representation of the data than it is about the data themselves. That leads me to the next (and final) topic in this post.

Future (What’s to Come)

As I said in the introduction, this is an early look at a work in progress. Here’s some of the things I’d like to add –

  • Better etymological data
  • Large scale comparisons of text to look for trends (across authors, genres, etc.)
  • More numeric (rather than visual) descriptions of the data to facilitate automated comparison. One idea is to add the mean and standard deviation of the percentage of Anglo-Saxon words.
  • Open sourcing

If you have any suggestions on how to use this tool or make it more interesting, I’d love to hear them in the comments below. I moderate all comments to filter spam which is yet another Viking influence on England.

Endnotes

Like English itself, “Endnote” is an etymological hybrid. “End” is of Anglo-Saxon origin, while  “note” comes from Old French/Latin.

1. Martin Luther King, Jr. isn’t the only person to have said “Free at last!”, but his use of it is perhaps the most famous. His “I Have a Dream” speech makes brilliant use of etymological contrasts. Many of his memorable phrases in that speech (“I have a dream today”, “Let freedom ring”, “Free at last”) are Anglo-Saxon.

2. In 2014 I pulled from Wikipedia a list of the 100 most common English words. At the time, it contained just four non-Anglo-Saxon words. They were “just” (ME < Latin), “people” (ME < Anglo-French < Latin), “use” (ME < Old French, replaced OE brucan, cognate w/modern Swedish bruk-), and “because” (ME < Fr ‘par cause’). There are lots of ways to count the 100 most common words, and doubtless the list would have been different in Carroll’s day. But my guess is that the presence of Anglo-Saxon hasn’t changed dramatically from that 96% regardless of when and how one counts.

5 thoughts on “How to Measure Anglo-Saxonicity – With a Ruler or Yardstick?”

  1. The graph would be smoother, and trends a bit more visible, with wider windows and/or a sliding-window approach.

    1. Hi Terry,
      Thanks for your comment. I agree with you, although I’m not sure how much I want to sand off those peaks and valleys. I think the smoothness (or lack thereof) is informative in and of itself. This is what I was thinking of when I mentioned calculating & showing the standard deviation.

      When you say “sliding window”, are you thinking of something like a moving average? I was thinking of plotting something like that as a line on top of the existing graph. That would retain the peaks and valleys while giving the eye a smoother curve to follow. Does that sound helpful to you?

      Plotting variance on top of all that might get too busy; I’ll have to see.

  2. You might want to calibrate the data via the British National Corpus (http://www.natcorp.ox.ac.uk/) so that you have a baseline for what an ‘expected’ proportion of Anglo-Saxon words would be.

    If you’re interested in sharing this with the community, you might consider using spaCy rather than NLTK, which is a bit awkward to deploy.

    1. Hi Ben,
      Calibration is a great suggestion. Most of the texts I’ve been playing around with are from Project Gutenberg. Many are from the 1800s and aren’t representative of how language is used today.

      It would also be interesting to work on the American national corpus and observe differences between that and the British national corpus. Which do you think would be “more Anglo-Saxon”?

      I’ll look into spaCy – thanks for the tip.

      1. I would /expect/ American English to have proportionately fewer Anglo-Saxon words due to the many other cultures that became part of modern America; but I suppose it’s worth measuring to be sure!

        It’s also an interesting question to wonder whether modern American English is closer to modern British English than it is to 1800s American English – I suspect that word choice is more similar across the modern languages, even if spellings may have diverged in the post-Webster era.

Leave a Reply

Your email address will not be published.