How to Measure Anglo-Saxonicity – With a Ruler or Yardstick?

Summary (Nutshell)

This is a first look at a work in progress. I’m using Python to study text from an etymological perspective. Specifically, I’m measuring how many words in a given English language text have Anglo-Saxon origin. Many people (including myself) think that Anglo-Saxon words convey a different sense than their counterparts of French/Latin origin. To demonstrate the point in a small way, I’ve included a Latin and Anglo-Saxon version of each heading in this blog post.

Background (Milieu)

English is a Germanic language with Scandinavian influence, with a big layer of Old French poured on top. That Old French (Anglo Norman French, to be specific) was principally derived from Latin, so English is a hybrid between two major Indo-European language groups. Those mongrel origins are a big part of why English is messy and rich.

French was introduced to English as the language of conquerers and nobility. French was also the language of some European royalty in the 18th and 19th century, further adding to its reputation as a language associated with high status. Even today, English words with French origins often have higher cultural status than their counterparts with Anglo-Saxon origins (think cuisine versus cooking, illumination versus light, create versus make, and escargot versus snail). By contrast, the Anglo-Saxon words are often considered more visceral (think sea versus ocean, sweat versus perspire, and free versus emancipated — more on that last pair in a moment).

For instance, when taunting someone, you reach for blunt Anglo-Saxon words. “Your mother was a hamster, and your father smelled of elderberries!” is 100% Anglo-Saxon, except for “elderberries” which was coined in Middle English from “elder” and “berry”, both of Anglo-Saxon origin.

A still of the French taunting King Arthur from Monty Python and the Holy Grail
William of Normandy in 1067, addressing his English subjects.

Legal documents and government issuances, on the other hand, tend to include more words of Latin/French origin. It’s no coincidence that the Latin/French words “Emancipation Proclamation” describe a legal act, but if you want to stir the heart about emancipation, you say something like “Free at last!”(1)  which is all Anglo-Saxon.

Others have written more eloquently than I about how word origin influences tone (Annalisa Quinn at NPRGemma Varnom, and M. Birch, to suggest a few), so I won’t belabor the point more than I already have. But I wanted to talk about how it inspired the project I’ve been working on.

The Project (The Work)

I should preface this by saying that I Am Not A Linguist, and I don’t even play one on TV.

I thought it would be interesting to perform lexicographical analysis of text from an etymological perspective. My etymological categorization is necessarily simple. When I look at a text, I put each word into one of three etymological categories: Anglo-Saxon, non-Anglo-Saxon, or unknown. From this rough grouping I generate statistics that allow me to compare one text to another.

For instance, does one author consistently use more Anglo-Saxon words than other authors? Does an author’s usage of Anglo-Saxon words change from one work to another? Also of interest to me is the etymology of words as the book progresses from front to back. Do the relative frequencies of etymologies change as the book progresses towards its exciting conclusion? For authors writing in English as a second language, is their word selection influenced by their first language?

All of the questions above can be explored with the tool I’ve written. It’s easier to show the tool’s output than describe it, so here’s an analysis of Lewis Carroll’s 1865 work “Alice’s Adventures in Wonderland”.

The graph below shows the relative frequency of the three etymological categories as the book progresses from beginning to end.

A graphical representation of how the etymological ration of Alice in Wonderland changes as one progresses through the book

This graph shows the relative frequency of the three etymological categories as counting statistics for various part-of-speech categories.

A graphical representation of the counts of words by parts of speech and etymology in Alice in Wonderland

The table below is a more detailed version of the chart immediately above. Some percentages may not add up to 100% due to rounding.

Total %age of All Words Anglo-Saxon non-Anglo-Saxon Unknown
All Words 26624 100% 18233 (68%) 3812 (14%) 4579 (17%)
Unique 3528 13% 1354 (38%) 899 (25%) 1275 (36%)
Nouns 8522 32% 4521 (53%) 2354 (27%) 1647 (19%)
Verbs 5479 20% 2994 (54%) 565 (10%) 1920 (35%)
Adjectives 1639 6% 896 (54%) 375 (22%) 368 (22%)
Adverbs 1974 7% 1348 (68%) 420 (21%) 206 (10%)
Other 9010 33% 8474 (94%) 98 (1%) 438 (4%)

Observations (What I See)

There’s some minor observations to be made here, but the strength of this tool will be in comparative analysis. It’s hard to draw conclusions from one analysis before I have an idea of what’s typical.

For instance, at first glance, the ratio of Anglo-Saxon to non-Anglo-Saxon words looks dramatic, but this says more about English than it does about Carroll. The most common words in English are overwhelmingly Anglo-Saxon in origin. (2)  For the small sample size of works I’ve processed so far (just 8 in total), I can see that it’s common for roughly three quarters of the words to be Anglo-Saxon. Alice in Wonderland isn’t an outlier by that standard.

We can also see that the frequency of Anglo-Saxon words decreases slightly throughout the book. This is the kind of trend that I find interesting, but in this case it’s due to an increase in the number of words of unknown etymology. Sometimes a word’s etymology is truly unknown. More often, though, the etymology is classified as unknown for other reasons. Most likely, it’s simply not in my etymological database (which isn’t very complete yet). Also, the word could be a proper noun, an invented word (like “woodshadows” from James Joyce’s Ulysses), or a word for which the etymology is ambiguous. An example of this last category is “bank” which is Anglo-Saxon in origin when referring to the side of a river, but French/Italian in origin when referring to a place that handles money.

At present, the quantity of words classified as “unknown” is too large for my tastes, and I plan to reduce it by improving both my database and the tool.

Verbs are overrepresented in the “unknown” category. My guess is that this is an artifact of my stemmer having difficulty stemming verbs. (I’m currently using the Snowball Stemmer from NLTK.)

As you can see, at this point it’s easier to draw conclusions about the representation of the data than it is about the data themselves. That leads me to the next (and final) topic in this post.

Future (What’s to Come)

As I said in the introduction, this is an early look at a work in progress. Here’s some of the things I’d like to add –

  • Better etymological data
  • Large scale comparisons of text to look for trends (across authors, genres, etc.)
  • More numeric (rather than visual) descriptions of the data to facilitate automated comparison. One idea is to add the mean and standard deviation of the percentage of Anglo-Saxon words.
  • Open sourcing

If you have any suggestions on how to use this tool or make it more interesting, I’d love to hear them in the comments below. I moderate all comments to filter spam which is yet another Viking influence on England.


Like English itself, “Endnote” is an etymological hybrid. “End” is of Anglo-Saxon origin, while  “note” comes from Old French/Latin.

1. Martin Luther King, Jr. isn’t the only person to have said “Free at last!”, but his use of it is perhaps the most famous. His “I Have a Dream” speech makes brilliant use of etymological contrasts. Many of his memorable phrases in that speech (“I have a dream today”, “Let freedom ring”, “Free at last”) are Anglo-Saxon.

2. In 2014 I pulled from Wikipedia a list of the 100 most common English words. At the time, it contained just four non-Anglo-Saxon words. They were “just” (ME < Latin), “people” (ME < Anglo-French < Latin), “use” (ME < Old French, replaced OE brucan, cognate w/modern Swedish bruk-), and “because” (ME < Fr ‘par cause’). There are lots of ways to count the 100 most common words, and doubtless the list would have been different in Carroll’s day. But my guess is that the presence of Anglo-Saxon hasn’t changed dramatically from that 96% regardless of when and how one counts.

Thanks to PYPTUG!

The logo of the Python Piedmont Triad Users Group
Thanks to PYPTUG, the Python Piedmont Triad Users Group, for inviting me to speak at their monthly meeting last night!

I gave a slightly expanded version of the talk I gave at PyData Carolinas 2016 about connecting Python to compiled languages like C, C++, and Fortran. (Slides from that talk are here.) I appreciate the time and attention of everyone who attended last night, especially Francois Dion for organizing and reminding us of some of the interesting new things in Python 3.5 and 3.6.

Last night’s talk wasn’t recorded, but you can see the version of the talk I gave at PyData at , or you can watch the embedded version below.


Pandas Surprise


Part of learning how to use any tool is exploring its strengths and weaknesses. I’m just starting to use the Python library Pandas, and my naïve use of it exposed a weakness that surprised me.


A photo of the many shapes and colors in Lucky Charms cereal
Thanks to bradleypjohnson for sharing this Lucky Charms photo under CC BY 2.0.

I have a long list of objects, each with the properties “color” and “shape”. I want to count the frequency of each color/shape combination. A sample of what I’m trying to achieve could be represented in a grid like this –

       circle square star
blue        8     41   18
orange      5     33   25
red        53     64   58

At first I implemented this with a dictionary of collections.Counter instances where the top level dictionary is keyed by shape, like so –

import collections
SHAPES = ('square', 'circle', 'star', )
frequencies = {shape: collections.Counter() for shape in SHAPES}

Then I counted my frequencies using the code below. (For simplicity, assume that my objects are simple 2-tuples of (shape, color)).

for shape, color in all_my_objects:
    frequencies[shape][color] += 1

So far, so good.

Enter the Pandas

This looked to me like a perfect opportunity to use a Pandas DataFrame which would nicely support the operations I wanted to do after tallying the frequencies, like adding a column to represent the total number (sum) of instances of each color.

It was especially easy to try out a DataFrame because my counting loop ( for...all_my_objects) wouldn’t change, only the definition of frequencies. (Note that the code below requires I know in advance all the possible colors I can expect to see, which the Dict + Counter version does not. This isn’t a problem for me in my real-world application.)

import pandas as pd
frequencies = pd.DataFrame(columns=SHAPES, index=COLORS, data=0,
for shape, color in all_my_objects:
    frequencies[shape][color] += 1

It Works, But…

Both versions of the code get the job done, but using the DataFrame as a frequency counter turned out to be astonishingly slow. A DataFrame is simply not optimized for repeatedly accessing individual cells as I do above.

How Slow is it?

To isolate the effect pandas was having on performance, I used Python’s timeit module to benchmark some simpler variations on this code. In the version of Python I’m using (3.6), the default number of iterations for each timeit test is 1 million.

First, I timed how long it takes to increment a simple variable, just to get a baseline.

Second, I timed how long it takes to increment a variable stored inside a collections.Counter inside a dict. This mimics the first version of my code (above) for a frequency counter. It’s more complex than the simple variable version because Python has to resolve two hash table references (one inside the dict, and one inside the Counter). I expected this to be slower, and it was.

Third, I timed how long it takes to increment one cell inside a 2×2 NumPy array. Since Pandas is built atop NumPy, this gives an idea of how the DataFrame’s backing store performs without Pandas involved.

Fourth, I timed how long it takes to increment one cell inside a 2×2 Pandas DataStore. This is what I had used in my real code.

Raw Benchmark Results

Here’s what timeit showed me. Sorry for the cramped formatting.

$ python
 Python 3.6.0 (v3.6.0:41df79263a11, Dec 22 2016, 17:23:13)
 [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
 Type "help", "copyright", "credits" or "license" for more information.
 >>> import timeit
 >>> timeit.timeit('data += 1', setup='data=0')
 >>> timeit.timeit('data[0][0]+=1',setup='from collections import Counter;data={0:Counter()}')
 >>> timeit.timeit('data[0][0]+=1',setup='import numpy as np;data=np.zeros((2,2))')
 >>> timeit.timeit('data[0][0]+=1',setup='import pandas as pd;data=pd.DataFrame(data=[[0,0],[0,0]],dtype="int")')

Benchmark Results Summary

Here’s a summary of the results from above (decimals truncated at 3 digits). The rightmost column shows the results normalized so the fastest method (incrementing a simple variable) equals 1.

Actual (seconds) Normalized (seconds)
Simple variable 0.092 1
Dict + Counter 0.683 7.398
Numpy 2D array 0.890 9.639
Pandas DataFrame 157.564 1704.784

As you can see, resolving the index references in the middle two cases (Dict + Counter in one case, NumPy array indices in the other) slows things down, which should come as no surprise. The NumPy array is a little slower than the Dict + Counter.

The DataFrame, however, is about 150 – 200 times slower than either of those two methods. Ouch!

I can’t really even give you a graph of all four of these methods together because the time consumed by the DataFrame throws the chart scale out of whack.

Here’s a bar chart of the first three methods –

A bar chart of the first three methods in the preceding table

Here’s a bar chart of all four –

A bar chart of all four methods in the preceding table

Why Is My DataFrame Access So Slow?

One of the nice features of DataFrames is that they support dictionary-like labels for rows and columns. For instance, if I define my frequencies to look like this –

>>> SHAPES = ('square', 'circle', 'star', )
>>> COLORS = ('red', 'blue', 'orange')
>>> pd.DataFrame(columns=SHAPES, index=COLORS, data=0, dtype='int')
        square  circle  star
red          0       0     0
blue         0       0     0
orange       0       0     0

Then frequencies['square']['orange'] is a valid reference.

Not only that, DataFrames support a variety of indexing and slicing options including –

  • A single label, e.g. 5 or 'a'
  • A list or array of labels ['a', 'b', 'c']
  • A slice object with labels 'a':'f'
  • A boolean array
  • A callable function with one argument

Here are those techniques applied in order to the frequencies DataFrame so you can see how they work –

>>> frequencies['star']
red       0
blue      0
orange    0
Name: star, dtype: int64
>>> frequencies[['square', 'star']]
        square  star
red          0     0
blue         0     0
orange       0     0
>>> frequencies['red':'blue']
      square  circle  star
red        0       0     0
blue       0       0     0
>>> frequencies[[True, False, True]]
        square  circle  star
red          0       0     0
orange       0       0     0
>>> frequencies[lambda x: 'star']
red       0
blue      0
orange    0
Name: star, dtype: int64

This flexibility has a price. Slicing (which is what is invoked by the square brackets) calls an object’s __getitem__() method. The parameter to __getitem__()  is the whatever was inside the square brackets. A DataFrame’s __getitem__() has to figure out what the passed parameter represents. Determining whether the parameter is a label reference, a callable, a boolean array, or something else takes time.

If you look at the DataFrame’s __getitem__() implementation, you can see all the code that has to execute to resolve a reference. (I linked to the version of the code that was current when I wrote this in February of 2017. By the time you read this, the actual implementation may differ.) Not only does __getitem__() have a lot to do, but because I’m accessing a cell (rather than a whole row or column), there’s two slice operations, so __getitem__() gets invoked twice each time I increment my counter.

This explains why the DataFrame is so much slower than the other methods. The dictionary and Counter both only support key lookup in a hash table, and a NumPy array has far fewer slicing options than a DataFrame, so its __getitem__() implementation can be much simpler.

Better DataFrame Indexing?

DataFrames support a few methods that exist explicitly to support “fast” getting and setting of scalars. Those methods are .at() (for label lookups) and .iat() (for integer-based index lookups). It also provides get_value() and set_value(), but those methods are deprecated in the version I have (0.19.2).

“Fast” is how the Panda’s documentation describes these methods. Let’s use timeit to get some hard data. I’ll try at() and iat(); I’ll also try get_value()/set_value() even though they’re deprecated.

>>> timeit.timeit("['red','square']+=1",setup="import pandas as pd;data=pd.DataFrame(columns=('square','circle','star'),index=('red','blue','orange'),data=0,dtype='int')")
>>> timeit.timeit('data.iat[0,0]+=1',setup='import pandas as pd;data=pd.DataFrame(data=[[0,0],[0,0]],dtype="int")')
>>> timeit.timeit('data.set_value(0,0,data.get_value(0,0)+1)',setup='import pandas as pd;data=pd.DataFrame(data=[[0,0],[0,0]],dtype="int")')

These methods are better, but they’re still pretty bad. Let’s put those numbers in context by comparing them to other techniques. This time, for normalized results, I’m going to use my Dict + Counter method as the baseline of 1 and compare all other methods to that. The row “DataFrame (naïve)” refers to naïve slicing, like frequencies[0][0].

Actual (seconds) Normalized (seconds)
Dict + Counter 0.683 1
Numpy 2D array 0.890 1.302
DataFrame (get/set) 15.050 22.009
DataFrame (at) 36.331 53.130
DataFrame (iat) 42.015 61.441
DataFrame (naïve) 157.564 230.417

The best I can do with a DataFrame uses deprecated methods, and is still over 20 times slower than the Dict + Counter. If I use non-deprecated methods, it’s over 50 times slower.


I like label-based access to my frequency counters, I like the way I can manipulate data in a DataFrame (not shown here, but it’s useful in my real-world code), and I like speed. I don’t necessarily need blazing fast speed, I just don’t want slow.

I can have my cake and eat it too by combining methods. I do my counting with the Dict + Counter method, and use the result as initialization data to a DataFrame constructor.

SHAPES = ('square', 'circle', 'star', )
frequencies = {shape: collections.Counter() for shape in SHAPES}
for shape, color in all_my_objects:
    frequencies[shape][color] += 1

frequencies = pd.DataFrame(data=frequencies)

The frequencies DataFrame now looks something like this –

         circle square star
 blue         8     41   18
 orange       5     33   25
 red         53     64   58

The rows and columns appear in essentially random order; they’re ordered by whatever order Python returns the dict keys during DataFrame initialization. Getting them in a specific order is left as an exercise for the reader.

There’s one more detail to be aware of. If a particular (shape, color) combination doesn’t appear in my data, it will be represented by NaN in the DataFrame. They’re easy to set to 0 with frequencies.fillna(0).


What I was trying to do with Pandas – unfortunately, the very first thing I ever tried to do with it – didn’t play to its strengths. It didn’t break my code, but it slowed it down by a factor of ~1700. Since I had thousands of items to process, the difference was hard to overlook!

Pandas looks great for some things, and I expect I’ll continue using it. This was just a bump in the road, albeit an interesting one.

Coercing Objects to Integer, Revisited


I recently wrote a blog post that involved exception handling, and gave short shrift to the part of exception handling I didn’t want to talk about in order to focus on the part I did want to talk about. For some readers, that clearly backfired.


My recent blog post about coercing Python objects to integers caught people’s attention in a way I hadn’t intended. The point I was trying to make was that an innocent-looking call like int(an_object) calls the method an_object.__int__(), and since that can be arbitrary code, it can raise arbitrary exceptions. Therefore, it’s insufficient to catch only the usual exceptions of ValueError and TypeError if you don’t know the type of an_object in advance.

Here’s the code I suggested –

def int_or_else(value, else_value=None):
    """Given a value, returns the value as an int if possible.
    If not, returns else_value which defaults to None.
        return int(value)
    # I don't like catch-all excepts, but since objects can raise arbitrary
    # exceptions when executing __int__(), then any exception is
    # possible here, even if only TypeError and ValueError are
    # really likely.
    except Exception:
        return else_value

Several commenters objected to the fact that this code discards (and therefore silences/masks/hides) all exceptions. Here’s why I made that choice.

The Two Parts of Exception Handling

In Python, there’s two parts to consider about exception handling — what to catch, and what to do with the exception once you’ve caught it. My intention was to write only about the former.

The latter is an interesting topic, too. Once you’ve caught an exception, you might want to log it and then discard it, log it and then re-raise it, re-raise it as a different exception, silence it, let it pass up to the caller, modify its attributes and re-raise it, etc. There’s enough material for an entire blog post about different ways to react to an exception, and the pros and cons of each.

Someday I might write that post about different ways to react to trapped exceptions, and if I do, I’ll dedicate the entire post to the subject to give it the attention it deserves. That other blog post – that was not it. In fact, it was the opposite. I gave the topic of processing the trapped exception as little attention as possible so as not to detract attention from what I wanted to be the main topic (what exceptions need to be trapped).

That backfired.


My post was not advocacy of discarding exceptions, nor was it advocacy of not discarding exceptions. What’s the right choice? It depends. One situation where you might want to discard exceptions is in a blog post where you’re trying to keep the code as brief as possible for readability. Then again, you might regret that. :-)

In the future, I’ll be clearer about what shortcuts I’m taking for brevity of presentation.

Agree? Disagree? I’d like to hear from you. I like it when people agree with me. Those who disagree can expand my horizons, and I like that too. In short, all civil comments are welcome. I feel I’ve spent enough time thinking about this topic for now, but that doesn’t make me right! Let me know what you think.

A Postcard of Tunisia

Earlier on this blog I briefly mentioned working with some Libyans in Tunis, the capital of Tunisia. We chose to meet at that location because it’s close to Libya but much safer than Tripoli. Now that I’ve been back for a while and had a chance to catch up, I wanted to write more about my experience.

A photo of the translator translating live
English-to-Arabic translation on the fly!


I was there with Tobias McNulty of Caktus Group. We (Tobias and I) trained the Libyan employees of Libya’s High National Election Commission (HNEC) in the maintenance and use of the HNEC-commissioned SMS-based voter registration system that I had helped to develop while working with Caktus. The system has been open sourced as Smart Elect.

If the big picture was promoting democracy, the medium picture was training system admins and developers. And the very small picture was working together on the nitty gritty of features and bug fixes, like figuring out that if a @property method raises an exception when invoked by hasattr(), the exception isn’t propagated under Python 2.7.

The admin training consisted of a comprehensive review of the system, including the obscure corners and edge case handling. The developers were eager to get their hands dirty, so after some organizational review, we dove into fixing bugs and implementing some new features that HNEC wanted.

A photo of a trainee
Abdullah (Photo by Tobias McNulty)

Tobias and I worked with the developers as both mentors and peers. Grinding through bugs from start to finish was really valuable. Our trainees have good development experience, but working in groups with us allowed them to participate in our approach to debugging, problem reporting, development, and test. It seemed a little different from what they were used to. We were very methodical about creating an issue in our tracker, creating a branch for that issue, reviewing one another’s code, documenting the fix, etc. “It’s a lot of process,” said one trainee after working through one particular bug with us. He’s right. I wish I had thought to ask if Libyan culture has a proverb similar to “For want of a nail…“. I could have said, “For want of filing an issue in the tracker, a voter was disenfranchised,” but it doesn’t have the same ring to it.

A photo of Tobias and a trainee
Tobias and Ahmed

This was my first trip to Africa, and, grand notions aside, what stood out to me was how mundane much of the experience was. The guys we worked with would have fit right in at any coding meetup I’ve been to. They had opinions about laptops. They were distracted by their phones. Everyone enjoyed a successful bug hunt. I remember one trainee being tired at 5PM, saying he had no more left in him, and seeing him there grinning 2 hours later when we finally solved the problem we’d been working on.

Outside of the training, I especially enjoyed the dinners at Sakura/Pasta Cosy and Chez Zina (my favorites, in that order).

We also ate at Le Bon Vieux Temps, where the handwritten chalkboard menu is carted from table to table on a charming-but-impractical frame. Tunisia is principally French speaking, with Arabic on an almost equal footing. At Le Bon Vieux Temps (“The Good Old Times”), the menu was all in French, and my vestigial French came in handy for translating the menu into English for the Libyans who in turn peppered the waiter with questions in Arabic. (That night in the restaurant began and ended my career as a French-to-English translator.)

On the weekends we rested, walked in the city, and paid a visit to the Bardo National Museum. The Bardo was famously attacked in 2015, and has since sprouted a razor wire fence around the entire property. Bored soldiers sat on a truck by the gate and motioned us to enter. It’s a nice museum, and I’m glad I went.

A photo of my entrance pass to the Bardo Museum

Inside the classroom and out, I got to know and really like our Libyan colleagues. They were generous with their good humor and kindness. If they lacked anything, it was a willingness to complain.

Libya is a difficult place to live at the moment. I think we all know that in an abstract sense, but talking to my Libyan friends made it more concrete for me. Banks don’t have enough cash. Electricity isn’t reliable. People they know have been kidnapped. My friends have a lot on their minds, and yet they found rooom to squeeze in opinions about good software development practices.

A photo of a trainee

I’m glad I got the chance to go, and to get to know the people I did. In addition to working with Tobias and the Libyans, I had a lot of non-work experiences I’ll remember for a long time. I walked among ruins in Carthage that are over 2000 years old. I drove solo (and lost) through rush hour traffic in Tunis and survived. I saw a Tunisian wedding, and got to use the word “ululating” for the first time outside of Scrabble or Bananagrams. I swam in the Mediterranean. I saw flocks of flamingoes (many, many thanks to Hichem and Claudia of Les Amis des Oiseaux).

HNEC is now better positioned than ever to use the Smart Elect system, and I hope they do so again soon. That’s partly for egotistical reasons — I like to see my work get used. Who doesn’t? But more importantly, if it gets used, that means Libyans are voting to determine their own future.

How Best to Coerce Python Objects to Integers?


In my opinion, the best way in Python to safely coerce things to integers requires use of an (almost) “naked” except, which is a construct I rarely want to use. Read on to see how I arrived at this conclusion, or you can jump ahead to what I think is the best solution.

The Problem

Suppose you had to write a Python function to convert to integer string values representing temperatures, like this list —

['22', '24', '24', '24', '23', '27']

The strings come from a file that a human has typed in, so even though most of the values are good, a few will have errors ('25C') that int() will reject.

Let’s Explore Some Solutions

You might write a function like this —

def force_to_int(value):
    """Given a value, returns the value as an int if possible.
    Otherwise returns None.
        return int(value)
    except ValueError:
        return None

Here’s that function in action at the Python prompt —

>>> print(force_to_int('42'))
>>> print(force_to_int('oops'))

That works! However, it’s not as robust as it could be.

Suppose this function gets input that’s even more unexpected, like None

>>> print(force_to_int(None))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 6, in force_to_int
TypeError: int() argument must be a string or a number, not 'NoneType'

Hmmm, let’s write a better version that catches TypeError in addition to ValueError

def force_to_int(value):
    """Given a value, returns the value as an int if possible.
    Otherwise returns None.
        return int(value)
    except (ValueError, TypeError):
        return None

Let’s give that a try at the Python prompt —

>>> print(force_to_int(None))

Aha! Now we’re getting somewhere. Let’s try some other types —

>>> import datetime
>>> print(force_to_int(
>>> print(force_to_int({}))
>>> print(force_to_int(complex(3,3)))
>>> print(force_to_int(ValueError))

OK, looks good! Time to pop open a cold one and…

Wait, I can still feed input to this function that will break it. Watch this —

>>> class Unintable():
 ...    def __int__(self):
 ...        raise ArithmeticError
 >>> trouble = Unintable()
 >>> print(force_to_int(trouble))
 Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "<stdin>", line 6, in force_to_int
   File "<stdin>", line 3, in __int__


While the class Unintable is contrived, it reminds us that classes control their own conversion to int, and can raise any error they please, even a custom error. A scenario that’s more realistic than the Unintable class might be a class that wraps an industrial sensor. Calling int() on an instance normally returns a value representing pressure or temperature. However, it might reasonably raise a SensorNotReadyError.

And Finally, the Naked Except

Since any exception is possible when calling int(), our code has to accomodate that. That requires the ugly “naked” except. A “naked” except is an except statement that doesn’t specify which exceptions it catches, so it catches all of them, even SyntaxError. They give bugs a place to hide, and I don’t like them. Here, I think it’s the only choice —

def force_to_int(value):
    """Given a value, returns the value as an int if possible.
    Otherwise returns None.
        return int(value)
        return None

At the Python prompt —

>>> print(int_or_else(trouble))

Now the bones of the function are complete.

Complete, Except For One Exception

Graham Dumpleton‘s comment below pointed out that there’s a difference between what I call a ‘naked’ except —


And this —

except Exception:

The former traps even SystemExit which you don’t want to trap without good reason. From the Python documentation for SystemExit —

It inherits from BaseException instead of Exception so that it is not accidentally caught by code that catches Exception. This allows the exception to properly propagate up and cause the interpreter to exit.

The difference between these two is only a side note here, but I wanted to point it out because (a) it was educational for me and (b) it explains why I’ve updated this post to hedge on what I was originally calling a ‘naked’ except.

The Final Version

We can make this a bit nicer by allowing the caller to control the non-int return value, giving the “naked” except a fig leaf, and changing the function name —

def int_or_else(value, else_value=None):
    """Given a value, returns the value as an int if possible. 
    If not, returns else_value which defaults to None.
        return int(value)
    # I don't like catch-all excepts, but since objects can raise arbitrary
    # exceptions when executing __int__(), then any exception is
    # possible here, even if only TypeError and ValueError are 
    # really likely.
    except Exception:
        return else_value

At the Python prompt —

>>> print(int_or_else(trouble))
>>> print(int_or_else(trouble, 'spaghetti'))

So there you have it. I’m happy with this function. It feels bulletproof. It contains an (almost) naked except, but that only covers one simple line of code that’s unlikely to hide anything nasty.

You might also want to read a post I made about the exception handling choices in this post.

I release this code into the public domain, and I’ll even throw in the valuable Unintable class for free!

The image in this post is public domain and comes to us courtesy of Wikimedia Commons.

Creating PDF Documents Using LibreOffice and Python, Part 4

This is the fourth and final post in a series on creating PDFs using LibreOffice and Python. The first three parts are here:

They’re all a supplement to a talk I gave at PyOhio 2016.

This final post is here to point you to a working code example that you can download from my Bitbucket repository. It’s enough to get you started so you can experiment with your own goals in mind.

One thing I mention in the code that’s worth repeating here is that the code uses ElementTree to manipulate XML. It’s sufficient for this demo, and the fact that it’s part of the Python standard library means you can run the demo without installing any third party libraries. For real world (i.e. non-demo) usage, I recommend lxml as a more robust and helpful alternative to ElementTree.

A Curious Coincidence: Stinkin’ Badges

Treasure of the Sierra Madre movie posterThe title of my PyOhio talk was “We Don’t Need No Stinkin’ PDF Library: Build PDFs with Python the Lazy Way”. You know the “we don’t need no stinkin’ [whatever]” meme, don’t you? It’s from the Mel Brooks movie Blazing Saddles. (You can find the clip on YouTube.) Did you know that Blazing Saddles is quoting another movie?

The night before I gave my talk, I walked from my AirBnB to a nearby bar and bottle shop. (It’s simply called “The Bottle Shop”. Ohioans are plain dealers, apparently). I settled in there, happy with a pint of stout. On the big screen they were playing an old black and white Western — The Treasure of the Sierra Madre.

I didn’t realize until it happened on the screen that this movie is the inspiration for the “We don’t need no stinkin’ badges” quote, although no one ever actually says “We don’t need no stinkin’ badges”. The actual line is “Badges? We ain’t got no badges. We don’t need no badges! I don’t have to show you any stinkin’ badges!”

It’s pretty close to the line from B. Traven’s novel of the same name.

I didn’t have time in my talk to mention Blazing Saddles, the mysterious B. Traven, The Treasure of the Sierra Madre, Humphrey Bogart, The Bottle Shop, nor the stout. But I was amused by our brief coincidence in Columbus.

Just in Time to Vote!

I just returned from Tunisia a couple of days ago.

In 2014 and 2015 I worked for Caktus Group to help develop an SMS-based voter registration system on behalf of the Libyan government (specifically the High National Election Commission, or HNEC). The open source version of this system is called Smart Elect.

For the last three weeks in Tunisia (which is next door to Libya and a whole lot safer), Caktus’ Tobias McNulty and I trained a dozen HNEC employees on how to use, develop, and maintain the system. We talked about Python, Django, open source culture, GitHub Flow, and, of course, the upcoming U.S. election.

On the eve of that election, I thought it appropriate to express gratitude for my opportunity to participate in the messy sausage-making that is democracy. Good luck to our new Libyan friends; I hope they get the opportunity to do the same in the very near future.



Creating PDF Documents Using LibreOffice and Python, Part 3

This is part 3 of a 4-part series on creating PDFs using LibreOffice. You should read part 1 and part 2 if you haven’t already. This series is a supplement to a talk I gave at PyOhio 2016.

Here in part 3, I review the conversation we (the audience and I) had at the end of the PyOhio talk. I committed the speaker’s cardinal sin of not repeating (into the microphone) the questions people asked, so they’re inaudible in the video. In addition, we had some interesting conversations among multiple people that didn’t get picked up by the microphone. I don’t want them to get lost, so I summarized them here.

The most interesting thing I learned out of this conversation is that LibreOffice can open PDFs; once opened they’re like an ordinary LibreOffice document. You can edit them, save them to ODF, export to PDF, etc. Is this cool, or what?

First Question: What about Using Excel or Word?

One of the attendees jumped in to confirm that modern MS Word formats are XML-based. However, he went on to say, the XML contains a statement at the top that says something like “You cannot legally read the rest of this file”. I made a joke about not having one’s lawyer present when reading the file.

In all seriousness, I can’t find anything online that suggests that Microsoft’s XML contains a warning like that, and the few examples I looked at didn’t have have any such warning. If you can shed any light on this, please do so in the comments!

We also discussed the fact that one must invoke the office app (LibreOffice or Word, Excel, etc.) in order to render the document to PDF. LibreOffice has a reputation for performing badly when invoked repeatedly for this purpose. LibreOffice 5 may have addressed some of these problems, but as of this writing it’s still pretty new so the jury is still out on how this will work in practice.

Another attendee noted that Microsoft can save to LibreOffice format, so if Word (or Excel) is your document-editing tool of choice, you can still use LibreOffice to render it to PDF. That’s really useful if MS Office is your tool of choice but you’re doing rendering on a BSD/Linux server.

Question 2: What about Scraping PDFs?

The questioner noted that scraping a semi-complex PDF is very painful. It’d be ideal, he said, to be able to take a complex form like the 1040 and extract key value pairs of the question and answer. Is the story getting better for scraping PDFs?

My answer was that for the little experience I have with scraping PDFs, I’ve used PDFMiner, and the attendee said he was using the same.

Someone else chimed in that it’s a great use case for [Amazon’s] Mechanical Turk; in his case he was dealing with old faxes that had been scanned.

Question 3: Helper Libraries

Matt Wilson asked if it would make sense to begin building helper libraries to simplify common tasks related to manipulating LibreOffice XML. My answer was that I wasn’t sure since each project has very specific needs. Someone else suggested that one would have to start learning the spec in order to begin creating abstractions.

In the YouTube comments, Paul Hoffman1 called our attention to OdfPy a “thin abstraction over direct XML access”. It looks quite interesting.

Comment 1: Back to Scraping

One of the attendees commented that he had used Jython and PDFBox for PDF scraping. “It took a lot to get started, but once I started to figure out my way around it, it was a pretty good tool and it moved pretty speedily as compared to some of the other tools I used.” He went on to say that it was pretty complete and that it worked very well.

Question 4: About XML Parsing

The question was what I used to parse the XML, and my answer was that I used ElementTree from the standard library. Your favorite XML parsing library will work just fine.

Question 5: Protecting Bookmarks

The question was whether or not I did anything special to protect the bookmarks in the document. My answer was that I didn’t. (I’m not even sure it’s possible.) If you go through multiple rounds of editing with your client, those invisible bookmarks are inevitably going to get moved or deleted, so expect a little maintenance work related to that.

Comment 2: Weasyprint

One of the attendees commented that Weasyprint is a useful HTML/CSS to PDF converter. My observation was that tools in this class (HTML/CSS to PDF converters) are not as precise as either of the methods I outlined in this talk, but if you don’t need precision they’re probably a nice way to go.

Question 6: unoconv in a Web Server

Can one use unoconv in a Web server? My answer was that it’s possible, but it’s not practical to use it in-process. For me, it worked to do so in a demo of an intranet application, but that’s about as far as you want to go with it. It’s much more practical to use a distributed processing application (Celery, for example).

One of the attendees concurred that it makes sense to spin it off into a separate process, but “unoconv inexplicably crashes when it feels like it”.

Comment 3: Converting from Word

The initial comment was that pandoc might help with converting from Word to LibreOffice. This started a conversation which I’d summarize this way:

  • LibreOffice can open MS Office docs, so use that instead of pandoc and save as LibreOffice
  • If you open MS Office documents with LibreOffice, double check the formatting because it doesn’t always survive the transition
  • LibreOffice can open PDFs for editing.