How to Measure Anglo-Saxonicity – With a Ruler or Yardstick?

Summary (Nutshell)

This is a first look at a work in progress. I’m using Python to study text from an etymological perspective. Specifically, I’m measuring how many words in a given English language text have Anglo-Saxon origin. Many people (including myself) think that Anglo-Saxon words convey a different sense than their counterparts of French/Latin origin. To demonstrate the point in a small way, I’ve included a Latin and Anglo-Saxon version of each heading in this blog post.

Background (Milieu)

English is a Germanic language with Scandinavian influence, with a big layer of Old French poured on top. That Old French (Anglo Norman French, to be specific) was principally derived from Latin, so English is a hybrid between two major Indo-European language groups. Those mongrel origins are a big part of why English is messy and rich.

French was introduced to English as the language of conquerers and nobility. French was also the language of some European royalty in the 18th and 19th century, further adding to its reputation as a language associated with high status. Even today, English words with French origins often have higher cultural status than their counterparts with Anglo-Saxon origins (think cuisine versus cooking, illumination versus light, create versus make, and escargot versus snail). By contrast, the Anglo-Saxon words are often considered more visceral (think sea versus ocean, sweat versus perspire, and free versus emancipated — more on that last pair in a moment).

For instance, when taunting someone, you reach for blunt Anglo-Saxon words. “Your mother was a hamster, and your father smelled of elderberries!” is 100% Anglo-Saxon, except for “elderberries” which was coined in Middle English from “elder” and “berry”, both of Anglo-Saxon origin.

A still of the French taunting King Arthur from Monty Python and the Holy Grail
William of Normandy in 1067, addressing his English subjects.

Legal documents and government issuances, on the other hand, tend to include more words of Latin/French origin. It’s no coincidence that the Latin/French words “Emancipation Proclamation” describe a legal act, but if you want to stir the heart about emancipation, you say something like “Free at last!”(1)  which is all Anglo-Saxon.

Others have written more eloquently than I about how word origin influences tone (Annalisa Quinn at NPRGemma Varnom, and M. Birch, to suggest a few), so I won’t belabor the point more than I already have. But I wanted to talk about how it inspired the project I’ve been working on.

The Project (The Work)

I should preface this by saying that I Am Not A Linguist, and I don’t even play one on TV.

I thought it would be interesting to perform lexicographical analysis of text from an etymological perspective. My etymological categorization is necessarily simple. When I look at a text, I put each word into one of three etymological categories: Anglo-Saxon, non-Anglo-Saxon, or unknown. From this rough grouping I generate statistics that allow me to compare one text to another.

For instance, does one author consistently use more Anglo-Saxon words than other authors? Does an author’s usage of Anglo-Saxon words change from one work to another? Also of interest to me is the etymology of words as the book progresses from front to back. Do the relative frequencies of etymologies change as the book progresses towards its exciting conclusion? For authors writing in English as a second language, is their word selection influenced by their first language?

All of the questions above can be explored with the tool I’ve written. It’s easier to show the tool’s output than describe it, so here’s an analysis of Lewis Carroll’s 1865 work “Alice’s Adventures in Wonderland”.

The graph below shows the relative frequency of the three etymological categories as the book progresses from beginning to end.

A graphical representation of how the etymological ration of Alice in Wonderland changes as one progresses through the book

This graph shows the relative frequency of the three etymological categories as counting statistics for various part-of-speech categories.

A graphical representation of the counts of words by parts of speech and etymology in Alice in Wonderland

The table below is a more detailed version of the chart immediately above. Some percentages may not add up to 100% due to rounding.

Total %age of All Words Anglo-Saxon non-Anglo-Saxon Unknown
All Words 26624 100% 18233 (68%) 3812 (14%) 4579 (17%)
Unique 3528 13% 1354 (38%) 899 (25%) 1275 (36%)
Nouns 8522 32% 4521 (53%) 2354 (27%) 1647 (19%)
Verbs 5479 20% 2994 (54%) 565 (10%) 1920 (35%)
Adjectives 1639 6% 896 (54%) 375 (22%) 368 (22%)
Adverbs 1974 7% 1348 (68%) 420 (21%) 206 (10%)
Other 9010 33% 8474 (94%) 98 (1%) 438 (4%)

Observations (What I See)

There’s some minor observations to be made here, but the strength of this tool will be in comparative analysis. It’s hard to draw conclusions from one analysis before I have an idea of what’s typical.

For instance, at first glance, the ratio of Anglo-Saxon to non-Anglo-Saxon words looks dramatic, but this says more about English than it does about Carroll. The most common words in English are overwhelmingly Anglo-Saxon in origin. (2)  For the small sample size of works I’ve processed so far (just 8 in total), I can see that it’s common for roughly three quarters of the words to be Anglo-Saxon. Alice in Wonderland isn’t an outlier by that standard.

We can also see that the frequency of Anglo-Saxon words decreases slightly throughout the book. This is the kind of trend that I find interesting, but in this case it’s due to an increase in the number of words of unknown etymology. Sometimes a word’s etymology is truly unknown. More often, though, the etymology is classified as unknown for other reasons. Most likely, it’s simply not in my etymological database (which isn’t very complete yet). Also, the word could be a proper noun, an invented word (like “woodshadows” from James Joyce’s Ulysses), or a word for which the etymology is ambiguous. An example of this last category is “bank” which is Anglo-Saxon in origin when referring to the side of a river, but French/Italian in origin when referring to a place that handles money.

At present, the quantity of words classified as “unknown” is too large for my tastes, and I plan to reduce it by improving both my database and the tool.

Verbs are overrepresented in the “unknown” category. My guess is that this is an artifact of my stemmer having difficulty stemming verbs. (I’m currently using the Snowball Stemmer from NLTK.)

As you can see, at this point it’s easier to draw conclusions about the representation of the data than it is about the data themselves. That leads me to the next (and final) topic in this post.

Future (What’s to Come)

As I said in the introduction, this is an early look at a work in progress. Here’s some of the things I’d like to add –

  • Better etymological data
  • Large scale comparisons of text to look for trends (across authors, genres, etc.)
  • More numeric (rather than visual) descriptions of the data to facilitate automated comparison. One idea is to add the mean and standard deviation of the percentage of Anglo-Saxon words.
  • Open sourcing

If you have any suggestions on how to use this tool or make it more interesting, I’d love to hear them in the comments below. I moderate all comments to filter spam which is yet another Viking influence on England.

Endnotes

Like English itself, “Endnote” is an etymological hybrid. “End” is of Anglo-Saxon origin, while  “note” comes from Old French/Latin.

1. Martin Luther King, Jr. isn’t the only person to have said “Free at last!”, but his use of it is perhaps the most famous. His “I Have a Dream” speech makes brilliant use of etymological contrasts. Many of his memorable phrases in that speech (“I have a dream today”, “Let freedom ring”, “Free at last”) are Anglo-Saxon.

2. In 2014 I pulled from Wikipedia a list of the 100 most common English words. At the time, it contained just four non-Anglo-Saxon words. They were “just” (ME < Latin), “people” (ME < Anglo-French < Latin), “use” (ME < Old French, replaced OE brucan, cognate w/modern Swedish bruk-), and “because” (ME < Fr ‘par cause’). There are lots of ways to count the 100 most common words, and doubtless the list would have been different in Carroll’s day. But my guess is that the presence of Anglo-Saxon hasn’t changed dramatically from that 96% regardless of when and how one counts.

Pandas Surprise

Summary

Part of learning how to use any tool is exploring its strengths and weaknesses. I’m just starting to use the Python library Pandas, and my naïve use of it exposed a weakness that surprised me.

Background

A photo of the many shapes and colors in Lucky Charms cereal
Thanks to bradleypjohnson for sharing this Lucky Charms photo under CC BY 2.0.

I have a long list of objects, each with the properties “color” and “shape”. I want to count the frequency of each color/shape combination. A sample of what I’m trying to achieve could be represented in a grid like this –

       circle square star
blue        8     41   18
orange      5     33   25
red        53     64   58

At first I implemented this with a dictionary of collections.Counter instances where the top level dictionary is keyed by shape, like so –

import collections
SHAPES = ('square', 'circle', 'star', )
frequencies = {shape: collections.Counter() for shape in SHAPES}

Then I counted my frequencies using the code below. (For simplicity, assume that my objects are simple 2-tuples of (shape, color)).

for shape, color in all_my_objects:
    frequencies[shape][color] += 1

So far, so good.

Enter the Pandas

This looked to me like a perfect opportunity to use a Pandas DataFrame which would nicely support the operations I wanted to do after tallying the frequencies, like adding a column to represent the total number (sum) of instances of each color.

It was especially easy to try out a DataFrame because my counting loop ( for...all_my_objects) wouldn’t change, only the definition of frequencies. (Note that the code below requires I know in advance all the possible colors I can expect to see, which the Dict + Counter version does not. This isn’t a problem for me in my real-world application.)

import pandas as pd
frequencies = pd.DataFrame(columns=SHAPES, index=COLORS, data=0,
                           dtype='int')
for shape, color in all_my_objects:
    frequencies[shape][color] += 1

It Works, But…

Both versions of the code get the job done, but using the DataFrame as a frequency counter turned out to be astonishingly slow. A DataFrame is simply not optimized for repeatedly accessing individual cells as I do above.

How Slow is it?

To isolate the effect pandas was having on performance, I used Python’s timeit module to benchmark some simpler variations on this code. In the version of Python I’m using (3.6), the default number of iterations for each timeit test is 1 million.

First, I timed how long it takes to increment a simple variable, just to get a baseline.

Second, I timed how long it takes to increment a variable stored inside a collections.Counter inside a dict. This mimics the first version of my code (above) for a frequency counter. It’s more complex than the simple variable version because Python has to resolve two hash table references (one inside the dict, and one inside the Counter). I expected this to be slower, and it was.

Third, I timed how long it takes to increment one cell inside a 2×2 NumPy array. Since Pandas is built atop NumPy, this gives an idea of how the DataFrame’s backing store performs without Pandas involved.

Fourth, I timed how long it takes to increment one cell inside a 2×2 Pandas DataStore. This is what I had used in my real code.

Raw Benchmark Results

Here’s what timeit showed me. Sorry for the cramped formatting.

$ python
 Python 3.6.0 (v3.6.0:41df79263a11, Dec 22 2016, 17:23:13)
 [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
 Type "help", "copyright", "credits" or "license" for more information.
 >>> import timeit
 >>> timeit.timeit('data += 1', setup='data=0')
 0.09242476700455882
 >>> timeit.timeit('data[0][0]+=1',setup='from collections import Counter;data={0:Counter()}')
 0.6838196019816678
 >>> timeit.timeit('data[0][0]+=1',setup='import numpy as np;data=np.zeros((2,2))')
 0.8909121589967981
 >>> timeit.timeit('data[0][0]+=1',setup='import pandas as pd;data=pd.DataFrame(data=[[0,0],[0,0]],dtype="int")')
 157.56428507200326
 >>>

Benchmark Results Summary

Here’s a summary of the results from above (decimals truncated at 3 digits). The rightmost column shows the results normalized so the fastest method (incrementing a simple variable) equals 1.

Actual (seconds) Normalized (seconds)
Simple variable 0.092 1
Dict + Counter 0.683 7.398
Numpy 2D array 0.890 9.639
Pandas DataFrame 157.564 1704.784

As you can see, resolving the index references in the middle two cases (Dict + Counter in one case, NumPy array indices in the other) slows things down, which should come as no surprise. The NumPy array is a little slower than the Dict + Counter.

The DataFrame, however, is about 150 – 200 times slower than either of those two methods. Ouch!

I can’t really even give you a graph of all four of these methods together because the time consumed by the DataFrame throws the chart scale out of whack.

Here’s a bar chart of the first three methods –

A bar chart of the first three methods in the preceding table

Here’s a bar chart of all four –

A bar chart of all four methods in the preceding table

Why Is My DataFrame Access So Slow?

One of the nice features of DataFrames is that they support dictionary-like labels for rows and columns. For instance, if I define my frequencies to look like this –

>>> SHAPES = ('square', 'circle', 'star', )
>>> COLORS = ('red', 'blue', 'orange')
>>> pd.DataFrame(columns=SHAPES, index=COLORS, data=0, dtype='int')
        square  circle  star
red          0       0     0
blue         0       0     0
orange       0       0     0
>>>

Then frequencies['square']['orange'] is a valid reference.

Not only that, DataFrames support a variety of indexing and slicing options including –

  • A single label, e.g. 5 or 'a'
  • A list or array of labels ['a', 'b', 'c']
  • A slice object with labels 'a':'f'
  • A boolean array
  • A callable function with one argument

Here are those techniques applied in order to the frequencies DataFrame so you can see how they work –

>>> frequencies['star']
red       0
blue      0
orange    0
Name: star, dtype: int64
>>> frequencies[['square', 'star']]
        square  star
red          0     0
blue         0     0
orange       0     0
>>> frequencies['red':'blue']
      square  circle  star
red        0       0     0
blue       0       0     0
>>> frequencies[[True, False, True]]
        square  circle  star
red          0       0     0
orange       0       0     0
>>> frequencies[lambda x: 'star']
red       0
blue      0
orange    0
Name: star, dtype: int64

This flexibility has a price. Slicing (which is what is invoked by the square brackets) calls an object’s __getitem__() method. The parameter to __getitem__()  is the whatever was inside the square brackets. A DataFrame’s __getitem__() has to figure out what the passed parameter represents. Determining whether the parameter is a label reference, a callable, a boolean array, or something else takes time.

If you look at the DataFrame’s __getitem__() implementation, you can see all the code that has to execute to resolve a reference. (I linked to the version of the code that was current when I wrote this in February of 2017. By the time you read this, the actual implementation may differ.) Not only does __getitem__() have a lot to do, but because I’m accessing a cell (rather than a whole row or column), there’s two slice operations, so __getitem__() gets invoked twice each time I increment my counter.

This explains why the DataFrame is so much slower than the other methods. The dictionary and Counter both only support key lookup in a hash table, and a NumPy array has far fewer slicing options than a DataFrame, so its __getitem__() implementation can be much simpler.

Better DataFrame Indexing?

DataFrames support a few methods that exist explicitly to support “fast” getting and setting of scalars. Those methods are .at() (for label lookups) and .iat() (for integer-based index lookups). It also provides get_value() and set_value(), but those methods are deprecated in the version I have (0.19.2).

“Fast” is how the Panda’s documentation describes these methods. Let’s use timeit to get some hard data. I’ll try at() and iat(); I’ll also try get_value()/set_value() even though they’re deprecated.

>>> timeit.timeit("data.at['red','square']+=1",setup="import pandas as pd;data=pd.DataFrame(columns=('square','circle','star'),index=('red','blue','orange'),data=0,dtype='int')")
36.33179204000044
>>> timeit.timeit('data.iat[0,0]+=1',setup='import pandas as pd;data=pd.DataFrame(data=[[0,0],[0,0]],dtype="int")')
42.01523362501757
>>> timeit.timeit('data.set_value(0,0,data.get_value(0,0)+1)',setup='import pandas as pd;data=pd.DataFrame(data=[[0,0],[0,0]],dtype="int")')
15.050199927005451
>>>

These methods are better, but they’re still pretty bad. Let’s put those numbers in context by comparing them to other techniques. This time, for normalized results, I’m going to use my Dict + Counter method as the baseline of 1 and compare all other methods to that. The row “DataFrame (naïve)” refers to naïve slicing, like frequencies[0][0].

Actual (seconds) Normalized (seconds)
Dict + Counter 0.683 1
Numpy 2D array 0.890 1.302
DataFrame (get/set) 15.050 22.009
DataFrame (at) 36.331 53.130
DataFrame (iat) 42.015 61.441
DataFrame (naïve) 157.564 230.417

The best I can do with a DataFrame uses deprecated methods, and is still over 20 times slower than the Dict + Counter. If I use non-deprecated methods, it’s over 50 times slower.

Workaround

I like label-based access to my frequency counters, I like the way I can manipulate data in a DataFrame (not shown here, but it’s useful in my real-world code), and I like speed. I don’t necessarily need blazing fast speed, I just don’t want slow.

I can have my cake and eat it too by combining methods. I do my counting with the Dict + Counter method, and use the result as initialization data to a DataFrame constructor.

SHAPES = ('square', 'circle', 'star', )
frequencies = {shape: collections.Counter() for shape in SHAPES}
for shape, color in all_my_objects:
    frequencies[shape][color] += 1

frequencies = pd.DataFrame(data=frequencies)

The frequencies DataFrame now looks something like this –

         circle square star
 blue         8     41   18
 orange       5     33   25
 red         53     64   58

The rows and columns appear in essentially random order; they’re ordered by whatever order Python returns the dict keys during DataFrame initialization. Getting them in a specific order is left as an exercise for the reader.

There’s one more detail to be aware of. If a particular (shape, color) combination doesn’t appear in my data, it will be represented by NaN in the DataFrame. They’re easy to set to 0 with frequencies.fillna(0).

Conclusion

What I was trying to do with Pandas – unfortunately, the very first thing I ever tried to do with it – didn’t play to its strengths. It didn’t break my code, but it slowed it down by a factor of ~1700. Since I had thousands of items to process, the difference was hard to overlook!

Pandas looks great for some things, and I expect I’ll continue using it. This was just a bump in the road, albeit an interesting one.

Coercing Objects to Integer, Revisited

Summary

I recently wrote a blog post that involved exception handling, and gave short shrift to the part of exception handling I didn’t want to talk about in order to focus on the part I did want to talk about. For some readers, that clearly backfired.

Background

My recent blog post about coercing Python objects to integers caught people’s attention in a way I hadn’t intended. The point I was trying to make was that an innocent-looking call like int(an_object) calls the method an_object.__int__(), and since that can be arbitrary code, it can raise arbitrary exceptions. Therefore, it’s insufficient to catch only the usual exceptions of ValueError and TypeError if you don’t know the type of an_object in advance.

Here’s the code I suggested –

def int_or_else(value, else_value=None):
    """Given a value, returns the value as an int if possible.
    If not, returns else_value which defaults to None.
    """
    try:
        return int(value)
    # I don't like catch-all excepts, but since objects can raise arbitrary
    # exceptions when executing __int__(), then any exception is
    # possible here, even if only TypeError and ValueError are
    # really likely.
    except Exception:
        return else_value

Several commenters objected to the fact that this code discards (and therefore silences/masks/hides) all exceptions. Here’s why I made that choice.

The Two Parts of Exception Handling

In Python, there’s two parts to consider about exception handling — what to catch, and what to do with the exception once you’ve caught it. My intention was to write only about the former.

The latter is an interesting topic, too. Once you’ve caught an exception, you might want to log it and then discard it, log it and then re-raise it, re-raise it as a different exception, silence it, let it pass up to the caller, modify its attributes and re-raise it, etc. There’s enough material for an entire blog post about different ways to react to an exception, and the pros and cons of each.

Someday I might write that post about different ways to react to trapped exceptions, and if I do, I’ll dedicate the entire post to the subject to give it the attention it deserves. That other blog post – that was not it. In fact, it was the opposite. I gave the topic of processing the trapped exception as little attention as possible so as not to detract attention from what I wanted to be the main topic (what exceptions need to be trapped).

That backfired.

Conclusion

My post was not advocacy of discarding exceptions, nor was it advocacy of not discarding exceptions. What’s the right choice? It depends. One situation where you might want to discard exceptions is in a blog post where you’re trying to keep the code as brief as possible for readability. Then again, you might regret that. :-)

In the future, I’ll be clearer about what shortcuts I’m taking for brevity of presentation.

Agree? Disagree? I’d like to hear from you. I like it when people agree with me. Those who disagree can expand my horizons, and I like that too. In short, all civil comments are welcome. I feel I’ve spent enough time thinking about this topic for now, but that doesn’t make me right! Let me know what you think.

How Best to Coerce Python Objects to Integers?

Summary

In my opinion, the best way in Python to safely coerce things to integers requires use of an (almost) “naked” except, which is a construct I rarely want to use. Read on to see how I arrived at this conclusion, or you can jump ahead to what I think is the best solution.

The Problem

Suppose you had to write a Python function to convert to integer string values representing temperatures, like this list —

['22', '24', '24', '24', '23', '27']

The strings come from a file that a human has typed in, so even though most of the values are good, a few will have errors ('25C') that int() will reject.

Let’s Explore Some Solutions

You might write a function like this —

def force_to_int(value):
    """Given a value, returns the value as an int if possible.
    Otherwise returns None.
    """
    try:
        return int(value)
    except ValueError:
        return None

Here’s that function in action at the Python prompt —

>>> print(force_to_int('42'))
42
>>> print(force_to_int('oops'))
None

That works! However, it’s not as robust as it could be.

Suppose this function gets input that’s even more unexpected, like None

>>> print(force_to_int(None))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 6, in force_to_int
TypeError: int() argument must be a string or a number, not 'NoneType'

Hmmm, let’s write a better version that catches TypeError in addition to ValueError

def force_to_int(value):
    """Given a value, returns the value as an int if possible.
    Otherwise returns None.
    """
    try:
        return int(value)
    except (ValueError, TypeError):
        return None

Let’s give that a try at the Python prompt —

>>> print(force_to_int(None))
None

Aha! Now we’re getting somewhere. Let’s try some other types —

>>> import datetime
>>> print(force_to_int(datetime.datetime.now()))
None
>>> print(force_to_int({}))
None
>>> print(force_to_int(complex(3,3)))
None
>>> print(force_to_int(ValueError))
None

OK, looks good! Time to pop open a cold one and…

Wait, I can still feed input to this function that will break it. Watch this —

>>> class Unintable():
 ...    def __int__(self):
 ...        raise ArithmeticError
 ...
 >>>
 >>> trouble = Unintable()
 >>> print(force_to_int(trouble))
 Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "<stdin>", line 6, in force_to_int
   File "<stdin>", line 3, in __int__
 ArithmeticError

Dang!

While the class Unintable is contrived, it reminds us that classes control their own conversion to int, and can raise any error they please, even a custom error. A scenario that’s more realistic than the Unintable class might be a class that wraps an industrial sensor. Calling int() on an instance normally returns a value representing pressure or temperature. However, it might reasonably raise a SensorNotReadyError.

And Finally, the Naked Except

Since any exception is possible when calling int(), our code has to accomodate that. That requires the ugly “naked” except. A “naked” except is an except statement that doesn’t specify which exceptions it catches, so it catches all of them, even SyntaxError. They give bugs a place to hide, and I don’t like them. Here, I think it’s the only choice —

def force_to_int(value):
    """Given a value, returns the value as an int if possible.
    Otherwise returns None.
    """
    try:
        return int(value)
    except:
        return None

At the Python prompt —

>>> print(int_or_else(trouble))
 None

Now the bones of the function are complete.

Complete, Except For One Exception

Graham Dumpleton‘s comment below pointed out that there’s a difference between what I call a ‘naked’ except —

except:

And this —

except Exception:

The former traps even SystemExit which you don’t want to trap without good reason. From the Python documentation for SystemExit —

It inherits from BaseException instead of Exception so that it is not accidentally caught by code that catches Exception. This allows the exception to properly propagate up and cause the interpreter to exit.

The difference between these two is only a side note here, but I wanted to point it out because (a) it was educational for me and (b) it explains why I’ve updated this post to hedge on what I was originally calling a ‘naked’ except.

The Final Version

We can make this a bit nicer by allowing the caller to control the non-int return value, giving the “naked” except a fig leaf, and changing the function name —

def int_or_else(value, else_value=None):
    """Given a value, returns the value as an int if possible. 
    If not, returns else_value which defaults to None.
    """
    try:
        return int(value)
    # I don't like catch-all excepts, but since objects can raise arbitrary
    # exceptions when executing __int__(), then any exception is
    # possible here, even if only TypeError and ValueError are 
    # really likely.
    except Exception:
        return else_value

At the Python prompt —

>>> print(int_or_else(trouble))
None
>>> print(int_or_else(trouble, 'spaghetti'))
spaghetti

So there you have it. I’m happy with this function. It feels bulletproof. It contains an (almost) naked except, but that only covers one simple line of code that’s unlikely to hide anything nasty.

You might also want to read a post I made about the exception handling choices in this post.

I release this code into the public domain, and I’ll even throw in the valuable Unintable class for free!

The image in this post is public domain and comes to us courtesy of Wikimedia Commons.

Thanks for PyData Carolinas

My PyData Pass

Thanks to all who made PyData Carolinas 2016 a success! I had conversations about eating well while on the road, conveyor belts, and a Fortran algorithm to calculate the interaction of charged particles. Great stuff!

My talk was on getting Python to talk to compiled languages; specifically C, Fortran, and C++.

I’m grateful to the PyData A/V team who did a great job capturing the presentations. Thanks to them, you can see my talk on YouTube at https://www.youtube.com/watch?v=aUSokzzsEko , or you can watch the embedded version below.

 

Creating PDF Documents Using LibreOffice and Python

This post is a supplement to a talk I’m giving at PyOhio about using Python to create PDFs “the lazy way”. It’s the first of a series on this subject which is a bit too big for just one blog post.

In the talk and in this series, I advocate a technique for creating PDFs that uses LibreOffice (or OpenOffice) to do most of the hard work, and I contrast that to the common solution of using ReportLab (or a library like it).

This technique offers some unique benefits, and in some common use cases—most importantly, perhaps in your case—it can be much more efficient than the alternative. I’ll compare and contrast the two in another blog post. In this post I just want to describe the technique I’m advocating.

Background

Creating PDFs programmatically is a task most Python programmers encounter at least once.

When I talk about creating PDFs programmtically, I’m thinking of the situation where one wants to create a lot of PDFs that follow a template. For instance, you might work for a bank that wants to produce end-of-month account statements for each of its 100,000 customers. The cover page will always contain the bank’s logo, some legal boilerplate, the month and year, and a bland stock photo 17068-a-woman-and-older-man-sitting-at-a-table-pvof happy customers doing something  unrelated to banking, like this one.

The first page after that will be a summary of the customer’s accounts, and then subsequent pages contain information about the account—a list of transactions, changes in values of stocks, etc.

Each PDF will be different, but similar because they follow a template. Computers are great for this sort of thing, and this technique is particularly good at it. As I said above, I’ll tell you why I think it’s good in another blog post. For now, I want to stop talking mysteriously about “the technique” and actually describe it.

Outline

Here’s a concise outline. Don’t worry if you don’t understand all the steps; they’re fleshed out below.

  1. Create a LibreOffice document that will serve as a template for the documents you want to create. (Note: I mean “template” in the general sense of a form or skeleton, not a LibreOffice .ott template file.)
  2. Unzip that document.
  3. Manipulate the document’s XML using standard Python libraries.
  4. Zip the modified files into a new LibreOffice .odt file.
  5. Ask LibreOffice to export the document in PDF format.

Let’s go through these step by step. I encourage you to follow along. We’re not going to write a single line of Python code, just explore a process. Writing Python would come later when you automate steps 2 – 5.

1. Create a LibreOffice Document to Use as a Template

This step will probably require the most work.

We usually know in advance at least some of the content we want. For instance, in the bank example above, we know what the cover page will look like, where each section should appear in the document, and how a section (e.g. a list of account transactions) should be formatted, even if we don’t know in advance the exact values of each transaction.

Your job during this step is to create a LibreOffice document that will serve as a skeleton (or template, or form) for your final documents. For content that you don’t know (words in paragraphs, images, bullet points in a list, table contents, etc.), leave placeholders.

If you want to play along with this blog post, here’s the LibreOffice document that I’ll use in the examples below.

2. Unzip the Document

This is a trick you might not know—LibreOffice documents are ZIP files. (This is true of all documents that follow the Open Document Format for Office Applications). You can unzip them with command line tools, or with the zipfile module in Python’s standard library.

On my Mac, the following command unzips the document into the directory unzipped.

unzip practice.odt -d unzipped

After unzipping, you’ll see a bunch of files like this:

drwxr-xr-x  11 philip staff    374 Jul 27 16:43 Configurations2/
drwxr-xr-x   3 philip staff    102 Jul 27 16:43 META-INF/
drwxr-xr-x   3 philip staff    102 Jul 27 16:43 Thumbnails/
-rw-r--r--   1 philip staff   8988 Jul 27 16:44 content.xml
-rw-r--r--   1 philip staff    899 Jul 27  2016 manifest.rdf
-rw-r--r--   1 philip staff   1005 Jul 27  2016 meta.xml
-rw-r--r--   1 philip staff     39 Jul 27  2016 mimetype
-rw-r--r--   1 philip staff  10319 Jul 27  2016 settings.xml
-rw-r--r--   1 philip staff  14903 Jul 27  2016 styles.xml

Of the files above, you’re only likely to be interested in content.xml. (You might also want to explore styles.xml, but I consider that an advanced topic, and I’m trying to maintain a rigorous standard of laziness.)

3. Manipulate the XML

The XML in content.xml is governed by the 846-page Open Document Format for Office Applications. You might think I’m going to suggest you read it, or at least familiarize yourself with it.

Heck no! That’s not the lazy way. I’m very pleased that it’s an ISO standard, but I don’t want to learn it if I can save time and effort by not doing so, and you shouldn’t have to either.

Instead I suggest you use what I use: common sense and intution, which can get you surprisingly far. For instance, if you see this in the XML—

<text:p text:style-name="P4">
 The fox jumped over the dog.
</text:p>

You don’t have to read 846 pages of documentation to guess that you can change it to this—

<text:p text:style-name="P4">
 The quick brown fox jumped over the lazy dog.
</text:p>

Or even this—

<text:p text:style-name="P4">
 No one expects the Spanish Inquisition!
</text:p>

Are you starting to see some possibilities?

If you’re doing this programmatically, you can use LibreOffice bookmarks to demarcate the text you want to replace. Bookmarks are visible in the XML and trivial to locate using XPath. You can see this in my example document where I’ve surrounded two blank space characters with bookmarks where adjectives might go to describe the fox and dog.

<text:p text:style-name="P1">
    The
    <text:bookmark-start text:name="fox_type_placeholder"/>
    <text:s/>
    <text:bookmark-end text:name="fox_type_placeholder"/>
    <text:s/>
    fox jumped over the
    <text:bookmark-start text:name="dog_type_placeholder"/>
    <text:s/>
    <text:bookmark-end text:name="dog_type_placeholder"/>
    <text:s/>
    dog.
</text:p>

What do you think will happen if you replace the first occurrence of  <text:s/> with quick brown?

Text isn’t the only thing you can change.

If you have a list item with bullets and you want another bullet or three, you can just duplicate existing bullets. For instance, if you start with this—

<text:list xml:id="list3413943092755896283" text:style-name="L1">
    <text:list-item>
        <text:p text:style-name="P2">First</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Second</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Third</text:p>
    </text:list-item>
</text:list>

You can turn it into this—

<text:list xml:id="list3413943092755896283" text:style-name="L1">
    <text:list-item>
        <text:p text:style-name="P2">First</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Second</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Third</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Fourth</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Fifth</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Sixth</text:p>
    </text:list-item>
</text:list>

Note that the text:list element itself has what looks like a unique id associated with it. This is a yellow flag that indicates to me that if you want to copy the entire list, you’ll need to give it a new unique id, and hope that LibreOffice  doesn’t reference that id in some other file.

I’m sure the details are somewhere in that 846-page document. You can read that document, or you can also just try your change and see what happens. The worst case scenario is that LibreOffice will tell you that your document is corrupted and you’ll have to go back and explore some more.

4. Zip a New LibreOffice File

Once you’ve made the changes you want, it’s time to reverse step 2, using your modified content.xml.

Here’s the command that works on my Mac—

cd unzipped && zip -r ../my_new_file.odt * && cd ..

Note that this command doesn’t respect the OpenDocument specification which has rules regarding how the mime type file should be represented in the zip file (as the first file in the archive, and uncompressed, per OpenDocument v1.2 part3, § 3.3 MIME Media Type). It works for me, maybe because LibreOffice is forgiving. It’s not something you should rely on, however. In another post, I’ll present some Python code that constructs the ZIP file according to standard.

5. Export to PDF via LibreOffice

If you’re just experimenting, you can just open the document in LibreOffice manually and then use the “File/Export as PDF…” menu item. (Opening manually is also a good test that you didn’t do anything objectionable to the XML.)

Programmatically, I recommend using unoconv for converting your finished document to PDF.

Review

So there you have it! If you feel underwhelmed, keep in mind that this was only a proof of concept. In some future posts, I’ll explain why I think this method is often an excellent choice (and also when it isn’t).

Photo Credit

Thanks to the National Cancer Institute for making many photos available for free, including the one used in this blog post which was taken by Rhoda Baer.

Embedding Python: How To Confuse Python and Yourself

This is a cautionary tale about how embedded Python finds its runtime files under Windows. Don’t worry, though — everyone lives happily ever after.

The story begins with a client’s request to build an executable under Windows that embeds Python. This is not so hard; there is documentation for embedding Python.

I had two versions of Python installed on my Windows virtual machine because I was experimenting with different Pythons at the time. (Doesn’t everyone go through an experimental phase in their youth?) In C:\Python27 I had installed 64-bit Python 2.7.9 from Python.org, and in C:\Users\philip\Miniconda2 I had installed 64-bit Python 2.7.11 from Continuum. The 2.7.9 version was an older install. I was only interested in using the 2.7.11 version.

My executable’s C code told Python where to find its runtime files —

Py_SetPythonHome("C:/Users/philip/Miniconda2");

After compiling and linking, I ran my Python-embedding executable which imported my hello_world.py file. It printed “Hello world!” as expected.

Here Come the Dragons

I thought everything was fine until I added these print statements to my Python code —

import sys
print sys.exec_prefix
print sys.version

The output was not what I expected —

C:/Users/philip/Miniconda2

2.7.9 (default, Dec 10 2014, 12:28:03) [MSC v.1500 64 bit (AMD64)]

This is contradictory! The Miniconda Python was 2.7.11, yet sys.version identified itself as Python 2.7.9. How can the same module report two different Pythons simultaneously?

What Went Wrong

The sys module reported information about two Pythons simultaneously because I was mixing parts from two different Python runtimes simultaneously.

Python’s runtime consists of a dynamically loaded library (python27.dll), the standard library which is largely written in Python itself, and an executable. The executable is usually python.exe, but in this case it was my C program that embedded Python. When my C program asked Windows to find python27.dll, Windows searched for it in these directories as documented in the Windows DLL search strategy  —

  1. The directory where the executable module for the current process is located.
  2. The current directory.
  3. The Windows system directory (usually C:\Windows\system32).
  4. The Windows directory (usually C:\Windows).
  5. The directories listed in the PATH environment variable.

My problem was that Windows found C:\Windows\system32\python27.dll first, and that was from my Python 2.7.9 installation. Meanwhile, my call to Py_SetPythonHome() had told Python to use the standard library from Miniconda Python. The value of sys.version comes from a string hardcoded in the runtime DLL, while sys.exec_prefix is derived from the value I passed in Py_SetPythonHome(). I was using the standard library from one Python installation, and the runtime DLL from another.

Consequences

Although I didn’t experiment with this for long, I might not have noticed that there was a problem if I hadn’t been lucky enough to double check my Python setup with the sys module. The standard library probably doesn’t care about which interpreter it runs under. I can imagine a few cases may exist where changes/bug fixes were made to the Python part of the standard library for versions 2.7.10 and 2.7.11 that rely on corresponding changes to the binary runtime, and that code might behave badly.

Both of the Pythons I was using were built with the same compiler, so theoretically binary extensions like numpy should run just fine under either Python. But I could certainly forgive numpy if it crashed as a result.

In short, this is neither a typical nor a supported use of Python which puts it in the “Here there be dragons” category.

The Solution

The solution was very simple. I copied C:\Users\philip\Miniconda2\python27.dll into the same directory as my custom executable. Since that’s the first location Windows searches when loading a DLL, it isolates my code from other Python DLLs that might appear in (or disappear from) other locations in the file system. Problem solved!

How To Set Up CentOS to Build Linux Wheels for Python

PEP 513 gives guidelines on how to build broadly compatible Linux platform wheels for Python. The PEP names CentOS 5.11 as the reference OS on which a Python wheel must run if it is to earn the right to the manylinux1 name.

The surest way to get one’s binary package to run on CentOS 5.11 is to build it there. This post explains how I set up a CentOS 5.11 VirtualBox guest to build manylinux1 Python wheels.

PEP 513 offers a prebuilt Docker container of CentOS 5.11. If you’re on Linux and/or you’re familiar with Docker, that’s probably a better route than building a VM.

Note that I’m not at all a Linux expert. If I’ve done something foolish or incorrect, I’d like to hear about it in the comments. Please be nice. =)

Why CentOS 5.11?

CentOS has a few things going for it that make it a good choice —

  • It’s free
  • As a derivative of Red Hat Enterprise Linux, it’s a conservative distro, so the libraries on it are likely older than the libraries on contemporaneous distros.
  • At the time PEP 513 was written, CentOS 5.11 was already over a year old. That increases the odds that other distros will have the libraries that it has.

It’s age is also a disadvantage, because at this stage the only updates CentOS 5 will receive are critical security updates and the CentOS team “recommends that you start moving workloads from CentOS 5 to CentOS Linux 6 or CentOS Linux 7”.

CentOS 5.11 Setup

Download the CentOS 5.11 ISO and install under VirtualBox. During the CentOS installation, I opted to disable SELinux. Since I only use this installation for builds and not as a server or daily desktop, I don’t feel the need for high security.

Once CentOS is installed, let the updater download and install its patches.

Next, you’ll want to install VirtualBox guest additions to make the guest OS easier to use. In order to do that, you first have to add yourself to the sudoers file.

Add Yourself to sudoers

Open a terminal and enter the following commands —

  1. su -
  2. vim /etc/sudoers
  3. At the end of the file, add this line:
    your_username ALL=(ALL) ALL
  4. Save the file with :wq!
  5. Type exit to exit the su - shell.

Now you should be able to run commands with sudo.

Build the VirtualBox Guest Additions

  1. Install GCC:
    sudo yum install gcc gcc-c++
  2. Insert the Guest Additions CD.
  3. Start a terminal and cd /media/VBOXADDITIONS_xxxx. Note that the exact name of the VBOXADDITIONS directory changes with each each version of VirtualBox.
  4. sudo ./VBoxLinuxAdditions.run
  5. Eject the Guest Additions CD and reboot.

Add Packages

Install the things you’ll need to build Python, and to use pip —

sudo yum install xz zlib zlib-devel openssl-devel pcre-devel sqlite-devel

Build Python

CentOS 5.11 comes with Python 2.4. You will undoubtedly want a newer Python, so download and untar the source code for the Python you want to use and then build it. I built Python 2.7.11 with these steps —

 sudo ./configure --enable-unicode=ucs4
 sudo make altinstall

It’s important to use UCS4 (as opposed to the default UCS2) during the configure step to increase your odds of being compatible with the Pythons built for other Linux distros.

make altinstall tells Python to install itself in such a way that it doesn’t interfere with the default (system) Python.

Once my Python was built, I added a symlink to make it the default Python in my shell —

 sudo ln -s /usr/local/bin/python2.7 /usr/local/bin/python

At this point, if you start a new terminal and type python, you should
get Python 2.7.11.

Download and Install pip

wget --no-check-certificate https://bootstrap.pypa.io/get-pip.py
sudo python get-pip.py

This also installs setuptools and wheel, both of which we need.

Fix pip

At this point, pip will malfunction as described in issue 1918. The fix is not difficult.

Use sudo vim to edit these 3 scripts —

  • /usr/local/bin/pip
  • /usr/local/bin/pip2
  • /usr/local/bin/pip2.7

In each, change the first line from this —

#!/usr/bin/python

to this —

#!/usr/local/bin/python

Enjoy, and Build Some Wheels!

You’re done! Go build some great wheels!

 

My Code Got Cyndi Lauper-ed, and I’m Glad

Let’s start with a quick musical quiz. Who wrote this 1983 pop hit? I’ll give you a hint — it’s not the person who sings it.

You can see the songwriting credits here (assuming that link survives time), but for most of you it won’t be a useful clue.

Those of you who grew up in Philadelphia in the 1980s might recognize the name Robert Hazard, leader of Robert Hazard and the Heroes, author of Escalator of Life, Change Reaction, and Out of the Blue, and, of course, Girls Just Wanna Have Fun. Robert’s popularity never grew much out of the Philly/NJ area, but Cyndi Lauper’s version of Girls Just Wanna Have Fun sold a zillion copies worldwide and touched a lot of people who had never heard of Robert Hazard, and never will.

I have something in common with him — someone has given my work far more exposure than I ever expected it to get. (Another thing we have in common is growing up in the same small Philadelphia suburb. But without Cyndi Lauper’s involvement, that’s just trivia.)

I was surprised to learn from http://pypi-ranking.info that posix_ipc, one of my open source packages, is currently in the top .5% (½ of 1%) of the most downloaded on PyPI. Now, posix_ipc might be good at what it does, but it fills a tiny niche that’s nowhere near big enough to justify all of those 1.7-million-and-counting downloads. Why is it a top 1% download? Because it has become part of something much bigger — the massively popular OpenStack.

OpenStack didn’t have to rewrite portions of posix_ipc (like Cyndi did with Girls Just Wanna Have Fun, with Robert’s permission). They haven’t yet made a video of it that includes a nod to the Marx Brothers (like Cyndi did, with or without Robert’s permission). And as far as I know, OpenStack has yet to be nominated for a Grammy. But they have shown me the value of putting something out into the world, because you never know where it will end up.

So thanks, OpenStack! And thanks to Robert Hazard for music I enjoyed growing up (and still do). R.I.P, Robert.

 

My Function Failed Inspection

Python’s inspect module allows one to examine the signature of functions, like so:

$ python3
Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 23 2015, 02:52:03)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import inspect
>>> def f(foo=42):
... pass
...
>>> print(inspect.signature(f))
(foo=42)

I wanted to use function signature inspection during unit testing of my sysv_ipc and posix_ipc modules to ensure that my code matched its documentation.

Unfortunately, inspect doesn’t work with functions written in C, as you can see in the example below that uses math.sqrt().

$ python3
Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 23 2015, 02:52:03)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import inspect
>>> import math
>>> inspect.signature(math.sqrt) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/inspect.py", line 2055, in signature return _signature_internal(obj) File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/inspect.py", line 1957, in _signature_internal skip_bound_arg=skip_bound_arg) File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/inspect.py", line 1890, in _signature_from_builtin raise ValueError("no signature found for builtin {!r}".format(func)) ValueError: no signature found for builtin <built-in function sqrt> 
>>>

Since posix_ipc and sysv_ipc are written in C, I have to test their functions and methods with keyword arguments by calling each one with each keyword argument specified.