Creating PDF Documents Using LibreOffice and Python, Part 2

This is part 2 of a 4-part series on creating PDFs using LibreOffice. You should read part 1 if you haven’t already. This series is a supplement to a talk I gave at PyOhio 2016.

Here in part 2, I compare and contrast the two approaches I outlined in part 1 — the obvious approach of using ReportLab, and the LibreOffice approach that I think is underappreciated. Both approaches can be good in the right situation, but neither is better than the other all the time. In some cases, the difference is dramatic.

Without further ado, here’s the 10 categories in which I want to compare these two, and how each approach stacks up. (The compare/contrast portion of my PyOhio talk starts at the 17 minute mark.)

1. Cross-Platform?

Both ReportLab and the LibreOffice technique run on Windows, Linux, OS X, and BSD. I haven’t researched mobile operating systems like iOS and Android, but you’re not likely to want to construct PDFs on a mobile device.

2. Python 2/3 Support?

Both approaches can be used with Python 2 and 3.

3. FOSS?

ReportLab is under a BSD License.

LibreOffice is under the MPL v2.0 which is a BSD/GPL hybrid. However the details don’t matter much since you’re not going to use the source code anyway.

4. Repairability?

By repairability, I’m referring to the ease with which you can fix things that don’t behave the way you want them to.

ReportLab scores very well here, because it’s pure Python and BSD license gives you a lot of flexibility. You can read, debug, patch, and copy the code. When debugging, you can step directly from your code into ReportLab code. If you patch ReportLab, it’s easy to roll out a patched version to your servers using pip.

LibreOffice, on the other hand, is a large office suite written in C++ (and maybe Java?). It’s orders of magnitude more complicated. Think of it as an unrepairable black box.

5. Power?

ReportLab includes lots of cool stuff out of the box, like bar, pie, line, and other kinds of charts, a table of contents generator, and probably lots of other things I don’t know about.

It’s also extensible, so it you want something it doesn’t have (like a list of figures generator) you can write it or search online to see if someone else has already done it.

LibreOffice has even more to offer, though. It’s an entire office suite, after all! It not only handles all of the normal text document things (like headings, foot/endnotes, autonumbering lists, etc.), you can do more sophisticated things like embedding spreadsheets in documents. 1001 Creative Ways to Use an Office Suite could be a blog post all its own (or 1001 of them!).

6. Scalability?

ReportLab is just Python so one can run multiple concurrent threads or processes just as with any other Python code.

Unfortunately, LibreOffice does not scale. It’s not possible to run multiple LibreOffice processes simultaneously on one machine. For probably 99.99% of users, this isn’t a concern, but it can be a problem for automation. It means you have to be willing to create your PDFs synchronously.

7. Speed?

Warning: Guess approaching!

My hunch is that ReportLab is faster, maybe by a lot. But that’s backed by no data whatsoever. Benchmarking would be time-consuming. It would require inventing a variety of relatively complex PDFs and generating them using both methods. And that still might not tell you much about your use case.

In the grand tradition of arguing on the Internet, I’m not going to let my ignorance or lack of data keep me from having an opinion. But understand that it’s a guess, and take it with a huge grain of salt.

8. Experimentation?

You’re probably not producing this PDF for yourself, but for someone else. That someone might be an immediate co-worker, another department in your company, or a customer that’s in a completely different company. Experimenting with the output PDF is an important part of the process because it usually takes many tweaks to get the PDF to look the way your client wants.

Just as with developing software, the end result will be a moving target as ideas evolve. And also like software development, you want your tools and process to add as little friction as possible to the evolution.

With ReportLab, experimentation can be time-consuming. If you have a complex PDF, you’ll have a non-trivial amount of Python/ReportLab code to generate it. As code gets more complex, it gets harder to change. That’s not specific to ReportLab, it’s just a general software development principle. So when your client wants to change, say, how the page footer is formatted, or how figures are numbered, or the document font, the usual difficulties of maintaining code apply.

With LibreOffice, changing the document is extremely easy because you’re using a tool built expressly for that purpose. It’s straightforward, and you can immediately see the results of your changes.

9. Complexity?

By complexity, I’m referring to the complexity of one’s code relative to the complexity of the PDF you’re trying to create.

With ReportLab, the relationship is roughly linear complexity. If you have a complex PDF, you’ll have reasonably complex Python code to create it.

With LibreOffice, the relationship is non-linear. Deleting and duplicating XML elements and changing text are easy. Creating new elements is difficult. For instance, our trivial PDF example contained two paragraphs and a table. As I demonstrate in part 1 and in my talk, it’s easy to add, delete, and change table rows, but if you asked me to add an image to that document, I would be stuck because there’s no image in the XML for me to copy.

Obviously, I could add an image to the document and then see how that’s expressed in the XML, but that only works if I know in advance that I’m going to need an image.

10. Strengths

ReportLab is a safe choice. It does one thing and it does it well. The fact that it’s extensible means you can always get it to do what you want (although you might have to write more code than you planned). It’s the well-traveled path, so you’ll be able to find fellow travelers (and their tutorials and advice). It can handle extremely varied output.

Using the LibreOffice method is best when there’s a high ratio of static to dynamic content. Think about the extreme example of a 900-page PDF in which there’s only one paragraph of dynamic content. You would have to write very little code to populate that one paragraph, whereas with ReportLab you’d have to write code to generate all 900 pages, even though they never change.

The LibreOffice method requires less code — maybe a lot less, depending on your situation. The tradeoff is that you have to do more document construction work, but to me that’s still a win for two reasons. First, you get to use a tool built expressly for that purpose. Second, it’s easier (and cheaper) to find LibreOffice/document editing skills than Python/software development skills. Your client might even be able to build most of the document, which will save them money and give them control over and investment in the outcome. That makes for a happy client.

What’s Next

In my next post in this series, I’ll discuss some of the questions asked at my PyOhio talk, and in the fourth and final post I’ll present some useful code snippets. Stay tuned!

 

Creating PDF Documents Using LibreOffice and Python

This post is a supplement to a talk I’m giving at PyOhio about using Python to create PDFs “the lazy way”. It’s the first of a series on this subject which is a bit too big for just one blog post.

In the talk and in this series, I advocate a technique for creating PDFs that uses LibreOffice (or OpenOffice) to do most of the hard work, and I contrast that to the common solution of using ReportLab (or a library like it).

This technique offers some unique benefits, and in some common use cases—most importantly, perhaps in your case—it can be much more efficient than the alternative. I’ll compare and contrast the two in another blog post. In this post I just want to describe the technique I’m advocating.

Background

Creating PDFs programmatically is a task most Python programmers encounter at least once.

When I talk about creating PDFs programmtically, I’m thinking of the situation where one wants to create a lot of PDFs that follow a template. For instance, you might work for a bank that wants to produce end-of-month account statements for each of its 100,000 customers. The cover page will always contain the bank’s logo, some legal boilerplate, the month and year, and a bland stock photo 17068-a-woman-and-older-man-sitting-at-a-table-pvof happy customers doing something  unrelated to banking, like this one.

The first page after that will be a summary of the customer’s accounts, and then subsequent pages contain information about the account—a list of transactions, changes in values of stocks, etc.

Each PDF will be different, but similar because they follow a template. Computers are great for this sort of thing, and this technique is particularly good at it. As I said above, I’ll tell you why I think it’s good in another blog post. For now, I want to stop talking mysteriously about “the technique” and actually describe it.

Outline

Here’s a concise outline. Don’t worry if you don’t understand all the steps; they’re fleshed out below.

  1. Create a LibreOffice document that will serve as a template for the documents you want to create. (Note: I mean “template” in the general sense of a form or skeleton, not a LibreOffice .ott template file.)
  2. Unzip that document.
  3. Manipulate the document’s XML using standard Python libraries.
  4. Zip the modified files into a new LibreOffice .odt file.
  5. Ask LibreOffice to export the document in PDF format.

Let’s go through these step by step. I encourage you to follow along. We’re not going to write a single line of Python code, just explore a process. Writing Python would come later when you automate steps 2 – 5.

1. Create a LibreOffice Document to Use as a Template

This step will probably require the most work.

We usually know in advance at least some of the content we want. For instance, in the bank example above, we know what the cover page will look like, where each section should appear in the document, and how a section (e.g. a list of account transactions) should be formatted, even if we don’t know in advance the exact values of each transaction.

Your job during this step is to create a LibreOffice document that will serve as a skeleton (or template, or form) for your final documents. For content that you don’t know (words in paragraphs, images, bullet points in a list, table contents, etc.), leave placeholders.

If you want to play along with this blog post, here’s the LibreOffice document that I’ll use in the examples below.

2. Unzip the Document

This is a trick you might not know—LibreOffice documents are ZIP files. (This is true of all documents that follow the Open Document Format for Office Applications). You can unzip them with command line tools, or with the zipfile module in Python’s standard library.

On my Mac, the following command unzips the document into the directory unzipped.

unzip practice.odt -d unzipped

After unzipping, you’ll see a bunch of files like this:

drwxr-xr-x  11 philip staff    374 Jul 27 16:43 Configurations2/
drwxr-xr-x   3 philip staff    102 Jul 27 16:43 META-INF/
drwxr-xr-x   3 philip staff    102 Jul 27 16:43 Thumbnails/
-rw-r--r--   1 philip staff   8988 Jul 27 16:44 content.xml
-rw-r--r--   1 philip staff    899 Jul 27  2016 manifest.rdf
-rw-r--r--   1 philip staff   1005 Jul 27  2016 meta.xml
-rw-r--r--   1 philip staff     39 Jul 27  2016 mimetype
-rw-r--r--   1 philip staff  10319 Jul 27  2016 settings.xml
-rw-r--r--   1 philip staff  14903 Jul 27  2016 styles.xml

Of the files above, you’re only likely to be interested in content.xml. (You might also want to explore styles.xml, but I consider that an advanced topic, and I’m trying to maintain a rigorous standard of laziness.)

3. Manipulate the XML

The XML in content.xml is governed by the 846-page Open Document Format for Office Applications. You might think I’m going to suggest you read it, or at least familiarize yourself with it.

Heck no! That’s not the lazy way. I’m very pleased that it’s an ISO standard, but I don’t want to learn it if I can save time and effort by not doing so, and you shouldn’t have to either.

Instead I suggest you use what I use: common sense and intution, which can get you surprisingly far. For instance, if you see this in the XML—

<text:p text:style-name="P4">
 The fox jumped over the dog.
</text:p>

You don’t have to read 846 pages of documentation to guess that you can change it to this—

<text:p text:style-name="P4">
 The quick brown fox jumped over the lazy dog.
</text:p>

Or even this—

<text:p text:style-name="P4">
 No one expects the Spanish Inquisition!
</text:p>

Are you starting to see some possibilities?

If you’re doing this programmatically, you can use LibreOffice bookmarks to demarcate the text you want to replace. Bookmarks are visible in the XML and trivial to locate using XPath. You can see this in my example document where I’ve surrounded two blank space characters with bookmarks where adjectives might go to describe the fox and dog.

<text:p text:style-name="P1">
    The
    <text:bookmark-start text:name="fox_type_placeholder"/>
    <text:s/>
    <text:bookmark-end text:name="fox_type_placeholder"/>
    <text:s/>
    fox jumped over the
    <text:bookmark-start text:name="dog_type_placeholder"/>
    <text:s/>
    <text:bookmark-end text:name="dog_type_placeholder"/>
    <text:s/>
    dog.
</text:p>

What do you think will happen if you replace the first occurrence of  <text:s/> with quick brown?

Text isn’t the only thing you can change.

If you have a list item with bullets and you want another bullet or three, you can just duplicate existing bullets. For instance, if you start with this—

<text:list xml:id="list3413943092755896283" text:style-name="L1">
    <text:list-item>
        <text:p text:style-name="P2">First</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Second</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Third</text:p>
    </text:list-item>
</text:list>

You can turn it into this—

<text:list xml:id="list3413943092755896283" text:style-name="L1">
    <text:list-item>
        <text:p text:style-name="P2">First</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Second</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Third</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Fourth</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Fifth</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Sixth</text:p>
    </text:list-item>
</text:list>

Note that the text:list element itself has what looks like a unique id associated with it. This is a yellow flag that indicates to me that if you want to copy the entire list, you’ll need to give it a new unique id, and hope that LibreOffice  doesn’t reference that id in some other file.

I’m sure the details are somewhere in that 846-page document. You can read that document, or you can also just try your change and see what happens. The worst case scenario is that LibreOffice will tell you that your document is corrupted and you’ll have to go back and explore some more.

4. Zip a New LibreOffice File

Once you’ve made the changes you want, it’s time to reverse step 2, using your modified content.xml.

Here’s the command that works on my Mac—

cd unzipped && zip -r ../my_new_file.odt * && cd ..

Note that this command doesn’t respect the OpenDocument specification which has rules regarding how the mime type file should be represented in the zip file (as the first file in the archive, and uncompressed, per OpenDocument v1.2 part3, § 3.3 MIME Media Type). It works for me, maybe because LibreOffice is forgiving. It’s not something you should rely on, however. In another post, I’ll present some Python code that constructs the ZIP file according to standard.

5. Export to PDF via LibreOffice

If you’re just experimenting, you can just open the document in LibreOffice manually and then use the “File/Export as PDF…” menu item. (Opening manually is also a good test that you didn’t do anything objectionable to the XML.)

Programmatically, I recommend using unoconv for converting your finished document to PDF.

Review

So there you have it! If you feel underwhelmed, keep in mind that this was only a proof of concept. In some future posts, I’ll explain why I think this method is often an excellent choice (and also when it isn’t).

Photo Credit

Thanks to the National Cancer Institute for making many photos available for free, including the one used in this blog post which was taken by Rhoda Baer.

Embedding Python: How To Confuse Python and Yourself

This is a cautionary tale about how embedded Python finds its runtime files under Windows. Don’t worry, though — everyone lives happily ever after.

The story begins with a client’s request to build an executable under Windows that embeds Python. This is not so hard; there is documentation for embedding Python.

I had two versions of Python installed on my Windows virtual machine because I was experimenting with different Pythons at the time. (Doesn’t everyone go through an experimental phase in their youth?) In C:\Python27 I had installed 64-bit Python 2.7.9 from Python.org, and in C:\Users\philip\Miniconda2 I had installed 64-bit Python 2.7.11 from Continuum. The 2.7.9 version was an older install. I was only interested in using the 2.7.11 version.

My executable’s C code told Python where to find its runtime files —

Py_SetPythonHome("C:/Users/philip/Miniconda2");

After compiling and linking, I ran my Python-embedding executable which imported my hello_world.py file. It printed “Hello world!” as expected.

Here Come the Dragons

I thought everything was fine until I added these print statements to my Python code —

import sys
print sys.exec_prefix
print sys.version

The output was not what I expected —

C:/Users/philip/Miniconda2

2.7.9 (default, Dec 10 2014, 12:28:03) [MSC v.1500 64 bit (AMD64)]

This is contradictory! The Miniconda Python was 2.7.11, yet sys.version identified itself as Python 2.7.9. How can the same module report two different Pythons simultaneously?

What Went Wrong

The sys module reported information about two Pythons simultaneously because I was mixing parts from two different Python runtimes simultaneously.

Python’s runtime consists of a dynamically loaded library (python27.dll), the standard library which is largely written in Python itself, and an executable. The executable is usually python.exe, but in this case it was my C program that embedded Python. When my C program asked Windows to find python27.dll, Windows searched for it in these directories as documented in the Windows DLL search strategy  —

  1. The directory where the executable module for the current process is located.
  2. The current directory.
  3. The Windows system directory (usually C:\Windows\system32).
  4. The Windows directory (usually C:\Windows).
  5. The directories listed in the PATH environment variable.

My problem was that Windows found C:\Windows\system32\python27.dll first, and that was from my Python 2.7.9 installation. Meanwhile, my call to Py_SetPythonHome() had told Python to use the standard library from Miniconda Python. The value of sys.version comes from a string hardcoded in the runtime DLL, while sys.exec_prefix is derived from the value I passed in Py_SetPythonHome(). I was using the standard library from one Python installation, and the runtime DLL from another.

Consequences

Although I didn’t experiment with this for long, I might not have noticed that there was a problem if I hadn’t been lucky enough to double check my Python setup with the sys module. The standard library probably doesn’t care about which interpreter it runs under. I can imagine a few cases may exist where changes/bug fixes were made to the Python part of the standard library for versions 2.7.10 and 2.7.11 that rely on corresponding changes to the binary runtime, and that code might behave badly.

Both of the Pythons I was using were built with the same compiler, so theoretically binary extensions like numpy should run just fine under either Python. But I could certainly forgive numpy if it crashed as a result.

In short, this is neither a typical nor a supported use of Python which puts it in the “Here there be dragons” category.

The Solution

The solution was very simple. I copied C:\Users\philip\Miniconda2\python27.dll into the same directory as my custom executable. Since that’s the first location Windows searches when loading a DLL, it isolates my code from other Python DLLs that might appear in (or disappear from) other locations in the file system. Problem solved!

CentOS Revisited – How Old is Too Old?

I recently started working with CentOS 5.11 as an operating system on which to build Python wheels for Linux. I wrote about why I used CentOS 5.11. The oversimplified reason is because it’s old. After using it for a while, I’ve started to wonder, how old is too old?

Why CentOS 5.11, Again?

To understand why the authors of PEP 513 recommend CentOS 5.11, we must briefly consider a subject that Python programmers can usually ignore—binary dependencies.

When one builds a binary on Linux (or any system, for that matter), it comes with dependencies on runtime libraries. Even a simple “hello world” program will depend on the C runtime library (glibc if you build with GCC). The benefit of building binaries on an older Linux is that the runtime dependencies—particularly glibc—are likely to be present on newer systems. The reverse is not true. If you link your binary to a brand new glibc, it won’t be able to run on older systems because they don’t have the glibc needed to load your binary.

CentOS 5.11 was released on the last day of September 2014 (according to DistroWatch.org), so it’s only 1½ years old. But it’s a derivative of Red Hat which is a notoriously conservative distro. To give you some idea of how conservative it is, CentOS 5.11 provides Python 2.4.3 which was released in 2006, eight years before CentOS 5.11, and almost ten years before PEP 513 was released. CentOS 5.11 is a snapshot of what was state-of-the-art some years ago.

There’s nothing wrong with a Linux distro that chooses to be this conservative. If you want modern software, or bleeding edge, there are distros for that. RedHat Enterprise Linux (and thus CentOS) is not one of them.

CentOS 5.11’s “old school” attitude makes it a very safe bet that the versions of it base libraries (like glibc) will appear on other Linux distros, and that’s why it’s a good choice for building Linux wheels.

Great! Are There Any Downsides?

Yup.

The point of using an older Linux is to ensure that runtime libraries are not too new to appear on other Linuxes. But what if the opposite happens?

What happens if a library on CentOS 5.11 is so old that some of the Linux world no longer supports it? That’s what happened when I tried (for one of my clients) to wrap a Fortran library with Python and distribute it as a wheel. I built the Fortran code with the default GFortran/GCC version which was 4.1.2.

The resulting binary has a dependency on libgfortran.so.1. This library has become old enough that it’s not always easy to install. For instance, it’s not in the repositories of the very popular Ubuntu 14.04 LTS.

That’s particularly surprising when you consider that Ubuntu 14.04 LTS was released about six months before CentOS 5.11. Despite this, the former had already dropped support for the default libgfortran of the latter.

This is a good example of how CentOS 5.11 helps to avoid dependency problems, but doesn’t entirely solve them. In short, caveat munitor (builder beware).

How I Resolved the libgfortran.so.1 Dependency

I was able to build a wheel for my client that solved the specific libgfortran.so.1 dependency problem described above. I set the binary’s rpath to include the binary’s directory ($ORIGIN) and shipped libgfortran.so.1 as part of the Python wheel in the same directory as the custom shared library. The relevant Makefile portion looks like this—

gfortran -shared               \
         -fPIC                 \
         -Wall                 \
         -Wl,-rpath,'$$ORIGIN' \
         -o my_library.so      \
         $(OBJS)

And running ldd on the resulting library shows that the binary uses the local libgfortran.so.1 as intended—

$ ldd libmy_library.so
   linux-vdso.so.1 => (0x00007ffd0d93d000)
   libfftw3.so.3 => /usr/lib/x86_64-linux-gnu/libfftw3.so.3 (0x00007f1b97297000)
   libgfortran.so.1 => /home/philip/miniconda2/lib/python2.7/site-packages/my_library/bin/./libgfortran.so.1 (0x00007f1b97000000)
   libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f1b96cfa000)
   libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f1b96ae4000)
   libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f1b9671f000)
   /lib64/ld-linux-x86-64.so.2 (0x00007f1b9f8b9000)

Matplotlib Advice

A little bonus tip: if you want to install matplotlib on CentOS 5.11, save some time and read this Stack Overflow comment by Byron Dover first.

How To Set Up CentOS to Build Linux Wheels for Python

PEP 513 gives guidelines on how to build broadly compatible Linux platform wheels for Python. The PEP names CentOS 5.11 as the reference OS on which a Python wheel must run if it is to earn the right to the manylinux1 name.

The surest way to get one’s binary package to run on CentOS 5.11 is to build it there. This post explains how I set up a CentOS 5.11 VirtualBox guest to build manylinux1 Python wheels.

PEP 513 offers a prebuilt Docker container of CentOS 5.11. If you’re on Linux and/or you’re familiar with Docker, that’s probably a better route than building a VM.

Note that I’m not at all a Linux expert. If I’ve done something foolish or incorrect, I’d like to hear about it in the comments. Please be nice. =)

Why CentOS 5.11?

CentOS has a few things going for it that make it a good choice —

  • It’s free
  • As a derivative of Red Hat Enterprise Linux, it’s a conservative distro, so the libraries on it are likely older than the libraries on contemporaneous distros.
  • At the time PEP 513 was written, CentOS 5.11 was already over a year old. That increases the odds that other distros will have the libraries that it has.

It’s age is also a disadvantage, because at this stage the only updates CentOS 5 will receive are critical security updates and the CentOS team “recommends that you start moving workloads from CentOS 5 to CentOS Linux 6 or CentOS Linux 7”.

CentOS 5.11 Setup

Download the CentOS 5.11 ISO and install under VirtualBox. During the CentOS installation, I opted to disable SELinux. Since I only use this installation for builds and not as a server or daily desktop, I don’t feel the need for high security.

Once CentOS is installed, let the updater download and install its patches.

Next, you’ll want to install VirtualBox guest additions to make the guest OS easier to use. In order to do that, you first have to add yourself to the sudoers file.

Add Yourself to sudoers

Open a terminal and enter the following commands —

  1. su -
  2. vim /etc/sudoers
  3. At the end of the file, add this line:
    your_username ALL=(ALL) ALL
  4. Save the file with :wq!
  5. Type exit to exit the su - shell.

Now you should be able to run commands with sudo.

Build the VirtualBox Guest Additions

  1. Install GCC:
    sudo yum install gcc gcc-c++
  2. Insert the Guest Additions CD.
  3. Start a terminal and cd /media/VBOXADDITIONS_xxxx. Note that the exact name of the VBOXADDITIONS directory changes with each each version of VirtualBox.
  4. sudo ./VBoxLinuxAdditions.run
  5. Eject the Guest Additions CD and reboot.

Add Packages

Install the things you’ll need to build Python, and to use pip —

sudo yum install xz zlib zlib-devel openssl-devel pcre-devel sqlite-devel

Build Python

CentOS 5.11 comes with Python 2.4. You will undoubtedly want a newer Python, so download and untar the source code for the Python you want to use and then build it. I built Python 2.7.11 with these steps —

 sudo ./configure --enable-unicode=ucs4
 sudo make altinstall

It’s important to use UCS4 (as opposed to the default UCS2) during the configure step to increase your odds of being compatible with the Pythons built for other Linux distros.

make altinstall tells Python to install itself in such a way that it doesn’t interfere with the default (system) Python.

Once my Python was built, I added a symlink to make it the default Python in my shell —

 sudo ln -s /usr/local/bin/python2.7 /usr/local/bin/python

At this point, if you start a new terminal and type python, you should
get Python 2.7.11.

Download and Install pip

wget --no-check-certificate https://bootstrap.pypa.io/get-pip.py
sudo python get-pip.py

This also installs setuptools and wheel, both of which we need.

Fix pip

At this point, pip will malfunction as described in issue 1918. The fix is not difficult.

Use sudo vim to edit these 3 scripts —

  • /usr/local/bin/pip
  • /usr/local/bin/pip2
  • /usr/local/bin/pip2.7

In each, change the first line from this —

#!/usr/bin/python

to this —

#!/usr/local/bin/python

Enjoy, and Build Some Wheels!

You’re done! Go build some great wheels!

 

My Code Got Cyndi Lauper-ed, and I’m Glad

Let’s start with a quick musical quiz. Who wrote this 1983 pop hit? I’ll give you a hint — it’s not the person who sings it.

You can see the songwriting credits here (assuming that link survives time), but for most of you it won’t be a useful clue.

Those of you who grew up in Philadelphia in the 1980s might recognize the name Robert Hazard, leader of Robert Hazard and the Heroes, author of Escalator of Life, Change Reaction, and Out of the Blue, and, of course, Girls Just Wanna Have Fun. Robert’s popularity never grew much out of the Philly/NJ area, but Cyndi Lauper’s version of Girls Just Wanna Have Fun sold a zillion copies worldwide and touched a lot of people who had never heard of Robert Hazard, and never will.

I have something in common with him — someone has given my work far more exposure than I ever expected it to get. (Another thing we have in common is growing up in the same small Philadelphia suburb. But without Cyndi Lauper’s involvement, that’s just trivia.)

I was surprised to learn from http://pypi-ranking.info that posix_ipc, one of my open source packages, is currently in the top .5% (½ of 1%) of the most downloaded on PyPI. Now, posix_ipc might be good at what it does, but it fills a tiny niche that’s nowhere near big enough to justify all of those 1.7-million-and-counting downloads. Why is it a top 1% download? Because it has become part of something much bigger — the massively popular OpenStack.

OpenStack didn’t have to rewrite portions of posix_ipc (like Cyndi did with Girls Just Wanna Have Fun, with Robert’s permission). They haven’t yet made a video of it that includes a nod to the Marx Brothers (like Cyndi did, with or without Robert’s permission). And as far as I know, OpenStack has yet to be nominated for a Grammy. But they have shown me the value of putting something out into the world, because you never know where it will end up.

So thanks, OpenStack! And thanks to Robert Hazard for music I enjoyed growing up (and still do). R.I.P, Robert.

 

My Function Failed Inspection

Python’s inspect module allows one to examine the signature of functions, like so:

$ python3
Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 23 2015, 02:52:03)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import inspect
>>> def f(foo=42):
... pass
...
>>> print(inspect.signature(f))
(foo=42)

I wanted to use function signature inspection during unit testing of my sysv_ipc and posix_ipc modules to ensure that my code matched its documentation.

Unfortunately, inspect doesn’t work with functions written in C, as you can see in the example below that uses math.sqrt().

$ python3
Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 23 2015, 02:52:03)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import inspect
>>> import math
>>> inspect.signature(math.sqrt) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/inspect.py", line 2055, in signature return _signature_internal(obj) File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/inspect.py", line 1957, in _signature_internal skip_bound_arg=skip_bound_arg) File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/inspect.py", line 1890, in _signature_from_builtin raise ValueError("no signature found for builtin {!r}".format(func)) ValueError: no signature found for builtin <built-in function sqrt> 
>>>

Since posix_ipc and sysv_ipc are written in C, I have to test their functions and methods with keyword arguments by calling each one with each keyword argument specified.

Maybe there is no current working directory

A few years ago I wrote a Django app for a client. One part of the app called os.getcwd(), and another part (that I thought of as completely separate) used a temporary directory to build PDFs.

Occasionally the call to os.getcwd() would raise an error. I was confused. How can there be no current directory? It took me a while to figure it out, but in hindsight it’s kind of obvious (as these things often are).

My PDF-building code created a temporary directory, set that directory to be the current working directory, and then removed the directory once the PDF was built. After that, there was no current working directory. It’s easy to demonstrate —

$ python3
Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 23 2015, 02:52:03)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> import tempfile
>>> temp_dir = tempfile.TemporaryDirectory()
>>> os.chdir(temp_dir.name)
>>> os.getcwd()
'/private/var/folders/9f/4zptd_j10dx343w0r7tvkm0r0000gn/T/tmpfcwpe1xs'
>>> temp_dir.cleanup()
>>> os.getcwd()
Traceback (most recent call last):
File "", line 1, in
FileNotFoundError: [Errno 2] No such file or directory
>>>

I have a feeling I’m not the only developer who was foolish enough to assume that os.getcwd() would never fail. At least now I know better, and you do too!

 

 

I’m not that type of variable

This is a story about the details of a C type leaking into Python.

As part of testing my Python wrapper for SysV IPC, I wrote tests for the time-related attributes of the IPC objects that change when something happens to the object. For instance, when someone sends a message to a queue, the queue’s last_send_time attribute (msg_stime in the C code) is updated to the current time.

I have a hard time imagining many programmers care about these attributes. Knowing the last time someone changed the uid of a message queue, for instance, just doesn’t have many use cases. But they’re part of the SysV IPC API and so I want my package to expose them.

I wrote tests to ensure that each one changed when it was supposed to. The tests failed consistently although the code worked when I tested it “by hand” in an interactive shell. Here’s the relevant portion of a failing test:

[python]
def test_property_last_change_time(self):
"""exercise MessageQueue.last_change_time"""
original_last_change_time = self.mq.last_change_time
# This might seem like a no-op, but setting the UID to
# any value triggers a call to msgctl(…IPC_STAT…)
# which should set last_change_time.
self.mq.uid = self.mq.uid
# Ensure the time actually changed.
self.assertNotEqual(self.mq.last_change_time,
original_last_change_time)
[/python]

The problem is obvious, right? No, it wasn’t obvious to me, either.

The problem is that in C, a message queue’s last change time (msg_ctime) is of variable type time_t which is typedef-ed as an integral type (int or long) on most (all?) systems. Because the test above executed in less than 1 second, the assertion always failed. Setting self.mq.uid correctly caused an update to the last change time (msg_ctime), it was just being updated to the same value that had been saved in the first line of the test.

My solution was to add a little sleeping, like so:
[python]
def test_property_last_change_time(self):
"""exercise MessageQueue.last_change_time"""
original_last_change_time = self.mq.last_change_time
time.sleep(1.1)
# This might seem like a no-op, but setting the UID to
# any value triggers a call to msgctl(…IPC_STAT…)
# which should set last_change_time.
self.mq.uid = self.mq.uid
# Ensure the time actually changed.
self.assertNotEqual(self.mq.last_change_time,
original_last_change_time)
[/python]
That ensured that the value stored in original_last_change_time at the start of the test would differ from self.mq.last_change_time by at least 1 at the end of the test.

Lessons learned: don’t forget about C types, especially when you’re wrapping a C API.