Coercing Objects to Integer, Revisited

Summary

I recently wrote a blog post that involved exception handling, and gave short shrift to the part of exception handling I didn’t want to talk about in order to focus on the part I did want to talk about. For some readers, that clearly backfired.

Background

My recent blog post about coercing Python objects to integers caught people’s attention in a way I hadn’t intended. The point I was trying to make was that an innocent-looking call like int(an_object) calls the method an_object.__int__(), and since that can be arbitrary code, it can raise arbitrary exceptions. Therefore, it’s insufficient to catch only the usual exceptions of ValueError and TypeError if you don’t know the type of an_object in advance.

Here’s the code I suggested –

def int_or_else(value, else_value=None):
    """Given a value, returns the value as an int if possible.
    If not, returns else_value which defaults to None.
    """
    try:
        return int(value)
    # I don't like catch-all excepts, but since objects can raise arbitrary
    # exceptions when executing __int__(), then any exception is
    # possible here, even if only TypeError and ValueError are
    # really likely.
    except Exception:
        return else_value

Several commenters objected to the fact that this code discards (and therefore silences/masks/hides) all exceptions. Here’s why I made that choice.

The Two Parts of Exception Handling

In Python, there’s two parts to consider about exception handling — what to catch, and what to do with the exception once you’ve caught it. My intention was to write only about the former.

The latter is an interesting topic, too. Once you’ve caught an exception, you might want to log it and then discard it, log it and then re-raise it, re-raise it as a different exception, silence it, let it pass up to the caller, modify its attributes and re-raise it, etc. There’s enough material for an entire blog post about different ways to react to an exception, and the pros and cons of each.

Someday I might write that post about different ways to react to trapped exceptions, and if I do, I’ll dedicate the entire post to the subject to give it the attention it deserves. That other blog post – that was not it. In fact, it was the opposite. I gave the topic of processing the trapped exception as little attention as possible so as not to detract attention from what I wanted to be the main topic (what exceptions need to be trapped).

That backfired.

Conclusion

My post was not advocacy of discarding exceptions, nor was it advocacy of not discarding exceptions. What’s the right choice? It depends. One situation where you might want to discard exceptions is in a blog post where you’re trying to keep the code as brief as possible for readability. Then again, you might regret that. :-)

In the future, I’ll be clearer about what shortcuts I’m taking for brevity of presentation.

Agree? Disagree? I’d like to hear from you. I like it when people agree with me. Those who disagree can expand my horizons, and I like that too. In short, all civil comments are welcome. I feel I’ve spent enough time thinking about this topic for now, but that doesn’t make me right! Let me know what you think.

A Postcard of Tunisia

Earlier on this blog I briefly mentioned working with some Libyans in Tunis, the capital of Tunisia. We chose to meet at that location because it’s close to Libya but much safer than Tripoli. Now that I’ve been back for a while and had a chance to catch up, I wanted to write more about my experience.

A photo of the translator translating live
English-to-Arabic translation on the fly!

 

I was there with Tobias McNulty of Caktus Group. We (Tobias and I) trained the Libyan employees of Libya’s High National Election Commission (HNEC) in the maintenance and use of the HNEC-commissioned SMS-based voter registration system that I had helped to develop while working with Caktus. The system has been open sourced as Smart Elect.

If the big picture was promoting democracy, the medium picture was training system admins and developers. And the very small picture was working together on the nitty gritty of features and bug fixes, like figuring out that if a @property method raises an exception when invoked by hasattr(), the exception isn’t propagated under Python 2.7.

The admin training consisted of a comprehensive review of the system, including the obscure corners and edge case handling. The developers were eager to get their hands dirty, so after some organizational review, we dove into fixing bugs and implementing some new features that HNEC wanted.

A photo of a trainee
Abdullah (Photo by Tobias McNulty)

Tobias and I worked with the developers as both mentors and peers. Grinding through bugs from start to finish was really valuable. Our trainees have good development experience, but working in groups with us allowed them to participate in our approach to debugging, problem reporting, development, and test. It seemed a little different from what they were used to. We were very methodical about creating an issue in our tracker, creating a branch for that issue, reviewing one another’s code, documenting the fix, etc. “It’s a lot of process,” said one trainee after working through one particular bug with us. He’s right. I wish I had thought to ask if Libyan culture has a proverb similar to “For want of a nail…“. I could have said, “For want of filing an issue in the tracker, a voter was disenfranchised,” but it doesn’t have the same ring to it.

A photo of Tobias and a trainee
Tobias and Ahmed

This was my first trip to Africa, and, grand notions aside, what stood out to me was how mundane much of the experience was. The guys we worked with would have fit right in at any coding meetup I’ve been to. They had opinions about laptops. They were distracted by their phones. Everyone enjoyed a successful bug hunt. I remember one trainee being tired at 5PM, saying he had no more left in him, and seeing him there grinning 2 hours later when we finally solved the problem we’d been working on.

Outside of the training, I especially enjoyed the dinners at Sakura/Pasta Cosy and Chez Zina (my favorites, in that order).

We also ate at Le Bon Vieux Temps, where the handwritten chalkboard menu is carted from table to table on a charming-but-impractical frame. Tunisia is principally French speaking, with Arabic on an almost equal footing. At Le Bon Vieux Temps (“The Good Old Times”), the menu was all in French, and my vestigial French came in handy for translating the menu into English for the Libyans who in turn peppered the waiter with questions in Arabic. (That night in the restaurant began and ended my career as a French-to-English translator.)

On the weekends we rested, walked in the city, and paid a visit to the Bardo National Museum. The Bardo was famously attacked in 2015, and has since sprouted a razor wire fence around the entire property. Bored soldiers sat on a truck by the gate and motioned us to enter. It’s a nice museum, and I’m glad I went.

A photo of my entrance pass to the Bardo Museum

Inside the classroom and out, I got to know and really like our Libyan colleagues. They were generous with their good humor and kindness. If they lacked anything, it was a willingness to complain.

Libya is a difficult place to live at the moment. I think we all know that in an abstract sense, but talking to my Libyan friends made it more concrete for me. Banks don’t have enough cash. Electricity isn’t reliable. People they know have been kidnapped. My friends have a lot on their minds, and yet they found rooom to squeeze in opinions about good software development practices.

A photo of a trainee
Munir

I’m glad I got the chance to go, and to get to know the people I did. In addition to working with Tobias and the Libyans, I had a lot of non-work experiences I’ll remember for a long time. I walked among ruins in Carthage that are over 2000 years old. I drove solo (and lost) through rush hour traffic in Tunis and survived. I saw a Tunisian wedding, and got to use the word “ululating” for the first time outside of Scrabble or Bananagrams. I swam in the Mediterranean. I saw flocks of flamingoes (many, many thanks to Hichem and Claudia of Les Amis des Oiseaux).

HNEC is now better positioned than ever to use the Smart Elect system, and I hope they do so again soon. That’s partly for egotistical reasons — I like to see my work get used. Who doesn’t? But more importantly, if it gets used, that means Libyans are voting to determine their own future.

How Best to Coerce Python Objects to Integers?

Summary

In my opinion, the best way in Python to safely coerce things to integers requires use of an (almost) “naked” except, which is a construct I rarely want to use. Read on to see how I arrived at this conclusion, or you can jump ahead to what I think is the best solution.

The Problem

Suppose you had to write a Python function to convert to integer string values representing temperatures, like this list —

['22', '24', '24', '24', '23', '27']

The strings come from a file that a human has typed in, so even though most of the values are good, a few will have errors ('25C') that int() will reject.

Let’s Explore Some Solutions

You might write a function like this —

def force_to_int(value):
    """Given a value, returns the value as an int if possible.
    Otherwise returns None.
    """
    try:
        return int(value)
    except ValueError:
        return None

Here’s that function in action at the Python prompt —

>>> print(force_to_int('42'))
42
>>> print(force_to_int('oops'))
None

That works! However, it’s not as robust as it could be.

Suppose this function gets input that’s even more unexpected, like None

>>> print(force_to_int(None))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 6, in force_to_int
TypeError: int() argument must be a string or a number, not 'NoneType'

Hmmm, let’s write a better version that catches TypeError in addition to ValueError

def force_to_int(value):
    """Given a value, returns the value as an int if possible.
    Otherwise returns None.
    """
    try:
        return int(value)
    except (ValueError, TypeError):
        return None

Let’s give that a try at the Python prompt —

>>> print(force_to_int(None))
None

Aha! Now we’re getting somewhere. Let’s try some other types —

>>> import datetime
>>> print(force_to_int(datetime.datetime.now()))
None
>>> print(force_to_int({}))
None
>>> print(force_to_int(complex(3,3)))
None
>>> print(force_to_int(ValueError))
None

OK, looks good! Time to pop open a cold one and…

Wait, I can still feed input to this function that will break it. Watch this —

>>> class Unintable():
 ...    def __int__(self):
 ...        raise ArithmeticError
 ...
 >>>
 >>> trouble = Unintable()
 >>> print(force_to_int(trouble))
 Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "<stdin>", line 6, in force_to_int
   File "<stdin>", line 3, in __int__
 ArithmeticError

Dang!

While the class Unintable is contrived, it reminds us that classes control their own conversion to int, and can raise any error they please, even a custom error. A scenario that’s more realistic than the Unintable class might be a class that wraps an industrial sensor. Calling int() on an instance normally returns a value representing pressure or temperature. However, it might reasonably raise a SensorNotReadyError.

And Finally, the Naked Except

Since any exception is possible when calling int(), our code has to accomodate that. That requires the ugly “naked” except. A “naked” except is an except statement that doesn’t specify which exceptions it catches, so it catches all of them, even SyntaxError. They give bugs a place to hide, and I don’t like them. Here, I think it’s the only choice —

def force_to_int(value):
    """Given a value, returns the value as an int if possible.
    Otherwise returns None.
    """
    try:
        return int(value)
    except:
        return None

At the Python prompt —

>>> print(int_or_else(trouble))
 None

Now the bones of the function are complete.

Complete, Except For One Exception

Graham Dumpleton‘s comment below pointed out that there’s a difference between what I call a ‘naked’ except —

except:

And this —

except Exception:

The former traps even SystemExit which you don’t want to trap without good reason. From the Python documentation for SystemExit —

It inherits from BaseException instead of Exception so that it is not accidentally caught by code that catches Exception. This allows the exception to properly propagate up and cause the interpreter to exit.

The difference between these two is only a side note here, but I wanted to point it out because (a) it was educational for me and (b) it explains why I’ve updated this post to hedge on what I was originally calling a ‘naked’ except.

The Final Version

We can make this a bit nicer by allowing the caller to control the non-int return value, giving the “naked” except a fig leaf, and changing the function name —

def int_or_else(value, else_value=None):
    """Given a value, returns the value as an int if possible. 
    If not, returns else_value which defaults to None.
    """
    try:
        return int(value)
    # I don't like catch-all excepts, but since objects can raise arbitrary
    # exceptions when executing __int__(), then any exception is
    # possible here, even if only TypeError and ValueError are 
    # really likely.
    except Exception:
        return else_value

At the Python prompt —

>>> print(int_or_else(trouble))
None
>>> print(int_or_else(trouble, 'spaghetti'))
spaghetti

So there you have it. I’m happy with this function. It feels bulletproof. It contains an (almost) naked except, but that only covers one simple line of code that’s unlikely to hide anything nasty.

You might also want to read a post I made about the exception handling choices in this post.

I release this code into the public domain, and I’ll even throw in the valuable Unintable class for free!

The image in this post is public domain and comes to us courtesy of Wikimedia Commons.

Creating PDF Documents Using LibreOffice and Python, Part 4

This is the fourth and final post in a series on creating PDFs using LibreOffice and Python. The first three parts are here:

They’re all a supplement to a talk I gave at PyOhio 2016.

This final post is here to point you to a working code example that you can download from my Bitbucket repository. It’s enough to get you started so you can experiment with your own goals in mind.

https://bitbucket.org/philip_semanchuk/pdfs_from_python

One thing I mention in the code that’s worth repeating here is that the code uses ElementTree to manipulate XML. It’s sufficient for this demo, and the fact that it’s part of the Python standard library means you can run the demo without installing any third party libraries. For real world (i.e. non-demo) usage, I recommend lxml as a more robust and helpful alternative to ElementTree.

A Curious Coincidence: Stinkin’ Badges

Treasure of the Sierra Madre movie posterThe title of my PyOhio talk was “We Don’t Need No Stinkin’ PDF Library: Build PDFs with Python the Lazy Way”. You know the “we don’t need no stinkin’ [whatever]” meme, don’t you? It’s from the Mel Brooks movie Blazing Saddles. (You can find the clip on YouTube.) Did you know that Blazing Saddles is quoting another movie?

The night before I gave my talk, I walked from my AirBnB to a nearby bar and bottle shop. (It’s simply called “The Bottle Shop”. Ohioans are plain dealers, apparently). I settled in there, happy with a pint of stout. On the big screen they were playing an old black and white Western — The Treasure of the Sierra Madre.

I didn’t realize until it happened on the screen that this movie is the inspiration for the “We don’t need no stinkin’ badges” quote, although no one ever actually says “We don’t need no stinkin’ badges”. The actual line is “Badges? We ain’t got no badges. We don’t need no badges! I don’t have to show you any stinkin’ badges!”

It’s pretty close to the line from B. Traven’s novel of the same name.

I didn’t have time in my talk to mention Blazing Saddles, the mysterious B. Traven, The Treasure of the Sierra Madre, Humphrey Bogart, The Bottle Shop, nor the stout. But I was amused by our brief coincidence in Columbus.

Just in Time to Vote!

I just returned from Tunisia a couple of days ago.

In 2014 and 2015 I worked for Caktus Group to help develop an SMS-based voter registration system on behalf of the Libyan government (specifically the High National Election Commission, or HNEC). The open source version of this system is called Smart Elect.

For the last three weeks in Tunisia (which is next door to Libya and a whole lot safer), Caktus’ Tobias McNulty and I trained a dozen HNEC employees on how to use, develop, and maintain the system. We talked about Python, Django, open source culture, GitHub Flow, and, of course, the upcoming U.S. election.

On the eve of that election, I thought it appropriate to express gratitude for my opportunity to participate in the messy sausage-making that is democracy. Good luck to our new Libyan friends; I hope they get the opportunity to do the same in the very near future.

i_voted

 

Creating PDF Documents Using LibreOffice and Python, Part 3

This is part 3 of a 4-part series on creating PDFs using LibreOffice. You should read part 1 and part 2 if you haven’t already. This series is a supplement to a talk I gave at PyOhio 2016.

Here in part 3, I review the conversation we (the audience and I) had at the end of the PyOhio talk. I committed the speaker’s cardinal sin of not repeating (into the microphone) the questions people asked, so they’re inaudible in the video. In addition, we had some interesting conversations among multiple people that didn’t get picked up by the microphone. I don’t want them to get lost, so I summarized them here.

The most interesting thing I learned out of this conversation is that LibreOffice can open PDFs; once opened they’re like an ordinary LibreOffice document. You can edit them, save them to ODF, export to PDF, etc. Is this cool, or what?

First Question: What about Using Excel or Word?

One of the attendees jumped in to confirm that modern MS Word formats are XML-based. However, he went on to say, the XML contains a statement at the top that says something like “You cannot legally read the rest of this file”. I made a joke about not having one’s lawyer present when reading the file.

In all seriousness, I can’t find anything online that suggests that Microsoft’s XML contains a warning like that, and the few examples I looked at didn’t have have any such warning. If you can shed any light on this, please do so in the comments!

We also discussed the fact that one must invoke the office app (LibreOffice or Word, Excel, etc.) in order to render the document to PDF. LibreOffice has a reputation for performing badly when invoked repeatedly for this purpose. LibreOffice 5 may have addressed some of these problems, but as of this writing it’s still pretty new so the jury is still out on how this will work in practice.

Another attendee noted that Microsoft can save to LibreOffice format, so if Word (or Excel) is your document-editing tool of choice, you can still use LibreOffice to render it to PDF. That’s really useful if MS Office is your tool of choice but you’re doing rendering on a BSD/Linux server.

Question 2: What about Scraping PDFs?

The questioner noted that scraping a semi-complex PDF is very painful. It’d be ideal, he said, to be able to take a complex form like the 1040 and extract key value pairs of the question and answer. Is the story getting better for scraping PDFs?

My answer was that for the little experience I have with scraping PDFs, I’ve used PDFMiner, and the attendee said he was using the same.

Someone else chimed in that it’s a great use case for [Amazon’s] Mechanical Turk; in his case he was dealing with old faxes that had been scanned.

Question 3: Helper Libraries

Matt Wilson asked if it would make sense to begin building helper libraries to simplify common tasks related to manipulating LibreOffice XML. My answer was that I wasn’t sure since each project has very specific needs. Someone else suggested that one would have to start learning the spec in order to begin creating abstractions.

In the YouTube comments, Paul Hoffman1 called our attention to OdfPy a “thin abstraction over direct XML access”. It looks quite interesting.

Comment 1: Back to Scraping

One of the attendees commented that he had used Jython and PDFBox for PDF scraping. “It took a lot to get started, but once I started to figure out my way around it, it was a pretty good tool and it moved pretty speedily as compared to some of the other tools I used.” He went on to say that it was pretty complete and that it worked very well.

Question 4: About XML Parsing

The question was what I used to parse the XML, and my answer was that I used ElementTree from the standard library. Your favorite XML parsing library will work just fine.

Question 5: Protecting Bookmarks

The question was whether or not I did anything special to protect the bookmarks in the document. My answer was that I didn’t. (I’m not even sure it’s possible.) If you go through multiple rounds of editing with your client, those invisible bookmarks are inevitably going to get moved or deleted, so expect a little maintenance work related to that.

Comment 2: Weasyprint

One of the attendees commented that Weasyprint is a useful HTML/CSS to PDF converter. My observation was that tools in this class (HTML/CSS to PDF converters) are not as precise as either of the methods I outlined in this talk, but if you don’t need precision they’re probably a nice way to go.

Question 6: unoconv in a Web Server

Can one use unoconv in a Web server? My answer was that it’s possible, but it’s not practical to use it in-process. For me, it worked to do so in a demo of an intranet application, but that’s about as far as you want to go with it. It’s much more practical to use a distributed processing application (Celery, for example).

One of the attendees concurred that it makes sense to spin it off into a separate process, but “unoconv inexplicably crashes when it feels like it”.

Comment 3: Converting from Word

The initial comment was that pandoc might help with converting from Word to LibreOffice. This started a conversation which I’d summarize this way:

  • LibreOffice can open MS Office docs, so use that instead of pandoc and save as LibreOffice
  • If you open MS Office documents with LibreOffice, double check the formatting because it doesn’t always survive the transition
  • LibreOffice can open PDFs for editing.

Thanks for PyData Carolinas

My PyData Pass

Thanks to all who made PyData Carolinas 2016 a success! I had conversations about eating well while on the road, conveyor belts, and a Fortran algorithm to calculate the interaction of charged particles. Great stuff!

My talk was on getting Python to talk to compiled languages; specifically C, Fortran, and C++.

I’m grateful to the PyData A/V team who did a great job capturing the presentations. Thanks to them, you can see my talk on YouTube at https://www.youtube.com/watch?v=aUSokzzsEko , or you can watch the embedded version below.

 

Creating PDF Documents Using LibreOffice and Python, Part 2

This is part 2 of a 4-part series on creating PDFs using LibreOffice. You should read part 1 if you haven’t already. This series is a supplement to a talk I gave at PyOhio 2016.

Here in part 2, I compare and contrast the two approaches I outlined in part 1 — the obvious approach of using ReportLab, and the LibreOffice approach that I think is underappreciated. Both approaches can be good in the right situation, but neither is better than the other all the time. In some cases, the difference is dramatic.

Without further ado, here’s the 10 categories in which I want to compare these two, and how each approach stacks up. (The compare/contrast portion of my PyOhio talk starts at the 17 minute mark.)

1. Cross-Platform?

Both ReportLab and the LibreOffice technique run on Windows, Linux, OS X, and BSD. I haven’t researched mobile operating systems like iOS and Android, but you’re not likely to want to construct PDFs on a mobile device.

2. Python 2/3 Support?

Both approaches can be used with Python 2 and 3.

3. FOSS?

ReportLab is under a BSD License.

LibreOffice is under the MPL v2.0 which is a BSD/GPL hybrid. However the details don’t matter much since you’re not going to use the source code anyway.

4. Repairability?

By repairability, I’m referring to the ease with which you can fix things that don’t behave the way you want them to.

ReportLab scores very well here, because it’s pure Python and BSD license gives you a lot of flexibility. You can read, debug, patch, and copy the code. When debugging, you can step directly from your code into ReportLab code. If you patch ReportLab, it’s easy to roll out a patched version to your servers using pip.

LibreOffice, on the other hand, is a large office suite written in C++ (and maybe Java?). It’s orders of magnitude more complicated. Think of it as an unrepairable black box.

5. Power?

ReportLab includes lots of cool stuff out of the box, like bar, pie, line, and other kinds of charts, a table of contents generator, and probably lots of other things I don’t know about.

It’s also extensible, so it you want something it doesn’t have (like a list of figures generator) you can write it or search online to see if someone else has already done it.

LibreOffice has even more to offer, though. It’s an entire office suite, after all! It not only handles all of the normal text document things (like headings, foot/endnotes, autonumbering lists, etc.), you can do more sophisticated things like embedding spreadsheets in documents. 1001 Creative Ways to Use an Office Suite could be a blog post all its own (or 1001 of them!).

6. Scalability?

ReportLab is just Python so one can run multiple concurrent threads or processes just as with any other Python code.

Unfortunately, LibreOffice does not scale. It’s not possible to run multiple LibreOffice processes simultaneously on one machine. For probably 99.99% of users, this isn’t a concern, but it can be a problem for automation. It means you have to be willing to create your PDFs synchronously.

7. Speed?

Warning: Guess approaching!

My hunch is that ReportLab is faster, maybe by a lot. But that’s backed by no data whatsoever. Benchmarking would be time-consuming. It would require inventing a variety of relatively complex PDFs and generating them using both methods. And that still might not tell you much about your use case.

In the grand tradition of arguing on the Internet, I’m not going to let my ignorance or lack of data keep me from having an opinion. But understand that it’s a guess, and take it with a huge grain of salt.

8. Experimentation?

You’re probably not producing this PDF for yourself, but for someone else. That someone might be an immediate co-worker, another department in your company, or a customer that’s in a completely different company. Experimenting with the output PDF is an important part of the process because it usually takes many tweaks to get the PDF to look the way your client wants.

Just as with developing software, the end result will be a moving target as ideas evolve. And also like software development, you want your tools and process to add as little friction as possible to the evolution.

With ReportLab, experimentation can be time-consuming. If you have a complex PDF, you’ll have a non-trivial amount of Python/ReportLab code to generate it. As code gets more complex, it gets harder to change. That’s not specific to ReportLab, it’s just a general software development principle. So when your client wants to change, say, how the page footer is formatted, or how figures are numbered, or the document font, the usual difficulties of maintaining code apply.

With LibreOffice, changing the document is extremely easy because you’re using a tool built expressly for that purpose. It’s straightforward, and you can immediately see the results of your changes.

9. Complexity?

By complexity, I’m referring to the complexity of one’s code relative to the complexity of the PDF you’re trying to create.

With ReportLab, the relationship is roughly linear complexity. If you have a complex PDF, you’ll have reasonably complex Python code to create it.

With LibreOffice, the relationship is non-linear. Deleting and duplicating XML elements and changing text are easy. Creating new elements is difficult. For instance, our trivial PDF example contained two paragraphs and a table. As I demonstrate in part 1 and in my talk, it’s easy to add, delete, and change table rows, but if you asked me to add an image to that document, I would be stuck because there’s no image in the XML for me to copy.

Obviously, I could add an image to the document and then see how that’s expressed in the XML, but that only works if I know in advance that I’m going to need an image.

10. Strengths

ReportLab is a safe choice. It does one thing and it does it well. The fact that it’s extensible means you can always get it to do what you want (although you might have to write more code than you planned). It’s the well-traveled path, so you’ll be able to find fellow travelers (and their tutorials and advice). It can handle extremely varied output.

Using the LibreOffice method is best when there’s a high ratio of static to dynamic content. Think about the extreme example of a 900-page PDF in which there’s only one paragraph of dynamic content. You would have to write very little code to populate that one paragraph, whereas with ReportLab you’d have to write code to generate all 900 pages, even though they never change.

The LibreOffice method requires less code — maybe a lot less, depending on your situation. The tradeoff is that you have to do more document construction work, but to me that’s still a win for two reasons. First, you get to use a tool built expressly for that purpose. Second, it’s easier (and cheaper) to find LibreOffice/document editing skills than Python/software development skills. Your client might even be able to build most of the document, which will save them money and give them control over and investment in the outcome. That makes for a happy client.

What’s Next

In my next post in this series, I’ll discuss some of the questions asked at my PyOhio talk, and in the fourth and final post I’ll present some useful code snippets. Stay tuned!

 

♡s to PyOhio

  • To conference volunteers too numerous to mention
  • To Jason, Eric, and Jan for their hospitality which helped me to feel at home away from home
  • To Oscar the AirBnB cat for headbutting me affectionately and repeatedly in the face at 5:45 AM only on the morning for which had my alarm set for 6:15. (He let me sleep the other mornings.)

I hope to see y’all at PyData Carolinas 2016!

oscar_the_cat_pyohio

Creating PDF Documents Using LibreOffice and Python

This post is a supplement to a talk I’m giving at PyOhio about using Python to create PDFs “the lazy way”. It’s the first of a series on this subject which is a bit too big for just one blog post.

In the talk and in this series, I advocate a technique for creating PDFs that uses LibreOffice (or OpenOffice) to do most of the hard work, and I contrast that to the common solution of using ReportLab (or a library like it).

This technique offers some unique benefits, and in some common use cases—most importantly, perhaps in your case—it can be much more efficient than the alternative. I’ll compare and contrast the two in another blog post. In this post I just want to describe the technique I’m advocating.

Background

Creating PDFs programmatically is a task most Python programmers encounter at least once.

When I talk about creating PDFs programmtically, I’m thinking of the situation where one wants to create a lot of PDFs that follow a template. For instance, you might work for a bank that wants to produce end-of-month account statements for each of its 100,000 customers. The cover page will always contain the bank’s logo, some legal boilerplate, the month and year, and a bland stock photo 17068-a-woman-and-older-man-sitting-at-a-table-pvof happy customers doing something  unrelated to banking, like this one.

The first page after that will be a summary of the customer’s accounts, and then subsequent pages contain information about the account—a list of transactions, changes in values of stocks, etc.

Each PDF will be different, but similar because they follow a template. Computers are great for this sort of thing, and this technique is particularly good at it. As I said above, I’ll tell you why I think it’s good in another blog post. For now, I want to stop talking mysteriously about “the technique” and actually describe it.

Outline

Here’s a concise outline. Don’t worry if you don’t understand all the steps; they’re fleshed out below.

  1. Create a LibreOffice document that will serve as a template for the documents you want to create. (Note: I mean “template” in the general sense of a form or skeleton, not a LibreOffice .ott template file.)
  2. Unzip that document.
  3. Manipulate the document’s XML using standard Python libraries.
  4. Zip the modified files into a new LibreOffice .odt file.
  5. Ask LibreOffice to export the document in PDF format.

Let’s go through these step by step. I encourage you to follow along. We’re not going to write a single line of Python code, just explore a process. Writing Python would come later when you automate steps 2 – 5.

1. Create a LibreOffice Document to Use as a Template

This step will probably require the most work.

We usually know in advance at least some of the content we want. For instance, in the bank example above, we know what the cover page will look like, where each section should appear in the document, and how a section (e.g. a list of account transactions) should be formatted, even if we don’t know in advance the exact values of each transaction.

Your job during this step is to create a LibreOffice document that will serve as a skeleton (or template, or form) for your final documents. For content that you don’t know (words in paragraphs, images, bullet points in a list, table contents, etc.), leave placeholders.

If you want to play along with this blog post, here’s the LibreOffice document that I’ll use in the examples below.

2. Unzip the Document

This is a trick you might not know—LibreOffice documents are ZIP files. (This is true of all documents that follow the Open Document Format for Office Applications). You can unzip them with command line tools, or with the zipfile module in Python’s standard library.

On my Mac, the following command unzips the document into the directory unzipped.

unzip practice.odt -d unzipped

After unzipping, you’ll see a bunch of files like this:

drwxr-xr-x  11 philip staff    374 Jul 27 16:43 Configurations2/
drwxr-xr-x   3 philip staff    102 Jul 27 16:43 META-INF/
drwxr-xr-x   3 philip staff    102 Jul 27 16:43 Thumbnails/
-rw-r--r--   1 philip staff   8988 Jul 27 16:44 content.xml
-rw-r--r--   1 philip staff    899 Jul 27  2016 manifest.rdf
-rw-r--r--   1 philip staff   1005 Jul 27  2016 meta.xml
-rw-r--r--   1 philip staff     39 Jul 27  2016 mimetype
-rw-r--r--   1 philip staff  10319 Jul 27  2016 settings.xml
-rw-r--r--   1 philip staff  14903 Jul 27  2016 styles.xml

Of the files above, you’re only likely to be interested in content.xml. (You might also want to explore styles.xml, but I consider that an advanced topic, and I’m trying to maintain a rigorous standard of laziness.)

3. Manipulate the XML

The XML in content.xml is governed by the 846-page Open Document Format for Office Applications. You might think I’m going to suggest you read it, or at least familiarize yourself with it.

Heck no! That’s not the lazy way. I’m very pleased that it’s an ISO standard, but I don’t want to learn it if I can save time and effort by not doing so, and you shouldn’t have to either.

Instead I suggest you use what I use: common sense and intution, which can get you surprisingly far. For instance, if you see this in the XML—

<text:p text:style-name="P4">
 The fox jumped over the dog.
</text:p>

You don’t have to read 846 pages of documentation to guess that you can change it to this—

<text:p text:style-name="P4">
 The quick brown fox jumped over the lazy dog.
</text:p>

Or even this—

<text:p text:style-name="P4">
 No one expects the Spanish Inquisition!
</text:p>

Are you starting to see some possibilities?

If you’re doing this programmatically, you can use LibreOffice bookmarks to demarcate the text you want to replace. Bookmarks are visible in the XML and trivial to locate using XPath. You can see this in my example document where I’ve surrounded two blank space characters with bookmarks where adjectives might go to describe the fox and dog.

<text:p text:style-name="P1">
    The
    <text:bookmark-start text:name="fox_type_placeholder"/>
    <text:s/>
    <text:bookmark-end text:name="fox_type_placeholder"/>
    <text:s/>
    fox jumped over the
    <text:bookmark-start text:name="dog_type_placeholder"/>
    <text:s/>
    <text:bookmark-end text:name="dog_type_placeholder"/>
    <text:s/>
    dog.
</text:p>

What do you think will happen if you replace the first occurrence of  <text:s/> with quick brown?

Text isn’t the only thing you can change.

If you have a list item with bullets and you want another bullet or three, you can just duplicate existing bullets. For instance, if you start with this—

<text:list xml:id="list3413943092755896283" text:style-name="L1">
    <text:list-item>
        <text:p text:style-name="P2">First</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Second</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Third</text:p>
    </text:list-item>
</text:list>

You can turn it into this—

<text:list xml:id="list3413943092755896283" text:style-name="L1">
    <text:list-item>
        <text:p text:style-name="P2">First</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Second</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Third</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Fourth</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Fifth</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Sixth</text:p>
    </text:list-item>
</text:list>

Note that the text:list element itself has what looks like a unique id associated with it. This is a yellow flag that indicates to me that if you want to copy the entire list, you’ll need to give it a new unique id, and hope that LibreOffice  doesn’t reference that id in some other file.

I’m sure the details are somewhere in that 846-page document. You can read that document, or you can also just try your change and see what happens. The worst case scenario is that LibreOffice will tell you that your document is corrupted and you’ll have to go back and explore some more.

4. Zip a New LibreOffice File

Once you’ve made the changes you want, it’s time to reverse step 2, using your modified content.xml.

Here’s the command that works on my Mac—

cd unzipped && zip -r ../my_new_file.odt * && cd ..

Note that this command doesn’t respect the OpenDocument specification which has rules regarding how the mime type file should be represented in the zip file (as the first file in the archive, and uncompressed, per OpenDocument v1.2 part3, § 3.3 MIME Media Type). It works for me, maybe because LibreOffice is forgiving. It’s not something you should rely on, however. In another post, I’ll present some Python code that constructs the ZIP file according to standard.

5. Export to PDF via LibreOffice

If you’re just experimenting, you can just open the document in LibreOffice manually and then use the “File/Export as PDF…” menu item. (Opening manually is also a good test that you didn’t do anything objectionable to the XML.)

Programmatically, I recommend using unoconv for converting your finished document to PDF.

Review

So there you have it! If you feel underwhelmed, keep in mind that this was only a proof of concept. In some future posts, I’ll explain why I think this method is often an excellent choice (and also when it isn’t).

Photo Credit

Thanks to the National Cancer Institute for making many photos available for free, including the one used in this blog post which was taken by Rhoda Baer.