Creating PDF Documents Using LibreOffice and Python

This post is a supplement to a talk I’m giving at PyOhio about using Python to create PDFs “the lazy way”. It’s the first of a series on this subject which is a bit too big for just one blog post.

In the talk and in this series, I advocate a technique for creating PDFs that uses LibreOffice (or OpenOffice) to do most of the hard work, and I contrast that to the common solution of using ReportLab (or a library like it).

This technique offers some unique benefits, and in some common use cases—most importantly, perhaps in your case—it can be much more efficient than the alternative. I’ll compare and contrast the two in another blog post. In this post I just want to describe the technique I’m advocating.

Background

Creating PDFs programmatically is a task most Python programmers encounter at least once.

When I talk about creating PDFs programmtically, I’m thinking of the situation where one wants to create a lot of PDFs that follow a template. For instance, you might work for a bank that wants to produce end-of-month account statements for each of its 100,000 customers. The cover page will always contain the bank’s logo, some legal boilerplate, the month and year, and a bland stock photo 17068-a-woman-and-older-man-sitting-at-a-table-pvof happy customers doing something  unrelated to banking, like this one.

The first page after that will be a summary of the customer’s accounts, and then subsequent pages contain information about the account—a list of transactions, changes in values of stocks, etc.

Each PDF will be different, but similar because they follow a template. Computers are great for this sort of thing, and this technique is particularly good at it. As I said above, I’ll tell you why I think it’s good in another blog post. For now, I want to stop talking mysteriously about “the technique” and actually describe it.

Outline

Here’s a concise outline. Don’t worry if you don’t understand all the steps; they’re fleshed out below.

  1. Create a LibreOffice document that will serve as a template for the documents you want to create. (Note: I mean “template” in the general sense of a form or skeleton, not a LibreOffice .ott template file.)
  2. Unzip that document.
  3. Manipulate the document’s XML using standard Python libraries.
  4. Zip the modified files into a new LibreOffice .odt file.
  5. Ask LibreOffice to export the document in PDF format.

Let’s go through these step by step. I encourage you to follow along. We’re not going to write a single line of Python code, just explore a process. Writing Python would come later when you automate steps 2 – 5.

1. Create a LibreOffice Document to Use as a Template

This step will probably require the most work.

We usually know in advance at least some of the content we want. For instance, in the bank example above, we know what the cover page will look like, where each section should appear in the document, and how a section (e.g. a list of account transactions) should be formatted, even if we don’t know in advance the exact values of each transaction.

Your job during this step is to create a LibreOffice document that will serve as a skeleton (or template, or form) for your final documents. For content that you don’t know (words in paragraphs, images, bullet points in a list, table contents, etc.), leave placeholders.

If you want to play along with this blog post, here’s the LibreOffice document that I’ll use in the examples below.

2. Unzip the Document

This is a trick you might not know—LibreOffice documents are ZIP files. (This is true of all documents that follow the Open Document Format for Office Applications). You can unzip them with command line tools, or with the zipfile module in Python’s standard library.

On my Mac, the following command unzips the document into the directory unzipped.

unzip practice.odt -d unzipped

After unzipping, you’ll see a bunch of files like this:

drwxr-xr-x  11 philip staff    374 Jul 27 16:43 Configurations2/
drwxr-xr-x   3 philip staff    102 Jul 27 16:43 META-INF/
drwxr-xr-x   3 philip staff    102 Jul 27 16:43 Thumbnails/
-rw-r--r--   1 philip staff   8988 Jul 27 16:44 content.xml
-rw-r--r--   1 philip staff    899 Jul 27  2016 manifest.rdf
-rw-r--r--   1 philip staff   1005 Jul 27  2016 meta.xml
-rw-r--r--   1 philip staff     39 Jul 27  2016 mimetype
-rw-r--r--   1 philip staff  10319 Jul 27  2016 settings.xml
-rw-r--r--   1 philip staff  14903 Jul 27  2016 styles.xml

Of the files above, you’re only likely to be interested in content.xml. (You might also want to explore styles.xml, but I consider that an advanced topic, and I’m trying to maintain a rigorous standard of laziness.)

3. Manipulate the XML

The XML in content.xml is governed by the 846-page Open Document Format for Office Applications. You might think I’m going to suggest you read it, or at least familiarize yourself with it.

Heck no! That’s not the lazy way. I’m very pleased that it’s an ISO standard, but I don’t want to learn it if I can save time and effort by not doing so, and you shouldn’t have to either.

Instead I suggest you use what I use: common sense and intution, which can get you surprisingly far. For instance, if you see this in the XML—

<text:p text:style-name="P4">
 The fox jumped over the dog.
</text:p>

You don’t have to read 846 pages of documentation to guess that you can change it to this—

<text:p text:style-name="P4">
 The quick brown fox jumped over the lazy dog.
</text:p>

Or even this—

<text:p text:style-name="P4">
 No one expects the Spanish Inquisition!
</text:p>

Are you starting to see some possibilities?

If you’re doing this programmatically, you can use LibreOffice bookmarks to demarcate the text you want to replace. Bookmarks are visible in the XML and trivial to locate using XPath. You can see this in my example document where I’ve surrounded two blank space characters with bookmarks where adjectives might go to describe the fox and dog.

<text:p text:style-name="P1">
    The
    <text:bookmark-start text:name="fox_type_placeholder"/>
    <text:s/>
    <text:bookmark-end text:name="fox_type_placeholder"/>
    <text:s/>
    fox jumped over the
    <text:bookmark-start text:name="dog_type_placeholder"/>
    <text:s/>
    <text:bookmark-end text:name="dog_type_placeholder"/>
    <text:s/>
    dog.
</text:p>

What do you think will happen if you replace the first occurrence of  <text:s/> with quick brown?

Text isn’t the only thing you can change.

If you have a list item with bullets and you want another bullet or three, you can just duplicate existing bullets. For instance, if you start with this—

<text:list xml:id="list3413943092755896283" text:style-name="L1">
    <text:list-item>
        <text:p text:style-name="P2">First</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Second</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Third</text:p>
    </text:list-item>
</text:list>

You can turn it into this—

<text:list xml:id="list3413943092755896283" text:style-name="L1">
    <text:list-item>
        <text:p text:style-name="P2">First</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Second</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Third</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Fourth</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Fifth</text:p>
    </text:list-item>
    <text:list-item>
        <text:p text:style-name="P2">Sixth</text:p>
    </text:list-item>
</text:list>

Note that the text:list element itself has what looks like a unique id associated with it. This is a yellow flag that indicates to me that if you want to copy the entire list, you’ll need to give it a new unique id, and hope that LibreOffice  doesn’t reference that id in some other file.

I’m sure the details are somewhere in that 846-page document. You can read that document, or you can also just try your change and see what happens. The worst case scenario is that LibreOffice will tell you that your document is corrupted and you’ll have to go back and explore some more.

4. Zip a New LibreOffice File

Once you’ve made the changes you want, it’s time to reverse step 2, using your modified content.xml.

Here’s the command that works on my Mac—

cd unzipped && zip -r ../my_new_file.odt * && cd ..

Note that this command doesn’t respect the OpenDocument specification which has rules regarding how the mime type file should be represented in the zip file (as the first file in the archive, and uncompressed, per OpenDocument v1.2 part3, § 3.3 MIME Media Type). It works for me, maybe because LibreOffice is forgiving. It’s not something you should rely on, however. In another post, I’ll present some Python code that constructs the ZIP file according to standard.

5. Export to PDF via LibreOffice

If you’re just experimenting, you can just open the document in LibreOffice manually and then use the “File/Export as PDF…” menu item. (Opening manually is also a good test that you didn’t do anything objectionable to the XML.)

Programmatically, I recommend using unoconv for converting your finished document to PDF.

Review

So there you have it! If you feel underwhelmed, keep in mind that this was only a proof of concept. In some future posts, I’ll explain why I think this method is often an excellent choice (and also when it isn’t).

Photo Credit

Thanks to the National Cancer Institute for making many photos available for free, including the one used in this blog post which was taken by Rhoda Baer.

11 thoughts on “Creating PDF Documents Using LibreOffice and Python”

  1. Hi,

    Funny to see someone Using this technique that I, for the moment, am using daily.

    I use a library called appy.pod for this, that was created by a friend of mine. Heard of it ? Basically, It allows you to put some python code into your template, in fields and comments of writer documents. It alse can be used, although with less complexity, in calc.
    There it is :
    http://appyframework.org/

      1. Philip, I had the same thoughts as Marc when reading your post. I really recommend that you have a look at appy.pod which I use daily on several production sites. Note that the appy framework is actually two parts (“gen” and “pod”) and you need only the pod part. Note also that appy does not yet fully support Python 3.

    1. Good question! I don’t use the Python-UNO bridge for a couple of reasons.

      First of all, one of the advantages of the method I described is that you get to do most of your document creation work in LibreOffice rather than describing a document via Python code. The latter is so much more abstract. I elaborate on this point in my talk, but here on my blog I haven’t yet written the post that describes the pros (and cons) of this method.

      Secondly, I tried Python-UNO some time ago and thought it was a noble effort, but difficult to use. It seemed complicated, wasn’t well documented, and the Java flavor of the API leaks through. I could live with the Java flavor but the other points killed my enthusiasm.

      If I’m going to describe my document in code, I’d rather do it in ReportLab. As you pointed out, there’s a cost to using the LibreOffice runtime, and ReportLab avoids that in addition to being a very capable library.

      If you have experience with Python-UNO, I’d like to hear what you think. Mine was brief and didn’t leave me wanting more.

      1. We use Python-UNO for similar reports as you describe. We encountered two main problems:

        1) UNO API is really complicated. For that reason I created Pythonic wrapper. Look at https://github.com/seznam/pyoo

        2) LibreOffice is not a server application and is not very stable. Use one LibreOffice process for each application process, restart it after every few requests and be prepared to kill everything and retry on occasional errors.

  2. I do not think that we are making a lot of Python-UNO calls. We are trying to minimize them because each of them takes some time. For example we are writing whole tables in one call.

    I guess that LibreOffice simply does not handle edge cases well. I saw many segfaults when I passed invalid (or out of range) arguments to UNO calls. Also most attempts for concurrent access ended by error.

Leave a Reply

Your email address will not be published.