Creating PDF Documents Using LibreOffice and Python, Part 2

This is part 2 of a 4-part series on creating PDFs using LibreOffice. You should read part 1 if you haven’t already. This series is a supplement to a talk I gave at PyOhio 2016.

Here in part 2, I compare and contrast the two approaches I outlined in part 1 — the obvious approach of using ReportLab, and the LibreOffice approach that I think is underappreciated. Both approaches can be good in the right situation, but neither is better than the other all the time. In some cases, the difference is dramatic.

Without further ado, here’s the 10 categories in which I want to compare these two, and how each approach stacks up. (The compare/contrast portion of my PyOhio talk starts at the 17 minute mark.)

1. Cross-Platform?

Both ReportLab and the LibreOffice technique run on Windows, Linux, OS X, and BSD. I haven’t researched mobile operating systems like iOS and Android, but you’re not likely to want to construct PDFs on a mobile device.

2. Python 2/3 Support?

Both approaches can be used with Python 2 and 3.

3. FOSS?

ReportLab is under a BSD License.

LibreOffice is under the MPL v2.0 which is a BSD/GPL hybrid. However the details don’t matter much since you’re not going to use the source code anyway.

4. Repairability?

By repairability, I’m referring to the ease with which you can fix things that don’t behave the way you want them to.

ReportLab scores very well here, because it’s pure Python and BSD license gives you a lot of flexibility. You can read, debug, patch, and copy the code. When debugging, you can step directly from your code into ReportLab code. If you patch ReportLab, it’s easy to roll out a patched version to your servers using pip.

LibreOffice, on the other hand, is a large office suite written in C++ (and maybe Java?). It’s orders of magnitude more complicated. Think of it as an unrepairable black box.

5. Power?

ReportLab includes lots of cool stuff out of the box, like bar, pie, line, and other kinds of charts, a table of contents generator, and probably lots of other things I don’t know about.

It’s also extensible, so it you want something it doesn’t have (like a list of figures generator) you can write it or search online to see if someone else has already done it.

LibreOffice has even more to offer, though. It’s an entire office suite, after all! It not only handles all of the normal text document things (like headings, foot/endnotes, autonumbering lists, etc.), you can do more sophisticated things like embedding spreadsheets in documents. 1001 Creative Ways to Use an Office Suite could be a blog post all its own (or 1001 of them!).

6. Scalability?

ReportLab is just Python so one can run multiple concurrent threads or processes just as with any other Python code.

Unfortunately, LibreOffice does not scale. It’s not possible to run multiple LibreOffice processes simultaneously on one machine. For probably 99.99% of users, this isn’t a concern, but it can be a problem for automation. It means you have to be willing to create your PDFs synchronously.

7. Speed?

Warning: Guess approaching!

My hunch is that ReportLab is faster, maybe by a lot. But that’s backed by no data whatsoever. Benchmarking would be time-consuming. It would require inventing a variety of relatively complex PDFs and generating them using both methods. And that still might not tell you much about your use case.

In the grand tradition of arguing on the Internet, I’m not going to let my ignorance or lack of data keep me from having an opinion. But understand that it’s a guess, and take it with a huge grain of salt.

8. Experimentation?

You’re probably not producing this PDF for yourself, but for someone else. That someone might be an immediate co-worker, another department in your company, or a customer that’s in a completely different company. Experimenting with the output PDF is an important part of the process because it usually takes many tweaks to get the PDF to look the way your client wants.

Just as with developing software, the end result will be a moving target as ideas evolve. And also like software development, you want your tools and process to add as little friction as possible to the evolution.

With ReportLab, experimentation can be time-consuming. If you have a complex PDF, you’ll have a non-trivial amount of Python/ReportLab code to generate it. As code gets more complex, it gets harder to change. That’s not specific to ReportLab, it’s just a general software development principle. So when your client wants to change, say, how the page footer is formatted, or how figures are numbered, or the document font, the usual difficulties of maintaining code apply.

With LibreOffice, changing the document is extremely easy because you’re using a tool built expressly for that purpose. It’s straightforward, and you can immediately see the results of your changes.

9. Complexity?

By complexity, I’m referring to the complexity of one’s code relative to the complexity of the PDF you’re trying to create.

With ReportLab, the relationship is roughly linear complexity. If you have a complex PDF, you’ll have reasonably complex Python code to create it.

With LibreOffice, the relationship is non-linear. Deleting and duplicating XML elements and changing text are easy. Creating new elements is difficult. For instance, our trivial PDF example contained two paragraphs and a table. As I demonstrate in part 1 and in my talk, it’s easy to add, delete, and change table rows, but if you asked me to add an image to that document, I would be stuck because there’s no image in the XML for me to copy.

Obviously, I could add an image to the document and then see how that’s expressed in the XML, but that only works if I know in advance that I’m going to need an image.

10. Strengths

ReportLab is a safe choice. It does one thing and it does it well. The fact that it’s extensible means you can always get it to do what you want (although you might have to write more code than you planned). It’s the well-traveled path, so you’ll be able to find fellow travelers (and their tutorials and advice). It can handle extremely varied output.

Using the LibreOffice method is best when there’s a high ratio of static to dynamic content. Think about the extreme example of a 900-page PDF in which there’s only one paragraph of dynamic content. You would have to write very little code to populate that one paragraph, whereas with ReportLab you’d have to write code to generate all 900 pages, even though they never change.

The LibreOffice method requires less code — maybe a lot less, depending on your situation. The tradeoff is that you have to do more document construction work, but to me that’s still a win for two reasons. First, you get to use a tool built expressly for that purpose. Second, it’s easier (and cheaper) to find LibreOffice/document editing skills than Python/software development skills. Your client might even be able to build most of the document, which will save them money and give them control over and investment in the outcome. That makes for a happy client.

What’s Next

In my next post in this series, I’ll discuss some of the questions asked at my PyOhio talk, and in the fourth and final post I’ll present some useful code snippets. Stay tuned!

 

♡s to PyOhio

  • To conference volunteers too numerous to mention
  • To Jason, Eric, and Jan for their hospitality which helped me to feel at home away from home
  • To Oscar the AirBnB cat for headbutting me affectionately and repeatedly in the face at 5:45 AM only on the morning for which had my alarm set for 6:15. (He let me sleep the other mornings.)

I hope to see y’all at PyData Carolinas 2016!

oscar_the_cat_pyohio