Here in part 3, I review the conversation we (the audience and I) had at the end of the PyOhio talk. I committed the speaker’s cardinal sin of not repeating (into the microphone) the questions people asked, so they’re inaudible in the video. In addition, we had some interesting conversations among multiple people that didn’t get picked up by the microphone. I don’t want them to get lost, so I summarized them here.
The most interesting thing I learned out of this conversation is that LibreOffice can open PDFs; once opened they’re like an ordinary LibreOffice document. You can edit them, save them to ODF, export to PDF, etc. Is this cool, or what?
First Question: What about Using Excel or Word?
One of the attendees jumped in to confirm that modern MS Word formats are XML-based. However, he went on to say, the XML contains a statement at the top that says something like “You cannot legally read the rest of this file”. I made a joke about not having one’s lawyer present when reading the file.
In all seriousness, I can’t find anything online that suggests that Microsoft’s XML contains a warning like that, and the few examples I looked at didn’t have have any such warning. If you can shed any light on this, please do so in the comments!
We also discussed the fact that one must invoke the office app (LibreOffice or Word, Excel, etc.) in order to render the document to PDF. LibreOffice has a reputation for performing badly when invoked repeatedly for this purpose. LibreOffice 5 may have addressed some of these problems, but as of this writing it’s still pretty new so the jury is still out on how this will work in practice.
Another attendee noted that Microsoft can save to LibreOffice format, so if Word (or Excel) is your document-editing tool of choice, you can still use LibreOffice to render it to PDF. That’s really useful if MS Office is your tool of choice but you’re doing rendering on a BSD/Linux server.
Question 2: What about Scraping PDFs?
The questioner noted that scraping a semi-complex PDF is very painful. It’d be ideal, he said, to be able to take a complex form like the 1040 and extract key value pairs of the question and answer. Is the story getting better for scraping PDFs?
My answer was that for the little experience I have with scraping PDFs, I’ve used PDFMiner, and the attendee said he was using the same.
Someone else chimed in that it’s a great use case for [Amazon’s] Mechanical Turk; in his case he was dealing with old faxes that had been scanned.
Question 3: Helper Libraries
Matt Wilson asked if it would make sense to begin building helper libraries to simplify common tasks related to manipulating LibreOffice XML. My answer was that I wasn’t sure since each project has very specific needs. Someone else suggested that one would have to start learning the spec in order to begin creating abstractions.
In the YouTube comments, Paul Hoffman1 called our attention to OdfPy a “thin abstraction over direct XML access”. It looks quite interesting.
Comment 1: Back to Scraping
One of the attendees commented that he had used Jython and PDFBox for PDF scraping. “It took a lot to get started, but once I started to figure out my way around it, it was a pretty good tool and it moved pretty speedily as compared to some of the other tools I used.” He went on to say that it was pretty complete and that it worked very well.
Question 4: About XML Parsing
The question was what I used to parse the XML, and my answer was that I used ElementTree from the standard library. Your favorite XML parsing library will work just fine.
Question 5: Protecting Bookmarks
The question was whether or not I did anything special to protect the bookmarks in the document. My answer was that I didn’t. (I’m not even sure it’s possible.) If you go through multiple rounds of editing with your client, those invisible bookmarks are inevitably going to get moved or deleted, so expect a little maintenance work related to that.
Comment 2: Weasyprint
One of the attendees commented that Weasyprint is a useful HTML/CSS to PDF converter. My observation was that tools in this class (HTML/CSS to PDF converters) are not as precise as either of the methods I outlined in this talk, but if you don’t need precision they’re probably a nice way to go.
Question 6: unoconv in a Web Server
Can one use unoconv in a Web server? My answer was that it’s possible, but it’s not practical to use it in-process. For me, it worked to do so in a demo of an intranet application, but that’s about as far as you want to go with it. It’s much more practical to use a distributed processing application (Celery, for example).
One of the attendees concurred that it makes sense to spin it off into a separate process, but “unoconv inexplicably crashes when it feels like it”.
Comment 3: Converting from Word
The initial comment was that pandoc might help with converting from Word to LibreOffice. This started a conversation which I’d summarize this way:
- LibreOffice can open MS Office docs, so use that instead of pandoc and save as LibreOffice
- If you open MS Office documents with LibreOffice, double check the formatting because it doesn’t always survive the transition
- LibreOffice can open PDFs for editing.