Using Word templates and docx to generate PDF

In the theme of automating tasks for the everyday insurance agency, this article is going to explore the use of various PDF writers to generate PDF documents based on a template, specifically, a Word template. Let's start by looking at what we need to create and then working backwards. Recently, I was tasked with the requirement to generate Certificates of Insurance based on a few parameters. In this use case, the certificate was generated using a number of known parameters; the insured name and address information, the various policy details, and a few comments regarding coverages. However, this certificate had to be replicated 80 times with only the policy numbers and effective start and end dates had to be changed. Therefore, the customer service representative had to log into the certificate generating application, enter the details for one, then click a "generate" button and repeat the process 80 times. Working as efficiently as possibly, they could complete 1 certificate every 2-3 minutes. The issue here was the manual labor of copying and pasting values into the certificate application and then waiting for the certificate to generate, then the manual step of saving the file, because it would always try to save the file using the same filename. It took 1 CSR an entire day to generate the 80 certificates. We can do better!

Template to PDF

First, a caveat: This is for educational purposes only. I did not build this program to generate the certificates of insurance. Instead, I wrote the program to prove that the concept could be done. There are licensing requirements with the company that issues the certificates and this article does not intend to circumvent those requirements. With this solution, I will leave it to the client to determine whether or not they wish to pursue obtaining the proper license to auto-generate the certificates. Now, onto the article...

The first step is to determine the data that must change from certificate to certificate. Here, the policy number and the start and end dates for the policy require changing. From certificate to certificate, we only need to change those three values, but they occur in four different places throughout the cert. So, there are four rows of three values for each certificate, and there are 80 certificates. This sounds like a job for csv, but we'll get to that in a minute.

Template to PDF

So, we have our master document and we have our data identified, but there is no easy way to modify a PDF document. In fact, it's kind of a pain in the ... Instead, we need to convert the master document into a template that we can use to format our 80 PDF documents. If it isn't known, docx is a Python package that makes it very easy to manipulate Word documents and Word has a very nice WYSIWYG editor, further, when you use the "Save As PDF" option, you get a perfect copy of the PDF as it was seen as a Word document; therefore, it makes a LOT of sense to try that method first.

Using Adobe Writer (I have the full version), I exported the PDF as a Word document (note: you need to have a writable copy of the PDF to perform this function. If the PDF is password protected, you are out of luck). Upon opening the file in Word, there were a few minor formatting issues, that were easily corrected: Some of the lines didn't line up correctly, the font and paragraph spacings needed adjustment, etc. Once I had the Word document in the format that I wanted, I could "templatize" the Word document by placing anchors. I used the standard template tags of {{ policy1 }}, {{ eff1 }} and {{ exp1 }} to correspond to the values that I wanted to replace.

Template to PDF

We can now get into the code to look at how to replace the values from {{ policy1 }} to our desired values. Remember when I said that this looks like a job for csv? Well, here's where it can come in handy. I created a file called policies.csv in my working directory and placed the values for policy1, eff1, exp1 in the file as comma separated values (note: if you create or save the csv using Excel, you will get a KeyError error because Excel adds a garbage character to the first value in the file. Stick with Notepad if you're using Windows). I created row headings for the csv so the entire file looked as follows:

Template to PDF

It is best to create a Unit Test case to run against everything, but I am choosing to skip that step for simplicity at this moment (this article is already long enough). First, you need to install docx if you haven't already done so (pip install docx), then open IDLE and get to work. Let's first read the CSV and parse the values out from it:

def read_file():
    policy_file = 'policies3.csv'

    with open(policy_file, newline='') as csvfile:
        policydata = csv.DictReader(csvfile, delimiter=',', quotechar='|')
        lines = []
        for row in policydata:
            print(row)
            lines.append([row['policy1'], row['eff1'], row['exp1'],
                          row['policy2'], row['eff2'], row['exp2'],
                          row['policy3'], row['eff3'], row['exp3'],
                          row['policy4'], row['eff4'], row['exp4']])
    return lines

This little snippet will return an array of a data dictionary where each element of the array is a single row in the csv, and the element of the array is a data dictionary separated with the name/value pairs of our data. Very nice. Now, all we have to do is iterate through the lines, extract the values and replace them in our Word document template using docx:

def add_csv_values_to_docs_template(filename, values):
    template = "coi_template.docx"
    dest_file = filename + ".docx"

    document = Document(template)

    style = document.styles['Normal']
    font = style.font
    font.name = 'Arial'
    font.size = Pt(6)
    font.bold = True

    for table in document.tables:
        for row in table.rows:
            for cell in row.cells:
                for paragraph in cell.paragraphs:
                    if '{{ policy1 }}' in paragraph.text:
                        paragraph.text = values[0]
                    if '{{ eff1 }}' in paragraph.text:
                        paragraph.text = values[1]
                    if '{{ exp1 }}' in paragraph.text:
                        paragraph.text = values[2]
                    [ ... repeat for the remaining values ... ]

    document.save(dest_file)

Let's step through this. In the first part, we are creating a unique filename based on the iteration of the program (this is done in main()). In the second, we are opening our template as a docx Word Document. In the next section, the font was coming out very large because it was using the standard size 11 font, which was huge in this certificate (it was font size 6), so I am adjusting the Normal style to use size 6 instead of the default 11. Finally, the program will iterate through the document's tables to find the values (note: all of the values we need are in tables instead of paragraphs, so we have to loop through the tables in the document instead of just the paragraphs). If it finds a template value of {{ policy1 }}, it will replace it with the first value in the spreadsheet. Please note that these are not elif statements because the program is not selecting a template tag and determining if that tag is one or the other. Instead we are looping through the whole document and finding one matching tag or finding another matching tag, etc.

Template to PDF

The code now reads from a CSV file and populates the values that were configured in the Word Document using docx. The last step is to covert the document to a PDF file, which will require a little effort. I believe there are three solid ways that the Word document can be converted to PDF: 1) unoconv (which is a script that uses LibreOffice's runtime libraries to save the file to PDF) 2) Word (as a subprocess) and 3) Open Office as a subprocess. For OpenOffice and LibreOffice, the results were the same; the format was a little off and the images didn't convert from the document to the PDF. So, the most reliable was the Word document (I am running Windows 10):

def convert_docx_to_pdf(method, filename):

    path = os.path.dirname(os.path.abspath(__file__))    
    try:
        word = win32com.client.DispatchEx("Word.Application")
        worddoc = word.Documents.Open(os.path.join(path, filename))
        new_filename = filename.replace(".docx", r".pdf")
        print(new_filename)
        worddoc.SaveAs(os.path.join(path, new_filename), FileFormat=17)
        worddoc.Close()
        word.Quit()
    except Exception:
        return Exception

The code first creates a com connection with the Word application, opens the document, saves the file as a PDF and then closes the document and the application. Easy Peasy.

Template to PDF

The last thing to do is to replace the remaining values with template anchors in the template document, then write a loop that iterates through the csv to create a new file for each row, which can be done as follows:

def main():
    lines = read_file()
    i = 0
    for line in lines:
        filename = "certificate_of_insurance_" + str(i)
        add_csv_values_to_docs_template(filename, line)
        i += 1

To summarize, here was a situation where a customer service representative (CSR) was required to load an application, change a couple of values and click submit 80 different times. Due to copying and pasting, this would have lead to at least 8 hours of their time, as well as a number of manual errors. This code took approximately an hour to write and try various methods, can be reused, and can generate n number of certificates within a couple of minutes. It took about 4 seconds per file (it takes time to load Word and generate the PDF). So, what took > 8 hours, now takes < 5 minutes to complete.

It's funny, people look at a typical agency and solution providers try to sell massive solutions for tens of thousands of dollars to automate, but these repetitive tasks are the ones that occupy the most time. For less than $1,000, an agency can reduce the workload of a CSR considerably!

Let me know your thoughts on Twitter.