Image to PDF in Python

#python #benchmarking

2023-06-29

As part of my current role at work, I'm required to automate the production of many plots using results from PSCAD simulations. These plots are generated by matplotlib. To streamline the plot reviewing process, we generally combine the plots into a single PDF file. Currently, this process is incredibly slow, sometimes taking minutes to generate a single PDF, with 50+ of pages. As a learning exercise in Python benchmarking, and to improve performance, I decided to undertake an investigation into this component of our simulation workflow, hopefully removing the bottleneck.

Current Practices

The current methodology for generating PDFs looks something like this:

Generate simulation data for each scenario, including plots.
Store folders names containing output plots in dictionary by scenario number.
Iterate through output folders using dictionary and generate PDF for each scenario.

I narrowed down the performance bottleneck to step 3 above. For this step, we have the following Python code:

# Create unique pdf for each scenario
for key in pdf_groups.keys():
    pdf = FPDF(orientation='L', unit='mm', format='A4')
    
    # Loop through test in scenario, keeping track of first and last
    first_test = ""
    last_test = ""
    for test in pdf_groups[key]:
        if last_test == "":
            first_test = test

        for signal in SIGNALS_TO_CAPTURE:
            image = f"{out_folder}\\{test}\\{test}_{signal}.png"
            # Only add the graph to PDF if it exists.
            if (os.path.isfile(image)):
                pdf.add_page()
                pdf.image(image, x=0, y=20, w=300, h=150)
            else:
                print(f"File: {image} does not exist.", file=sys.stderr)

        last_test = test

pdf.output(out_folder + "\\" + first_test + "-" 
            + (re.search(r"[0-9]+",last_test)).group(0) + ".pdf", "F")

The code itself is $O(n\times m)$ , where $n$ is the number of scenarios and $m$ is the number of tests.

My intuition tells me that the bottleneck is somewhere in the FPDF package, which is used here to embed the images in generated PDF output. We will benchmark this entire code segment.

Benchmarking

There are a few options for benchmarking the speed of code in Python. Some of these are:

I'm opting to use line_profiler as this will give me a line by line breakdown of my code's execution speed.

Benchmarking Setup

To support the profiling process, we need to isolate this segment of code and create a custom script for profiling. Here's what I ended up with.

@profile
def benchmark_with_FPDF(tests, out_folder, in_folder):
    pdf = FPDF(orientation='L', unit='mm', format='A4')
    for test in tests:
        for signal in SIGNALS_TO_CAPTURE:
            image = in_folder + "\\" + test + "\\"+ test + "_" + signal + ".png"
            # Only add the graph to PDF if it exists.
            if (os.path.isfile(image)):
                pdf.add_page()
                pdf.image(image, x=0, y=20, w=300, h=150)
            else:
                print("File: " + image + " does not exist.")


    pdf.output(out_folder + "\\FPDF_benchmark.pdf", "F")

Note the use of the @profile decorator; line_profiler uses this to signal which function to profile.

To profile the script with line_profiler, we just run:

kernprof -lv script_to_profile.py

Results

Before using line_profiler, I ran a very crude test shown below.

start_time = datetime.now()
benchmark_with_FPDF(tests, out_folder, in_folder)
total_time = datetime.now() - start_time
print(f"Benchmark took {str(total_time)}")

This gave me a runtime of 45.574701 seconds. Note that the input was 24 PNG images, generating a 24 page PDF.

The results from the intial profiling are captured below.

Benchmark took 0:01:48.941374
Wrote profile results to benchmark.py.lprof
Timer unit: 1e-06 s

Total time: 108.941 s
File: .\benchmark.py
Function: benchmark_with_FPDF at line 20

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    20                                           @profile
    21                                           def benchmark_with_FPDF(tests, out_folder, in_folder):
    22         1         59.7     59.7      0.0      pdf = FPDF(orientation='L', unit='mm', format='A4')
    23         3          1.6      0.5      0.0      for test in tests:
    24        24         25.5      1.1      0.0          for signal in SIGNALS_TO_CAPTURE:
    25        24         39.2      1.6      0.0              image = in_folder + "\\" + test + "\\"+ test + "_" + signal + ".png"
    26                                                       # Only add the graph to PDF if it exists.
    27        24       2324.0     96.8      0.0              if (os.path.isfile(image)):
    28        24        708.6     29.5      0.0                  pdf.add_page()
    29        24  108695906.0 4528996.1     99.8                  pdf.image(image, x=0, y=20, w=300, h=150)
    30                                                       else:
    31                                                           print("File: " + image + " does not exist.")
    32
    33
    34         1     242092.4 242092.4      0.2      pdf.output(out_folder + "\\FPDF_benchmark.pdf", "F")

According to the results, 99.8% of execution time is spent on this FPDF API call.

pdf.image(image, x=0, y=20, w=300, h=150)

My intuition seems to be correct; it's clear that the bottleneck is in FPDF.

Alternatives to FPDF

After some research, I found the following tools to be candidates for replacing FPDF.

It seems that img2pdf actually uses Pillow internally. These img2pdf API seems straightforward to use, so I'll just be testing that.

The benchmarking function rewritten to use img2pdf is included below.

@profile
def benchmark_with_img2pdf(tests, out_folder, in_folder):
    # Get a list of paths to the images
    paths = []
    for test in tests:
        for signal in SIGNALS_TO_CAPTURE:
            image = in_folder + "\\" + test + "\\"+ test + "_" + signal + ".png"
            # Only add the graph to PDF if it exists.
            if (os.path.isfile(image)):
                paths.append(image)
            else:
                print("File: " + image + " does not exist.")

    # Specify paper size (A4 landscape)
    a4inpt = (img2pdf.mm_to_pt(297), img2pdf.mm_to_pt(210))
    layout_fun = img2pdf.get_layout_fun(a4inpt)

    # Generate PDF
    with open(f"{out_folder}\\benchmark_img2pdf.pdf","wb") as f:
        f.write(img2pdf.convert(paths, layout_fun=layout_fun))

As before, we run the script again with line_profile.

Results

Using my crude test script again, the img2pdf API only took 1.887416 seconds. The line_profiler results confirm this:

Total time: 94.2377 s
File: .\benchmark.py
Function: benchmark_with_FPDF at line 21

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    21                                           @profile
    22                                           def benchmark_with_FPDF(tests, out_folder, in_folder):
    23         1        110.0    110.0      0.0      pdf = FPDF(orientation='L', unit='mm', format='A4')
    24         3          2.5      0.8      0.0      for test in tests:
    25        24         44.1      1.8      0.0          for signal in SIGNALS_TO_CAPTURE:
    26        24         63.1      2.6      0.0              image = in_folder + "\\" + test + "\\"+ test + "_" + signal + ".png"
    27                                                       # Only add the graph to PDF if it exists.
    28        24       3338.2    139.1      0.0              if (os.path.isfile(image)):
    29        24       1062.9     44.3      0.0                  pdf.add_page()
    30        24   93822206.6 3909258.6     99.6                  pdf.image(image, x=0, y=20, w=300, h=150)
    31                                                       else:
    32                                                           print("File: " + image + " does not exist.")
    33
    34
    35         1     410847.1 410847.1      0.4      pdf.output(out_folder + "\\FPDF_benchmark.pdf", "F")

Total time: 1.88704 s
File: .\benchmark.py
Function: benchmark_with_img2pdf at line 37

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    37                                           @profile
    38                                           def benchmark_with_img2pdf(tests, out_folder, in_folder):
    39                                               # Get a list of paths to the images
    40         1          0.5      0.5      0.0      paths = []
    41         3          1.2      0.4      0.0      for test in tests:
    42        24          8.7      0.4      0.0          for signal in SIGNALS_TO_CAPTURE:
    43        24         25.2      1.1      0.0              image = in_folder + "\\" + test + "\\"+ test + "_" + signal + ".png"
    44                                                       # Only add the graph to PDF if it exists.
    45        24        938.6     39.1      0.0              if (os.path.isfile(image)):
    46        24         15.7      0.7      0.0                  paths.append(image)
    47                                                       else:
    48                                                           print("File: " + image + " does not exist.")
    49
    50                                               # specify paper size (A4 landscape)
    51         1          3.9      3.9      0.0      a4inpt = (img2pdf.mm_to_pt(297), img2pdf.mm_to_pt(210))
    52         1          9.8      9.8      0.0      layout_fun = img2pdf.get_layout_fun(a4inpt)
    53
    54                                               # Generate PDF
    55         1        408.9    408.9      0.0      with open(f"{out_folder}\\benchmark_img2pdf.pdf","wb") as f:
    56         1    1885623.0 1885623.0     99.9          f.write(img2pdf.convert(paths, layout_fun=layout_fun))

FPDF took approximately 94.2377 seconds to complete the conversion process.
img2pdf completed the same task in approximately 1.88704 seconds.

Therefore, by migrating our image to PDF conversion scripts to use img2pdf, we should observe a 50x increase in PDF generations speeds. This is huge for multi-hour batched simulations runs, which will have 50+ test cases across multiple scenarios.

Conclusion

There was a major bottleneck in our image to PDF conversion code. By moving from FPDF to img2pdf, we saw a 50x increase in program execution speed.