Image to PDF in Python
#python #benchmarking
As part of my current role at work, I'm required to automate the production of
many plots using results from PSCAD simulations.
These plots are generated by matplotlib
.
To streamline the plot reviewing process, we generally combine the plots into
a single PDF file. Currently, this process is incredibly slow, sometimes taking
minutes to generate a single PDF, with 50+ of pages.
As a learning exercise in Python benchmarking, and to improve performance, I
decided to undertake an investigation into this component of our simulation
workflow, hopefully removing the bottleneck.
Current Practices
The current methodology for generating PDFs looks something like this:
- Generate simulation data for each scenario, including plots.
- Store folders names containing output plots in dictionary by scenario number.
- Iterate through output folders using dictionary and generate PDF for each scenario.
I narrowed down the performance bottleneck to step 3 above. For this step, we have the following Python code:
# Create unique pdf for each scenario
for key in pdf_groups.keys():
pdf = FPDF(orientation='L', unit='mm', format='A4')
# Loop through test in scenario, keeping track of first and last
first_test = ""
last_test = ""
for test in pdf_groups[key]:
if last_test == "":
first_test = test
for signal in SIGNALS_TO_CAPTURE:
image = f"{out_folder}\\{test}\\{test}_{signal}.png"
# Only add the graph to PDF if it exists.
if (os.path.isfile(image)):
pdf.add_page()
pdf.image(image, x=0, y=20, w=300, h=150)
else:
print(f"File: {image} does not exist.", file=sys.stderr)
last_test = test
pdf.output(out_folder + "\\" + first_test + "-"
+ (re.search(r"[0-9]+",last_test)).group(0) + ".pdf", "F")
The code itself is , where is the number of scenarios and is the number of tests.
My intuition tells me that the bottleneck is somewhere in the FPDF
package,
which is used here to embed the images in generated PDF output. We will
benchmark this entire code segment.
Benchmarking
There are a few options for benchmarking the speed of code in Python. Some of these are:
I'm opting to use line_profiler
as this will give me a line by line breakdown
of my code's execution speed.
Benchmarking Setup
To support the profiling process, we need to isolate this segment of code and create a custom script for profiling. Here's what I ended up with.
@profile
def benchmark_with_FPDF(tests, out_folder, in_folder):
pdf = FPDF(orientation='L', unit='mm', format='A4')
for test in tests:
for signal in SIGNALS_TO_CAPTURE:
image = in_folder + "\\" + test + "\\"+ test + "_" + signal + ".png"
# Only add the graph to PDF if it exists.
if (os.path.isfile(image)):
pdf.add_page()
pdf.image(image, x=0, y=20, w=300, h=150)
else:
print("File: " + image + " does not exist.")
pdf.output(out_folder + "\\FPDF_benchmark.pdf", "F")
Note the use of the @profile
decorator; line_profiler
uses this to signal
which function to profile.
To profile the script with line_profiler
, we just run:
kernprof -lv script_to_profile.py
Results
Before using line_profiler, I ran a very crude test shown below.
start_time = datetime.now()
benchmark_with_FPDF(tests, out_folder, in_folder)
total_time = datetime.now() - start_time
print(f"Benchmark took {str(total_time)}")
This gave me a runtime of 45.574701 seconds
. Note that the input was 24 PNG
images, generating a 24 page PDF.
The results from the intial profiling are captured below.
Benchmark took 0:01:48.941374
Wrote profile results to benchmark.py.lprof
Timer unit: 1e-06 s
Total time: 108.941 s
File: .\benchmark.py
Function: benchmark_with_FPDF at line 20
Line # Hits Time Per Hit % Time Line Contents
==============================================================
20 @profile
21 def benchmark_with_FPDF(tests, out_folder, in_folder):
22 1 59.7 59.7 0.0 pdf = FPDF(orientation='L', unit='mm', format='A4')
23 3 1.6 0.5 0.0 for test in tests:
24 24 25.5 1.1 0.0 for signal in SIGNALS_TO_CAPTURE:
25 24 39.2 1.6 0.0 image = in_folder + "\\" + test + "\\"+ test + "_" + signal + ".png"
26 # Only add the graph to PDF if it exists.
27 24 2324.0 96.8 0.0 if (os.path.isfile(image)):
28 24 708.6 29.5 0.0 pdf.add_page()
29 24 108695906.0 4528996.1 99.8 pdf.image(image, x=0, y=20, w=300, h=150)
30 else:
31 print("File: " + image + " does not exist.")
32
33
34 1 242092.4 242092.4 0.2 pdf.output(out_folder + "\\FPDF_benchmark.pdf", "F")
According to the results, 99.8% of execution time is spent on this FPDF API call.
pdf.image(image, x=0, y=20, w=300, h=150)
My intuition seems to be correct; it's clear that the bottleneck is in FPDF.
Alternatives to FPDF
After some research, I found the following tools to be candidates for replacing FPDF.
It seems that img2pdf
actually uses Pillow
internally. These img2pdf
API
seems straightforward to use, so I'll just be testing that.
The benchmarking function rewritten to use img2pdf
is included below.
@profile
def benchmark_with_img2pdf(tests, out_folder, in_folder):
# Get a list of paths to the images
paths = []
for test in tests:
for signal in SIGNALS_TO_CAPTURE:
image = in_folder + "\\" + test + "\\"+ test + "_" + signal + ".png"
# Only add the graph to PDF if it exists.
if (os.path.isfile(image)):
paths.append(image)
else:
print("File: " + image + " does not exist.")
# Specify paper size (A4 landscape)
a4inpt = (img2pdf.mm_to_pt(297), img2pdf.mm_to_pt(210))
layout_fun = img2pdf.get_layout_fun(a4inpt)
# Generate PDF
with open(f"{out_folder}\\benchmark_img2pdf.pdf","wb") as f:
f.write(img2pdf.convert(paths, layout_fun=layout_fun))
As before, we run the script again with line_profile
.
Results
Using my crude test script again, the img2pdf API only took 1.887416 seconds.
The line_profiler
results confirm this:
Total time: 94.2377 s
File: .\benchmark.py
Function: benchmark_with_FPDF at line 21
Line # Hits Time Per Hit % Time Line Contents
==============================================================
21 @profile
22 def benchmark_with_FPDF(tests, out_folder, in_folder):
23 1 110.0 110.0 0.0 pdf = FPDF(orientation='L', unit='mm', format='A4')
24 3 2.5 0.8 0.0 for test in tests:
25 24 44.1 1.8 0.0 for signal in SIGNALS_TO_CAPTURE:
26 24 63.1 2.6 0.0 image = in_folder + "\\" + test + "\\"+ test + "_" + signal + ".png"
27 # Only add the graph to PDF if it exists.
28 24 3338.2 139.1 0.0 if (os.path.isfile(image)):
29 24 1062.9 44.3 0.0 pdf.add_page()
30 24 93822206.6 3909258.6 99.6 pdf.image(image, x=0, y=20, w=300, h=150)
31 else:
32 print("File: " + image + " does not exist.")
33
34
35 1 410847.1 410847.1 0.4 pdf.output(out_folder + "\\FPDF_benchmark.pdf", "F")
Total time: 1.88704 s
File: .\benchmark.py
Function: benchmark_with_img2pdf at line 37
Line # Hits Time Per Hit % Time Line Contents
==============================================================
37 @profile
38 def benchmark_with_img2pdf(tests, out_folder, in_folder):
39 # Get a list of paths to the images
40 1 0.5 0.5 0.0 paths = []
41 3 1.2 0.4 0.0 for test in tests:
42 24 8.7 0.4 0.0 for signal in SIGNALS_TO_CAPTURE:
43 24 25.2 1.1 0.0 image = in_folder + "\\" + test + "\\"+ test + "_" + signal + ".png"
44 # Only add the graph to PDF if it exists.
45 24 938.6 39.1 0.0 if (os.path.isfile(image)):
46 24 15.7 0.7 0.0 paths.append(image)
47 else:
48 print("File: " + image + " does not exist.")
49
50 # specify paper size (A4 landscape)
51 1 3.9 3.9 0.0 a4inpt = (img2pdf.mm_to_pt(297), img2pdf.mm_to_pt(210))
52 1 9.8 9.8 0.0 layout_fun = img2pdf.get_layout_fun(a4inpt)
53
54 # Generate PDF
55 1 408.9 408.9 0.0 with open(f"{out_folder}\\benchmark_img2pdf.pdf","wb") as f:
56 1 1885623.0 1885623.0 99.9 f.write(img2pdf.convert(paths, layout_fun=layout_fun))
- FPDF took approximately 94.2377 seconds to complete the conversion process.
- img2pdf completed the same task in approximately 1.88704 seconds.
Therefore, by migrating our image to PDF conversion scripts to use img2pdf, we should observe a 50x increase in PDF generations speeds. This is huge for multi-hour batched simulations runs, which will have 50+ test cases across multiple scenarios.
Conclusion
There was a major bottleneck in our image to PDF conversion code. By moving
from FPDF
to img2pdf
, we saw a 50x increase in program execution speed.