A zero-effort approach to speed up numerical Python code

Introduction

Since the start of 2022, Python and C have remained neck and neck at the top 2 positions of the TIOBE index, making them the most popular programming languages. The contrast could not be starker: Python favors programmer productivity while C favors computational speed, each designed to be at the opposite ends of this trade-off spectrum. This post explores how Numba, a just-in-time (JIT) compiler, can speed up numerical computations in Python without additional effort or any rewriting of existing Python code, i.e., improving the computational speed without trading off any programmer productivity. I will employ an example to illustrate this crucial point by comparing code readability and execution speed: Numba vs. Numpy vs. JavaScript.

Background

In computer science, all programming languages generally lie along a spectrum between low-level languages and high-level languages, with high-level languages having more abstraction from the computer’s instruction set architecture and hence resulting in a programming language that is more expressive to the programmer. The primary purpose of abstraction is to improve the programmer’s productivity to enable the programmer to spend less time on the detailed “hows” and focus more on the “what” of the big picture. An excellent example of a high-level programming language is SQL, a declarative language. In SQL, the programmer states what they want without painstakingly specifying the time-consuming nitty-gritty details of how. Due to this, the “programmer” here is generally referred to as a “database query user” rather than as a “programmer”. Greater abstraction improves programmer productivity in terms of less time spent coding and lower risk of bugs. Unfortunately, the price to pay here is the reduction in computational speed, also known as the abstraction penalty. It is more challenging or even impossible for the compiler or interpreter to figure out a fast way of performing the computations. In contrast, an excellent example of a low-level programming language is C, a procedural language, meaning the programmer specifies in detail each of the steps (on how) to perform the computations. The result of this hand-holding, close-to-the-metal approach to programming is faster computational speed, at the expense of the programmer spending more time on menial detailed tasks and increased risk of bugs, especially security bugs, which C and C++ are notorious for.

The trade-off between programmer productivity and computational speed can be appreciated by comparing conciseness indices and performance benchmarks. For example, when comparing Python to C, The Computer Language Benchmarks Game (formerly called The Great Computer Language Shootout) indicates that Python generally takes 10 to 100 times longer than C to complete the equivalent computations. On the other hand, conciseness can be indicated by simple measures like lines of code (LOC) or more sophisticated measures like gearing factors or even compression ratios, as described in this paper. However, all these measures only provide general indications of programmer productivity and computational speed but do not provide accurate quantifications. In particular, programmer productivity depends not only on time spent writing code but also on reading (maintaining) code, testing code, and debugging code; no conciseness index can fully represent all those aspects. For example, Haskell code is generally exceedingly concise but also extremely hard and time-consuming for most programmers to read; i.e., the code looks terse rather than concise for most programmers.

Depending on the situation, such as use case, problem domain, team, cost of programmer’s time, cost of computational power, etc., the trade-off mentioned earlier may or may not be an issue. A good example is Python. The rise in popularity of Python is not only due to the ease of both writing and reading Python code but also due to the significant improvements in the efficiency of computational power in the past decades, resulting in the cost of computational power being much lower than the cost of programmer’s time in many use cases. However, even in such use cases, there is often a strong need for computational speed in certain parts of the program. Programmers usually meet such requirements using libraries like Numpy and TensorFlow, which take care of only specific tasks, such as heavy number crunching computations. This approach is a compromise, where the programmer invests some additional time to code in the application programmer’s interface (API) of the libraries instead of plain Python code only. Still, the time investment is much less than if the programmer were to say write all those parts in C code.

In an ideal world, such compromises and trade-offs would not be necessary; Julia is a programming language designed from the ground up to be simultaneously expressive and fast. However, it is a relatively young language, so it remains to be seen if Julia will achieve this goal. In the meantime, for existing programming languages, new technologies help to reduce this compromise gap in 2 ways:

Improving the computational speed of high-level languages through better compilers or interpreters, such as just-in-time (JIT) compilers. A good example is the V8 JIT compiler for JavaScript, which made JavaScript so fast that it expanded JavaScript’s usage beyond its original web browser front-end to powering web servers in the back-end, e.g., in Node.js and Express.js.
Improving the programmer productivity of low-level languages through coding aids like linters and intelligent code editors, and reducing the risk of bugs through automated code analysis tools. More recently, tools like GitHub Copilot and Chat GPT have used artificial intelligence (AI) to help programmers code even faster and better.

Unfortunately, for Python, a general JIT compiler is still infeasible. One popular way to speed up Python code is to use Cython. However, that approach requires modifying existing code. On the other hand, Numba is a Python just-in-time (JIT) compiler that speeds up specific Python code. Unlike Cython, it works on plain Python code without modifications except for a simple one-line decorator. Hence, by using Numba, one could speed up the execution time of the code without compromising the readability of the code or trading off the programmer’s productivity in any way. This alternative has certain limitations for general use cases, but in this post, I will illustrate and demonstrate that it works well on numerical computations.

Illustrative problem

The goal is to write a program to solve the Seven-Eleven puzzle, copied from the Internet and reproduced below (original author unknown).

You walk into a 7-11 store and pick up four different items. You place them on the checkout counter. The clerk says to you, “that will be $7.11”. You laugh and say, how did you get that price? The clerk says, “I took the four individual prices and multiplied them together and the result was $7.11. MULTIPLY?! You say, you can’t do that, you need to ADD the four prices together. Clerk says, “doesn’t matter, the result is still a total of $7.11 whether I multiply the four prices or add the four prices.”

Assume here that:

The goal is to solve the puzzle as quickly as possible, i.e., spend as little programmer’s time as possible; no prize for coming up with an elegant solution.
There is no need to generalize the puzzle beyond the stated problem.

Although not always the case, such situations may arise in actual work, i.e., a brute-force method is preferred if it saves a lot of time for the programmer while still meeting the required target computational speed.

All solutions described below were benchmarked on a desktop PC with an AMD Athlon 200GE CPU.

Illustrative solutions

Before we start, note that:

Given the nature of such puzzles, the smallest monetary unit here should be 1 cent, i.e., no fractional cents.
The puzzle did not mention that there is only one unique solution, nor did it specify that only one solution is needed, even if there might be more than one solution; hence, it would be reasonable to attempt to search for all possible solutions.

The first point is crucial; it implies that our solution would best be solved using integral values of cents rather than floating-point values of dollars since floating-point numbers cannot represent base-10 decimals perfectly, which means a higher risk of bugs in our solution.

Python only

The following code is a straightforward search for all possible solutions by expressing everything in cents. Note that this is not entirely a brute force search: in each of the nested levels of loops, we start our search from the value of the last variable instead of from one, i.e., we assumed a ≥ b ≥ c ≥ d without any loss of generality. For example, b iterates starting from a:

        for b in range(a, 712):

rather than from 1:

        for b in range(1, 712):

This particular observation saved a lot of unnecessary searches. The exact amount is more than 96% of the search space but is peripheral to our discussion here as I assume this iteration technique is an essential programming skill for most professional programmers.

The code took 2.00 seconds to run.

import time


def main():
    for a in range(1, 712):
        if 711000000 % a != 0:
            continue
        for b in range(a, 712):
            if 711000000 % b != 0:
                continue
            for c in range(b, 712):
                if 711000000 % c != 0:
                    continue
                for d in range(c, 712):
                    if 711000000 % d != 0:
                        continue
                    sum = a + b + c + d
                    if (sum == 711 and
                            sum * 1000000
                            ==
                            a * b * c * d):
                        print(a, b, c, d)


# get the start time
st = time.time()

main()

# get the end time
et = time.time()

# get the execution time
elapsed_time = et - st
print(
    'Execution time:',
    elapsed_time, 'seconds')

JavaScript only

The JavaScript code below took an impressive mere 0.15 seconds to complete.

const start = performance.now();
for (let a = 1; a <= 711; a++) { 
  if (711000000 % a !== 0)  continue;
  for (let b = a; b <= 711; b++) {
    if (711000000 % b !== 0)  continue;
    for (let c = b; c <= 711; c++) {
      if (711000000 % c !== 0)  continue;
      for (let d = c; d <= 711; d++) {
        if (711000000 % d !== 0)
            continue;
        if (a + b + c + d === 711 &&
            a * b * c * d === 711000000)
                console.log(a, b, c, d);
      }
    }
  }
}
const end = performance.now();
console.log(`Execution time:
    ${end - start} ms`);

The above is a one-to-one translation of our earlier Python code into JavaScript. However, there is one crucial difference: all integers in Python 3 have no size limits, while JavaScript integers are represented by 64-bit floating-point by default, which means they only have up to 53 bits, with their range limited between Number.MIN_SAFE_INTEGER and Number.MAX_SAFE_INTEGER. Hence, the programmer is responsible for ensuring no arithmetic overflow bugs. In addition to its default integers, JavaScript supports BigInt numbers with no size limits, just like the Python 3 integers. This point gives us a clue as to one key driver of the difference in speed, other than JavaScript having a fast JIT compiler: Python is using BigInt! To see how much of an impact this could cause, let us modify our JavaScript code above and rerun using BigInt, as shown below. It took 1.15 seconds, more than seven times the earlier version, although still faster than Python 3.

const start = performance.now();
for (let a = 1n; a <= 711n; a++) { 
  if (711000000n % a !== 0n)  continue;
  for (let b = a; b <= 711n; b++) {
    if (711000000n % b !== 0n)  continue;
    for (let c = b; c <= 711n; c++) {
      if (711000000n % c !== 0n)  continue;
      for (let d = c; d <= 711n; d++) {
        if (711000000n % d !== 0n)
            continue;
        if (a + b + c + d === 711n &&
            a * b * c * d === 711000000n)
          console.log(a, b, c, d);
      }
    }
  }
}
const end = performance.now();
console.log(`Execution time:
    ${end - start} ms`);

Python with Numpy

Using the Numpy library, the code below took 0.40 seconds. Note that:

Numpy does not use BigInt.
The code below required the use of Cartesian product, which is not available in Numpy but was found on StackOverflow.
The code is extremely time-consuming to read!

import numpy
import time


# see StackOverflow link above
def cartesian_product_transpose(
        *arrays):
    broadcastable = numpy.ix_(
        *arrays)
    broadcasted = (
        numpy.broadcast_arrays(
            *broadcastable))
    rows = (
        numpy
        .prod(broadcasted[0].shape))
    cols = len(broadcasted)
    dtype = (
        numpy.result_type(*arrays))

    out = numpy.empty(rows * cols,
                      dtype=dtype)
    start, end = 0, rows
    for a in broadcasted:
        out[start:end] = \
            a.reshape(-1)
        start = end
        end += rows
    return (
        out
        .reshape(cols, rows)
        .T)


# get the start time
st = time.time()

all_ints = numpy.arange(
    1, 712,
    dtype=numpy.dtype(
        numpy.int64))
divisors = \
    all_ints[((711000000
             % all_ints) == 0)]

# First, get the Cartesian product
pre_candidates = \
    cartesian_product_transpose(
        divisors, divisors,
        divisors, divisors)
assert(
    pre_candidates.shape[0]
    ==
    divisors.shape[0]**4)
# Then, filter out duplicates
# due to the associative
# nature of sums and products
candidates = pre_candidates[
    (pre_candidates[:, 0] <=
     pre_candidates[:, 1]) &
    (pre_candidates[:, 1] <=
     pre_candidates[:, 2]) &
    (pre_candidates[:, 2] <=
     pre_candidates[:, 3]), :]

candidates2 = \
    candidates[
        numpy.sum(
            candidates,
            axis=1) == 711]
sums = numpy.sum(
    candidates2, axis=1)
products = numpy.product(
    candidates2, axis=1)
print(
    candidates2[
        sums * 1000000
        == products, :])

# get the end time
et = time.time()

# get the execution time
elapsed_time = et - st
print(
    'Execution time:',
    elapsed_time,
    'seconds')

Python with Numba

Using Numba, the following program took 1.00 seconds to run. It is identical to the original Python-only program above, except for only two extra lines:

from numba import jit

@jit(nopython=True)

import time
from numba import jit


@jit(nopython=True)
def main():
    for a in range(1, 712):
        if 711000000 % a != 0:
            continue
        for b in range(a, 712):
            if 711000000 % b != 0:
                continue
            for c in range(b, 712):
                if 711000000 % c != 0:
                    continue
                for d in range(c, 712):
                    if 711000000 % d != 0:
                        continue
                    sum = a + b + c + d
                    if (sum == 711 and
                            sum * 1000000
                            ==
                            a * b * c * d):
                        print(a, b, c, d)


# get the start time
st = time.time()

main()

# get the end time
et = time.time()

# get the execution time
elapsed_time = et - st
print(
    'Execution time:',
    elapsed_time, 'seconds')

Conclusion

The following table summarises our findings. “Safe integer” refers to unlimited-size integers (the only integer type in Python 3 and BigInt in JavaScript) without the risk of any arithmetic overflow bug.

Method	Run time	No extra coding
Plain Python 3	2.00s
Python + Numpy	0.40s
Python + Numba	1.00s
Plain JavaScript	0.15s
JavaScript + BigInt	1.15s	*

Minimal extra coding for JavaScript + BigInt.

In this post, I have illustrated and demonstrated that Numba could provide meaningful computational speed-ups without any programmer effort at all while retaining the readability of the original Python code. However, compared to vanilla JavaScript, there is still room for improvement for this JIT compiler. Nonetheless, it should prove useful to many Python programmers even in its current state. However, the programmers would need basic computer science knowledge, e.g., exercising caution with any potential arithmetic overflow in our example.