Python Comprehension Is Not Fast

published at 12 Oct, 2016 by Szymon LipiƄski tags: python programming c++

In the previous blog post I have presented results of benchmarking list comprehensions and maps in Python.

The results looked fine, however there was one thing I couldn’t stop thinking about. The huge memory overhead and the huge execution time.

The process used 50MB of memory from the beginning. Then it had just to iterate through a list of integers (which are small). The number of integers is also small (only 1M). Then it had to make 1M divisions to find the even numbers, and square it (making 0.5M multiplications, and 0.5M additions).

Why did it take 1.5 minute?

The answer is simple:

Because Python is slow.

The Previous Winner

The winner of the competition was the version with a list comprehension and squaring the numbers in a loop:

@profile
def sum_numbers_list_comprehension_for_square(data):
    res = 0
    for x in [n for n in data if n % 2 == 0]:
        res += x*x
    return res

I have tested it with Python 2 and Python 3. The results were:

Python Version Time[s] Operation Overhead [MB] * Memory Usage [MB] **
2 70 3.7 47
3 76 3.9 53

* - Memory Overhead is the amount of memory used additionally for the computations

** - Memory Usage is the amount of memory that Python used before running the test function

Let’s Reimplement It With C++

The Data Memory Usage

First the memory overhead. We keep 1M of integers in a list. In Python it was generated with the below code. The memory size of the list is 8.6MB.

list(range(1, 1*1000*1000 + 1))

In C++ I will use:

#include <vector>
auto data = std::vector<int>(MAX);
std::iota(data.begin(), data.end(), 1);

This will use exactly:

4B (int size) * 1.000.000 + 24B (vector overhead) = 3.8MB

The Implementation

The implementation is pretty simple. Take a look at the code, there is no pointer. This is another myth, that we need to use lots of pointers in C++ code. They are inside the classes that we use, they are nice and managed by the class destructors. No need to insert * everywhere.

#import <vector>
#import <iostream>
#import <numeric>

using namespace std;

typedef unsigned long long int big_number;

const int MAX = 1*1000*1000;

big_number sum_for(std::vector<int> const &v)
{
  big_number res = 0;
  for(auto n : v)
  {
    if (n % 2 == 0)
    {
      res += static_cast<big_number>(n) * static_cast<big_number>(n);
    }
  }
  return res;
}

int main()
{
  auto data = std::vector<int>(MAX);
  std::iota(data.begin(), data.end(), 1);

  std::cout << "There are " << data.size() <<  " numbers." << std::endl;
  std::cout << "The first is " << data[0] << std::endl;
  std::cout << "The last is " << data[1000000-1] << std::endl;
  
  auto res = sum_for(data);
  
  std::cout << "The result is: " << res << std::endl;

  return 1;
};

Compilation

This will be compiled using clang++ with modern C++ features enabled (-std=c++1z) – this is needed for the auto and the for (auto n:v) loop, with optimization (-O2) and the result will be stored in the test binary file.

clang++ -std=c++1z -O2 -o test test.cpp

Let’s Run It

So now let’s see how fast this can be:

time ./test
There are 1000000 numbers.
The first is 1
The last is 1000000
The result is: 166667166667000000

real    0m0.003s
user    0m0.000s
sys     0m0.000s

Yes, really. That code run in 3ms. It used 6MB of the memory. This time includes starting the program, creating the 1M of integers in a container, iterating, filtering, squaring, summing.

Final Thoughts

The C++ implementation uses almost 8 times less memory and runs 23.333 times faster. Yes, that’s 23 thousand.

Yes, there is the numpy which could be used, however people write about different results of np.power(n, 2) and a*a or a**2.