Python Comprehension Is Not Fast

Author: Szymon Lipiński
Published at: 2016-10-12

In the previous blog post I have presented results of benchmarking list comprehensions, and maps in Python.

The results looked fine, however there was one thing that I couldn’t stop thinking about. The memory overhead, and the huge execution time.

The process wanted to use the 50MB from the beginning. Then it had just to iterate through a list of integers (which are small). The number of integers is also small (only 1M). Then it had to make 1M divisions to find the even numbers, and square it (making 0.5M multiplications, and 0.5M additions).

Why did it take 1.5 minute?

The answer is simple:

Because Python is slow.

The Previous Winner

The winner of the competition was the version with a list comprehension, and squaring the numbers in a loop:

@profile
def sum_numbers_list_comprehension_for_square(data):
    res = 0
    for x in [n for n in data if n % 2 == 0]:
        res += x*x
    return res

I have tested that with Python 2 and Python 3. The results were:

Python Version Time[s] Operation Overhead [MB] Memory Usage [MB]*
2 70 3.7 47
3 76 3.9 53

* - Memory Usage is the amount of memory that Python used before running the test function

Let’s Reimplement It With C++

Data Memory Usage

First the memory overhead. The list with 1M of integers in Python was generated with:

list(range(1, 1*1000*1000 + 1))

and it took 8.6MB.

In C++ I will use:

#include <vector>
auto data = std::vector<int>(MAX);
std::iota(data.begin(), data.end(), 1);

This will use exactly:

4B (int size) * 1.000.000 + 24B (vector overhead) = 3.8MB

Implementation

The implementation is pretty simple. Btw. take a look at the code, there is no pointer. This is another myth, that we need to use lots of pointers in C++ code. They are inside the classes that we use, they are nice, and managed by the classes destructors. No need to insert * everywhere.

#import <vector>
#import <iostream>
#import <numeric>

using namespace std;

typedef unsigned long long int big_number;

const int MAX = 1*1000*1000;

big_number sum_for(std::vector<int> const &v)
{
  big_number res = 0;
  for(auto n : v)
  {
    if (n % 2 == 0)
    {
      res += static_cast<big_number>(n) * static_cast<big_number>(n);
    }
  }
  return res;
}

int main()
{
  auto data = std::vector<int>(MAX);
  std::iota(data.begin(), data.end(), 1);

  std::cout << "There are " << data.size() <<  " numbers." << std::endl;
  std::cout << "The first is " << data[0] << std::endl;
  std::cout << "The last is " << data[1000000-1] << std::endl;

  auto res = sum_for(data);

  std::cout << "The result is: " << res << std::endl;

  return 1;
};

Compilation

This will be compiled using clang++, with modern C++ features enabled (-std=c++1z) - this is needed for the auto, and the for (auto n:v) loop, with optimization (-O2) and the result will be stored in the test binary file.

clang++ -std=c++1z -O2 -o test test.cpp

Let’s Run It

So now let’s see how fast this can be:

time ./test
There are 1000000 numbers.
The first is 1
The last is 1000000
The result is: 166667166667000000

real    0m0.003s
user    0m0.000s
sys     0m0.000s

Yes, really. That code run in 3ms. It used 6MB. This time includes starting the program, creating the 1M of integers in a container, iterating, filtering, squaring, summing.

Final Thoughts

The C++ implementation uses almost 8 times less memory, and runs 23.333 times faster. Yes, that’s 23 thousand.

Yes, there is numpy which could be used, however people write about different results of np.power(n, 2) and a*a or a**2.

The comments are disabled. If you want to write something to me, you can use e.g. Twitter.