Hi Quants,

I'll be brief here.

I'd like to introduce you to vectorization in general and in Python.

What is vectorization?

In computer science, vectorization is applying the mathematics of vectors and their operations to arrays (people using Python know arrays as Lists) instead of using “for loops over arrays” to perform calculations.

Why using vectorization?

If you are using a computer from before 1995-ish, yes, you do not need to care too much about vectorization because your processor much likely does not really support true parallelism (multi-cores) and super-fast SIMD instructions on memory locations.

If you are using a computer from after 2000-ish, which I bet you do, then vectorization would make use of incredibly performant instructions in your processor AND would make use of those cores you are proud to have. So use true parallelism while doing calculations in your backtests/indicators. You can use it on your NVIDIA cards also, to use them for backtesting at just amazing blazing speeds, but I do not want to err there in this post.

Note that if you use QC web/cloud, you are using computers from after 2000……………………………… 😊. So vectorization is in scope for you.

How to use vectorization?

This is a college-level subject, even graduate-studies level on many aspects, so I will not fully teach how to use it in a forum post. Google is your friend. Numpy and Pandas, which are both included in the QC cloud environment so you can use them, are also very close friends here. They do have vectorization implemented for you. Look on Google. And, as you probably know, self.History() returns a Pandas DataFrame. So you can use vectorization right away while exploiting its results. Compared to using a for loop to iterate over the rows of the DataFrame, you'll gain 20-50-100-200-1000X…………

You could have a read of this:

and this:

As starters if you are more interested.

Pandas is incredibly vast and powerful. Even Numpy is orders of magnitude more powerful than most people make use of it. Think that vectorization is one of the foundations of data analytics. Now if backtesting and algo-trading are not data analytics…

SHOW ME CODE! SHOW ME COOOOOOOOOOOOOODE!!!!!!!!!!!!

Alright.

I did a very easy to understand comparison, to put things in perspective and eventually spark your interest.

Have a look at this:

Timer unit: 1e-06 s

Total time: 226.546 s
File: test20.py
Function: f1 at line 6

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     6                                           @profile
     7                                           def f1():
     8      1000  186045965.8 186046.0     82.1      l1 = random.sample(range(1, 1_000_000), 50_000)
     9      1000       4287.7      4.3      0.0      _sum = 0
    10  50001000   18733153.4      0.4      8.3      for item in l1:
    11  50000000   21757529.3      0.4      9.6          _sum += item
    12      1000       4978.6      5.0      0.0      return _sum / len(l1)

Total time: 165.398 s
File: test20.py
Function: f2 at line 14

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    14                                           @profile
    15                                           def f2():
    16      1000  164171321.4 164171.3     99.3      l1 = random.sample(range(1, 1_000_000), 50_000)
    17      1000    1223668.1   1223.7      0.7      _sum = sum(l1)
    18      1000       3147.0      3.1      0.0      return _sum / len(l1)

Total time: 168.975 s
File: test20.py
Function: f3 at line 20

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    20                                           @profile
    21                                           def f3():
    22      1000  168338808.0 168338.8     99.6      l1 = random.sample(range(1, 1_000_000), 50_000)
    23      1000     632884.5    632.9      0.4      _sum = math.fsum(l1)
    24      1000       3317.4      3.3      0.0      return _sum / len(l1)

Total time: 268.41 s
File: test20.py
Function: f4 at line 26

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    26                                           @profile
    27                                           def f4():
    28      1000  167759822.9 167759.8     62.5      l1 = random.sample(range(1, 1_000_000), 50_000)
    29      1000  100650675.6 100650.7     37.5      return statistics.mean(l1)

Total time: 165.881 s
File: test20.py
Function: f5 at line 31

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    31                                           @profile
    32                                           def f5():
    33      1000  165213090.9 165213.1     99.6      l1 = random.sample(range(1, 1_000_000), 50_000)
    34      1000     668202.6    668.2      0.4      return statistics.fmean(l1)

Total time: 172.987 s
File: test20.py
Function: f6 at line 36

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    36                                           @profile
    37                                           def f6():
    38      1000  169385695.9 169385.7     97.9      l1 = random.sample(range(1, 1_000_000), 50_000)
    39      1000    3595975.9   3596.0      2.1      _sum = numpy.sum(l1)
    40      1000       4905.4      4.9      0.0      return _sum / len(l1)

Total time: 0.380018 s
File: test20.py
Function: f7 at line 42

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    42                                           @profile
    43                                           def f7():
    44      1000     338150.1    338.2     89.0      l1 = numpy.random.randint(1_000_000, size=50_000)
    45      1000      40276.7     40.3     10.6      _sum = numpy.sum(l1)
    46      1000       1591.3      1.6      0.4      return _sum / len(l1)

I compute a mean/average in many ways. Is computing averages something you do in your algos? 😊

f1 : The sum is calculated with a “for loop”. Then the sum is divided by the number of items. Very classic huh?

f2 : I use the built-in sum() Python function.

f3 : I use the math.fsum() function.

f4 : I use statistics.mean() directly.

f5 : I use statistics.fmean().

f6 : I use numpy.sum() to compute the sum on a Python List.

f7 : I fully use vectorization implemented into Numpy.

Time needed to execute each function 1000x:

f1 = 226s

f2 = 165s

f3 = 168s

f4 = 268s

f5 = 165s

f6 = 172s

f7 = ……………………………. drum roll, 0.38s

Now, let's say I am telling you that using vectorization over Pandas DataFrame gives even bigger gains? We're talking about operations over vectors there but also, of course, over matrices. There are specialized CPU SIMD instructions that Pandas uses for operations over matrices for which gains over “for loops” are even bigger.

By now, I guess you understand that vectorization means you can crunch a lot more data for a lot more symbols in a lot less time.

Hope you enjoyed reading this.

Fred