【Python】NumPyで相関係数算出と速度パフォーマンス

Pythonで２つのデータ群の相関係数を求めるプログラムを作成した。NumPyで提供される関数の使用度合いを変化させて実行速度への影響を調べた。

実装
計測方法

実装

NumPyまたPythonの標準ライブラリであるmathを用いる。

import math
import numpy as np

各実装は関数corrcoefXで表記し、2つのデータ群x, yを引数に取り、相関係数を返り値とする。

また、以下のプログラムで記述している変数sxx, syy, sxyは必ずしも標準偏差と共分散に対応していない。実装方法によって値が異なることに注意されたし。

０．オリジナル

NumPyの関数をほとんど用いず、各要素の操作はmap関数で展開した逐次処理で実装している。この状態から一部をNumPyの関数に書き換える等して、実行速度への影響を確認する。

def corrcoef0(x, y):
    avg_x = np.mean(x)
    avg_y = np.mean(y)
    sxx = sum(list(map(lambda x :x ** 2 - avg_x ** 2, x)))
    syy = sum(list(map(lambda y: y ** 2 - avg_y ** 2, y)))
    sxy = sum(list(map(lambda xi, yi: (xi - avg_x) * ( yi - avg_y), x, y)))
    rxy = sxy / math.sqrt(sxx * syy)
    return rxy

１．リスト生成処理を省略

corrcoef0ではmapオブジェクトをリストに変換してから総和を算出していたが、sum関数はmapオブジェクトに適用できるのでリスト生成過程を省略する。

def corrcoef1(x, y):
    avg_x = np.mean(x)
    avg_y = np.mean(y)
    sxx = sum(map(lambda xi: xi ** 2- avg_x ** 2, x))
    syy = sum(map(lambda yi: yi ** 2- avg_y ** 2, y))
    sxy = sum(map(lambda xi, yi: (xi - avg_x) * (yi - avg_y), x, y))
    rxy = sxy / math.sqrt(sxx * syy)
    return rxy

２．map関数によるイテレータ不使用

Numpy配列は行列演算に適しており、各要素への同一処理は纏めて行った方が良い。

arr = np.arange(5)  # array([0, 1, 2, 3, 4])
# 定数との演算は各要素に適用される
arr_inc = arr + 1  # array([1, 2, 3, 4, 5])
arr_sq = arr ** 2  # array([0, 1, 4, 9, 16])
# 配列同士の演算は行列演算に基づく
res = arr_inc - arr  # array([1, 1, 1, 1, 1])

sxx, syy, sxyの算出方法を置換した。

def corrcoef3(x, y):
    avg_x = np.mean(x)
    avg_y = np.mean(y)
    sxx = np.sum(x ** 2 - avg_x ** 2)
    syy = np.sum(y ** 2 - avg_y ** 2)
    sxy = np.sum((x - avg_x) * (y - avg_y))
    rxy = sxy / np.sqrt(sxx * syy)
    return rxy

３．math.sqrtをnp.sqrtに置換

平方根の算出に用いたmath.sqrtをnp.sqrtに書き換えた。

def corrcoef3(x, y):
    avg_x = np.mean(x)
    avg_y = np.mean(y)
    sxx = np.sum(x ** 2 - avg_x ** 2)
    syy = np.sum(y ** 2 - avg_y ** 2)
    sxy = np.sum((x - avg_x) * (y - avg_y))
    rxy = sxy / np.sqrt(sxx * syy)
    return rxy

４．標準偏差の算出にnp.stdを使用

上と異なり、平均を算出・平方根の適用を行っているので変数が示す意味が異なる。

def corrcoef4(x, y):
    avg_x = np.mean(x)
    avg_y = np.mean(y)
    sxx = np.std(x)
    syy = np.std(y)
    sxy = np.mean((x - avg_x) * (y - avg_y))
    rxy = sxy / (sxx * syy)
    return rxy

５．分散/共分散の算出にnp.covを使用

np.covの返り値は行列（Numpy配列）であり、対角成分からデータ群の分散、その他成分から共分散を取得できる。

def corrcoef5(x, y):
    s = np.cov(x, y)
    sxx = s[0, 0]
    syy = s[1, 1]
    sxy = s[0, 1]
    rxy = sxy / math.sqrt(sxx * syy)
    return rxy

６．相関係数の算出にnp.corrcoefを使用

np.corrcoefの返り値は行列（Numpy配列）であり、対角成分以外から相関係数を取得できる。

def corrcoef6(x, y):
    rxy = np.corrcoef(x, y)
    return rxy[0, 1]

計測方法

各実装であるcorrcoefX関数を100回実行し、その実行時間を計測する。時間計測にはtimeitモジュールを用いた。

また、データ数によるパフォーマンスの変化を調べるため、データ数10から10,000まで10刻みでデータ群を生成し計測した。

テストプログラムを以下に示す。

import timeit

LOOP = 100
MAX_NUM = 10_000

if __name__ == "__main__":
    nx = np.arange(10, MAX_NUM + 1, 10)
    xs = [np.random.rand(n) for n in nx]
    ys = [np.random.rand(n) for n in nx]

    ty0 = np.array([timeit.timeit(lambda: corrcoef0(x, y), number=LOOP) for x, y in zip(xs, ys)])
    ty1 = np.array([timeit.timeit(lambda: corrcoef1(x, y), number=LOOP) for x, y in zip(xs, ys)])
    ty2 = np.array([timeit.timeit(lambda: corrcoef2(x, y), number=LOOP) for x, y in zip(xs, ys)])
    ty3 = np.array([timeit.timeit(lambda: corrcoef3(x, y), number=LOOP) for x, y in zip(xs, ys)])
    ty4 = np.array([timeit.timeit(lambda: corrcoef4(x, y), number=LOOP) for x, y in zip(xs, ys)])
    ty5 = np.array([timeit.timeit(lambda: corrcoef5(x, y), number=LOOP) for x, y in zip(xs, ys)])
    ty6 = np.array([timeit.timeit(lambda: corrcoef6(x, y), number=LOOP) for x, y in zip(xs, ys)])