首頁猿問有效地返回數組中第一個值滿足條件的索引

有效地返回數組中第一個值滿足條件的索引

Python 性能測試

紫衣仙女 2019-11-20 12:48:42

我需要在滿足條件的1d NumPy數組或Pandas數值系列中找到第一個值的索引。數組很大，索引可能在數組的開始或結尾附近，或者可能根本不滿足條件。我無法提前告訴您哪種可能性更大。如果不滿足條件，則返回值為-1。我考慮了幾種方法。嘗試1# func(arr) returns a Boolean arrayidx = next(iter(np.where(func(arr))[0]), -1)但這通常太慢，因為func(arr)在整個數組上應用矢量化函數，而不是在滿足條件時停止。具體來說，在數組開始附近滿足條件時，這很昂貴。嘗試2np.argmax是稍快，但無法確定何時條件永不滿足：np.random.seed(0)arr = np.random.rand(10**7)assert next(iter(np.where(arr > 0.999999)[0]), -1) == np.argmax(arr > 0.999999)%timeit next(iter(np.where(arr > 0.999999)[0]), -1) # 21.2 ms%timeit np.argmax(arr > 0.999999) # 17.7 msnp.argmax(arr > 1.0)返回0，當條件，即一個實例并不滿足。嘗試3# func(arr) returns a Boolean scalaridx = next((idx for idx, val in enumerate(arr) if func(arr)), -1)但這在數組末尾附近滿足條件時太慢了。大概是因為生成器表達式的大量__next__調用產生了昂貴的開銷。這是否總是一種折衷方案，或者對于通用而言func，是否有辦法有效地提取第一個索引？標桿管理對于基準測試，假定func值大于給定常數時查找索引：# Python 3.6.5, NumPy 1.14.3, Numba 0.38.0import numpy as npnp.random.seed(0)arr = np.random.rand(10**7)m = 0.9n = 0.999999# Start of array benchmark%timeit next(iter(np.where(arr > m)[0]), -1) # 43.5 ms%timeit next((idx for idx, val in enumerate(arr) if val > m), -1) # 2.5 μs# End of array benchmark%timeit next(iter(np.where(arr > n)[0]), -1) # 21.4 ms%timeit next((idx for idx, val in enumerate(arr) if val > n), -1) # 39.2 ms

查看完整描述

3 回答

胡說叔叔

TA貢獻1804條經驗獲得超8個贊

numba

有了numba它可以優化這兩個場景。從語法上講，您只需要構造一個帶有簡單for循環的函數：

from numba import njit

@njit

def get_first_index_nb(A, k):

for i in range(len(A)):

if A[i] > k:

return i

return -1

idx = get_first_index_nb(A, 0.9)

Numba通過JIT（“及時”）編譯代碼并利用CPU級別的優化來提高性能。一個常規的 for無環路@njit裝飾通常會慢比你已經嘗試了在條件滿足后期的情況下的方法。

對于Pandas數值系列df['data']，您可以簡單地將NumPy表示提供給JIT編譯的函數：

idx = get_first_index_nb(df['data'].values, 0.9)

概括

由于numba允許將函數用作參數，并且假設傳遞的函數也可以JIT編譯，則可以找到一種方法來計算第n個索引，其中滿足任意條件的條件func。

@njit

def get_nth_index_count(A, func, count):

c = 0

for i in range(len(A)):

if func(A[i]):

c += 1

if c == count:

return i

return -1

@njit

def func(val):

return val > 0.9

# get index of 3rd value where func evaluates to True

idx = get_nth_index_count(arr, func, 3)

對于第三個最后的值，可以喂相反，arr[::-1]和否定的結果len(arr) - 1，則- 1需要考慮0索引。

績效基準

# Python 3.6.5, NumPy 1.14.3, Numba 0.38.0

np.random.seed(0)

arr = np.random.rand(10**7)

m = 0.9

n = 0.999999

@njit

def get_first_index_nb(A, k):

for i in range(len(A)):

if A[i] > k:

return i

return -1

def get_first_index_np(A, k):

for i in range(len(A)):

if A[i] > k:

return i

return -1

%timeit get_first_index_nb(arr, m) # 375 ns

%timeit get_first_index_np(arr, m) # 2.71 μs

%timeit next(iter(np.where(arr > m)[0]), -1) # 43.5 ms

%timeit next((idx for idx, val in enumerate(arr) if val > m), -1) # 2.5 μs

%timeit get_first_index_nb(arr, n) # 204 μs

%timeit get_first_index_np(arr, n) # 44.8 ms

%timeit next(iter(np.where(arr > n)[0]), -1) # 21.4 ms

%timeit next((idx for idx, val in enumerate(arr) if val > n), -1) # 39.2 ms

反對回復 2019-11-20

Smart貓小萌

TA貢獻1911條經驗獲得超7個贊

我也想做類似的事情，發現這個問題中提出的解決方案并沒有真正幫助我。特別是，numba對我來說，解決方案比問題本身中介紹的更常規的方法慢得多。我有一個times_all列表，通常為數萬個元素的數量級，并且想要找到第一個元素的索引times_all大于a 的索引time_event。而且我有數千個time_event。我的解決方案是將其times_all分成例如100個元素的塊，首先確定time_event屬于哪個時間段，保留該時間段的第一個元素的索引，然后找到該時間段中的哪個索引，然后將兩個索引相加。這是最少的代碼。對我來說，它的運行速度比本頁中的其他解決方案快幾個數量級。

def event_time_2_index(time_event, times_all, STEPS=100):

import numpy as np

time_indices_jumps = np.arange(0, len(times_all), STEPS)

time_list_jumps = [times_all[idx] for idx in time_indices_jumps]

time_list_jumps_idx = next((idx for idx, val in enumerate(time_list_jumps)\

if val > time_event), -1)

index_in_jumps = time_indices_jumps[time_list_jumps_idx-1]

times_cropped = times_all[index_in_jumps:]

event_index_rel = next((idx for idx, val in enumerate(times_cropped) \

if val > time_event), -1)

event_index = event_index_rel + index_in_jumps

return event_index

反對回復 2019-11-20

3 回答
0 關注
1384 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

有效地返回數組中第一個值滿足條件的索引

有效地返回數組中第一個值滿足條件的索引

3 回答

添加回答