numpy的矢量化是做什麼的？

我有清潔一組禁用詞的文本功能：numpy的矢量化是做什麼的？

def clean_text(raw_text, stopwords_set): 
    # removing everything which is not a letter 
    letters_only = re.sub("[^a-zA-Z]", " ", raw_text) 
    # lower case + split --> list of words 
    words = letters_only.lower().split()    
    # now remove the stop words 
    meaningful_words = [w for w in words if not w in stopwords_set] 
    # join the remaining words together to get the cleaned tweet 
    return " ".join(meaningful_words)

160萬嘰嘰喳喳的鳴叫在pandas數據幀的數據集。如果我只是apply此功能這樣的數據框：

dataframe['clean_text'] = dataframe.apply(
    lambda text: clean_text(text, set(stopwords.words('english'))), 
    axis = 1)

計算需要2分鐘才能完成（約）。然而，當我使用np.vectorize這樣的：

dataframe['clean_text'] = np.vectorize(clean_text)(
    dataframe['text'], set(stopwords.words('english')))

計算10秒（大約）之後完成。

這本身並不令人驚訝，如果不是兩種方法都只在我的機器上使用一個內核。我假設，通過向量化，它會自動使用多個內核來更快地完成，並以這種方式獲得更多速度，但它似乎做了一些不同的事情。

numpy的hasctorize是什麼樣的「魔法」呢？

來源

2017-03-03 Zelphir

同樣，您是否閱讀過有關'np.vectorize'的文檔？它聲明 - ''vectorize函數主要是爲了方便，而不是爲了性能，實現本質上是一個for循環。「 – Divakar

@Divakar那麼如何解釋加速呢？即使有了這些知識，我也看不出如何解釋加速，所以這對我還沒有幫助。請保持建設性，謝謝。 – Zelphir

你可以把它與一個for-loop版本對比嗎？ – Divakar

我想知道vectorize如何處理這些輸入。它被設計爲採用數組輸入，相互廣播它們，並將所有元素（標量）作爲標量提供給函數。我特別想知道它如何處理set。

隨着你的功能和print(stop_words)此外，我

In [98]: words = set('one two three four five'.split()) 
In [99]: f=np.vectorize(clean_text) 
In [100]: f(['this is one line with two words'],words) 
{'five', 'four', 'three', 'one', 'two'} 
{'five', 'four', 'three', 'one', 'two'} 
Out[100]: 
array(['this is line with words'], 
     dtype='<U23')

因爲vectorize運行測試用例來確定返回數組的D型的組顯示兩次。但與我擔心的是，它將整套設備傳遞給功能。這是因爲在一個數組包裝一set只是創建0D對象數組：

In [101]: np.array(words) 
Out[101]: array({'five', 'four', 'three', 'one', 'two'}, dtype=object)

因爲我們不想向量化功能遍歷第二個參數，我們真的應該使用excluded參數。速度差異可能可以忽略不計。

In [104]: f=np.vectorize(clean_text, excluded=[1]) 
In [105]: f(['this is one line with two words'],words)

但只有一個陣列或dataseries遍歷，vectorize比一維迭代或列表理解多一點：

In [111]: text = ['this is one line with two words'] 
In [112]: [clean_text(t, words) for t in text] 
Out[112]: ['this is line with words']

如果我讓長文本列表（10000）：

In [121]: timeit [clean_text(t, words) for t in text] 
10 loops, best of 3: 98.2 ms per loop 
In [122]: f=np.vectorize(clean_text, excluded=[1]) 
In [123]: timeit f(text,words) 
10 loops, best of 3: 158 ms per loop 
In [124]: f=np.vectorize(clean_text) 
In [125]: timeit f(text,words) 
10 loops, best of 3: 108 ms per loop

excluded實際上減慢了vectorize下降;沒有它，列表理解和矢量化執行相同。

所以如果pandasapply慢得多，它不是因爲vectorize是神奇的。這是因爲apply很慢。

來源

2017-03-03 17:55:11 hpaulj

我明白了。 'apply'是緩慢的，'vectorize'是「正常的」，所以它看起來像'矢量化'正在加快速度，但實際上它只是讓它們回到了它們「應該是的」（這是加速）。感謝您的時間安排！ – Zelphir

numpy的矢量化是做什麼的？

回答

相關問題