您知道是否有一種方法可以使python random.sample與生成器對象一起工作。我試圖從一個很大的文本語料庫中獲取一個隨機樣本。問題是random.sample()引發以下錯誤。TypeError: object of type 'generator' has no len()我當時在想,也許有某種方法itertools可以解決某些問題,但是經過一點搜索卻找不到。一個有些虛構的例子:import randomdef list_item(ls): for item in ls: yield itemrandom.sample( list_item(range(100)), 20 )更新根據MartinPieters要求,我對當前提出的三種方法做了一些時間安排。結果如下。Sampling 1000 from 10000Using iterSample 0.0163 sUsing sample_from_iterable 0.0098 sUsing iter_sample_fast 0.0148 sSampling 10000 from 100000Using iterSample 0.1786 sUsing sample_from_iterable 0.1320 sUsing iter_sample_fast 0.1576 sSampling 100000 from 1000000Using iterSample 3.2740 sUsing sample_from_iterable 1.9860 sUsing iter_sample_fast 1.4586 sSampling 200000 from 1000000Using iterSample 7.6115 sUsing sample_from_iterable 3.0663 sUsing iter_sample_fast 1.4101 sSampling 500000 from 1000000Using iterSample 39.2595 sUsing sample_from_iterable 4.9994 sUsing iter_sample_fast 1.2178 sSampling 2000000 from 5000000Using iterSample 798.8016 sUsing sample_from_iterable 28.6618 sUsing iter_sample_fast 6.6482 s因此,事實證明,array.insert當涉及大樣本量時,存在嚴重的缺陷。我用來計時方法的代碼from heapq import nlargestimport randomimport timeitdef iterSample(iterable, samplesize): results = [] for i, v in enumerate(iterable): r = random.randint(0, i) if r < samplesize: if i < samplesize: results.insert(r, v) # add first samplesize items in random order else: results[r] = v # at a decreasing rate, replace random items if len(results) < samplesize: raise ValueError("Sample larger than population.") return resultsdef sample_from_iterable(iterable, samplesize): return (x for _, x in nlargest(samplesize, ((random.random(), x) for x in iterable)))我還行了一項測試,以檢查所有方法是否確實都對發生器進行了無偏向采樣。因此,對于所有方法,我都1000從10000 100000時間上對元素進行采樣,并計算出總體中每個項目出現的平均頻率,事實證明~.1這三種方法都符合預期。
添加回答
舉報
0/150
提交
取消