3 回答

TA貢獻1869條經驗 獲得超4個贊
使用熊貓
主要區別在于,pandas 已將所有數據轉換為正確的
dtype
,(例如datetime
,?int
, 和float
),并且代碼更簡潔。此外,數據現在采用了一種有用的格式來執行時間序列分析和繪圖,但我建議添加列名稱。
df.columns = ['datetime', ..., 'price']
這可以通過 1 行矢量化操作來完成。
如
timeit
測試所示,對于 1M 行數據,使用 pandas 比使用 讀取文件with open
和str
查找:00
.
讀取文件并
pandas.read_csv
解析第 0 列中的日期。使用
header=None
,因為測試數據中沒有提供標題
使用布爾索引選擇秒為 0 的日期
使用
.dt
訪問器獲取.second
.
import pandas as pd
# read the file which apparently has no header and parse the date column
df = pd.read_csv('test.csv', header=None, parse_dates=[0])
# using Boolean indexing to select data when seconds = 00
top_of_the_minute = df[df[0].dt.second == 0]
# save the data
top_of_the_minute.to_csv('clean.csv', header=False, index=False)
# display(top_of_the_minute)
? ? ? ? ? ? ? ? ? ? 0? 1? 2? ? ?3? ? ?4? ?5? ? ? 6? ? ? 7? ? ? ? 8
5 2020-08-03 22:17:00? 0? 0? 4803? 4800? 91? 28.05? 24.05? 58.8917
6 2020-08-03 22:17:00? 0? 0? 4802? 4800? 91? 28.05? 24.05? 58.8925
7 2020-08-03 22:17:00? 0? 0? 4805? 4800? 91? 28.05? 24.05? 58.9341
8 2020-08-03 22:17:00? 0? 0? 4802? 4800? 91? 28.05? 24.05? 58.9683
9 2020-08-03 22:17:00? 0? 0? 4802? 4800? 91? 28.05? 23.05? 58.9780
# example: rename columns
top_of_the_minute.columns = ['datetime', 'v1', 'v2', 'v3', 'v4', 'v5', 'p1', 'p2', 'p3']
# example: plot the data
p = top_of_the_minute.plot('datetime', 'p3')
p.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
p.set_xlim('2020-08', '2020-09')
test.csv
2020-08-03 22:17:12,0,0,4803,4800,91,28.05,24.05,58.8917
2020-08-03 22:17:13,0,0,4802,4800,91,28.05,24.05,58.8925
2020-08-03 22:17:14,0,0,4805,4800,91,28.05,24.05,58.9341
2020-08-03 22:17:15,0,0,4802,4800,91,28.05,24.05,58.9683
2020-08-03 22:17:18,0,0,4802,4800,91,28.05,23.05,58.978
2020-08-03 22:17:00,0,0,4803,4800,91,28.05,24.05,58.8917
2020-08-03 22:17:00,0,0,4802,4800,91,28.05,24.05,58.8925
2020-08-03 22:17:00,0,0,4805,4800,91,28.05,24.05,58.9341
2020-08-03 22:17:00,0,0,4802,4800,91,28.05,24.05,58.9683
2020-08-03 22:17:00,0,0,4802,4800,91,28.05,23.05,58.978
%%timeit測試
創建測試數據
# read test.csv
df = pd.read_csv('test.csv', header=None, parse_dates=[0])
# create a dataframe with 1M rows?
df = pd.concat([df] * 100000)
# save the new test data
df.to_csv('test.csv', index=False, header=False)
test_sk
def test_sk(path: str):
? ? zero_entries = []
? ? with open(path, "r") as file:
? ? ? ? for line in file:
? ? ? ? ? ? semi_index = line.index(',')
? ? ? ? ? ? if line[:semi_index].endswith(':00'):
? ? ? ? ? ? ? ? zero_entries.append(line)
? ? return zero_entries
%%timeit
result_sk = test_sk('test.csv')
[out]:
668 ms ± 5.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
test_tm
def test_tm(path: str):
? ? df = pd.read_csv(path, header=None, parse_dates=[0])
? ? return df[df[0].dt.second == 0]
%%timeit
result_tm = test_tm('test.csv')
[out]:
774 ms ± 7.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

TA貢獻1871條經驗 獲得超13個贊
試試這個
finalmasterlist2 = []
for i in range(len(altmasterlist)):
if ":00" in altmasterlist[i][0]:
finalmasterlist2.extend(altmasterlist[i])
print("finalemasterlist_2")
print(finalmasterlist2)
輸入:
2020-08-03 22:17:12,0,0,4803,4800,91,28.05,24.05,58.8917
2020-08-03 22:17:13,0,0,4802,4800,91,28.05,24.05,58.8925
2020-08-03 22:17:00,0,0,4805,4800,91,28.05,24.05,58.9341
2020-08-03 22:17:15,0,0,4802,4800,91,28.05,24.05,58.9683
2020-08-03 22:17:18,0,0,4802,4800,91,28.05,23.05,58.978
輸出:
['2020-08-03 22:17:00', '0', '0', '4805', '4800', '91', '28.05', '24.05', '58.9341']

TA貢獻1757條經驗 獲得超7個贊
你說你的文件很大?也許最好在閱讀時拆分數據。
您可以在沒有庫的情況下這樣做:
zero_entries = []
with open(path_to_file, "r") as file:
# iterates over every line
for line in file:
# finds the end if the first cell
timestamp_end = line.index(',')
# checks if the timestamp ends on zero seconds and adds it to a list.
if line[:timestamp_end].endswith(':00'):
zero_entries.append(line)
print(zero_entries)
我假設您的時間戳將始終是該行的第一個元素。
根據您的文件大小,這將比 Trenton 的解決方案快得多(我用 ~58k 行對其進行了測試):
import time
import pandas as pd
path = r"txt.csv"
start = time.time()
zero_entries = []
with open(path, "r") as file:
for line in file:
semi_index = line.index(',')
if line[:semi_index].endswith(':00'):
zero_entries.append(line)
end = time.time()
print(end-start)
start = time.time()
df = pd.read_csv(path, header=None, parse_dates=[0])
# using Boolean indexing to select data when seconds = 00
top_of_the_minute = df[df[0].dt.second == 0]
end = time.time()
print(end-start)
0.04886937141418457 # built-in
0.27971720695495605 # pandas
添加回答
舉報