首頁猿問如何將包含許多注釋行的數據文本文件...

如何將包含許多注釋行的數據文本文件加載到 pandas 中？

Python

皈依舞 2023-09-26 15:09:56

我正在嘗試將分隔文本文件讀入 python 中的數據幀中。當我使用時，分隔符未被識別pd.read_table。如果我明確設置sep = ' '，則會收到錯誤：Error tokenizing data. C error。值得注意的是，當我使用np.loadtxt().例子：pd.read_table('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt', comment = '%', header = None) 00 1850 1 -0.777 0.412 NaN NaN...1 1850 2 -0.239 0.458 NaN NaN...2 1850 3 -0.426 0.447 NaN NaN...3 1850 4 -0.680 0.367 NaN NaN...4 1850 5 -0.687 0.298 NaN NaN...如果我設置 sep = ' '，則會收到另一個錯誤：pd.read_table('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt', comment = '%', header = None, sep = ' ')ParserError: Error tokenizing data. C error: Expected 2 fields in line 78, saw 58查找此錯誤，人們建議使用header = None（已經完成）并sep = 顯式設置，但這導致了問題：Python Pandas Error tokenizing data。我查看了第 78 行，沒有發現任何問題。如果我設置，error_bad_lines=False我會得到一個空的 df，表明每個條目都有問題。值得注意的是，當我使用以下命令時，這會起作用np.loadtxt()：pd.DataFrame(np.loadtxt('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt', comments = '%')) 0 1 2 3 4 5 6 7 8 9 10 110 1850.0 1.0 -0.777 0.412 NaN NaN NaN NaN NaN NaN NaN NaN1 1850.0 2.0 -0.239 0.458 NaN NaN NaN NaN NaN NaN NaN NaN2 1850.0 3.0 -0.426 0.447 NaN NaN NaN NaN NaN NaN NaN NaN3 1850.0 4.0 -0.680 0.367 NaN NaN NaN NaN NaN NaN NaN NaN4 1850.0 5.0 -0.687 0.298 NaN NaN NaN NaN NaN NaN NaN NaN這對我來說表明文件沒有問題，而是我調用的方式有問題pd.read_table()。我查看了文檔，np.loadtxt()希望將 sep 設置為相同的值，但這只是顯示：（delimiter=Nonehttps://numpy.org/doc/stable/reference/ generated /numpy.loadtxt.html ）。我希望能夠將其導入為 apd.DataFrame并設置名稱，而不是必須導入為 amatrix然后轉換為pd.DataFrame.我錯了什么？

查看完整描述

2 回答

慕娘9325324

TA貢獻1783條經驗獲得超5個贊

這個是相當棘手的。請嘗試下面的代碼片段：

import pandas as pd

url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'

df = pd.read_csv(url,

sep='\s+',

comment='%',

usecols=(0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11),

names=('Year', 'Month', 'M.Anomaly', 'M.Unc.', 'A.Anomaly',

'A.Unc.','5y.Anomaly', '5y.Unc.' ,'10y.Anomaly', '10y.Unc.',

'20y.Anomaly', '20y.Unc.'))

反對回復 2023-09-26

料青山看我應如是

TA貢獻1772條經驗獲得超8個贊

問題是該文件有 77 行注釋文本，例如'Global Average Temperature Anomaly with Sea Ice Temperature Inferred from Air Temperatures'

其中兩行是標題

有一堆數據，然后還有兩個標頭，以及一組新數據'Global Average Temperature Anomaly with Sea Ice Temperature Inferred from Water Temperatures'
該解決方案將文件中的兩個表分成單獨的數據幀。
這不像其他答案那么好，但數據被正確地分成不同的數據幀。
標題很痛苦，手動創建自定義標題并跳過將標題與文本分開的代碼行可能會更容易。
重要的一點是air與ice數據分離。

import requests

import pandas as pd

import math

# read the file with requests

url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'

response = requests.get(url)

data = response.text

# convert data into a list

data = [d.strip().replace('% ', '') for d in data.split('\n')]

# specify the data from the ranges in the file

air_header1 = data[74].split() # not used

air_header2 = [v.strip() for v in data[75].split(',')]

# combine the 2 parts of the header into a single header

air_header = air_header2[:2] + [f'{air_header1[math.floor(i/2)]}_{v}' for i, v in enumerate(air_header2[2:])]

air_data = [v.split() for v in data[77:2125]]

h2o_header1 = data[2129].split() # not used

h2o_header2 = [v.strip() for v in data[2130].split(',')]

# combine the 2 parts of the header into a single header

h2o_header = h2o_header2[:2] + [f'{h2o_header1[math.floor(i/2)]}_{v}' for i, v in enumerate(h2o_header2[2:])]

h2o_data = [v.split() for v in data[2132:4180]]

# create the dataframes

air = pd.DataFrame(air_data, columns=air_header)

h2o = pd.DataFrame(h2o_data, columns=h2o_header)

沒有標題代碼

通過使用手動標頭列表來簡化代碼。

import pandas as pd

import requests

# read the file with requests

url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'

response = requests.get(url)

data = response.text

# convert data into a list

data = [d.strip().replace('% ', '') for d in data.split('\n')]

# manually created header

headers = ['Year', 'Month', 'Monthly_Anomaly', 'Monthly_Unc.',

'Annual_Anomaly', 'Annual_Unc.',

'Five-year_Anomaly', 'Five-year_Unc.',

'Ten-year_Anomaly', 'Ten-year_Unc.',

'Twenty-year_Anomaly', 'Twenty-year_Unc.']

# separate the air and h2o data

air_data = [v.split() for v in data[77:2125]]

h2o_data = [v.split() for v in data[2132:4180]]

# create the dataframes

air = pd.DataFrame(air_data, columns=headers)

h2o = pd.DataFrame(h2o_data, columns=headers)

air

Year Month Monthly_Anomaly Monthly_Unc. Annual_Anomaly Annual_Unc. Five-year_Anomaly Five-year_Unc. Ten-year_Anomaly Ten-year_Unc. Twenty-year_Anomaly Twenty-year_Unc.

0 1850 1 -0.777 0.412 NaN NaN NaN NaN NaN NaN NaN NaN

1 1850 2 -0.239 0.458 NaN NaN NaN NaN NaN NaN NaN NaN

2 1850 3 -0.426 0.447 NaN NaN NaN NaN NaN NaN NaN NaN

h2o

Year Month Monthly_Anomaly Monthly_Unc. Annual_Anomaly Annual_Unc. Five-year_Anomaly Five-year_Unc. Ten-year_Anomaly Ten-year_Unc. Twenty-year_Anomaly Twenty-year_Unc.

0 1850 1 -0.724 0.370 NaN NaN NaN NaN NaN NaN NaN NaN

1 1850 2 -0.221 0.430 NaN NaN NaN NaN NaN NaN NaN NaN

2 1850 3 -0.443 0.419 NaN NaN NaN NaN NaN NaN NaN NaN

反對回復 2023-09-26

2 回答
0 關注
171 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

如何將包含許多注釋行的數據文本文件加載到 pandas 中？

如何將包含許多注釋行的數據文本文件加載到 pandas 中？

2 回答

添加回答

如何將包含許多注釋行的數據文本文件加載到 pandas 中？