首頁猿問使用 Python/Pandas...

使用 Python/Pandas 清除 Dataframe 中的錯誤標頭

Python

慕絲7291255 2022-05-19 14:30:18

我有一個損壞的數據幀，其中數據幀內有隨機標題重復。加載數據框時如何忽略或刪除這些行？由于這個隨機頭在數據框中，熊貓在加載時會引發錯誤。我想在用熊貓加載它時忽略這一行?；蛘咴谟眯茇埣虞d它之前以某種方式刪除它。該文件如下所示：col1, col2, col30, 1, 10, 0, 01, 1, 1col1, col2, col3 <- this is the random copy of the header inside the dataframe0, 1, 10, 0, 01, 1, 1我想：col1, col2, col30, 1, 10, 0, 01, 1, 10, 1, 10, 0, 01, 1, 1

查看完整描述

2 回答

白衣染霜花

TA貢獻1796條經驗獲得超10個贊

投入na_filter = False以將您的列類型轉換為字符串。然后找到所有包含錯誤數據的行，然后將它們過濾掉您的數據框。

>>> df = pd.read_csv('sample.csv', header = 0, na_filter = False)

>>> df

col1 col2 col3

0 0 1 1

1 0 0 0

2 1 1 1

3 col1 col2 col3

4 0 1 1

5 0 0 0

6 1 1 1

>>> type(df.iloc[0,0])

現在您已將每列中的數據解析為字符串，找到col1, col2, and col3df 中的所有值，如果您在每列中找到它們，則創建一個新列np.where()，如下所示：

>>> df['Tag'] = np.where(((df['col1'] != '0') & (df['col1'] != '1')) & ((df['col2'] != '0') & (df['col2'] != '1')) & ((df['col3'] != '0') & (df['col3'] != '1')), ['Remove'], ['Don\'t remove'])

>>> df

col1 col2 col3 Tag

0 0 1 1 Don't remove

1 0 0 0 Don't remove

2 1 1 1 Don't remove

3 col1 col2 col3 Remove

4 0 1 1 Don't remove

5 0 0 0 Don't remove

6 1 1 1 Don't remove

現在，使用過濾掉列中標記為Removed的那個。Tagisin()

>>> df2 = df[~df['Tag'].isin(['Remove'])]

>>> df2

col1 col2 col3 Tag

0 0 1 1 Don't remove

1 0 0 0 Don't remove

2 1 1 1 Don't remove

4 0 1 1 Don't remove

5 0 0 0 Don't remove

6 1 1 1 Don't remove

刪除Tag列：

>>> df2 = df2[['col1', 'col2', 'col3']]

>>> df2

col1 col2 col3

0 0 1 1

1 0 0 0

2 1 1 1

4 0 1 1

5 0 0 0

6 1 1 1

最后將您的數據幀類型轉換為 int，如果您需要它是整數：

>>> df2 = df2.astype(int)

>>> df2

col1 col2 col3

0 0 1 1

1 0 0 0

2 1 1 1

4 0 1 1

5 0 0 0

6 1 1 1

>>> type(df2['col1'][0])

注意：如果您想要標準索引，請使用：

>>> df2.reset_index(inplace = True, drop = True)

>>> df2

col1 col2 col3

0 0 1 1

1 0 0 0

2 1 1 1

3 0 1 1

4 0 0 0

5 1 1 1

反對回復 2022-05-19

BIG陽

TA貢獻1859條經驗獲得超6個贊

您只需要執行以下操作：

假設df_raw您的原始數據框具有列標題作為列名并在其他幾行中重復，則您更正的數據框是df.

# Filter out only the rows without the headers in them.

headers = df_raw.columns.tolist()

df = df_raw[df_raw[headers[0]]!=headers[0]].reset_index(drop=True)

假設：

- 我們假設第一列標題的出現意味著必須刪除該行。

現在詳細

介紹一個詳細的代碼塊，任何人都可以

- 創建數據，

- 將其寫入 csv 文件，

- 將其作為數據幀加載，然后

- 刪除作為標題的行。

import numpy as np

import pandas as pd

# make a csv file to load as dataframe

data = '''col1, col2, col3

0, 1, 1

0, 0, 0

1, 1, 1

col1, col2, col3

0, 1, 1

0, 0, 0

1, 1, 1'''

# Write the data to a csv file

with open('data.csv', 'w') as f:

f.write(data)

# Load your data with header=None

df_raw = pd.read_csv('data.csv', header=None)

# Declare which row to find the header data:

# assuming the top one, we set this to zero.

header_row_number = 0

# Read in columns headers

headers = df_raw.iloc[header_row_number].tolist()

# Set new column headers

df_raw.columns = headers

# Filter out only the rows without the headers in them

# We assume that the appearance of the

# first column header means that row has to be dropped

# And reset index (and drop the old index column)

df = df_raw[df_raw[headers[0]]!=headers[0]].reset_index(drop=True)

反對回復 2022-05-19

2 回答
0 關注
467 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

使用 Python/Pandas 清除 Dataframe 中的錯誤標頭

使用 Python/Pandas 清除 Dataframe 中的錯誤標頭

2 回答

添加回答