2 回答

TA貢獻1796條經驗 獲得超10個贊
投入na_filter = False以將您的列類型轉換為字符串。然后找到所有包含錯誤數據的行,然后將它們過濾掉您的數據框。
>>> df = pd.read_csv('sample.csv', header = 0, na_filter = False)
>>> df
col1 col2 col3
0 0 1 1
1 0 0 0
2 1 1 1
3 col1 col2 col3
4 0 1 1
5 0 0 0
6 1 1 1
>>> type(df.iloc[0,0])
<class 'str'>
現在您已將每列中的數據解析為字符串,找到col1, col2, and col3df 中的所有值,如果您在每列中找到它們,則創建一個新列np.where(),如下所示:
>>> df['Tag'] = np.where(((df['col1'] != '0') & (df['col1'] != '1')) & ((df['col2'] != '0') & (df['col2'] != '1')) & ((df['col3'] != '0') & (df['col3'] != '1')), ['Remove'], ['Don\'t remove'])
>>> df
col1 col2 col3 Tag
0 0 1 1 Don't remove
1 0 0 0 Don't remove
2 1 1 1 Don't remove
3 col1 col2 col3 Remove
4 0 1 1 Don't remove
5 0 0 0 Don't remove
6 1 1 1 Don't remove
現在,使用 過濾掉列中標記為Removed的那個。Tagisin()
>>> df2 = df[~df['Tag'].isin(['Remove'])]
>>> df2
col1 col2 col3 Tag
0 0 1 1 Don't remove
1 0 0 0 Don't remove
2 1 1 1 Don't remove
4 0 1 1 Don't remove
5 0 0 0 Don't remove
6 1 1 1 Don't remove
刪除Tag列:
>>> df2 = df2[['col1', 'col2', 'col3']]
>>> df2
col1 col2 col3
0 0 1 1
1 0 0 0
2 1 1 1
4 0 1 1
5 0 0 0
6 1 1 1
最后將您的數據幀類型轉換為 int,如果您需要它是整數:
>>> df2 = df2.astype(int)
>>> df2
col1 col2 col3
0 0 1 1
1 0 0 0
2 1 1 1
4 0 1 1
5 0 0 0
6 1 1 1
>>> type(df2['col1'][0])
<class 'numpy.int32'>
注意:如果您想要標準索引,請使用:
>>> df2.reset_index(inplace = True, drop = True)
>>> df2
col1 col2 col3
0 0 1 1
1 0 0 0
2 1 1 1
3 0 1 1
4 0 0 0
5 1 1 1

TA貢獻1859條經驗 獲得超6個贊
您只需要執行以下操作:
假設df_raw您的原始數據框具有列標題作為列名并在其他幾行中重復,則您更正的數據框是df.
# Filter out only the rows without the headers in them.
headers = df_raw.columns.tolist()
df = df_raw[df_raw[headers[0]]!=headers[0]].reset_index(drop=True)
假設:
- 我們假設第一列標題的出現意味著必須刪除該行。
現在詳細
介紹一個詳細的代碼塊,任何人都可以
- 創建數據,
- 將其寫入 csv 文件,
- 將其作為數據幀加載,然后
- 刪除作為標題的行。
import numpy as np
import pandas as pd
# make a csv file to load as dataframe
data = '''col1, col2, col3
0, 1, 1
0, 0, 0
1, 1, 1
col1, col2, col3
0, 1, 1
0, 0, 0
1, 1, 1'''
# Write the data to a csv file
with open('data.csv', 'w') as f:
f.write(data)
# Load your data with header=None
df_raw = pd.read_csv('data.csv', header=None)
# Declare which row to find the header data:
# assuming the top one, we set this to zero.
header_row_number = 0
# Read in columns headers
headers = df_raw.iloc[header_row_number].tolist()
# Set new column headers
df_raw.columns = headers
# Filter out only the rows without the headers in them
# We assume that the appearance of the
# first column header means that row has to be dropped
# And reset index (and drop the old index column)
df = df_raw[df_raw[headers[0]]!=headers[0]].reset_index(drop=True)
df
添加回答
舉報