4 回答

TA貢獻2016條經驗 獲得超9個贊
看起來您想要字符串對的杰卡德距離。groupby
這是使用and的一種方法scipy.spatial.distance.jaccard
:
from scipy.spatial.distance import jaccard
g = df.groupby(df.name.str[0])
df['diff'] = [sim for _, seqs in g.seq for sim in
[float('nan'), jaccard(*map(list,seqs))]]
print(df)
name seq diff
1 a1 bbb NaN
2 a2 bbc 1.0
3 b1 fff NaN
4 b2 fff 0.0
5 c1 aaa NaN
6 c2 acg 2.0

TA貢獻1951條經驗 獲得超3個贊
Levenshtein距離替代:
import Levenshtein
s = df['name'].str[0]
out = df.assign(Diff=s.drop_duplicates(keep='last').map(df.groupby(s)['seq']
.apply(lambda x: Levenshtein.distance(x.iloc[0],x.iloc[-1]))))
name seq Diff
1 a1 bbb NaN
2 a2 bbc 1.0
3 b1 fff NaN
4 b2 fff 0.0
5 c1 aaa NaN
6 c2 acg 2.0

TA貢獻1865條經驗 獲得超7個贊
作為第一步,我使用以下方法重新創建了您的數據:
#!/usr/bin/env python3
import pandas as pd
# Setup
data = {'name': {1: 'a1', 2: 'a2', 3: 'b1', 4: 'b2', 5: 'c1', 6: 'c2'}, 'seq': {1: 'bbb', 2: 'bbc', 3: 'fff', 4: 'fff', 5: 'aaa', 6: 'acg'}}
df = pd.DataFrame(data)
解決方案 您可以嘗試迭代數據框并將seq最后一次迭代的值與當前迭代值進行比較。為了比較兩個字符串(存儲在數據框的seq列中),您可以應用一個簡單的列表推導,如在此函數中:
def diff_letters(a,b):
return sum ( a[i] != b[i] for i in range(len(a)) )
迭代 Dataframe 行
diff = ['NA']
row_iterator = df.iterrows()
_, last = next(row_iterator)
# Iterate over the df get populate a list with result of the comparison
for i, row in row_iterator:
if i % 2 == 0:
diff.append(diff_letters(last['seq'],row['seq']))
else:
# for odd row numbers append NA value
diff.append("NA")
last = row
df['diff'] = diff
結果看起來像這樣
name seq diff
1 a1 bbb NA
2 a2 bbc 1
3 b1 fff NA
4 b2 fff 0
5 c1 aaa NA
6 c2 acg 2

TA貢獻1801條經驗 獲得超16個贊
檢查這個
import pandas as pd
data = {'name': ['a1', 'a2','b1','b2','c1','c2'],
'seq': ['bbb', 'bbc','fff','fff','aaa','acg']
}
df = pd.DataFrame (data, columns = ['name','seq'])
diffCntr=0
df['diff'] = np.nan
i=0
while i < len(df)-1:
diffCntr=np.nan
item=df.at[i,'seq']
df.at[i,'diff']=diffCntr
diffCntr=0
for j in df.at[i+1,'seq']:
if item.find(j) < 0:
diffCntr +=1
df.at[i+1,'diff']=diffCntr
i +=2
df
結果是這樣的:
name seq diff
0 a1 bbb NaN
1 a2 bbc 1.0
2 b1 fff NaN
3 b2 fff 0.0
4 c1 aaa NaN
5 c2 acg 2.0
添加回答
舉報