已解決430363個問題，去搜搜看，總會有你想問的

PySpark：連接數據類型為“Struc”的兩列 --> 錯誤：由于數據類型不匹配而無法解析

首頁猿問 PySpark：連接數據類型為&l...

PySpark：連接數據類型為“Struc”的兩列 --> 錯誤：由于數據類型不匹配而無法解析

Python

慕尼黑5688855 2021-10-05 16:14:21

我在 PySpark 中有一個數據表，其中包含數據類型為“struc”的兩列。請參閱下面的示例數據框：word_verb word_noun{_1=cook, _2=VB} {_1=chicken, _2=NN}{_1=pack, _2=VBN} {_1=lunch, _2=NN}{_1=reconnected, _2=VBN} {_1=wifi, _2=NN}我想將兩列連接在一起，以便我可以對連接的動詞和名詞塊進行頻率計數。我試過下面的代碼：df = df.withColumn('word_chunk_final', F.concat(F.col('word_verb'), F.col('word_noun'))) 但我收到以下錯誤：AnalysisException: u"cannot resolve 'concat(`word_verb`, `word_noun`)' due to data type mismatch: input to function concat should have been string, binary or array, but it's [struct<_1:string,_2:string>, struct<_1:string,_2:string>]我想要的輸出表如下。連接的新字段的數據類型為字符串：word_verb word_noun word_chunk_final{_1=cook, _2=VB} {_1=chicken, _2=NN} cook chicken{_1=pack, _2=VBN} {_1=lunch, _2=NN} pack lunch{_1=reconnected, _2=VBN} {_1=wifi, _2=NN} reconnected wifi

查看完整描述

2 回答

萬千封印

TA貢獻1891條經驗獲得超3個贊

你的代碼就快到了。

假設您的架構如下：

df.printSchema()

#root

# |-- word_verb: struct (nullable = true)

# | |-- _1: string (nullable = true)

# | |-- _2: string (nullable = true)

# |-- word_noun: struct (nullable = true)

# | |-- _1: string (nullable = true)

# | |-- _2: string (nullable = true)

您只需要訪問_1每一列的字段值：

import pyspark.sql.functions as F

df.withColumn(

"word_chunk_final",

F.concat_ws(' ', F.col('word_verb')['_1'], F.col('word_noun')['_1'])

).show()

#+-----------------+------------+----------------+

#| word_verb| word_noun|word_chunk_final|

#+-----------------+------------+----------------+

#| [cook,VB]|[chicken,NN]| cook chicken|

#| [pack,VBN]| [lunch,NN]| pack lunch|

#|[reconnected,VBN]| [wifi,NN]|reconnected wifi|

#+-----------------+------------+----------------+

此外，您應該使用concat_ws("concatenate with separator") 而不是concat將字符串添加在一起，并在它們之間留一個空格。它類似于str.join在 python 中的工作方式。

反對回復 2021-10-05

2 回答
0 關注
288 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

PySpark：連接數據類型為“Struc”的兩列 --> 錯誤：由于數據類型不匹配而無法解析

PySpark：連接數據類型為“Struc”的兩列 --> 錯誤：由于數據類型不匹配而無法解析

2 回答

添加回答