首頁猿問數據框中的文本操作：單詞提取

數據框中的文本操作：單詞提取

Python

躍然一笑 2022-12-20 09:49:10

我想檢查數字旁邊的單詞。例如，我的數據框中有這一列：RecipesHalve the clementine and place into the cavity along with the bay leaves. Transfer the duck to a medium roasting tray and roast for around 1 hour 20 minutes.Add the stock, then bring to the boil and reduce to a simmer for around 15 minutes.2 heaped teaspoons Chinese five-spice 100 ml Marsala1 litre organic chicken stock我想獲得一個新的專欄，我在其中提取它們：New Column[1 hour, 20 minutes]15 minutes2 heaped100 ml1 litre因為我需要與值列表進行比較：to_compare= ["1 hour", "20 litres", "100 ml", "2", "15 minutes", "20 minutes"]查看每行有多少個元素是共同的。謝謝您的幫助。

查看完整描述

2 回答

動漫人物

TA貢獻1815條經驗獲得超10個贊

我們Series.str.extractall與模式一起使用numbers - space - letter。然后我們檢查有哪些匹配項to_compare，最后我們使用GroupBy.sum得到有多少匹配項

matches = df['Col'].str.extractall('(\d+\s\w+)')

df['matches'] = matches[0].isin(to_compare).groupby(level=0).sum()

Col matches

0 Halve the clementine and place into the cavity... 2.0

1 Add the stock, then bring to the boil and redu... 1.0

2 2 heaped teaspoons Chinese five-spice 0.0

3 100 ml Marsala 1.0

4 1 litre organic chicken stock 0.0

此外，matches返回：

match

0 0 1 hour

1 20 minutes

1 0 15 minutes

2 0 2 heaped

3 0 100 ml

4 0 1 litre

要將它們放入列表中，請使用：

matches.groupby(level=0).agg(list)

0 [1 hour, 20 minutes]

1 [15 minutes]

2 [2 heaped]

3 [100 ml]

4 [1 litre]

反對回復 2022-12-20

慕森卡

TA貢獻1806條經驗獲得超8個贊

您可以使用正則表達式構建可以提取數字和后續單詞的模式，然后將此功能應用于數據框的整個列

import pandas as pd

import re

df = pd.DataFrame({'text':["Halve the clementine and place into the cavity along with the bay leaves. Transfer the duck to a medium roasting tray and roast for around 1 hour 20 minutes.",

"Add the stock, then bring to the boil and reduce to a simmer for around 15 minutes.",

"2 heaped teaspoons Chinese five-spice",

"100 ml Marsala",

"1 litre organic chicken stock"]})

def extract_qty(txt):

return re.findall('\d+ \w+',txt)

df['extracted_qty'] = df['text'].apply(extract_qty)

# text extracted_qty

#0 Halve the clementine and place into the cavity... [1 hour, 20 minutes]

#1 Add the stock, then bring to the boil and redu... [15 minutes]

#2 2 heaped teaspoons Chinese five-spice [2 heaped]

#3 100 ml Marsala [100 ml]

#4 1 litre organic chicken stock [1 litre]

to_compare使用列表理解提取常見值：

to_compare= ["1 hour", "20 litres", "100 ml", "2", "15 minutes", "20 minutes"]

df['common'] = df['extracted_qty'].apply(lambda x: [el for el in x if el in to_compare])

# text extracted_qty common

#0 Halve the clementine ... [1 hour, 20 minutes] [1 hour, 20 minutes]

#1 Add the stock, then ... [15 minutes] [15 minutes]

#2 2 heaped teaspoons ... [2 heaped] []

#3 100 ml Marsala [100 ml] [100 ml]

#4 1 litre organic chicken... [1 litre] []

反對回復 2022-12-20

2 回答
0 關注
125 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

數據框中的文本操作：單詞提取

數據框中的文本操作：單詞提取

2 回答

添加回答