首頁猿問根據特定條件對數據進行分組，并在...

根據特定條件對數據進行分組，并在 R 或 Python 中查找持續時間

Python

陪伴而非守候 2022-08-11 20:18:12

我有一個數據集df，如下所示： subject recipient length folder message date edit 80 out 1/2/2020 1:00:01 AM T 80 out 1/2/2020 1:00:05 AM T hey [email protected],[email protected] 80 out 1/2/2020 1:00:10 AM They [email protected],[email protected] 80 out 1/2/2020 1:00:15 AM They [email protected],[email protected] 80 out 1/2/2020 1:00:30 AM Tsome k 900 in jjjjj 1/2/2020 1:00:35 AM Fsome k 900 in jjjjj 1/2/2020 1:00:36 AM F some k 900 in jjjjj 1/2/2020 1:00:37 AM Fhey [email protected],[email protected] 80 draft 1/2/2020 1:02:00 AM They [email protected],[email protected] 80 draft 1/2/2020 1:02:05 AM T no a 900 in iii 1/2/2020 1:02:10 AM Fno a 900 in iii 1/2/2020 1:02:15 AM Fno a 900 in iii 1/2/2020 1:02:20 AM Fno a 900 in iii 1/2/2020 1:02:25 AM F數據集表示用戶何時編輯消息、離開并繼續執行該消息。我正在嘗試捕獲手頭消息的總持續時間。我知道我必須首先對消息進行分組。我希望根據以下條件對消息進行分組：如果“文件夾”列為 == “out” 或 “draft”，如果“消息”列為 == “”，并且 Edit == “T”，則“長度”列也應連續相同。因此，一旦我有了這些組，我希望找到這些組的持續時間（開始和結束）。例如，第一組持續時間為 29 秒，因為它從 1/2/2020 1：00：01 AM 開始，到 1/2/2020 1：00：30 AM 結束。第二組將于1/2/2020 1：02：00開始，并于凌晨1：02：05結束。最后，第三組從1/2/2020 1：03：00 AM開始，到1：03：20 AM結束。此外，由于這些組都屬于同一郵件，因此我想使用以下邏輯將這些組完全鏈接在一起：組最后一行中的“主題”、“收件人”和“長度”內容與下一個組的第一行“主題”、“收件人”和“長度”匹配，則這些都屬于同一組。

查看完整描述

1 回答

POPMUISE

TA貢獻1765條經驗獲得超5個贊

df %>%

# The original data was loaded as factors, which have their uses, but

# converting those to characters will be simpler to work with here.

mutate_if(is.factor, as.character) %>%

# I'm replacing NA in Subj & Recip with an empty string, and trimming

# excess spaces from the start and end. One of the recipients is " "

# but I assume that's functionally the same as blank.

mutate_at(c("Subject", "Recipient"), ~if_else(is.na(.), "", stringr::str_trim(.))) %>%

filter(Subject != '') %>%

mutate(Date = as.POSIXct(Date, format = '%m/%d/%Y %H:%M:%OS')) %>%

mutate(cond = Edit & Folder %in% c('out', 'draft') & Message == '') %>%

mutate(segment = cumsum(!cond)) %>%

filter(cond) %>% # EDIT: Added to exclude rows matching cond

# Get summary stats for each segment

group_by(Subject, Recipient, Length, segment) %>%

summarize(Start = min(Date),

End = max(Date),

Duration = End - Start) %>%

# This counts the number of times that these columns don't match their

# predecessor. TRUE = 1, so we get 1 when anything changes.

# Look at ?lag for more on what those parameters mean.

mutate(new_group = (Subject != lag(Subject, 1, "")) *

(Recipient != lag(Recipient, 1, "")) *

(Length != lag(Length, 1, ""))) %>%

ungroup() %>%

mutate(group = LETTERS[cumsum(new_group)])

# A tibble: 3 x 9

Subject Recipient Length segment Start End Duration new_group group

1 hey [email protected],[email protected] 80 0 2020-01-02 01:00:10 2020-01-02 01:00:30 20 secs 1 A

2 hey [email protected],[email protected] 80 3 2020-01-02 01:02:00 2020-01-02 01:02:05 5 secs 0 A

3 hey [email protected],[email protected] 80 7 2020-01-02 01:03:00 2020-01-02 01:03:20 20 secs 0 A

反對回復 2022-08-11

1 回答
0 關注
139 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

根據特定條件對數據進行分組，并在 R 或 Python 中查找持續時間

根據特定條件對數據進行分組，并在 R 或 Python 中查找持續時間

1 回答

添加回答

根據特定條件對數據進行分組，并在 R 或 Python 中查找持續時間