首頁猿問 R：計算單個列中值的連續出現

R：計算單個列中值的連續出現

R語言

慕萊塢森 2019-10-19 15:06:20

我希望在每次運行時都創建一個相等值的序號，例如出現次數計數器，一旦當前行中的值與上一行不同，該序號就會重新開始。請在下面找到輸入和預期輸出的示例。dataset <- data.frame(input = c("a","b","b","a","a","c","a","a","a","a","b","c"))dataset$counter <- c(1,1,2,1,2,1,1,2,3,4,1,1)dataset# input counter# 1 a 1# 2 b 1# 3 b 2# 4 a 1# 5 a 2# 6 c 1# 7 a 1# 8 a 2# 9 a 3# 10 a 4# 11 b 1# 12 c 1我的問題與這一問題非常相似：值出現的累積順序。

查看完整描述

3 回答

揚帆大魚

TA貢獻1799條經驗獲得超9個贊

您需要使用sequence和rle：

> sequence(rle(as.character(dataset$input))$lengths)

[1] 1 1 2 1 2 1 1 2 3 4 1 1

反對回復 2019-10-19

不負相思意

TA貢獻1777條經驗獲得超10個贊

而從v1.9.8（新聞項目16），采用rowid與rleid

dataset[, counter := rowid(rleid(input))]

計時碼：

set.seed(1L)

library(data.table)

DT <- data.table(input=sample(letters, 1e6, TRUE))

DT1 <- copy(DT)

bench::mark(DT[, counter := seq_len(.N), by=rleid(input)],

DT1[, counter := rowid(rleid(input))])

時間：

expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time

<bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>

1 DT[, `:=`(counter, seq_len(.N)), by = rleid(input)] 613.8ms 613.8ms 1.63 18.8MB 8.15 1 5 614ms

2 DT1[, `:=`(counter, rowid(rleid(input)))] 60.5ms 71.4ms 12.7 26.4MB 14.5 7 8 553ms

現在可以在名為的data.table程序包中獲得下面編寫的函數的高效且更直接的版本rleid。使用它，就是：

setDT(dataset)[, counter := seq_len(.N), by=rleid(input)]

有關?rleid更多用法和示例，請參見。感謝@Henrik提出的更新此帖子的建議。

rle絕對是最方便的方法（+1 @Ananda）。但是，在更大的數據上，可以做得更好（就速度而言）。您可以按以下方式使用duplist和vecseq函數（未導出）data.table：

require(data.table)

arun <- function(y) {

w = data.table:::duplist(list(y))

w = c(diff(w), length(y)-tail(w,1L)+1L)

data.table:::vecseq(rep(1L, length(w)), w, length(y))

}

x <- c("a","b","b","a","a","c","a","a","a","a","b","c")

arun(x)

# [1] 1 1 2 1 2 1 1 2 3 4 1 1

大數據基準測試：

set.seed(1)

x <- sample(letters, 1e6, TRUE)

# rle solution

ananda <- function(y) {

sequence(rle(y)$lengths)

}

require(microbenchmark)

microbenchmark(a1 <- arun(x), a2<-ananda(x), times=100)

Unit: milliseconds

expr min lq median uq max neval

a1 <- arun(x) 123.2827 132.6777 163.3844 185.439 563.5825 100

a2 <- ananda(x) 1382.1752 1899.2517 2066.4185 2247.233 3764.0040 100

identical(a1, a2) # [1] TRUE

反對回復 2019-10-19

蝴蝶不菲

TA貢獻1810條經驗獲得超4個贊

包亞軍有專門的解決方案來計算需要什么。streak_run是最快的解決方案，接受向量作為輸入。

library(microbenchmark); library(runner)

x <- sample(letters, 1e6, TRUE)

ananda <- function(y) sequence(rle(y)$lengths)

microbenchmark( a2<-ananda(x), runner <- streak_run(x), times=100)

#Unit: milliseconds

# expr min lq mean median uq max neval

# a2 <- ananda(x) 580.744 718.117 1059.676 944.073 1399.649 1699.293 10

#run <- streak_run(x) 37.682 39.568 42.277 40.591 43.947 52.917 10

identical(a2, run)

#[1] TRUE

反對回復 2019-10-19

3 回答
0 關注
743 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

R：計算單個列中值的連續出現

R：計算單個列中值的連續出現

3 回答

添加回答