Questions tagged [dplyr]
Use this tag for questions relating to functions from the dplyr package, such as group_by, summarize, filter, and select.
                                	
	dplyr
    
                            
                        
                    
            36,903
            questions
        
        
            914
            votes
        
        
            5
            answers
        
        
            165k
            views
        
    data.table vs dplyr: can one do something well the other can't or does poorly?
                Overview
I'm relatively familiar with data.table, not so much with dplyr.  I've read through some dplyr vignettes and examples that have popped up on SO, and so far my conclusions are that:
data....
            
        
       
    
            315
            votes
        
        
            5
            answers
        
        
            699k
            views
        
    Filter rows which contain a certain string
                I have to filter a data frame using as criterion those row in which is contained the string RTB.
I'm using dplyr.
d.del <- df %>%
  group_by(TrackingPixel) %>%
  summarise(MonthDelivery = as....
            
        
       
    
            295
            votes
        
        
            10
            answers
        
        
            221k
            views
        
    Use dynamic name for new column/variable in `dplyr`
                I want to use dplyr::mutate() to create multiple new columns in a data frame. The column names and their contents should be dynamically generated.
Example data from iris:
library(dplyr)
iris <- ...
            
        
       
    
            274
            votes
        
        
            8
            answers
        
        
            157k
            views
        
    Extract a dplyr tbl column as a vector
                Is there a more succinct way to get one column of a dplyr tbl as a vector, from a tbl with database back-end (i.e. the data frame/table can't be subset directly)?
require(dplyr)
db <- src_sqlite(...
            
        
       
    
            269
            votes
        
        
            7
            answers
        
        
            208k
            views
        
    Display / print all rows of a tibble (tbl_df)
                tibble (previously tbl_df) is a version of a data frame created by the dplyr data frame manipulation package in R. It prevents long table outputs when accidentally calling the data frame.
Once a data ...
            
        
       
    
            244
            votes
        
        
            11
            answers
        
        
            287k
            views
        
    Relative frequencies / proportions with dplyr
                Suppose I want to calculate the proportion of different values within each group. For example, using the mtcars data, how do I calculate the relative frequency of number of gears by am (automatic/...
            
        
       
    
            232
            votes
        
        
            5
            answers
        
        
            368k
            views
        
    Can dplyr package be used for conditional mutating?
                Can the mutate be used when the mutation is conditional (depending on the values of certain column values)?
This example helps showing what I mean.
structure(list(a = c(1, 3, 4, 6, 3, 2, 5, 1), b = ...
            
        
       
    
            208
            votes
        
        
            7
            answers
        
        
            633k
            views
        
    What does %>% function mean in R?
                I have seen the use of %>% (percent greater than percent) function in some packages like dplyr and rvest.  What does it mean? Is it a way to write closure blocks in R?
            
        
       
    
            204
            votes
        
        
            10
            answers
        
        
            174k
            views
        
    Select first and last row from grouped data
                Question
Using dplyr, how do I select the top and bottom observations/rows of grouped data in one statement?
Data & Example
Given a data frame:
df <- data.frame(id=c(1,1,1,2,2,2,3,3,3), 
       ...
            
        
       
    
            203
            votes
        
        
            6
            answers
        
        
            211k
            views
        
    Remove duplicated rows using dplyr
                I have a data.frame like this - 
set.seed(123)
df = data.frame(x=sample(0:1,10,replace=T),y=sample(0:1,10,replace=T),z=1:10)
> df
   x y  z
1  0 1  1
2  1 0  2
3  0 1  3
4  1 1  4
5  1 0  5
6  0 1 ...
            
        
       
    
            203
            votes
        
        
            6
            answers
        
        
            178k
            views
        
    How to interpret dplyr message `summarise()` regrouping output by 'x' (override with `.groups` argument)?
                I started getting a new message (see post title) when running group_by and summarise() after updating to dplyr development version 0.8.99.9003.
Here is an example to recreate the output:
library(...
            
        
       
    
            200
            votes
        
        
            10
            answers
        
        
            117k
            views
        
    Fixing a multiple warning "unknown column"
                I have a persistent multiple warning of "unknown column" for all types of commands (e.g., str(x) to installing updates on packages), and not sure how to debug this or fix it. 
The warning "unknown ...
            
        
       
    
            192
            votes
        
        
            5
            answers
        
        
            308k
            views
        
    Summarizing multiple columns with dplyr? [duplicate]
                I'm struggling a bit with the dplyr-syntax. I have a data frame with different variables and one grouping variable. Now I want to calculate the mean for each column within each group, using dplyr in R....
            
        
       
    
            184
            votes
        
        
            10
            answers
        
        
            122k
            views
        
    Group by multiple columns in dplyr, using string vector input
                I'm trying to transfer my understanding of plyr into dplyr, but I can't figure out how to group by multiple columns.
# make data with weird column names that can't be hard coded
data = data.frame(
  ...
            
        
       
    
            177
            votes
        
        
            9
            answers
        
        
            290k
            views
        
    Sum across multiple columns with dplyr
                My question involves summing up values across multiple columns of a data frame and creating a new column corresponding to this summation using dplyr. The data entries in the columns are binary(0,1). I ...
            
        
       
    
            162
            votes
        
        
            6
            answers
        
        
            241k
            views
        
    How to select the rows with maximum values in each group with dplyr? [duplicate]
                I would like to select a row with maximum value in each group with dplyr.
Firstly I generate some random data to show my question
set.seed(1)
df <- expand.grid(list(A = 1:5, B = 1:5, C = 1:5))
df$...
            
        
       
    
            159
            votes
        
        
            2
            answers
        
        
            249k
            views
        
    Can dplyr join on multiple columns or composite key?
                I realize that dplyr v3.0 allows you to join on different variables:
left_join(x, y, by = c("a" = "b") will match x.a to y.b
However, is it possible to join on a combination of variables or do I ...
            
        
       
    
            151
            votes
        
        
            4
            answers
        
        
            611k
            views
        
    Error: could not find function "%>%"
                I'm running an example in R, going through the steps and everything is working so far except for this code produces an error:  
 words <- dtm %>%
 as.matrix %>%
 colnames %>%
 (function(x)...
            
        
       
    
            148
            votes
        
        
            2
            answers
        
        
            255k
            views
        
    Change value of variable with dplyr
                I regularly need to change the values of a variable based on the values on a different variable, like this:
mtcars$mpg[mtcars$cyl == 4] <- NA
I tried doing this with dplyr but failed miserably:
...
            
        
       
    
            148
            votes
        
        
            8
            answers
        
        
            110k
            views
        
    Applying a function to every row of a table using dplyr?
                When working with plyr I often found it useful to use adply for scalar functions that I have to apply to each and every row.
e.g.
data(iris)
library(plyr)
head(
     adply(iris, 1, transform , Max....
            
        
       
    
            142
            votes
        
        
            6
            answers
        
        
            61k
            views
        
    R Conditional evaluation when using the pipe operator %>%
                When using the pipe operator %>% with packages such as dplyr, ggvis, dycharts, etc, how do I do a step conditionally? For example;
step_1 %>%
step_2 %>%
if(condition)
step_3
These ...
            
        
       
    
            137
            votes
        
        
            6
            answers
        
        
            262k
            views
        
    Count number of rows by group using dplyr
                I am using the mtcars dataset. I want to find the number of records for a particular combination of data. Something very similar to the count(*) group by clause in SQL. ddply() from plyr is working ...
            
        
       
    
            133
            votes
        
        
            10
            answers
        
        
            190k
            views
        
    R dplyr: Drop multiple columns
                I have a dataframe and list of columns in that dataframe that I'd like to drop. Let's use the iris dataset as an example. I'd like to drop Sepal.Length and Sepal.Width and use only the remaining ...
            
        
       
    
            131
            votes
        
        
            4
            answers
        
        
            81k
            views
        
    Pass a string as variable name in dplyr::filter
                I'm using mtcars dataset to illustrate my question.
For example, I want to subset data to 4-cyl cars.I can do:
mtcars %>% filter(cyl == 4)
In my work, I need to pass a string variable as my ...
            
        
       
    
            128
            votes
        
        
            7
            answers
        
        
            78k
            views
        
    Filter for complete cases in data.frame using dplyr (case-wise deletion)
                Is it possible to filter a data.frame for complete cases using dplyr? complete.cases with a list of all variables works, of course. But that is a) verbose when there are a lot of variables and b) ...
            
        
       
    
            124
            votes
        
        
            7
            answers
        
        
            160k
            views
        
    Replacement for "rename" in dplyr
                I like plyr's renaming function rename.  I have recently started using dplyr, and was wondering if there is an easy way to rename variables using a function from dplyr, that is as easy to use as to ...
            
        
       
    
            123
            votes
        
        
            6
            answers
        
        
            139k
            views
        
    Getting the top values by group
                Here's a sample data frame:
d <- data.frame(
  x   = runif(90),
  grp = gl(3, 30)
) 
I want the subset of d containing the rows with the top 5 values of x for each value of grp.
Using base-R, my ...
            
        
       
    
            121
            votes
        
        
            3
            answers
        
        
            204k
            views
        
    How to specify names of columns for x and y when joining in dplyr?
                I have two data frames that I want to join using dplyr. One is a data frame containing first names.
test_data <- data.frame(first_name = c("john", "bill", "madison", "abby", "zzz"),
               ...
            
        
       
    
            121
            votes
        
        
            3
            answers
        
        
            317k
            views
        
    dplyr mutate with conditional values
                In a large dataframe ("myfile") with four columns I have to add a fifth column with values conditionally based on the first four columns.
Prefer answers with dplyr and mutate, mainly because of its ...
            
        
       
    
            120
            votes
        
        
            9
            answers
        
        
            141k
            views
        
    Find duplicated elements with dplyr
                I tried using the code presented here to find ALL duplicated elements with dplyr like this:
library(dplyr)
mtcars %>%
mutate(cyl.dup = cyl[duplicated(cyl) | duplicated(cyl, from.last = TRUE)])
...
            
        
       
    
            120
            votes
        
        
            5
            answers
        
        
            197k
            views
        
    Select columns based on string match - dplyr::select
                I have a data frame ("data") with lots and lots of columns. Some of the columns contain a certain string ("search_string").
How can I use dplyr::select() to give me a subset including only the ...
            
        
       
    
            119
            votes
        
        
            4
            answers
        
        
            58k
            views
        
    dplyr summarise: Equivalent of ".drop=FALSE" to keep groups with zero length in output
                When using summarise with plyr's ddply function, empty categories are dropped by default. You can change this behavior by adding .drop = FALSE. However, this doesn't work when using summarise with ...
            
        
       
    
            118
            votes
        
        
            8
            answers
        
        
            119k
            views
        
    Extract row corresponding to minimum value of a variable by group
                I wish to (1) group data by one variable (State), (2) within each group find the row of minimum value of another variable (Employees), and (3) extract the entire row.
(1) and (2) are easy one-liners, ...
            
        
       
    
            117
            votes
        
        
            5
            answers
        
        
            126k
            views
        
    Gather multiple sets of columns
                I have data from an online survey where respondents go through a loop of questions 1-3 times. The survey software (Qualtrics) records this data in multiple columns—that is, Q3.2 in the survey will ...
            
        
       
    
            109
            votes
        
        
            7
            answers
        
        
            389k
            views
        
    Filter multiple values on a string column in dplyr
                I have a data.frame with character data in one of the columns.
I would like to filter multiple options in the data.frame from the same column. Is there an easy way to do this that I'm missing?
Example:...
            
        
       
    
            109
            votes
        
        
            12
            answers
        
        
            74k
            views
        
    dplyr mutate/replace several columns on a subset of rows
                I'm in the process of trying out a dplyr-based workflow (rather than using mostly data.table, which I'm used to), and I've come across a problem that I can't find an equivalent dplyr solution to. I ...
            
        
       
    
            109
            votes
        
        
            1
            answer
        
        
            108k
            views
        
    R spreading multiple columns with tidyr [duplicate]
                Take this sample variable
df <- data.frame(month=rep(1:3,2),
                 student=rep(c("Amy", "Bob"), each=3),
                 A=c(9, 7, 6, 8, 6, 9),
                 B=c(6, 7, 8, 5, 6, 7))
...
            
        
       
    
            106
            votes
        
        
            15
            answers
        
        
            315k
            views
        
    How to get summary statistics by group
                I'm trying to get multiple summary statistics in R/S-PLUS grouped by categorical column in one shot. I found couple of functions, but all of them do one statistic per call, like aggregate().
data <-...
            
        
       
    
            101
            votes
        
        
            6
            answers
        
        
            67k
            views
        
    dplyr: "Error in n(): function should not be called directly"
                I am attempting to reproduce one of the examples in the dplyr package but am getting this error message. I am expecting to see a new column n produced with the frequency of each combination.  What am ...
            
        
       
    
            101
            votes
        
        
            2
            answers
        
        
            153k
            views
        
    Get dplyr count of distinct in a readable way
                I'm new using dplyr,
I need to calculate the distinct values in a group. Here's a table example:
data <- data.frame(aa = c(1, 2, 3, 4, NA), 
                   bb = c('a', 'b', 'a', 'c', 'c'))
I ...
            
        
       
    
            100
            votes
        
        
            4
            answers
        
        
            80k
            views
        
    Use pipe operator %>% with replacement functions like colnames()<-
                How can I use the pipe operator to pipe into replacement function like colnames()<- ?
Here's what I'm trying to do:
library(dplyr)
averages_df <- 
   group_by(mtcars, cyl) %>%
   summarise(...
            
        
       
    
            97
            votes
        
        
            5
            answers
        
        
            50k
            views
        
    R move column to last using dplyr
                For a data.frame with n columns, I would like to be able to move a column from any of 1-(n-1) positions, to be the nth column (i.e. a non-last column to be the last column). I would also like to do it ...
            
        
       
    
            96
            votes
        
        
            4
            answers
        
        
            17k
            views
        
    dplyr on data.table, am I really using data.table?
                If I use dplyr syntax on top of a datatable, do I get all the speed benefits of datatable while still using the syntax of dplyr? In other words, do I mis-use the datatable if I query it with dplyr ...
            
        
       
    
            93
            votes
        
        
            9
            answers
        
        
            192k
            views
        
    dplyr change many data types
                I have a data.frame:
dat <- data.frame(fac1 = c(1, 2),
                  fac2 = c(4, 5),
                  fac3 = c(7, 8),
                  dbl1 = c('1', '2'),
                  dbl2 = c('4', '5'),...
            
        
       
    
            92
            votes
        
        
            5
            answers
        
        
            114k
            views
        
    How to create a lag variable within each group?
                I have a data.table:
require(data.table)
set.seed(1)
data <- data.table(time = c(1:3, 1:4),
                   groups = c(rep(c("b", "a"), c(3, 4))),
                   value = ...
            
        
       
    
            91
            votes
        
        
            1
            answer
        
        
            197k
            views
        
    Removing NA in dplyr pipe [duplicate]
                I tried to remove NA's from the subset using dplyr piping. Is my answer an indication of a missed step. I'm trying to learn how to write functions using dplyr:
> outcome.df%>%
+ group_by(...
            
        
       
    
            90
            votes
        
        
            6
            answers
        
        
            184k
            views
        
    Changing factor levels with dplyr mutate
                This is probably simple and I feel stupid for asking. I want to change the levels of a factor in a data frame, using mutate. Simple example:
library("dplyr")
dat <- data.frame(x = factor("A"), y = ...
            
        
       
    
            90
            votes
        
        
            4
            answers
        
        
            125k
            views
        
    Applying group_by and summarise on data while keeping all the columns' info
                I have a large dataset with 22000 rows and 25 columns. I am trying to group my dataset based on one of the columns and take the min value of the other column based on the grouped dataset. However, the ...
            
        
       
    
            89
            votes
        
        
            1
            answer
        
        
            37k
            views
        
    Create new variables with mutate_at while keeping the original ones
                Consider this simple example:
library(dplyr)
library(tibble)
dataframe <- tibble(helloo = c(1,2,3,4,5,6),
                        ooooHH = c(1,1,1,2,2,2),
                        ahaaa = c(200,400,...
            
        
       
    
            88
            votes
        
        
            5
            answers
        
        
            128k
            views
        
    dplyr summarise_each with na.rm
                Is there a way to instruct dplyr to use summarise_each with na.rm=TRUE? I would like to take the mean of variables with summarise_each("mean") but I don't know how to specify it to ignore missing ...