Dplyr summarize n lines

1/18/2024

The error message basically says that we are in the “wrong context” for a selection function. But why not? Basically, the problem is that mutate() doesn't know what do to with selection functions like starts_with(). # i Input `xmean` is `mean(starts_with("x"))`.

# x `starts_with()` must be used within a *selecting* function. # Error: Problem with `mutate()` input `xmean`. To do so, we just need to use a tidy selection function in this case, all the variables we want to include start with the letter “x” so let's use starts_with(). This isn't such a time-savings in this case with only three variables, but in settings with more variables it can really add up. We can even save some time by selecting the variables to include in the mean() operation automatically, instead of listing them out in the c() function. This did what we wanted it to do, despite the actual mutate() call being identical to what is was before! Pretty cool. The simplest version simply adds a call to the rowwise() function to our pipeline. Luckily, dplyr 1.0.0 added some great features for doing operations within rows. So, clearly mutate() is not doing what we intended it to do. To verify this is what happened, we can do the operation by hand and see that we get the same number: mean(c(x1, x2, x3)) But because the column needs to be a vector of 10 numbers to fit into the tibble, that single value gets recycled (i.e., repeated 10 times). The mean() function then returns a single value-the mean of all 30 numbers-and tries to put that into the new column xmean. What is going on here? Basically, what mutate() did was take all the numbers in x1, x2, and x3, combine them into one long vector of 30 numbers, and send that vector to the mean() function. However, you'll notice in the output above that the new xmean variable contains repetitions of a constant value. If you are just learning dplyr, you would probably try to combine the mean() and mutate() functions as below. Now let's say we want to add a new variable xmean to the tibble containing each observation's mean of x1, x2, and x3. We can simulate this quickly using rnorm() to sample from different normal distributions. Let's say we have a tibble (or data frame) containing 10 observations and 4 numerical variables: y, x1, x2, and x3. It will focus on how to avoid some common issues I ran into and how to speed up rowwise operations with large data frames. Once I figured out what was going on, I wanted to share what I learned through this brief blog post. However, there is one type of operation that I frequently do that has historically caused me some confusion and frustration: row-wise means. Dplyr is an amazing tool for data wrangling and I use it daily.

0 Comments

Dplyr summarize n lines

Leave a Reply.

Author

Archives

Categories