In R and ddply, is it possible to avoid enumerating all columns I need when using ddply?

ddply r package
could not find function ddply in r
ddply transform
ddply custom function
ddply multiple functions
ddply multiple columns
ddply mutate
ddply multiple variables

Other posts suggested that ddply is a good workhorse. I am trying to learn xxply functions and I can not solve this problem.

This is my

library(ggplot2)
(df= tips[1:5,])
             total_bill                   tip    sex smoker day   time size
1 16.989999999999998437 1.0100000000000000089 Female     No Sun Dinner    2
2 10.339999999999999858 1.6599999999999999201   Male     No Sun Dinner    3
3 21.010000000000001563 3.5000000000000000000   Male     No Sun Dinner    3
4 23.679999999999999716 3.3100000000000000533   Male     No Sun Dinner    2
5 24.589999999999999858 3.6099999999999998757 Female     No Sun Dinner    4

and I need to something like this

ddply(df
       ,.(<do I have to enumerate all columns I need to operate on here?)>
       , function(x) {if size>=3 return(size) else return(total_bill+tip)
     )

(the example is a fake problem (does not make real life sense) and only demonstrates my problem with larger data)

  1. I could not get the ddply code right reading just help files. Any advise appreciated. Or even great ddply tutorial?

  2. I like that with ddply I can just pass my dataframe as input, but in the second argument, it is not nice that I am forced to enumerate all columns that I need later. Is there a way to pass the whole row (all columns)?

  3. I like defining the function on the fly, but I am not sure how to make my pseudocode correct in R (my last argument).

Based on your code, it doesn't look like you need to use plyr here at all. It seems to me you are calculating a new variable for each row of the data.frame. If that's the case, then just use some base R functions:

dat <- transform(dat, newval = ifelse(size >= 3, size, total_bill + tip))

  total_bill  tip    sex smoker day   time size newval
1      16.99 1.01 Female     No Sun Dinner    2  18.00
2      10.34 1.66   Male     No Sun Dinner    3   3.00
3      21.01 3.50   Male     No Sun Dinner    3   3.00
4      23.68 3.31   Male     No Sun Dinner    2  26.99
5      24.59 3.61 Female     No Sun Dinner    4   4.00

Sorry if I misunderstood what you are doing. If you do in fact need to pass the entire row of a data.frame into plyr with no grouping variable, perhaps you can treat it as an array with margin = 1? i.e adply(dat, 1, ...)

Great introduction of plyr here: www.jstatsoft.org/v40/i01/paper

r - Colwise eats column names within ddply, It's a funny interaction between ddply and colwise, it seems. More specifically, the problem occurs when colwise calls strip_splits and finds a vars attribute that  I need to do two group_by function, first to group all countries together and after that group genders to calculate loan percent. Total loan amount = 2525 female_prcent = 175+100+175+225/2525 = 26.73 male_percent = 825+1025/2525 = 73.26 The output should be as below:

The second argument is the "splitting" variable. so in your sample data set, if you're looking to see the difference in spending habits between the sexes you would supply .(sex) or if you want all possibilities of your categorical variables, yes you would have to supply them all .(sex, smoker, day, time).

On a separate note, when using ddply your function should take a data.frame and return a data.frame. Currently It returns a vector. Also, if is not vectorized, you should use ifelse.

ddply(df, .(sex), function(x) {
      x$new.var <- ifelse(x$size >= 3, x$size, x$total_bill + x$tip)
      return(x)
})

if you don't specify the return value, R will return the last thing calculated which is a vector.

My only other suggestion is to keep playing with plyr. Eventually it will click and you'll love it!

ddply: Split data frame, apply function, and return results in a in , To apply a function for each row, use adply with .margins set to 1. ddply: Split data frame, apply function, and return results in a. View source: R/ddply.r them so that all cluster nodes have the correct environment set up for computing. this function will return a data frame with zero rows and columns ( data.frame() ). This is a static method that will determine the unique combinations in total (i.e., combinations of all five columns). The plyr package has a nifty utility for grouping unique combinations of columns in a data.frame. We can specify the names of the columns we want to group by, and then specify a function to perform for each of those combinations.

don't know if this is still useful. Whilst I am not sure whether this is adequate I am used to solve tasks similar to yours as follows:

ddply(df
       , as.quoted(colnames(df))
       , function(x) {if size>=3 return(size) else return(total_bill+tip)
     )

Data aggregation, In due course, save this script with a name ending in .r or .R Stop and ask yourself . ddply(). Let's say we want to get the maximum life expectancy for each In theory, sub-data.frames will be made for all possible combinations of the levels  If there is more than one match, all possible matches contribute one row each. For the precise meaning of ‘match’, see match. Columns to merge on can be specified by name, number or by a logical vector: the name "row.names" or the number 0 specifies the row names. If specified by name it must correspond uniquely to a named column in the input.

[PDF] plyr, Some examples using ddply - look at the column names them so that all cluster nodes have the correct environment set up for Pass NULL to avoid creation of the index col- qual <- laply(models, function(mod) summary(mod)$r.squared) Call length(), min() and max() on a random normal vector. Arguments x. a numeric vector which is to be converted to a factor by cutting. breaks. either a numeric vector of two or more unique cut points or a single number (greater than or equal to 2) giving the number of intervals into which x is to be cut.

plyr: Split-Apply-Combine for Mortals, plyr is a set of tools that solves a common set of problems: you need It's already possible to do this with split and the apply functions, but So, ddply means: take a data frame, split it up, do something to it, and The basic syntax can be easily extended to break apart the data based on multiple columns: I have a few tens of thousands of observations that are in a time series but grouped by locations. For example: location date observationA observationB ----- A 1-2010 22 12 A 2-2010 26 15 A 3-2010 45 16 A 4-2010 46 27 B 1-2010 167 48 B 2-2010 134 56 B 3-2010 201 53 B 4-2010 207 42

plyr along two dimensions (ddply) - list - iOS, I have a data frame that looks like this (simplified for exposition): date id value d1 id1 v1 d2 id1 v2 d1 id2 v3 d2 id2 v4 I would like to break this apart by id, run a  If we tried merging the raw life_expectancy and sanitation data frames without renaming the columns, and without setting by parameters, R would have tried merging the two data frames by all common columns — namely country.name, 2010, 2011, 2012, etc. Since the numeric columns (2010-2012) likely won’t match across the two data sets, your

Comments
  • the transform trick is very nice and new to me. I saw that paper. thx
  • I think you can operate on all columns by using .var = names(df).
  • I need to learn *apply techniques and decided to go with plyr
  • @Justin Uh... I don't really know what I was thinking there. Let's pretend I meant to say this: If you wanted to operate on all of the factors in a data.frame, you can do it like this: ddply(df, .var = names(df)[names(df)[t(colwise(is.factor)(df))]], .fun = ...) Basically, it's sometimes easier to subset names(df) to what you need than to explicitly spell out each variable name.