aggregate(df, ...) returning NAs?

r aggregate sum
r aggregate multiple columns
r aggregate mean by group
r aggregate keep na

I would like to apply the aggregate function on this data frame by the variables "id" and "var1"

df <- structure(list (id = c(1L,1L,1L,1L,2L,2L,2L,2L),
        var1 = structure(c(1L,1L,2L,2L,1L,1L,2L,2L),
          .Label = c("A", "B"), class = "factor"), 
        var2 = c(1L,2L,1L,2L,1L,2L,1L,2L),
        values = c(37L,20L,22L,18L,30L,5L,41L,50L)),
        .Names = c("id","var1","var2","values"),
        class = "data.frame", row.names = c(NA,-8L))

# looks like
> df
  id var1 var2 values
1  1    A    1     37
2  1    A    2     20
3  1    B    1     22
4  1    B    2     18
5  2    A    1     30
6  2    A    2      5
7  2    B    1     41
8  2    B    2     50

However if I do this I have a lot of warnings and a column full of NAs

> agg <- aggregate(df, by=list(df$id, df$var1), mean)
Warning messages:
1: In mean.default(X[[i]], ...) :
  argument is not numeric or logical: returning NA
2: In mean.default(X[[i]], ...) :
  argument is not numeric or logical: returning NA
3: In mean.default(X[[i]], ...) :
  argument is not numeric or logical: returning NA
4: In mean.default(X[[i]], ...) :
  argument is not numeric or logical: returning NA
> agg
  Group.1 Group.2 id var1 var2 values
1       1       A  1   NA  1.5   28.5
2       2       A  2   NA  1.5   17.5
3       1       B  1   NA  1.5   20.0
4       2       B  2   NA  1.5   45.5

Is there a way to prevent these warnings? has my aggregate result lost some data due to these?

Try this

aggregate( . ~ id + var1 , data = df, mean)

#  id var1 var2 values
#1  1    A  1.5   28.5
#2  2    A  1.5   17.5
#3  1    B  1.5   20.0
#4  2    B  1.5   45.5

Here are some other options

Using dplyr

library(dplyr)
df %>% group_by(id, var1) %>% summarize(var2 = mean(var2), values = mean(values))
#or simply
df %>% group_by(id, var1) %>% summarise_each(funs(mean))

#Source: local data frame [4 x 4]
#Groups: id
#  id var1 var2 values
#1  1    A  1.5   28.5
#2  2    A  1.5   17.5
#3  1    B  1.5   20.0
#4  2    B  1.5   45.5

Using data.table, you have two options:

library(data.table)
setDT(df)[, .(var2 = mean(var2), values = mean(values)), by = .(id, var1)] # option 1
setDT(df)[, lapply(.SD, mean), by=.(id,var1), .SDcols=c("var2","values")] # option 2

#   id var1 var2 values
#1:  1    A  1.5   28.5
#2:  1    B  1.5   20.0
#3:  2    A  1.5   17.5
#4:  2    B  1.5   45.5

Using ddply

library(plyr)
ddply(df, .(id,var1), colwise(mean))

#  id var1 var2 values
#1  1    A  1.5   28.5
#2  1    B  1.5   20.0
#3  2    A  1.5   17.5
#4  2    B  1.5   45.5

Aggregating a data.frame with NAs using data.table, Aggregating a data.frame with NAs using data.table Since df$id repeats I want to aggregate df by df$id , and apply sum to all the other columns. not a data.​table problem: sum(c(NA, NA, NA), na.rm=TRUE) returns 0  Dataframe.aggregate() function is used to apply some aggregation across one or more column. Aggregate using callable, string, dict, or list of string/callables. Most frequently used aggregations are: sum: Return the sum of the values for the requested axis min: Return the minimum of the values for the requested axis

You need to limit the data frame provided for argument x to the columns you want FUN to be applied to. So in your example, you want to apply the mean function to the values column, grouped by id and var1, hence you need to specify df$values instead of just df:

agg <- aggregate(df$values, by=list(df$id, df$var1), mean)

pandas.DataFrame.aggregate, Keyword arguments to pass to func . Returns. scalar, Series or DataFrame. The return df.agg("mean", axis="columns") 0 2.0 1 5.0 2 8.0 3 NaN dtype: float64. pandas.DataFrame.aggregate¶ DataFrame.aggregate (self, func, axis=0, *args, **kwargs) [source] ¶ Aggregate using one or more operations over the specified axis.

Because your first argument (data=df, ...) asked it to aggregate over all the df's columns (not just the single column values).

You want (data=df$values,....

Or use the formula interface as others have said.

Group By: split-apply-combine, Filling NAs within groups with a value derived from each group. Calling the standard Python len function on the GroupBy object just returns the length In [​62]: grouped = df.groupby('A') In [63]: grouped.aggregate(np.sum) Out[63]: C D  Aggregating Data . It is relatively easy to collapse data in R using one or more BY variables and a defined function. # aggregate data frame mtcars by cyl and vs, returning means

aggregate function, Splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form. pandas.Series.value_counts¶ Series.value_counts (self, normalize=False, sort=True, ascending=False, bins=None, dropna=True) [source] ¶ Return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default. Parameters

Sum of pd.DataFrame.groupby.sum containing NaN should return , In [238]: df.groupby('l')['v'].apply(np.sum)['right'] Out[238]: nan missing for some aggregated index value, the final figure should be returned as NaN It would be nice to have a keyword and get those NAs back in this case: Aggregate, filter, transform, apply¶ The preceding discussion focused on aggregation for the combine operation, but there are more options available. In particular, GroupBy objects have aggregate(), filter(), transform(), and apply() methods that efficiently implement a variety of useful operations before combining the grouped data.

Aggregate, It is relatively easy to collapse data in R using one or more BY variables and a defined function. # aggregate data frame mtcars by cyl and vs, returning means # for  Questions: On a concrete problem, say I have a DataFrame DF word tag count 0 a S 30 1 the S 20 2 a T 60 3 an T 5 4 the T 10 I want to find, for every “word”, the “tag” that has the most “count”.

Comments
  • Do this ?aggregate and read under ## S3 method for class 'formula'
  • Because your first argument (data=df, ... asked it to aggregate over all the df's columns (not just values). When you use the non-formula interface, you need to specify the column you want to aggregate (data=df$values,....
  • An alternative for the data.table option would be: setDT(df)[, lapply(.SD, mean), by=.(id,var1), .SDcols=c("var2","values")]
  • Nice options. Using the list method aggregate(df[c('var2', 'values')], df[c('id', 'var1')], FUN=mean)
  • I am not getting the warnings though. Can you check whether var2 is factor or not.
  • Thanks, I am not sure how it happens. It must be some kind of clash between separarting the 'x' and by variables.