## aggregate(df, ...) returning NAs?

r aggregate multiple columns

r aggregate mean by group

r aggregate keep na

I would like to apply the aggregate function on this data frame by the variables "id" and "var1"

df <- structure(list (id = c(1L,1L,1L,1L,2L,2L,2L,2L), var1 = structure(c(1L,1L,2L,2L,1L,1L,2L,2L), .Label = c("A", "B"), class = "factor"), var2 = c(1L,2L,1L,2L,1L,2L,1L,2L), values = c(37L,20L,22L,18L,30L,5L,41L,50L)), .Names = c("id","var1","var2","values"), class = "data.frame", row.names = c(NA,-8L)) # looks like > df id var1 var2 values 1 1 A 1 37 2 1 A 2 20 3 1 B 1 22 4 1 B 2 18 5 2 A 1 30 6 2 A 2 5 7 2 B 1 41 8 2 B 2 50

However if I do this I have a lot of warnings and a column full of NAs

> agg <- aggregate(df, by=list(df$id, df$var1), mean) Warning messages: 1: In mean.default(X[[i]], ...) : argument is not numeric or logical: returning NA 2: In mean.default(X[[i]], ...) : argument is not numeric or logical: returning NA 3: In mean.default(X[[i]], ...) : argument is not numeric or logical: returning NA 4: In mean.default(X[[i]], ...) : argument is not numeric or logical: returning NA > agg Group.1 Group.2 id var1 var2 values 1 1 A 1 NA 1.5 28.5 2 2 A 2 NA 1.5 17.5 3 1 B 1 NA 1.5 20.0 4 2 B 2 NA 1.5 45.5

Is there a way to prevent these warnings? has my aggregate result lost some data due to these?

Try this

aggregate( . ~ id + var1 , data = df, mean) # id var1 var2 values #1 1 A 1.5 28.5 #2 2 A 1.5 17.5 #3 1 B 1.5 20.0 #4 2 B 1.5 45.5

Here are some other options

Using `dplyr`

library(dplyr) df %>% group_by(id, var1) %>% summarize(var2 = mean(var2), values = mean(values)) #or simply df %>% group_by(id, var1) %>% summarise_each(funs(mean)) #Source: local data frame [4 x 4] #Groups: id # id var1 var2 values #1 1 A 1.5 28.5 #2 2 A 1.5 17.5 #3 1 B 1.5 20.0 #4 2 B 1.5 45.5

Using `data.table`

, you have two options:

library(data.table) setDT(df)[, .(var2 = mean(var2), values = mean(values)), by = .(id, var1)] # option 1 setDT(df)[, lapply(.SD, mean), by=.(id,var1), .SDcols=c("var2","values")] # option 2 # id var1 var2 values #1: 1 A 1.5 28.5 #2: 1 B 1.5 20.0 #3: 2 A 1.5 17.5 #4: 2 B 1.5 45.5

Using `ddply`

library(plyr) ddply(df, .(id,var1), colwise(mean)) # id var1 var2 values #1 1 A 1.5 28.5 #2 1 B 1.5 20.0 #3 2 A 1.5 17.5 #4 2 B 1.5 45.5

**Aggregating a data.frame with NAs using data.table,** Aggregating a data.frame with NAs using data.table Since df$id repeats I want to aggregate df by df$id , and apply sum to all the other columns. not a data.table problem: sum(c(NA, NA, NA), na.rm=TRUE) returns 0 Dataframe.aggregate() function is used to apply some aggregation across one or more column. Aggregate using callable, string, dict, or list of string/callables. Most frequently used aggregations are: sum: Return the sum of the values for the requested axis min: Return the minimum of the values for the requested axis

You need to limit the data frame provided for argument `x`

to the columns you want FUN to be applied to. So in your example, you want to apply the mean function to the values column, grouped by `id`

and `var1`

, hence you need to specify ** df$values** instead of just

**:**

`df`

agg <- aggregate(df$values, by=list(df$id, df$var1), mean)

**pandas.DataFrame.aggregate,** Keyword arguments to pass to func . Returns. scalar, Series or DataFrame. The return df.agg("mean", axis="columns") 0 2.0 1 5.0 2 8.0 3 NaN dtype: float64. pandas.DataFrame.aggregate¶ DataFrame.aggregate (self, func, axis=0, *args, **kwargs) [source] ¶ Aggregate using one or more operations over the specified axis.

Because your first argument `(data=df, ...)`

asked it to aggregate over all the df's columns (not just the single column `values`

).

You want `(data=df$values,...`

.

Or use the formula interface as others have said.

**Group By: split-apply-combine,** Filling NAs within groups with a value derived from each group. Calling the standard Python len function on the GroupBy object just returns the length In [62]: grouped = df.groupby('A') In [63]: grouped.aggregate(np.sum) Out[63]: C D Aggregating Data . It is relatively easy to collapse data in R using one or more BY variables and a defined function. # aggregate data frame mtcars by cyl and vs, returning means

**aggregate function,** Splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form. pandas.Series.value_counts¶ Series.value_counts (self, normalize=False, sort=True, ascending=False, bins=None, dropna=True) [source] ¶ Return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default. Parameters

**Sum of pd.DataFrame.groupby.sum containing NaN should return ,** In [238]: df.groupby('l')['v'].apply(np.sum)['right'] Out[238]: nan missing for some aggregated index value, the final figure should be returned as NaN It would be nice to have a keyword and get those NAs back in this case: Aggregate, filter, transform, apply¶ The preceding discussion focused on aggregation for the combine operation, but there are more options available. In particular, GroupBy objects have aggregate(), filter(), transform(), and apply() methods that efficiently implement a variety of useful operations before combining the grouped data.

**Aggregate,** It is relatively easy to collapse data in R using one or more BY variables and a defined function. # aggregate data frame mtcars by cyl and vs, returning means # for Questions: On a concrete problem, say I have a DataFrame DF word tag count 0 a S 30 1 the S 20 2 a T 60 3 an T 5 4 the T 10 I want to find, for every “word”, the “tag” that has the most “count”.

##### Comments

- Do this
`?aggregate`

and read under ## S3 method for class 'formula' - Because your first argument
`(data=df, ...`

asked it to aggregate over all the df's columns (not just values). When you use the non-formula interface, you need to specify the column you want to aggregate`(data=df$values,....`

- An alternative for the
`data.table`

option would be:`setDT(df)[, lapply(.SD, mean), by=.(id,var1), .SDcols=c("var2","values")]`

- Nice options. Using the
`list`

method`aggregate(df[c('var2', 'values')], df[c('id', 'var1')], FUN=mean)`

- I am not getting the warnings though. Can you check whether
`var2`

is factor or not. - Thanks, I am not sure how it happens. It must be some kind of clash between separarting the 'x' and
`by`

variables.