aggregate(df, ...) returning NAs?
r aggregate multiple columns
r aggregate mean by group
r aggregate keep na
I would like to apply the aggregate function on this data frame by the variables "id" and "var1"
df <- structure(list (id = c(1L,1L,1L,1L,2L,2L,2L,2L), var1 = structure(c(1L,1L,2L,2L,1L,1L,2L,2L), .Label = c("A", "B"), class = "factor"), var2 = c(1L,2L,1L,2L,1L,2L,1L,2L), values = c(37L,20L,22L,18L,30L,5L,41L,50L)), .Names = c("id","var1","var2","values"), class = "data.frame", row.names = c(NA,-8L)) # looks like > df id var1 var2 values 1 1 A 1 37 2 1 A 2 20 3 1 B 1 22 4 1 B 2 18 5 2 A 1 30 6 2 A 2 5 7 2 B 1 41 8 2 B 2 50
However if I do this I have a lot of warnings and a column full of NAs
> agg <- aggregate(df, by=list(df$id, df$var1), mean) Warning messages: 1: In mean.default(X[[i]], ...) : argument is not numeric or logical: returning NA 2: In mean.default(X[[i]], ...) : argument is not numeric or logical: returning NA 3: In mean.default(X[[i]], ...) : argument is not numeric or logical: returning NA 4: In mean.default(X[[i]], ...) : argument is not numeric or logical: returning NA > agg Group.1 Group.2 id var1 var2 values 1 1 A 1 NA 1.5 28.5 2 2 A 2 NA 1.5 17.5 3 1 B 1 NA 1.5 20.0 4 2 B 2 NA 1.5 45.5
Is there a way to prevent these warnings? has my aggregate result lost some data due to these?
aggregate( . ~ id + var1 , data = df, mean) # id var1 var2 values #1 1 A 1.5 28.5 #2 2 A 1.5 17.5 #3 1 B 1.5 20.0 #4 2 B 1.5 45.5
Here are some other options
library(dplyr) df %>% group_by(id, var1) %>% summarize(var2 = mean(var2), values = mean(values)) #or simply df %>% group_by(id, var1) %>% summarise_each(funs(mean)) #Source: local data frame [4 x 4] #Groups: id # id var1 var2 values #1 1 A 1.5 28.5 #2 2 A 1.5 17.5 #3 1 B 1.5 20.0 #4 2 B 1.5 45.5
data.table, you have two options:
library(data.table) setDT(df)[, .(var2 = mean(var2), values = mean(values)), by = .(id, var1)] # option 1 setDT(df)[, lapply(.SD, mean), by=.(id,var1), .SDcols=c("var2","values")] # option 2 # id var1 var2 values #1: 1 A 1.5 28.5 #2: 1 B 1.5 20.0 #3: 2 A 1.5 17.5 #4: 2 B 1.5 45.5
library(plyr) ddply(df, .(id,var1), colwise(mean)) # id var1 var2 values #1 1 A 1.5 28.5 #2 1 B 1.5 20.0 #3 2 A 1.5 17.5 #4 2 B 1.5 45.5
Aggregating a data.frame with NAs using data.table, Aggregating a data.frame with NAs using data.table Since df$id repeats I want to aggregate df by df$id , and apply sum to all the other columns. not a data.table problem: sum(c(NA, NA, NA), na.rm=TRUE) returns 0 Dataframe.aggregate() function is used to apply some aggregation across one or more column. Aggregate using callable, string, dict, or list of string/callables. Most frequently used aggregations are: sum: Return the sum of the values for the requested axis min: Return the minimum of the values for the requested axis
You need to limit the data frame provided for argument
x to the columns you want FUN to be applied to. So in your example, you want to apply the mean function to the values column, grouped by
var1, hence you need to specify
df$values instead of just
agg <- aggregate(df$values, by=list(df$id, df$var1), mean)
pandas.DataFrame.aggregate, Keyword arguments to pass to func . Returns. scalar, Series or DataFrame. The return df.agg("mean", axis="columns") 0 2.0 1 5.0 2 8.0 3 NaN dtype: float64. pandas.DataFrame.aggregate¶ DataFrame.aggregate (self, func, axis=0, *args, **kwargs) [source] ¶ Aggregate using one or more operations over the specified axis.
Because your first argument
(data=df, ...) asked it to aggregate over all the df's columns (not just the single column
Or use the formula interface as others have said.
Group By: split-apply-combine, Filling NAs within groups with a value derived from each group. Calling the standard Python len function on the GroupBy object just returns the length In : grouped = df.groupby('A') In : grouped.aggregate(np.sum) Out: C D Aggregating Data . It is relatively easy to collapse data in R using one or more BY variables and a defined function. # aggregate data frame mtcars by cyl and vs, returning means
aggregate function, Splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form. pandas.Series.value_counts¶ Series.value_counts (self, normalize=False, sort=True, ascending=False, bins=None, dropna=True) [source] ¶ Return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default. Parameters
Sum of pd.DataFrame.groupby.sum containing NaN should return , In : df.groupby('l')['v'].apply(np.sum)['right'] Out: nan missing for some aggregated index value, the final figure should be returned as NaN It would be nice to have a keyword and get those NAs back in this case: Aggregate, filter, transform, apply¶ The preceding discussion focused on aggregation for the combine operation, but there are more options available. In particular, GroupBy objects have aggregate(), filter(), transform(), and apply() methods that efficiently implement a variety of useful operations before combining the grouped data.
Aggregate, It is relatively easy to collapse data in R using one or more BY variables and a defined function. # aggregate data frame mtcars by cyl and vs, returning means # for Questions: On a concrete problem, say I have a DataFrame DF word tag count 0 a S 30 1 the S 20 2 a T 60 3 an T 5 4 the T 10 I want to find, for every “word”, the “tag” that has the most “count”.
- Do this
?aggregateand read under ## S3 method for class 'formula'
- Because your first argument
(data=df, ...asked it to aggregate over all the df's columns (not just values). When you use the non-formula interface, you need to specify the column you want to aggregate
- An alternative for the
data.tableoption would be:
setDT(df)[, lapply(.SD, mean), by=.(id,var1), .SDcols=c("var2","values")]
- Nice options. Using the
aggregate(df[c('var2', 'values')], df[c('id', 'var1')], FUN=mean)
- I am not getting the warnings though. Can you check whether
var2is factor or not.
- Thanks, I am not sure how it happens. It must be some kind of clash between separarting the 'x' and