R: rank individual column data in a large dataframe or matrix

r rank dataframe by column
r rank by group
r rank descending
r rank multiple columns
difference between sort and order in r
order function in r
dplyr rank multiple columns
how to find top 5 values in r

I have a large file of patient data that I want to rank based on column values (without changing the order of the data). For example

patient<-c("a", "b", "c", "d", "e","f")
gene1<-c(500, 490, 500, 750, 550, 500)
gene2<-c(200, 470, 1000, 50, 720, 1100)
x<-data.frame(patient,gene1,gene2)
x
  patient gene1 gene2
1       a   500   200
2       b   490   470
3       c   500  1000
4       d   750    50
5       e   550   720
6       f   500  1100

I want to get something like this...

x
  patient gene1 gene2 
1       a     2     2
2       b     1     3
3       c     6     5
4       d     5     1
5       e     4     4
6       f     3     6

I can do this for individual columns using something similar to the below code, but I have thousands of columns worth of patient data to deal with, so this is unrealistic.

x <- read.csv("data.csv", row.names = "Patient")
order.scores<-order(x$gene1,x)
x$rank <- NA
x$rank[order.scores] <- 1:nrow(x)

Can anyone suggest a suitable function? Thanks!

here's one way using dplyr package. This will rank all columns from 2nd to last assuming first column is always patient.

Also you need to pass ties.method = "first" argument to rank which means that ties are broken by whichever value appears first.

library(dplyr)

x %>% mutate_at(2:ncol(.), rank, ties.method = "first")

  patient gene1 gene2
1       a     2     2
2       b     1     3
3       c     3     5
4       d     6     1
5       e     5     4
6       f     4     6

R: rank vs. order, If you're learning R you've come across the sort, rank and order functions. of data frame by column values, whether it's a single column or multiple a data frame (or matrix) using the square brackets with a Row, Column  df <- data.frame(item = rep(c('a','b','c'), 3), year = rep(c('2010','2011','2012'), each=3), count = c(1,4,6,3,8,3,5,7,9)) And I would like to add a "year.rank" column, which gives an item's rank within a given year, where a higher count leads to a higher "rank". With the above, it would look like:

This code would allow you to loop through the columns

for (i in 2:length(colnames(x))) {
  x[,i] <- rank(x[,i])
}

and yields this result:

  patient gene1 gene2
1       a     3     2
2       b     1     3
3       c     3     5
4       d     6     1
5       e     5     4
6       f     3     6

Or

for (i in 2:length(colnames(x))) {
  x[,i] <- order(x[,i])
}

yields

  patient gene1 gene2
1       a     2     4
2       b     1     1
3       c     3     2
4       d     6     5
5       e     5     3
6       f     4     6

rankings function, Create a "rankings" object from data or convert a matrix of rankings or ordered items to a "rankings" object. a data frame with columns specified by id , item and rank . id. an index of if TRUE return single row/column matrices as a vector​. Similar to base::rank but much faster. And it accepts vectors, lists, data.frames or data.tables as input. In addition to the ties.method possibilities provided by base::rank, it also provides ties.method="dense". Like forder, sorting is done in "C-locale"; in particular, this may affect how capital/lowercase letters are ranked. See Details on forder for more. bit64::integer64 type is also

Try out:

library(dplyr)
x %>% mutate_at(vars(starts_with("gene")), rank, ties.method = "first")
# or x %>% mutate_at(vars(contains("gene")), rank, ties.method = "first")

frank: Fast rank in data.table: Extension of `data.frame`, In addition to the ties.method possibilities provided by base::rank, it also provides in particular, this may affect how capital/lowercase letters are ranked. To sort by a column in descending order prefix "-" , e.g., frank(x, a, -b, c) . 4, NA, 1, NA, 4) # NAs are considered identical (unlike base R) # default is average frankv(x)  A data frame is a list of vectors which are of equal length. A matrix contains only one type of data, while a data frame accepts different data types (numeric, character, factor, R Data Frame: Create, Append, Select, Subset

rowRanks: Gets the rank of the elements in each row (column) of a , Gets the rank of the elements in each row (column) of a matrix. Details Value Missing values Performance Author(s) See Also. View source: R/rowRanks.R  R provides a variety of methods for summarising data in tabular and other forms. View data structure. Before you do anything else, it is important to understand the structure of your data and that of any objects derived from it.

Chapter 2 R basics, But then you remember that the US is a large and diverse country with 50 very Matrices are another type of object that are common in R. Matrices are similar to data You can also use single square brackets ( [ ) to access rows and columns of a data frame: Say we want to rank the states from least to most gun murders. average: average rank of the group. min: lowest rank in the group. max: highest rank in the group. first: ranks assigned in order they appear in the array. dense: like ‘min’, but rank always increases by 1 between groups. numeric_only bool, optional. For DataFrame objects, rank only numeric columns if set to True. na_option {‘keep’, ‘top’, ‘bottom’}, default ‘keep’

Basic Statistical Analysis Using the R Statistical Package, For our basic applications, matrices representing data sets (where columns object (this is helpful with larger data sets when the print out extends over several lines). For an analysis of a single variable, with a small number of observations​, it is An R dataframe can be viewed and edited as a spreadsheet within R using  So you specify the data frame, followed by a dollar sign and then the name of the variable. You don’t have to surround the variable name by quotation marks (as you would when you use the indices). R will return a vector with all the values contained in that variable. Note again that the row names are dropped here.

Comments
  • your ranks for gene1 seem wrong.
  • great, thank you! In my data I had actually specified row.names = "patient", so I just switched the code to 1:ncol(.)
  • In that case you can use mutate_all()
  • thanks that worked! My gene list is only 33 genes, so working that into the code is no problem. It would be useful in other circumstances (such as large gene lists) to use a code that ranks for all genes without having to specify the gene names within the code. Any ideas?
  • You can mutate_at variables starting on containing gene