quantile normalization on pandas dataframe

pandas quantile
numpy quantile normalization
pandas percentile of value
pandas interquartile range
pandas quantile index
how to calculate percentile of a column in python
subset quantile normalization
pandas quantile axis

Simply speaking, how to apply quantile normalization on a large Pandas dataframe (probably 2,000,000 rows) in Python?

PS. I know that there is a package named rpy2 which could run R in subprocess, using quantile normalize in R. But the truth is that R cannot compute the correct result when I use the data set as below:

5.690386092696389541e-05,2.051450375415418849e-05,1.963190184049079707e-05,1.258362869906251862e-04,1.503352476021528139e-04,6.881341586355676286e-06
8.535579139044583634e-05,5.128625938538547123e-06,1.635991820040899643e-05,6.291814349531259308e-05,3.006704952043056075e-05,6.881341586355676286e-06
5.690386092696389541e-05,2.051450375415418849e-05,1.963190184049079707e-05,1.258362869906251862e-04,1.503352476021528139e-04,6.881341586355676286e-06
2.845193046348194770e-05,1.538587781561563968e-05,2.944785276073619561e-05,4.194542899687506431e-05,6.013409904086112150e-05,1.032201237953351358e-05

Edit:

What I want:

Given the data shown above, how to apply quantile normalization following steps in https://en.wikipedia.org/wiki/Quantile_normalization.

I found a piece of code in Python declaring that it could compute the quantile normalization:

import rpy2.robjects as robjects
import numpy as np
from rpy2.robjects.packages import importr
preprocessCore = importr('preprocessCore')


matrix = [ [1,2,3,4,5], [1,3,5,7,9], [2,4,6,8,10] ]
v = robjects.FloatVector([ element for col in matrix for element in col ])
m = robjects.r['matrix'](v, ncol = len(matrix), byrow=False)
Rnormalized_matrix = preprocessCore.normalize_quantiles(m)
normalized_matrix = np.array( Rnormalized_matrix)

The code works fine with the sample data used in the code, however when I test it with the data given above the result went wrong.

Since ryp2 provides an interface to run R in python subprocess, I test it again in R directly and the result was still wrong. As a result I think the reason is that the method in R is wrong.

Using the example dataset from Wikipedia article:

df = pd.DataFrame({'C1': {'A': 5, 'B': 2, 'C': 3, 'D': 4},
                   'C2': {'A': 4, 'B': 1, 'C': 4, 'D': 2},
                   'C3': {'A': 3, 'B': 4, 'C': 6, 'D': 8}})

df
Out: 
   C1  C2  C3
A   5   4   3
B   2   1   4
C   3   4   6
D   4   2   8

For each rank, the mean value can be calculated with the following:

rank_mean = df.stack().groupby(df.rank(method='first').stack().astype(int)).mean()

rank_mean
Out: 
1    2.000000
2    3.000000
3    4.666667
4    5.666667
dtype: float64

Then the resulting Series, rank_mean, can be used as a mapping for the ranks to get the normalized results:

df.rank(method='min').stack().astype(int).map(rank_mean).unstack()
Out: 
         C1        C2        C3
A  5.666667  4.666667  2.000000
B  2.000000  2.000000  3.000000
C  3.000000  4.666667  4.666667
D  4.666667  3.000000  5.666667

Computing Quantile Normalization in Python, In this post, we will learn how to implement quantile normalization in Python using Pandas and Numpy. We will implement the quantile normalization algorithm step-by-by with a toy data set. pandas.DataFrame.quantile¶ DataFrame.quantile (q = 0.5, axis = 0, numeric_only = True, interpolation = 'linear') [source] ¶ Return values at the given quantile over requested axis. Parameters q float or array-like, default 0.5 (50% quantile) Value between 0 <= q <= 1, the quantile(s) to compute. axis {0, 1, ‘index’, ‘columns’}, default 0

Ok I implemented the method myself of relatively high efficiency.

After finishing, this logic seems kind of easy but, anyway, I decided to post it here for any one feels confused like I was when I couldn't googled the available code.

The code is in github: Quantile Normalize

pandas.DataFrame.quantile — pandas 1.1.0 documentation, Series or DataFrame. If q is an array, a DataFrame will be returned where the. index is q , the columns are the columns of self, and the values are the quantiles. Simply speaking, how to apply quantile normalization on a large Pandas dataframe (probably 2,000,000 rows) in Python? PS. I know that there is a package named rpy2 which could run R in a subprocess, using quantile normalize in R. But the truth is that R cannot compute the correct result when I use the data set as below: 5.690386092696389541e-05,

One thing worth noticing is that both ayhan and shawn's code use the smaller rank mean for ties, but if you use R package processcore's normalize.quantiles() , it would use the mean of rank means for ties.

Using the above example:

> df

   C1  C2  C3
A   5   4   3
B   2   1   4
C   3   4   6
D   4   2   8

> normalize.quantiles(as.matrix(df))

         C1        C2        C3
A  5.666667  5.166667  2.000000
B  2.000000  2.000000  3.000000
C  3.000000  5.166667  4.666667
D  4.666667  3.000000  5.666667

quantile normalization on pandas dataframe, Quantile normalization can be done easily in python by using the following method: Creating an sample dataframe: df = pd.DataFrame({'C1':� Quantile normalization is widely adopted in fields like genomics, but it can be useful in any high-dimensional setting. In this post, we will learn how to implement quantile normalization in Python using Pandas and Numpy. We will implement the quantile normalization algorithm step-by-by with a toy data set.

Possibly more robust to use the median on each row rather than mean (based on code from Shawn. L):

def quantileNormalize(df_input):
    df = df_input.copy()
    #compute rank
    dic = {}
    for col in df:
        dic[col] = df[col].sort_values(na_position='first').values
    sorted_df = pd.DataFrame(dic)
    #rank = sorted_df.mean(axis = 1).tolist()
    rank = sorted_df.median(axis = 1).tolist()
    #sort
    for col in df:
        # compute percentile rank [0,1] for each score in column 
        t = df[col].rank( pct=True, method='max' ).values
        # replace percentile values in column with quantile normalized score
        # retrieve q_norm score using calling rank with percentile value
        df[col] = [ np.nanpercentile( rank, i*100 ) if ~np.isnan(i) else np.nan for i in t ]
    return df

ShawnLYU/Quantile_Normalize: This function implements quantile , Quantile_Normalize. This function implements quantile normalization in python matrix (Pandas DataFrame). dependencies. Numpy � Pandas. data. Input data is � The quantile() function of Pandas DataFrame class computes the value, below which a given portion of the data lies. Example: The Python example prints for the given distributions - the scores on Physics and Chemistry class tests, at what point or below 100%(1), 95%(.95), 50%(.5) of the scores are lying.

The code below gives identical result as preprocessCore::normalize.quantiles.use.target and I find it simpler clearer than the solutions above. Also performance should be good up to huge array lengths.

import numpy as np

def quantile_normalize_using_target(x, target):
    """
    Both `x` and `target` are numpy arrays of equal lengths.
    """

    target_sorted = np.sort(target)

    return target_sorted[x.argsort().argsort()]

Once you have a pandas.DataFrame easy to do:

quantile_normalize_using_target(df[0].as_matrix(),
                                df[1].as_matrix())

(Normalizing the first columnt to the second one as a reference distribution in the example above.)

Computing Quantile Normalization in Python – Government News, Allow us to create a dataframe with some toy information to do quantile normalization. The dataframe right here accommodates the identical� Pandas is one of those packages and makes importing and analyzing data much easier. Pandas dataframe.quantile() function return values at the given quantile over requested axis, a numpy.percentile. Note : In each of any set of values of a variate which divide a frequency distribution into equal groups, each containing the same fraction of the

quantile normalization on pandas dataframe, Simply speaking, how to apply quantile normalization on a large Pandas dataframe (probably 2,000,000 rows) in Python? PS. I know that there� Quantile_Normalize. This function implements quantile normalization in python matrix (Pandas DataFrame) dependencies. Numpy; Pandas; data. Input data is a Pandas dataframe (df). Each row stands for an observations and each column stands for an attribute. usage from Quantile_Normalize.quantile_norm import quantileNormalize result

How to normalize dataframe pandas, mean of distances from center to points on the perimeter. tagtexture_meansort. standard deviation of gray-scale values. tagperimeter_meansort. mean size of� Quantile Normalization in PythonWhen working with high-dimensional information, preprocessing and normalizing the info are key necessary steps in doing information

sklearn.preprocessing.quantile_transform — scikit-learn 0.23.2 , scikit-learn: machine learning in Python. Transform features using quantiles information. This method transforms the Number of quantiles to be computed. Lets see an example which normalizes the column in pandas by scaling . Create a single column dataframe: import pandas as pd import numpy as np from sklearn import preprocessing # Create a DataFrame d = { 'Score':[62,-47,-55,74,31,77,85,63,42,67,89,81,56]} df = pd.DataFrame(d,columns=['Score']) print df

Comments
  • I removed the "R" tag since you (1) aren't using R and (2) don't want R in the answer. But if you say "R cannot compute the correct result", it sounds like you are either disparaging R (to what end?) or want somebody to correct your unposted code. Either way, perhaps I'm misunderstanding what you want: quantile normalization needs a source and target distribution and I'm not certain which you're providing here. Can you clarify, please?
  • @r2evans Thanks for your comment and I already edited the question. FYI, the code I googled runs R as subprocess of Python. After run R directly I found that the result was wrong. Besides, I'm not sure about what do you mean by 'target distribution. According to the Wiki, the computation of quantile normalization doesn't involve that term. The question, hopefully I made it clear, is to apply quantile normalization on the data I gave.
  • You are right, my term of "target" isn't really good. The wiki references "making two distributions identical", so I was wondering what your two distributions were. Now that you provided additional code (and data, defined as matrix), I'm confused about which is your actual data to be quant-normed. (Perhaps a stupid question, but is it possible that the matrix is transposed compared with what you actually need?)
  • @r2evans I'm sorry for the confusion I caused. FYI, the actual data is a (2119055,124) matrix. Data I gave above is the tiny subset of it for testing. And yes, I did consider the question of transpose. As you could see, in the sample code, matrix is (3,5), but the normalized result is (5,3), therefore I summarized that to use this code I need to transpose the matrix first. To be more clear, my data is (4,6) and to use the code I will assign transposed data, i.e. (6,4) to variable matrix, and then continue.
  • elegant use of groupby, map, and stacking/unstacking. are you a pandas developer?
  • Thanks. No, I am just a regular user.
  • @ayhan Why did you do different ranking method in the first and second processing line, i.e. first vs min?
  • Just here to say that I made a package/answer called qnorm for Python which does handle ties: stackoverflow.com/a/62792272/9544516