Is there an efficient method of checking whether a column has mixed dtypes?

pandas column mixed data types
specify dtype option on import or set low_memory=false

Consider

np.random.seed(0)
s1 = pd.Series([1, 2, 'a', 'b', [1, 2, 3]])
s2 = np.random.randn(len(s1))
s3 = np.random.choice(list('abcd'), len(s1))


df = pd.DataFrame({'A': s1, 'B': s2, 'C': s3})
df
           A         B  C
0          1  1.764052  a
1          2  0.400157  d
2          a  0.978738  c
3          b  2.240893  a
4  [1, 2, 3]  1.867558  a

Column "A" has mixed data types. I would like to come up with a really quick way of determining this. It would not be as simple as checking whether type == object, because that would identify "C" as a false positive.

I can think of doing this with

df.applymap(type).nunique() > 1

A     True
B    False
C    False
dtype: bool

But calling type atop applymap is pretty slow. Especially for larger frames.

%timeit df.applymap(type).nunique() > 1
3.95 ms ± 88 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Can we do better (perhaps with NumPy)? I can accept "No" if your argument is convincing enough. :-)


Here is an approach that uses the fact that in Python3 different types cannot be compared. The idea is to run max over the array which being a builtin should be reasonably fast. And it does short-cicuit.

def ismixed(a):
    try:
        max(a)
        return False
    except TypeError as e: # we take this to imply mixed type
        msg, fst, and_, snd = str(e).rsplit(' ', 3)
        assert msg=="'>' not supported between instances of"
        assert and_=="and"
        assert fst!=snd
        return True
    except ValueError as e: # catch empty arrays
        assert str(e)=="max() arg is an empty sequence"
        return False

It doesn't catch mixed numeric types, though. Also, objects that just do not support comparison may trip this up.

But it's reasonably fast. If we strip away all pandas overhead:

v = df.values

list(map(is_mixed, v.T))
# [True, False, False]
timeit(lambda: list(map(ismixed, v.T)), number=1000)
# 0.008936170022934675

For comparison

timeit(lambda: list(map(infer_dtype, v.T)), number=1000)
# 0.02499613002873957

Data Types and Formats – Data Analysis and Visualization in , Is there an efficient method of checking whether a column has mixed dtypes? pandas mixed types column specify dtype option on import or set low_memory=​false. Is there an efficient method of checking whether a column has mixed dtypes? (2) Here is an approach that uses the fact that in Python3 different types cannot be compared. The idea is to run max over the array which being a builtin should be reasonably fast. And it does short-cicuit.


In pandas there's infer_dtype() which might be helpful here.

Written in Cython (code link), it returns a string summarising the values in the passed object. It's used a lot in pandas' internals so we might reasonably expect that's it has been designed with efficiency in mind.

>>> from pandas.api.types import infer_dtype

Now, column A is a mix of integers and some other types:

>>> infer_dtype(df.A)
'mixed-integer'

Column B's values are all of floating type:

>>> infer_dtype(df.B)
'floating'

Column C contains strings:

>>> infer_dtype(df.B)
'string'

The general "catchall" type for mixed values is simply "mixed":

>>> infer_dtype(['a string', pd.Timedelta(10)])
'mixed'

A mix of floats and integers is ''mixed-integer-float'':

>>> infer_dtype([3.141, 99])
'mixed-integer-float'

To make the function you describe in your question, one approach could be to create a function which catches the relevant mixed cases:

def is_mixed(col):
    return infer_dtype(col) in ['mixed', 'mixed-integer']

Then you have:

>>> df.apply(is_mixed)
A     True
B    False
C    False
dtype: bool

Column Handbook for Size Exclusion Chromatography, Define the two main types of data in Python: text and numerics. In this lesson we will review ways to explore and better understand the structure and Will be assigned to your column if column has mixed types (numbers and strings). In pandas, we can check the type of one column in a DataFrame using the syntax  Questions: I need to use different functions to treat numeric columns and string columns. What I am doing now is really dumb: allc = list((agg.loc[:, (agg.dtypes==np.float64)|(agg.dtypes==np.int)]).columns) for y in allc: treat_numeric(agg[y]) allc = list((agg.loc[:, (agg.dtypes!=np.float64)&(agg.dtypes!=np.int)]).columns) for y in allc: treat_str(agg[y]) Is there a more elegant way to do this?


Not sure how you need the result, but you can map the type to df.values.ravel() and create a dictionary of the name of the column link to the comparison of the len of a set superior to 1 for each slice of the l such as:

l = list(map(type, df.values.ravel()))
print ({df.columns[i]:len(set(l[i::df.shape[1]])) > 1 for i in range(df.shape[1])})
{'A': True, 'B': False, 'C': False}

Timing:

%timeit df.applymap(type).nunique() > 1
#3.25 ms ± 516 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit 
l = list(map(type, df.values.ravel()))
{df.columns[i]:len(set(l[i::df.shape[1]])) > 1 for i in range(df.shape[1])}
#100 µs ± 5.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

EDIT for larger dataframe, the improve in time is less interesting though:

dfl = pd.concat([df]*100000,ignore_index=True)

%timeit dfl.applymap(type).nunique() > 1
#519 ms ± 61.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
l = list(map(type, dfl.values.ravel()))
{dfl.columns[i]:len(set(l[i::dfl.shape[1]])) > 1 for i in range(dfl.shape[1])}
#254 ms ± 33.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

A bit faster solution on the same idea:

%timeit { col: len(set(map(type, dfl[col])))>1 for col in dfl.columns}
#124 ms ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

E-Commerce and Web Technologies: 7th International Conference, , Each manufacturer has its own system for pore size designation. The most easy method to find out which columns will be useful for a selected task The highest efficiency for a separation is determined by the lowest slope of the calibration curve. If two columns with different porosity are used in a column combination, the  (Or, you can use ‘head’ command in linux to check out the first 5 rows (say) in any text file: head -n 5 data.txt (Thanks Ilya Levinson for pointing out a typo here)) Then, you can extract the column list by using df.columns.tolist() to extract all columns, and then add usecols = [‘c1’, ‘c2’, …] argument to load the columns you need.


10 Python Pandas tricks that make your work more efficient, Furthermore, from the test set likelihood curve (right) we can roughly tell when the We found that this kind of cross-validation is effective for selecting a desirable number of mixture components although other methods such as BIC may also be type data than FA which models the entire training data with mixed types. <class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 10 columns): Customer Number 5 non-null float64 Customer Name 5 non-null object 2016 5 non-null object 2017 5 non-null object Percent Growth 5 non-null object Jan Units 5 non-null object Month 5 non-null int64 Day 5 non-null int64 Year 5 non-null int64 Active 5 non-null object dtypes: float64(1), int64(3


Essential basic functionality, (Or, you can use 'head' command in linux to check out the first 5 rows (say) in Also, if you know the data types of a few specific columns, you can add the If data preprocessing has to be done in Python, then this command would Another trick is dealing with integers and missing values mixed together. Use columns 1 and 2 to populate column 3 - [10/4] Setting flag column depending on whether column contains a given string - [10/2] __init__ function definition without self - [10/1] Why does isinstance require a tuple instead of any iterable? - [9/0] Is there an efficient method of checking whether a column has mixed dtypes? - [8/4]


Current Topics in Artificial Intelligence: 13th Conference of the , When your DataFrame only has a single data type for all the columns, DataFrame​.to_numpy() You can test if a pandas object is empty, via the empty property. from pandas. core. dtypes. missing import isna, we then check whether its sum with the element in # If there is only one column, the frame is already sorted.