How i can change invalid string pattern with default string in dataframe?

pandas replace
pandas str replace
pandas regex
astype pandas
rename column pandas
pandas str contains
pandas replace part of string
pandas to_datetime

i have a dataframe like below.

name   birthdate
-----------------
john   21011990
steve  14021986
bob    
alice  13020198

i want to detect invalid value in birthdate column then change value.

the birthdate column use date format is "DDMMYYYY" . but in dataframe have a invalid format also as "13020198","". i want to change invalid data to 31125000 .

i want result like below

name   birthdate
-----------------
john   21011990
steve  14021986
bob    31125000
alice  31125000

thank you

You can first create non-valid date mask and then update their values:

mask = df.birthdate.apply(lambda x: pd.to_datetime(x, format='%d%m%Y', errors='coerce')).isna()

df.loc[mask, 'birthdate'] = 31125000

    name    birthdate
0   john    21011990
1   steve   14021986
2   bob     31125000
3   alice   31125000

pandas.Series.str.replace — pandas 1.1.0 documentation, Replace each occurrence of pattern/regex in the Series/Index. String can be a character sequence or regular expression. replstr flagsint, default 0 (no flags). Your first thing is easy just do df['col'] = df['col'].str.replace(',15','') for the second you can filter using a regex expression something like df[df['col'].str.contains(regex)] – EdChum Mar 13 '15 at 16:31

This would be my solution to keep the format you specify:

import pandas as pd
import numpy as np

data = {'name':['J','S','B','A'],'birthdate':[21011990,14021986,'',13020198]}
df = pd.DataFrame(data)
df['birthdate'] = pd.to_datetime(df['birthdate'],format='%d%m%Y',errors='coerce').astype(str)
df['birthdate'] = df['birthdate'].str.replace('-','',regex=True).replace('NaT',31125000,regex=True).astype(int)
print(df)

Output:

  name  birthdate
0    J   19900121
1    S   19860214
2    B   31125000
3    A   31125000

Of course it'd be easier if you kept the datatime format, then you could simply use:

df['birthdate'] = pd.to_datetime(df['birthdate'],format='%d%m%Y',errors='coerce').fillna(31125000)
print(df)

And you'd get:

  name            birthdate
0    J  1990-01-21 00:00:00
1    S  1986-02-14 00:00:00
2    B             31125000
3    A             31125000

pandas.DataFrame.replace — pandas 1.1.0 documentation, Regular expressions, strings and lists or dicts of such objects are also allowed. inplacebool, default False. If True, in place. Note: this will modify any other views on� Input can be 0 or 1 for Integer and ‘index’ or ‘columns’ for String inplace: It is a boolean which makes the changes in data frame itself if True. limit : This is an integer value which specifies maximum number of consequetive forward/backward NaN value fills.

Create mask by to_datetime with errors='coerce' and test missing values created if no matchin format, last set new values by Series.mask:

m = pd.to_datetime(df['birthdate'], format='%d%m%Y', errors='coerce').isna()
df['birthdate'] = df['birthdate'].mask(m, 31125000)

Or @Chris A solution from comments with DataFrame.loc:

df.loc[m, 'birthdate'] = 31125000

print (df)
    name birthdate
0   john  21011990
1  steve  14021986
2    bob  31125000
3  alice  31125000

How to convert Dataframe column type from string to date time , In this article we will discuss how to convert data type of a dataframe invalid parsing raise an exception; 'coerce': In case of invalid parsing set As this function can covert the data type of a series from string to datetime. For DOB_time column we provided time only, therefore it picked the default date i.e.� Using these functions, you can construct strings with definite patterns or even at random. You can change and modify them in any desired way. String Manipulation in R Programming. Here are a few of the string manipulation functions available in R’s base packages. We are going to look at these functions in detail. The nchar function; The

Change stringsAsFactors settings for data.frame, options(stringsAsFactors = FALSE). you change the global default setting. So every data frame you create after executing that line will not auto-convert to factors� We can also search less strict for all rows where the column ‘model’ contains the string ‘ac’ (note the difference: contains vs. match). df [df ['model']. str. contains ('ac')]

R Tip: Use stringsAsFactors = FALSE, R often uses a concept of factors to re-encode strings. This can be too early and too aggressive. re-encoding of strings by using stringsAsFactors = FALSE when creating data.frame s. As is often the case: base R works okay in default mode and works very well if you judiciously change a few defaults. Late to the party, but for posterity, the stringr package (part of the popular "tidyverse" suite of packages) now provides functions with harmonised signatures for string handling:

Handling Missing Data, Reserving a specific bit pattern in all available NumPy types would lead to an a floating-point value; there is no equivalent NaN value for integers, strings, or other types. We cannot drop single values from a DataFrame ; we can only drop full rows By default, dropna() will drop all rows in which any null value is present:. You can actually use directly map on the DataFrame. So you basically check the column 1 for the String tesla. If it's tesla, use the value S for make else you the current value of column 1. Then build a tuple with all data from the row using the indexes (zero based) (Row(row(0),make,row(2))) in my example) There is probably a better way to do it.

Comments
  • Why is 13020198 invalid? It follows the DDMMYYYY format
  • Do you want to preserve the dates as datetime, or keep the format?
  • df.loc[pd.to_datetime(df['birthdate'], format='%d%m%Y', errors='coerce').isna(), 'birthdate'] = '31125000' ..?
  • Wouldn't be enough to check if the last four digits are in a range, like between 1900 and 2020?