dask dataframe how to convert column to to_datetime

dask dataframe example
dask sort
dask read_sql_table example
dask api
dask pivot table
convert column to datetime pandas
pandas to_datetime
dask drop columns

I am trying to convert one column of my dataframe to datetime. Following the discussion here https://github.com/dask/dask/issues/863 I tried the following code:

import dask.dataframe as dd
df['time'].map_partitions(pd.to_datetime, columns='time').compute()

But I am getting the following error message

ValueError: Metadata inference failed, please provide `meta` keyword

What exactly should I put under meta? should I put a dictionary of ALL the columns in df or only of the 'time' column? and what type should I put? I have tried dtype and datetime64 but none of them work so far.

Thank you and I appreciate your guidance,

Update

I will include here the new error messages:

1) Using Timestamp

df['trd_exctn_dt'].map_partitions(pd.Timestamp).compute()

TypeError: Cannot convert input to Timestamp

2) Using datetime and meta

meta = ('time', pd.Timestamp)
df['time'].map_partitions(pd.to_datetime,meta=meta).compute()
TypeError: to_datetime() got an unexpected keyword argument 'meta'

3) Just using date time: gets stuck at 2%

    In [14]: df['trd_exctn_dt'].map_partitions(pd.to_datetime).compute()
[                                        ] | 2% Completed |  2min 20.3s

Also, I would like to be able to specify the format in the date, as i would do in pandas:

pd.to_datetime(df['time'], format = '%m%d%Y'

Update 2

After updating to Dask 0.11, I no longer have problems with the meta keyword. Still, I can't get it past 2% on a 2GB dataframe.

df['trd_exctn_dt'].map_partitions(pd.to_datetime, meta=meta).compute()
    [                                        ] | 2% Completed |  30min 45.7s

Update 3

worked better this way:

def parse_dates(df):
  return pd.to_datetime(df['time'], format = '%m/%d/%Y')

df.map_partitions(parse_dates, meta=meta)

I'm not sure whether it's the right approach or not

Use astype

You can use the astype method to convert the dtype of a series to a NumPy dtype

df.time.astype('M8[us]')

There is probably a way to specify a Pandas style dtype as well (edits welcome)

Use map_partitions and meta

When using black-box methods like map_partitions, dask.dataframe needs to know the type and names of the output. There are a few ways to do this listed in the docstring for map_partitions.

You can supply an empty Pandas object with the right dtype and name

meta = pd.Series([], name='time', dtype=pd.Timestamp)

Or you can provide a tuple of (name, dtype) for a Series or a dict for a DataFrame

meta = ('time', pd.Timestamp)

Then everything should be fine

df.time.map_partitions(pd.to_datetime, meta=meta)

If you were calling map_partitions on df instead then you would need to provide the dtypes for everything. That isn't the case in your example though.

API, Convert columns of the DataFrame to category dtype. to_datetime: Convert argument to datetime. to_timedelta: Convert argument to timedelta. to_numeric  to_datetime() function doesn’t modify the DataFrame data in-place; therefore we need to assign the returned Pandas Series to the specific DataFrame column. to_datetime() Function Is Smart to Convert to Datetime. to_datetime() function could do the conversion to datetime in a smart way without being given the datetime format string. It will

Dask also come with to_timedelta so this should work as well.

df['time']=dd.to_datetime(df.time,unit='ns')

The values unit takes is the same as pd.to_timedelta in pandas. This can be found here.

converting strings to timestamps · Issue #863 · dask/dask · GitHub, Hi, I have a dask dataframe with a 'time' column that in string format. Timestamp type as follows: from pandas.tseries.tools import to_datetime  In my data I have a column "MONTHSTART" which I wish to interact with as a datetime object. However I cannot seem to get an output from the Dask dataframe despite my code working in a Pandas example. I have read my csv in using dask . df = dd.read_csv(filename, dtype='str') Then I am converting the dtype of the column to a datetime object

I'm not sure if it this is the right approach, but mapping the column worked for me:

df['time'] = df['time'].map(lambda x: pd.to_datetime(x, errors='coerce'))

column with dates into datetime index in dask - pandas - html, pd.DatetimeIndex(df_dask_dataframe['name_col']) I have a dask dataframe for which I want to convert a column with dates into datetime index. However I get a  I have one field in a pandas DataFrame that was imported as string format. It should be a datetime variable. How do I convert it to a datetime column and then filter based on date.

This worked for me

ddf["Date"] = ddf["Date"].map_partitions(pd.to_datetime,format='%d/%m/%Y',meta = ('datetime64[ns]'))

Dask DataFrames, of the month respectively), and that dask should convert them to a single column called Date that is of type datetime. This isn't special to dask, it's just a pandas  Or you can even combine the DATE and TIME columns in your original data to a single column . ddfs = [dask.delayed(pd.read_csv)(url, parse_dates={'DATETIME': ['DATE', 'TIME']}) for url in urls] Use map_partitions. If you have a dataframe with an object dtype column you can always use map_partitions to apply a pandas function to every partition

If the datetime is in a non ISO format then map_partition yields better results:

import dask
import pandas as pd
from dask.distributed import Client
client = Client()

ddf = dask.datasets.timeseries()
ddf = ddf.assign(datetime=ddf.index.astype(object))
ddf = (ddf.assign(datetime_nonISO = ddf['datetime'].astype(str).str.split(' ')
                                 .apply(lambda x: x[1]+' '+x[0], meta=('object'))) 

%%timeit
ddf.datetime = ddf.datetime.astype('M8[s]')
ddf.compute()

11.3 s ± 719 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

ddf = dask.datasets.timeseries()
ddf = ddf.assign(datetime=ddf.index.astype(object))
ddf = (ddf.assign(datetime_nonISO = ddf['datetime'].astype(str).str.split(' ')
                                 .apply(lambda x: x[1]+' '+x[0], meta=('object'))) 


%%timeit
ddf.datetime_nonISO = (ddf.datetime_nonISO.map_partitions(pd.to_datetime
                       ,  format='%H:%M:%S %Y-%m-%d', meta=('datetime64[s]')))
ddf.compute()

8.78 s ± 599 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

ddf = dask.datasets.timeseries()
ddf = ddf.assign(datetime=ddf.index.astype(object))
ddf = (ddf.assign(datetime_nonISO = ddf['datetime'].astype(str).str.split(' ')
                                 .apply(lambda x: x[1]+' '+x[0], meta=('object'))) 

%%timeit
ddf.datetime_nonISO = ddf.datetime_nonISO.astype('M8[s]')
ddf.compute()

1min 8s ± 3.65 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

pandas.to_datetime, Convert argument to timedelta. Examples. Assembling a datetime from multiple columns of a DataFrame. The keys can be common abbreviations like ['year', '  I have a dask dataframe with a 'time' column that in string format. I'm trying to convert it to a pandas Timestamp type as follows: from pandas.tseries.tools import to_datetime my_ddf['time'].map_partitions(to_datetime, columns='time').compute() This conversion takes forever to run (>1min) .

How to Convert Strings to Datetime in Pandas DataFrame, df['DataFrame Column'] = pd.to_datetime(df['DataFrame Column'], format=specify To begin, collect the data that you'd like to convert to datetime. Now let's say that the strings contain characters, such as the dash character  If your date column is a string of the format '2017-01-01' you can use pandas astype to convert it to datetime. df['date'] = df['date'].astype('datetime64[ns]') or use datetime64[D] if you want Day precision and not nanoseconds print(type(df_launath['date'].iloc))

Reading & cleaning files, The first step is to use the Dask dd.read_csv() function to read multiple files at once. Dask to read all of the CSV files from the taxi/ subdirectory into a single Dask DataFrame. Convert the 'tpep_dropoff_datetime' column to datetime using  Convert the column type from string to datetime format in Pandas dataframe While working with data in Pandas, it is not an unusual thing to encounter time series data and we know Pandas is a very useful tool for working with time series data in python.

Data Transformations, We invite the reader to go through the Dask dataframe documentation to review updates its “semantic” type to 'categorical', 'continuous', 'datetime' or 'ordinal'. In the following example, we convert a column data type from float to integer,  pandas.DataFrame.insert() to add a new column in Pandas DataFrame We could use assign() and insert() methods of DataFrame objects to add a new column to the existing DataFrame with default values. We can also directly assign a default value to the column of DataFrame to be created.

Comments
  • what dask version are you on?
  • MRocklin, you were right, I updated to 0.11 version and now don't get any problems with the meta keyword. Still, it does 1 and 2% in less than 30 seconds, but it got stuck there for an hour. Any suggestions?
  • I think I semi-solved it using defining a function to parse the dates and applying it using map partitions
  • From our experience using the format keyword always results in enhanced performance.
  • Thank you MRocklin! please see my updates in the question
  • does not work for me anymore with pandas 0.20, I get dtype <class 'pandas._lib.tslib.Timestamp'> not understood. Works however with meta = ('time', np.datetime64)
  • It worked for me with no warning for future deprecation with meta = ('time', 'datetime64[ns]')
  • Here is a link to the numpy datetime documentations
  • what if the Month is in this format: 'JAN' (note is uppercase)