Pandas Filtering and convert to Date to datetime64ns

vaex
pandas to_datetime
convert timestamp to date python pandas
pandas datetime cheat sheet
working with dates in pandas
pandas to_datetime timezone
extract time from datetime pandas

I am trying to figure out a problem but so far I could not find any solution I hope you might can help. I have a DataFrame and I would like to convert str to datatime but there are some invalid rows which I would like to filter out. Here are two examples:

Out[6]:
  #  name    date
  0  aa      2012-11-30T14:00:00+01:00
  1  bb      2012-12-01T08:16:00+01:00
  2  cc      2012-12-01T10:14:00+01:00
  3  ee      2012-12-01T11:05:00+01:00
  4  gg      2012-12-01T11:05:00+01:00

In [7]: df2
Out[7]:
  #  name    date
  0  aa      2012-11-30T14:00:00+01:00
  1  bb      2012-12-01T08:16:00+01:00
  2  cc      2012-12-01T10:14:00+01:00
  3  ee      2012-12-01T11:05:00+01:00
  4  ff      fsadfi2 2ih3ro
  5  gg      2012-12-01T11:05:00+01:00
In [11]: df.dtypes
Out[11]:
name    <class 'str'>
date    <class 'str'>
dtype: object

In [12]: df2.dtypes
Out[12]:
name    <class 'str'>
date    <class 'str'>
dtype: object

df I fine, it only has valid dates in the date column. But df2 has some invalid lines. Let's see df first with the following line I can convert to datetime:

df['pdate']=df.date.values.astype('datetime64[ns]')

Works well:

In [16]: df
Out[16]:
  #  name    date                       pdate
  0  aa      2012-11-30T14:00:00+01:00  2012-11-30 13:00:00.000000000
  1  bb      2012-12-01T08:16:00+01:00  2012-12-01 07:16:00.000000000
  2  cc      2012-12-01T10:14:00+01:00  2012-12-01 09:14:00.000000000
  3  ee      2012-12-01T11:05:00+01:00  2012-12-01 10:05:00.000000000
  4  gg      2012-12-01T11:05:00+01:00  2012-12-01 10:05:00.000000000

In [17]: df.dtypes
Out[17]:
name      <class 'str'>
date      <class 'str'>
pdate    datetime64[ns]
dtype: object

Now I try to filter out with a very simple str.contains::

In [18]: df2_filtered=df2[df2['date'].str.contains(':00')]

In [19]: df2_filtered
Out[19]:
  #  name    date
  0  aa      2012-11-30T14:00:00+01:00
  1  bb      2012-12-01T08:16:00+01:00
  2  cc      2012-12-01T10:14:00+01:00
  3  ee      2012-12-01T11:05:00+01:00
  4  gg      2012-12-01T11:05:00+01:00

In [20]: df2_filtered.dtypes
Out[20]:
name    <class 'str'>
date    <class 'str'>
dtype: object

It has only 5 Rows. Now I try to convert and I get a nice error message:

In [21]: df2_filtered['pdate']=df2_filtered.date.values.astype('datetime64[ns]')
    ...:
/usr/local/bin/ipython:1: DeprecationWarning: parsing timezone aware datetimes is deprecated; this will raise an error in the future
  #!/opt/local/Library/Frameworks/Python.framework/Versions/3.7/bin/python3.7
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-21-563087d6f949> in <module>
----> 1 df2_filtered['pdate']=df2_filtered.date.values.astype('datetime64[ns]')

/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py in __setitem__(self, name, value)
   4370         if isinstance(name, six.string_types):
   4371             if isinstance(value, (np.ndarray, Column)):
-> 4372                 self.add_column(name, value)
   4373             else:
   4374                 self.add_virtual_column(name, value)

/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py in add_column(self, name, data, dtype)
   5743         #     self._length_original = len(data)
   5744         #     self._index_end = self._length_unfiltered
-> 5745         super(DataFrameArrays, self).add_column(name, data, dtype=dtype)
   5746         self._length_unfiltered = int(round(self._length_original * self._active_fraction))
   5747         # self.set_active_fraction(self._active_fraction)

/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py in add_column(self, name, f_or_array, dtype)
   2872                     # give a better warning to avoid confusion
   2873                     if len(self) == len(ar):
-> 2874                         raise ValueError("Array is of length %s, while the length of the DataFrame is %s due to the filtering, the (unfiltered) length is %s." % (len(ar), len(self), self.length_unfiltered()))
   2875                 raise ValueError("array is of length %s, while the length of the DataFrame is %s" % (len(ar), self.length_original()))
   2876             # assert self.length_unfiltered() == len(data), "columns should be of equal length, length should be %d, while it is %d" % ( self.length_unfiltered(), len(data))

ValueError: Array is of length 5, while the length of the DataFrame is 5 due to the filtering, the (unfiltered) length is 6.

Saying: ValueError: Array is of length 5, while the length of the DataFrame is 5 due to the filtering, the (unfiltered) length is 6.

But as far as I understand in the df2_filtered I have only 5 rows. I do not know why does it matter how many rows are in the df2.

Basically my question is how can I filter out the unnecessary data and convert the column to a Datetime?

UPDATE

Based on Maarten Breddels I have tried to use:

df2_filtered['pdate']=df2_filtered.date.astype('datetime64[ns]')

Which seems working but when I try to use df2_filtered I get the following.

In [57]: df2_filtered
Out[57]: ERROR:MainThread:vaex:error evaluating: pdate at rows 0-5
Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 94, in evaluate
    result = self[expression]
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 141, in __getitem__
    raise KeyError("Unknown variables or column: %r" % (variable,))
KeyError: 'Unknown variables or column: "astype(date, \'datetime64[ns]\')"'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py", line 3467, in table_part
    values[name] = df.evaluate(name)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py", line 5038, in evaluate
    dtype = dtypes[expression] = self.dtype(expression, internal=False)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py", line 2005, in dtype
    data = self.evaluate(expression, 0, 1, filtered=False, internal=True, parallel=False)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py", line 5143, in evaluate
    value = scope.evaluate(expression)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 94, in evaluate
    result = self[expression]
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 136, in __getitem__
    self.values[variable] = self.evaluate(expression)  # , out=self.buffers[variable])
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 100, in evaluate
    result = eval(expression, expression_namespace, self)
  File "<string>", line 1, in <module>
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/functions.py", line 2106, in _astype
    return x.astype(dtype)
AttributeError: 'ColumnStringArrow' object has no attribute 'astype'
ERROR:MainThread:vaex:error evaluating: pdate at rows 0-5
Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 94, in evaluate
    result = self[expression]
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 141, in __getitem__
    raise KeyError("Unknown variables or column: %r" % (variable,))
KeyError: 'Unknown variables or column: "astype(date, \'datetime64[ns]\')"'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py", line 3467, in table_part
    values[name] = df.evaluate(name)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py", line 5038, in evaluate
    dtype = dtypes[expression] = self.dtype(expression, internal=False)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py", line 2005, in dtype
    data = self.evaluate(expression, 0, 1, filtered=False, internal=True, parallel=False)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/dataframe.py", line 5143, in evaluate
    value = scope.evaluate(expression)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 94, in evaluate
    result = self[expression]
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 136, in __getitem__
    self.values[variable] = self.evaluate(expression)  # , out=self.buffers[variable])
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/scopes.py", line 100, in evaluate
    result = eval(expression, expression_namespace, self)
  File "<string>", line 1, in <module>
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/vaex/functions.py", line 2106, in _astype
    return x.astype(dtype)
AttributeError: 'ColumnStringArrow' object has no attribute 'astype'

  #  name    date                       pdate
  0  aa      2012-11-30T14:00:00+01:00  error
  1  bb      2012-12-01T08:16:00+01:00  error
  2  cc      2012-12-01T10:14:00+01:00  error
  3  ee      2012-12-01T11:05:00+01:00  error
  4  gg      2012-12-01T11:05:00+01:00  error

vaex main author here. Good question, and what halumpago is correct, df2_filtered is internally still 6 rows long. What you try to do with the following:

# Adds a numpy arrays to the dataframe
df2_filtered['pdate'] = df2_filtered.date.values.astype('datetime64[ns]')

Is to add an array of length 5 (that is what .values gives you back) to a DataFrame that internally has a bunch of arrays of length 6. This is what the error message tries to convey. If the error is unclear, of something else, let us know at https://github.com/vaexio/vaex/issues . In theory we could support this, but we'd have to create an array of length 6, and copy you data in it. That wouldn't be ideal when you work with 1 billion rows.

In vaex, you preferrably do not work with the underlying arrays (ok, sometimes you need them, so that's why we have support for .values and friends). Instead, the expression system tries to mimic a Pandas series / numpy array as close as possible. If you remove the .values, you will add a new virtual column to the DataFrame instead of an array:

# Adds a virtual column (backed by an expression) to the dataframe
# at zero memory cost
df2_filtered['pdate']=df2_filtered.date.astype('datetime64[ns]')

Vaex will happily store that expression, and evaluate it only for the rows that you have not filtered out. Note that if you print out the size of the DataFrame in bytes before and after this new column, it's memory usage is the same:

print(df2.nbytes)
648

Note: We are ignoring the memory usage of the virtual columns bookkeeping, which are a few bytes of course, but negligible when you work with 100's of GB's of data).

Just for fun, to break the illusion that vaex DataFrame are the same as Pandas dataframes, you can actually remove your filter:

print(df2_filtered.drop_filter())
  #  name    date                       pdate
  0  aa      2012-11-30T14:00:00+01:00  2012-11-30 13:00:00.000000000
  1  bb      2012-12-01T08:16:00+01:00  2012-12-01 07:16:00.000000000
  2  cc      2012-12-01T10:14:00+01:00  2012-12-01 09:14:00.000000000
  3  ee      2012-12-01T11:05:00+01:00  2012-12-01 10:05:00.000000000
  4  ff      fsadfi2 2ih3ro             NaT
  5  gg      2012-12-01T11:05:00+01:00  2012-12-01 10:05:00.000000000

So the data was actually never really gone, we just hide it from you :).

This allows you to work with a massively large Vaex DataFrames and add many new columns, do a lot of filtering, and still have no MemoryError. We keep referring to the original data and just keep masks for filters, and expressions for calculations.

Getting started with Pandas time-series functionality, To instruct Pandas to convert the values, use the parse_dates Notice how the Dtype of the column date is datetime64[ns] . Pandas provides convenient filtering methods for a DataFrame when its index is a datetime type. pages int64 title object is_loaded bool loading_datetime datetime64[ns, UTC] datetime_col object if the column for date is stored as object then it should be converted to datetime. Another possible way to verify the data is by:

IIUC, pd.to_datetime which allows you to convert the column into the DateTime with certain keyword arguments. in this instance you need errors='coerce'

print(df)

   name                       date
0   aa  2012-11-30T14:00:00+01:00
1   bb  2012-12-01T08:16:00+01:00
2   cc  2012-12-01T10:14:00+01:00
3   ee  2012-12-01T11:05:00+01:00
4   ff              fsadfi22ih3ro
5   gg  2012-12-01T11:05:00+01:00

df['date'] = pd.to_datetime(df['date'],errors='coerce')
print(df)
      name                      date
0   aa 2012-11-30 14:00:00+01:00
1   bb 2012-12-01 08:16:00+01:00
2   cc 2012-12-01 10:14:00+01:00
3   ee 2012-12-01 11:05:00+01:00
4   ff                       NaT
5   gg 2012-12-01 11:05:00+01:00

now simply drop the rows with a .dropna() whilst subsetting the date column.

df.dropna(subset=['date'])
print(df)
    name                      date
0   aa 2012-11-30 14:00:00+01:00
1   bb 2012-12-01 08:16:00+01:00
2   cc 2012-12-01 10:14:00+01:00
3   ee 2012-12-01 11:05:00+01:00
5   gg 2012-12-01 11:05:00+01:00

print(df.dtypes)
name                                  object
date    datetime64[ns, pytz.FixedOffset(60)]
dtype: object

Time series / date functionality — pandas 1.1.0 documentation, Manipulating and converting date times with timezone information Out[45]: DatetimeIndex(['2012-01-04 10:00:00'], dtype='datetime64[ns]', freq=None) In [46 ]:� How do I convert a numpy.datetime64 object to a datetime.datetime (or Timestamp)? In the following code, I create a datetime, timestamp and datetime64 objects. import datetime import numpy as np import pandas as pd dt = datetime.datetime(2012, 5, 1) # A strange way to extract a T

I unfortunately don't have a full answer, but I might have an idea for this portion of your question:

I do not know why does it matter how many rows are in the df2.

It matters because, from what I understand, vaex constructs new columns by storing the operations that define that column (looks like they call them "Virtual Columns"). Pandas, in comparison, constructs new columns by computing, storing, and copying the actual values for the new column.

Pandas is really hard on memory, but gives you a lot of flexibility with moving your data around. You're not going to have this kind of flexibility with Virtual Columns, but your program will likely be better with memory usage.

Take a look at your line that's doing the filtering:

df2_filtered=df2[df2['date'].str.contains(':00')]

Unlike in Pandas, df2_filtered is not an actual "thing" in memory. Rather, it's a reference to the original df2 dataframe plus some extra logic that tells vaex to ignore anything that doesn't end with ':00'.

So, when you run:

df2_filtered['pdate']=df2_filtered.date.values.astype('datetime64[ns]')

you're effectively asking vaex to create a new column in df2 because df2_filtered is actually just a filtered reference to df2. vaex doesn't know what to do with the rows that were filtered out of df2, so it throws the error you're seeing.

So, to do the conversion, you need some way to fill in the missing values for df2. I'm unfortunately not familiar enough with vaex to be able to help with this. I tried out @datanovice's approach, but vaex was complaining that it doesn't know what to do with NaT values.

Working with Date and Time, Python Comprehensions � Python Lambda, Reduce, Map and Filter � Python This introduction to pandas is derived from Data School's pandas Q&A we will convert the Time column to datatime format # there are many options to Shape Reported object State object Time datetime64[ns] dtype: object. Pandas to_datetime () (pd.to_datetime ()) Function Is Smart to Convert to Datetime pandas.to_datetime () function could do the conversion to datetime in a smart way without being given the datetime format string. It will find the string pattern automatically and smartly.

Working with Pandas datetime, In this post we will explore the Pandas datetime methods which can be used We are using **parse_date** attribute to parse and convert the date columns in the csv Filter all rows between two dates i.e. 1989-JAN and 1995-Apr here 01:00:00', '2018-01-01 02:00:00'], dtype='datetime64[ns]', freq='H')� Convert the column type from string to datetime format in Pandas dataframe 21-01-2019 While working with data in Pandas, it is not an unusual thing to encounter time series data and we know Pandas is a very useful tool for working with time series data in python.

How do I convert dates in a Pandas DataFrame to a DateTime data , How do I convert dates in a Pandas DataFrame to a DateTime data datetime64 [ns] State object dtype: object C:\python\pandas examples>. pandas.Series.dt.tz_convert¶ Series.dt.tz_convert (* args, ** kwargs) [source] ¶ Convert tz-aware Datetime Array/Index from one time zone to another. Parameters tz str, pytz.timezone, dateutil.tz.tzfile or None. Time zone for time. Corresponding timestamps would be converted to this time zone of the Datetime Array/Index.

How to filter Pandas DataFrame rows by date in Python, How to filter Pandas DataFrame rows by date in Python. Filtering pandas. Warning: The data in the date column must be converted to datetime if it is in string� Working with date and time in pandas. City Colors Reported Shape Reported State Time Year; 0: Ithaca: NaN: TRIANGLE: NY: 1930-06-01 22:00:00

Comments
  • df['Date'] = pd.to_datetime(df['Date'],errors='coerce') then filter out NaT rows.
  • Hi Maarten, thanks for the detailed answer. Just let me say that Vaex is great! Of course there are a lot if thinks to learn but it is great so far. :) So If I understand right you are saying In [55]: df2_filtered['pdate']=df2_filtered.date.astype('datetime64[ns]') ...: should solve my problem? I have tried it works but when I ran df2_filtered I get a an other error. It is too long I can not even type it here..
  • Glad you like it, thanks! It works for me for vaex-core version 1.3, if not, then you may have found a bug, which you could report on github.com/vaexio/vaex/issues . If you could leave a piece of code that I can run, that would be great. Make sure you have the latest versions as well.
  • I have updated the original comment with the error message what I am getting now. I am not sure why is it working for you but I am pretty sure you know vaex a bit more than me.. :)
  • I just double checked vaex-core (1.3.0) - Core of vaex INSTALLED: 1.3.0 (latest) ,so I am on 1.3 as well.
  • Yeah, you have found an bug. It does hot happend in my case because the underlying data is a numpy array. Would you mind opening an issue. If I fix this, my answer should be work.
  • Thanks, but unfortunately this gives me an error just like to halumpago, Vaex does not know what to do with NaT values.
  • Thanks, that makes a bit clear but somehow seems a bit odd , I understand you vaex never copies data just hide filtered data, but if you filter and you do anything on that filtered data that still should not impact the other rows... at least that would be nice :)