BUG: Regression - AmbiguousTimeError creating DataFrame #42505

kenahoo · 2021-07-12T14:50:03Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

Pandas 1.25:

import pandas as pd
dt = pd.to_datetime('2019-11-03 01:00:00-0700').tz_convert('America/Los_Angeles')
pd.DataFrame({'dt': dt, 'value': [1]})
#                          dt  value
# 0 2019-11-03 01:00:00-07:00      1

Pandas 1.3:

import pandas as pd
dt = pd.to_datetime('2019-11-03 01:00:00-0700').tz_convert('America/Los_Angeles')
pd.DataFrame({'dt': dt, 'value': [1]})

#  File "pandas/_libs/tslibs/tzconversion.pyx", line 284, in pandas._libs.tslibs.tzconversion.tz_localize_to_utc
# pytz.exceptions.AmbiguousTimeError: Cannot infer dst time from 2019-11-03 01:00:00, try using the 'ambiguous' argument

Problem description

The dt object already has a timezone attribute, so it shouldn't be converting/inferring at all.

Expected Output

1.25 and 1.3.0 should have the same output.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : f00ed8f
python : 3.7.3.final.0
python-bits : 64
OS : Darwin
OS-release : 20.5.0
Version : Darwin Kernel Version 20.5.0: Sat May 8 05:10:33 PDT 2021; root:xnu-7195.121.3~9/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.0
numpy : 1.21.0
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.2
setuptools : 57.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.1 (dt dec pq3 ext lo64)
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.2.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

mzeitlin11 · 2021-07-12T15:51:13Z

Thanks for reporting this @kenahoo! Bisection indicates this changed with #39767, further investigations welcome!

rhshadrach · 2021-07-13T20:19:35Z

@mzeitlin11 - I think standard practice is to cc the author of a PR where a regression has (or may have) occurred.

cc @jbrockmendel

jmcomie · 2021-07-16T20:07:51Z

Brainstorm: Is it ever good practice to support a conversion from an offset to a named timezone? If we already "know" that 2019-11-03 01:00:00-0700 corresponds to the Los Angeles timezone, we can of course say that this is the first 1AM of the evening and it's therefore not an ambiguous time, but can that be generalized? If not it strikes me that the AmbiguousTimeError exception should simply occur on the initial tz_convert, and that it should be up to the user to strip the offset data from the datetime string and force the conversion a different way. Getting timezones right is often a pain as an end user but ultimately I think it's better to not shoulder ambiguities that the end user should resolve.

jbrockmendel · 2021-07-16T21:31:57Z

based on a quick look:

if you pass {"dt": [dt]} instead of {"dt": dt} it works as expected
otherwise, inside sequence_to_dt64ns we go through objects_to_datetime64ns after which we cast to dt64naive. that is my best guess as to where we need to tinker

kenahoo · 2021-07-17T15:27:15Z

@jmcomie - I don't think that analysis is correct. Once the timezone or offset is attached, you know the exact instant (in seconds past 1970 GMT), so attaching a new timezone is a simple matter of reinterpreting the clock/calendar time on the wall.

Additional fact: in my domain, the data 2019-11-03 01:00:00-0700 does not actually correspond to the Los Angeles timezone, sometimes it's quite literally 7 hours offset all year (someone's idea of a way to avoid DST problems), so it's 1 hour off of LA time in the summer. So our process needs to be: read in the time with the given -0700 offset (which is a perfectly valid set of data, even if weird to a user's eyes), then convert to user-friendly timezone for display and for calculations that depend on what day it is (energy used by day, etc.).

jmcomie · 2021-07-17T15:45:14Z

@jmcomie - I don't think that analysis is correct. Once the timezone or offset is attached, you know the exact instant (in seconds past 1970 GMT), so attaching a new timezone is a simple matter of reinterpreting the clock/calendar time on the wall.

Touché. You're right -- both time representations are precise and the conversion should be valid.

simonjayhawkins · 2021-10-16T19:24:18Z

changing milestone to 1.3.5

kenahoo · 2021-11-04T22:45:51Z

@simonjayhawkins the DST transition is coming this weekend and this issue does not seem to be fixed. Is there a workaround, or a chance of fixing it yet this week?

kenahoo · 2021-11-04T22:49:41Z

Additional test case - under pandas version 1.2.5, this works:

pd.DataFrame({'dt': dt, 'value': [1]})

but this throws the above error:

pd.DataFrame({'dt': dt, 'value': [1, 2, 3]})

(where dt is pd.to_datetime('2021-11-07 01:00:00-0400').tz_convert('America/New_York').)

jbrockmendel · 2021-11-05T22:21:34Z

Is there a workaround, or a chance of fixing it yet this week?

Does the workaround #42505 (comment) not work for you? There is essentially no chance of a release with a fix for this in the next week, no. The next planned release is end of November I think. A PR addressing this would be welcome.

kenahoo · 2021-11-06T04:40:37Z

@jbrockmendel - no, that workaround doesn't apply in all cases. For example, the second case I showed above would have to use [dt] * 3 or similar, rather than just [dt].

Am I correct that the proposed workaround is to simply never use a scalar datetime when creating a DataFrame? That's not really feasible to implement in all cases, partly because there are so many "array types" that can be passed to DataFrame().

In my code, I'm handed a dict containing the data to put into a DataFrame. We would have to scan through all the columns, detect any that pandas will interpret as an array (including lists, numpy arrays, Series, "list-like objects", tuples, etc. - I'm not sure the complete set), find the max length across all such columns, then find all datetime-like columns that are specified as scalars (I'm not sure the extent of types to consider - is it just pd.Timestamp? Or are numpy and datetime items problematic too?), and finally do some [val] * max_n transformation on them to protect them from whatever's causing this error inside numpy.

simonjayhawkins · 2021-11-08T12:51:45Z

I not sure working around this issue is necessarily the best way forward.

The release note from the commit that caused the change in behavior is...

Deprecated casting datetime.date objects to datetime64 when used as fill_value in :meth:DataFrame.unstack, :meth:DataFrame.shift, :meth:Series.shift, and :meth:DataFrame.reindex, pass pd.Timestamp(dateobj) instead (:issue:39767)

This is in no way related to the change in behavior reported in this issue.

PR restoring the 1.2.5 behavior is welcome.

AFAICT

we have the exception raised from pd.core.arrays.datetimes.DatetimeArray._from_sequence with the dtype specified.

on master and 1.2.5 we have

dt = pd.to_datetime("2019-11-03 01:00:00-0700").tz_convert("America/Los_Angeles")
print(dt)
dtype = pd.DataFrame({"dt": [dt], "value": [1]}).dtypes[0]
print(dtype)
arr = pd.core.arrays.datetimes.DatetimeArray._from_sequence([dt])
dtype_inferred = arr.dtype
print(dtype_inferred)
print(dtype == dtype_inferred)
print(arr)
pd.core.arrays.datetimes.DatetimeArray._from_sequence([dt], dtype=dtype)

2019-11-03 01:00:00-07:00
datetime64[ns, America/Los_Angeles]
datetime64[ns, America/Los_Angeles]
True
<DatetimeArray>
['2019-11-03 01:00:00-07:00']
Length: 1, dtype: datetime64[ns, America/Los_Angeles]
---------------------------------------------------------------------------
AmbiguousTimeError                        Traceback (most recent call last)
/tmp/ipykernel_4117/1065977013.py in <module>
      8 print(dtype == dtype_inferred)
      9 print(arr)
---> 10 pd.core.arrays.datetimes.DatetimeArray._from_sequence([dt], dtype=dtype)

~/miniconda3/envs/pandas-1.2.5/lib/python3.9/site-packages/pandas/core/arrays/datetimes.py in _from_sequence(cls, scalars, dtype, copy)
    305     @classmethod
    306     def _from_sequence(cls, scalars, *, dtype=None, copy: bool = False):
--> 307         return cls._from_sequence_not_strict(scalars, dtype=dtype, copy=copy)
    308 
    309     @classmethod

~/miniconda3/envs/pandas-1.2.5/lib/python3.9/site-packages/pandas/core/arrays/datetimes.py in _from_sequence_not_strict(cls, data, dtype, copy, tz, freq, dayfirst, yearfirst, ambiguous)
    324         freq, freq_infer = dtl.maybe_infer_freq(freq)
    325 
--> 326         subarr, tz, inferred_freq = sequence_to_dt64ns(
    327             data,
    328             dtype=dtype,

~/miniconda3/envs/pandas-1.2.5/lib/python3.9/site-packages/pandas/core/arrays/datetimes.py in sequence_to_dt64ns(data, dtype, copy, tz, dayfirst, yearfirst, ambiguous)
   1999             # Convert tz-naive to UTC
   2000             tz = timezones.maybe_get_tz(tz)
-> 2001             data = tzconversion.tz_localize_to_utc(
   2002                 data.view("i8"), tz, ambiguous=ambiguous
   2003             )

pandas/_libs/tslibs/tzconversion.pyx in pandas._libs.tslibs.tzconversion.tz_localize_to_utc()

AmbiguousTimeError: Cannot infer dst time from 2019-11-03 01:00:00, try using the 'ambiguous' argument

Now this behavior of raising when the dtype is specified and is the same dtype as the inferred dtype when the dtype is not specified does indeed seem strange and maybe it is this that needs to be changed to fix the regression even though this is not changed from 1.2.5.

So what has changed?

There we two changes in #39767, maybe_promote and infer_dtype_from_scalar. Although there we extensive changes to maybe_promote, this is not hit in the code sample from the OP.

The change is that infer_dtype_from_scalar now returns a different val

dt = pd.to_datetime("2019-11-03 01:00:00-0700").tz_convert("America/Los_Angeles")
print(dt)
dtype, val = pd.core.dtypes.cast.infer_dtype_from_scalar(dt, pandas_dtype=True)
print(dtype, type(dtype))
print(val, type(val))

on 1.2.5

2019-11-03 01:00:00-07:00
datetime64[ns, America/Los_Angeles] <class 'pandas.core.dtypes.dtypes.DatetimeTZDtype'>
1572768000000000000 <class 'int'>

on master

2019-11-03 01:00:00-07:00
datetime64[ns, America/Los_Angeles] <class 'pandas.core.dtypes.dtypes.DatetimeTZDtype'>
2019-11-03 01:00:00-07:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>

Now the docstring for infer_dtype_from_scalar doesn't give any details on what is happening to val so not sure if the behavior on master is correct or not.

Suffice to say, using the int val in pd.core.arrays.datetimes.DatetimeArray._from_sequence with the dtype specified does not raise an exception.

dt = pd.to_datetime("2019-11-03 01:00:00-0700").tz_convert("America/Los_Angeles")
print(dt)
dtype = pd.DataFrame({"dt": [dt], "value": [1]}).dtypes[0]
print(dtype)
int_val = dt.value
print(int_val)
result = pd.core.arrays.datetimes.DatetimeArray._from_sequence([int_val], dtype=dtype)
print(result)

2019-11-03 01:00:00-07:00
datetime64[ns, America/Los_Angeles]
1572768000000000000
<DatetimeArray>
['2019-11-03 01:00:00-07:00']
Length: 1, dtype: datetime64[ns, America/Los_Angeles]

simonjayhawkins · 2021-11-18T11:38:56Z

@jbrockmendel any thoughts on the best way to fix?

Maybe three options

reverting the change in return type of infer_dtype_from_scalar
changing the handling of types in pd.core.arrays.datetimes.DatetimeArray._from_sequence
special case the passing of the return of infer_dtype_from_scalar onto pd.core.arrays.datetimes.DatetimeArray._from_sequence. (i.e. convert to integer)

I'm not sure about option 1, but I assume that the new behaviour is correct? (Needs a better docstring)

I'm not keen on option 2 for a backport PR and late in the 1.3.x series even though it's probably the correct long term solution.

That leaves option 3. Not normally keen on special casing but probably a trivial and easy fix and safe and suitable for a backport.

jbrockmendel · 2021-11-18T18:23:30Z

Best guess remains sequence_to_dt64ns (#42505 (comment)). i'll try to confirm this guess today.

kenahoo added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 12, 2021

kenahoo changed the title ~~BUG: Regression - AmbiguousTimeError creating~~ BUG: Regression - AmbiguousTimeError creating DataFrame Jul 12, 2021

mzeitlin11 added Constructors Series/DataFrame/Index/pd.array Constructors Regression Functionality that used to work in a prior pandas version Timezones Timezone data dtype and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 12, 2021

mzeitlin11 added this to the 1.3.1 milestone Jul 12, 2021

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jul 19, 2021

code sample for pandas-dev#42505

7e86855

simonjayhawkins modified the milestones: 1.3.1, 1.3.2 Jul 24, 2021

simonjayhawkins modified the milestones: 1.3.2, 1.3.3 Aug 15, 2021

mroeschke added the Bug label Aug 21, 2021

simonjayhawkins modified the milestones: 1.3.3, 1.3.4 Sep 11, 2021

simonjayhawkins modified the milestones: 1.3.4, 1.3.5 Oct 16, 2021

jbrockmendel mentioned this issue Nov 18, 2021

BUG: DataFrame with scalar tzaware Timestamp #44518

Merged

4 tasks

jreback closed this as completed in #44518 Nov 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Regression - AmbiguousTimeError creating DataFrame #42505

BUG: Regression - AmbiguousTimeError creating DataFrame #42505

kenahoo commented Jul 12, 2021 •

edited

Loading

INSTALLED VERSIONS

mzeitlin11 commented Jul 12, 2021

rhshadrach commented Jul 13, 2021

jmcomie commented Jul 16, 2021

jbrockmendel commented Jul 16, 2021

kenahoo commented Jul 17, 2021

jmcomie commented Jul 17, 2021

simonjayhawkins commented Oct 16, 2021

kenahoo commented Nov 4, 2021

kenahoo commented Nov 4, 2021

jbrockmendel commented Nov 5, 2021

kenahoo commented Nov 6, 2021

simonjayhawkins commented Nov 8, 2021

simonjayhawkins commented Nov 18, 2021

jbrockmendel commented Nov 18, 2021

BUG: Regression - AmbiguousTimeError creating DataFrame #42505

BUG: Regression - AmbiguousTimeError creating DataFrame #42505

Comments

kenahoo commented Jul 12, 2021 • edited Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

mzeitlin11 commented Jul 12, 2021

rhshadrach commented Jul 13, 2021

jmcomie commented Jul 16, 2021

jbrockmendel commented Jul 16, 2021

kenahoo commented Jul 17, 2021

jmcomie commented Jul 17, 2021

simonjayhawkins commented Oct 16, 2021

kenahoo commented Nov 4, 2021

kenahoo commented Nov 4, 2021

jbrockmendel commented Nov 5, 2021

kenahoo commented Nov 6, 2021

simonjayhawkins commented Nov 8, 2021

simonjayhawkins commented Nov 18, 2021

jbrockmendel commented Nov 18, 2021

kenahoo commented Jul 12, 2021 •

edited

Loading

Output of `pd.show_versions()`