Skip to content

Remove unnecessary usage of _TSObject #17297

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 5 commits into from
Aug 21, 2017
Merged

Conversation

jbrockmendel
Copy link
Member

This is part 2 in an N-part series of PRs to disentangle inter-dependent pieces of tslib.pyx (and by extension, lib.pyx and period.pyx).

tslib has a _TSObject class that is used as a container during conversion steps. In a number of the places where it is currently used, it is not needed. All this PR does is remove it in cases where it is either unused or unneeded.

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@jreback
Copy link
Contributor

jreback commented Aug 20, 2017

pls show asv for all time series operations

Only those that are cdef so not exposed.  That way there is not a risk
of backward incompatibility
@jbrockmendel
Copy link
Member Author

pls show asv for all time series operations

I'll get that started shortly.

@jbrockmendel
Copy link
Member Author

Between this and the experience last time with Index._is_multi, I'm not at all convinced these measurements are meaningful.

asv continuous -f 1.1 -E virtualenv master less_dts
[...]
before           after         ratio
     [3b02e73b]       [73e67ef7]
!           17.2s           failed      n/a  gil.nogil_datetime_fields.time_datetime_field_daysinmonth
!           39.4s           failed      n/a  gil.nogil_datetime_fields.time_datetime_field_normalize
!           20.5s           failed      n/a  gil.nogil_datetime_fields.time_datetime_field_year
+         231±1μs        31.5±40ms   136.06  indexing.MultiIndexing.time_series_xs_mi_ix
+     4.00±0.02ms            482ms   120.49  frame_methods.Reindex.time_reindex_axis0
+        263±10μs        5.53±20ms    20.99  indexing.MultiIndexing.time_frame_xs_mi_ix
+          73.2ms            611ms     8.35  packers.JSON.time_write_json_mixed_float_int_T
+       327±100ms            2.36s     7.21  gil.NoGilGroupby.time_sum_4_notp
+           801ms            5.14s     6.41  frame_methods.Reindex.time_reindex_axis1
+          74.6ms            440ms     5.90  packers.JSON.time_write_json_T
+     19.9±0.04ms          103±2ms     5.19  series_methods.series_isin_int64.time_series_isin_int64_large
+          74.0ms            380ms     5.14  packers.JSON.time_write_json_mixed_float_int_str
+        178±20ms            855ms     4.81  frame_methods.Shift.time_shift_axis_1
+         191±8ms         912±30ms     4.78  indexing.Int64Indexing.time_getitem_array
+        363±10ms            1.72s     4.73  frame_methods.Reindex.time_reindex_both_axes
+      1.31±0.01s            5.86s     4.47  reshape.reshape_unstack_large_single_dtype.time_unstack_with_mask
+           3.06s            13.7s     4.46  join_merge.MergeCategoricals.time_merge_cat
+           13.1s            58.3s     4.45  join_merge.JoinIndex.time_left_outer_join_index
+       837±300ms            3.63s     4.33  gil.NoGilGroupby.time_sum_8_notp
+           4.89s            20.2s     4.13  join_merge.MergeCategoricals.time_merge_object
+      1.45±0.02s            5.78s     3.98  gil.NoGilGroupby.time_groups_2
+        264±30ms       1.03±0.01s     3.90  indexing.Int64Indexing.time_loc_scalar
+        252±10ms         956±20ms     3.79  indexing.Int64Indexing.time_getitem_lists
+        295±10ms       1.10±0.03s     3.74  indexing.Int64Indexing.time_ix_list_like
+        652±50ms            2.38s     3.65  indexing.MultiIndexing.time_multiindex_large_get_loc_warm
+        295±10ms       1.07±0.05s     3.64  indexing.Int64Indexing.time_ix_array
+        607±10ms            2.20s     3.62  indexing.MultiIndexing.time_multiindex_large_get_loc
+        287±20ms       1.04±0.05s     3.62  indexing.Int64Indexing.time_loc_array
+        247±10ms          885±9ms     3.58  indexing.Int64Indexing.time_getitem_list_like
+        351±30ms            1.20s     3.42  frame_methods.Dropna.time_dropna_axis1_any_mixed_dtypes
+        298±10ms       1.01±0.04s     3.41  indexing.Int64Indexing.time_loc_list_like
+           1.72s            5.43s     3.16  join_merge.i8merge.time_i8merge
+          66.1ms            208ms     3.15  packers.JSON.time_write_json
+           994ms            2.95s     2.97  packers.JSON.time_write_json_lines
+           750ms            2.16s     2.88  frame_methods.frame_nunique.time_frame_nunique
+           17.4s            47.4s     2.73  gil.nogil_datetime_fields.time_datetime_field_day
+        338±30ms            894ms     2.64  indexing.MultiIndexing.time_multiindex_get_indexer
+         311±4ms         811±30ms     2.60  packers.Packers.time_packers_read_csv
+        432±30ms            1.07s     2.47  indexing.StringIndexing.time_getitem_label_slice
+        303±40ms         748±20ms     2.47  frame_methods.frame_duplicated.time_frame_duplicated
+       168±0.6ms         376±10ms     2.24  inference.to_numeric_downcast.time_downcast('string-float', 'signed')
+         175±2ms          384±4ms     2.20  inference.to_numeric_downcast.time_downcast('string-float', None)
+         177±2ms         384±10ms     2.17  inference.to_numeric_downcast.time_downcast('string-float', 'integer')
+        105±20ms            219ms     2.09  gil.NoGilGroupby.time_sum_4
+           136ms            248ms     1.82  packers.JSON.time_write_json_date_index
+           1.32s            2.12s     1.61  reindex.Reindexing.time_reindex_multiindex
+     2.48±0.01ms      3.77±0.03ms     1.52  groupby.groupby_datetimetz.time_groupby_sum
+           232ms            345ms     1.49  gil.nogil_read_csv.time_read_csv_datetime
+        262±90ms            390ms     1.49  join_merge.Align.time_series_align_int64_index
+      74.6±0.6ms          103±8ms     1.39  io_bench.frame_to_csv2.time_frame_to_csv2
+     9.27±0.06ms       12.7±0.2ms     1.37  algorithms.Algorithms.time_factorize_string
+       112±0.5ms          152±6ms     1.36  frame_methods.series_string_vector_slice.time_series_string_vector_slice
+           8.60s            11.7s     1.36  gil.NoGilGroupby.time_groups_4
+           2.37s            3.17s     1.34  stat_ops.FrameOps.time_op('median', False, 'float', 1)
+           14.0s            18.3s     1.31  gil.NoGilGroupby.time_groups_8
+      61.9±0.5μs       80.6±0.1μs     1.30  indexing.Int64Indexing.time_ix_slice
+      42.1±0.2ms         54.5±3ms     1.30  packers.STATA.time_write_stata_with_validation
+      56.2±0.6ms       69.6±0.7ms     1.24  packers.HDF.time_write_hdf_store
+           373ms            461ms     1.24  groupby.groupby_multi_index.time_groupby_multi_index
+           6.61s            7.95s     1.20  join_merge.ConcatPanels.time_c_ordered_axis2
+        91.4±3ms          109±3ms     1.19  gil.nogil_factorize.time_factorize_strings_4
+         418±4ms        496±0.9ms     1.19  timeseries.SemiMonthOffset.time_end_apply_index
+     3.14±0.08ms      3.70±0.08ms     1.18  timeseries.AsOf.time_asof_nan
+     2.75±0.03ms       3.25±0.2ms     1.18  timeseries.TimeSeries.time_large_lookup_value
+     1.44±0.01ms      1.70±0.05ms     1.18  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('MonthBegin', 1)
+           2.93s            3.44s     1.18  join_merge.ConcatFrames.time_c_ordered_axis0
+     1.98±0.01ms      2.31±0.01ms     1.17  groupby.GroupBySuite.time_mean('int', 10000)
+     14.9±0.03ms       17.4±0.2ms     1.16  frame_methods.Formatting.time_repr_tall
+           780ms            907ms     1.16  index_object.Multi2.time_sortlevel_int64
+        95.6±1ms          111±2ms     1.16  frame_methods.frame_insert_100_columns_begin.time_frame_insert_500_columns_end
+     4.87±0.04ms       5.52±0.1ms     1.13  groupby.groupby_float32.time_groupby_sum
+           2.30s            2.61s     1.13  stat_ops.FrameOps.time_op('median', True, 'int', 1)
+     3.24±0.09ms      3.67±0.06ms     1.13  timeseries.AsOf.time_asof
+     5.66±0.07ms      6.40±0.03ms     1.13  algorithms.Algorithms.time_add_overflow_pos_arr
+        1.38±0ms      1.56±0.04ms     1.13  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('MonthBegin', 2)
+        1.19±0ms         1.33±0ms     1.12  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('Nano', 1)
+       830±0.9μs         929±90μs     1.12  indexing.DataFrameIndexing.time_loc_dups
+           1.88s            2.11s     1.12  groupby.GroupBySuite.time_diff('float', 10000)
+          8.90ms           9.94ms     1.12  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('FY5253Quarter_1', 1)
+        2.18±0ms       2.42±0.2ms     1.11  timeseries.ResampleDataFrame.time_max_string
+        1.92±0ms      2.14±0.04ms     1.11  stat_ops.stats_rolling_mean.time_rolling_mean
+           1.44s            1.58s     1.10  groupby.GroupBySuite.time_rank('int', 10000)
+     18.4±0.02ms           20.2ms     1.10  join_merge.MergeAsof.time_by_int
-      2.03±0.1ms      1.84±0.03ms     0.91  reindex.Duplicates.time_frame_drop_dups_bool
-      5.56±0.1μs      5.04±0.01μs     0.91  indexing.IndexingMethods.time_get_loc_float
-        429±30ns          388±1ns     0.90  period.period_standard_indexing.time_shape
-        1.26±0ms         1.14±0ms     0.90  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('Hour', 2)
-      1.88±0.2ms         1.68±0ms     0.89  period.Algorithms.time_drop_duplicates_pseries
-     1.29±0.03ms         1.15±0ms     0.89  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('Micro', 2)
-      7.09±0.2ms       6.31±0.1ms     0.89  binary_ops.Ops2.time_frame_int_div_by_zero
-           8.50s            7.52s     0.89  join_merge.ConcatPanels.time_c_ordered_axis1
-     2.81±0.07ms      2.48±0.01ms     0.88  rolling.SeriesRolling.time_rolling_max_l
-         530±7ms          464±5ms     0.88  inference.to_numeric_downcast.time_downcast('string-nint', 'signed')
-        359±20μs          315±1μs     0.88  reindex.Reindexing.time_reindex_dates
-        512±10ms         448±10ms     0.88  inference.to_numeric_downcast.time_downcast('string-int', 'signed')
-        455±20ms          397±3ms     0.87  replace.replace_convert.time_replace_frame_timedelta
-           26.6s            23.1s     0.87  replace.replace_large_dict.time_replace_large_dict
-     1.21±0.05ms      1.05±0.02ms     0.87  replace.replace_replacena.time_replace_replacena
-        460±40ms         400±10ms     0.87  inference.to_numeric_downcast.time_downcast('string-nint', 'float')
-        798±30μs          692±1μs     0.87  reindex.LevelAlign.time_reindex_level
-        89.0±3μs       76.6±0.2μs     0.86  frame_methods.frame_dtypes.time_frame_dtypes
-        323±20ms       277±0.04ms     0.86  packers.packers_read_sql.time_packers_read_sql
-     3.20±0.09ms      2.74±0.01ms     0.86  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('WeekOfMonth', 2)
-     4.91±0.07ms       4.19±0.2ms     0.85  binary_ops.Ops2.time_frame_int_mod
-        74.3±3μs       63.3±0.2μs     0.85  period.period_standard_indexing.time_series_loc
-        838±40μs          704±2μs     0.84  reindex.LevelAlign.time_align_level
-     21.5±0.04μs       18.0±0.4μs     0.84  timeseries.Offsets.time_custom_bday_incr
-          89.0μs           74.3μs     0.83  panel_methods.PanelMethods.time_shift_minor
-      9.29±0.4μs      7.73±0.01μs     0.83  period.Algorithms.time_drop_duplicates_pindex
-      14.3±0.7ms           11.8ms     0.83  join_merge.MergeAsof.time_on_int32
-        480±30ms          393±2ms     0.82  replace.replace_convert.time_replace_frame_timestamp
-      9.30±0.4ms      7.59±0.04ms     0.82  reindex.Duplicates.time_frame_drop_dups
-        97.6±5ms         79.2±1ms     0.81  plotting.TimeseriesPlotting.time_plot_regular_compat
-     35.7±0.07μs      28.9±0.04μs     0.81  indexing.Int64Indexing.time_iloc_list_like
-           48.4s            39.3s     0.81  gil.nogil_datetime_fields.time_period_to_datetime
-        404±30μs        325±0.9μs     0.80  reindex.Duplicates.time_series_drop_dups_int
-        98.3±8ms       78.8±0.8ms     0.80  gil.nogil_rolling_algos_fast.time_nogil_rolling_min
-     1.67±0.06ms      1.31±0.01ms     0.78  parser_vb.read_csv3.time_default_converter
-        341±10μs          266±5μs     0.78  reindex.FillMethod.time_pad
-        271±20μs        209±0.3μs     0.77  reindex.FillMethod.time_backfill_float32
-        515±20μs          396±2μs     0.77  reindex.Duplicates.time_series_drop_dups_string
-        10.5±1ms       7.94±0.1ms     0.76  reindex.LibFastZip.time_lib_fast_zip
-        567±60μs          429±3μs     0.76  period.Algorithms.time_value_counts_pindex
-      2.63±0.2ms      1.99±0.01ms     0.76  parser_vb.read_csv3.time_roundtrip_converter
-        641±30μs         485±10μs     0.76  reindex.Reindexing.time_reindex_columns
-        14.4±1ms      10.9±0.06ms     0.76  reindex.LibFastZip.time_lib_fast_zip_fillna
-      5.62±0.9ms      4.23±0.06ms     0.75  period.Constructor.time_from_pydatetime
-        342±50ms          250±4ms     0.73  replace.replace_convert.time_replace_series_timestamp
-        9.32±1ms      6.83±0.01ms     0.73  parser_vb.read_csv3.time_default_converter_with_decimal_python_engine
-      16.2±0.6ms       11.9±0.2ms     0.73  gil.nogil_read_csv.time_read_csv_object
-        15.3±1μs       11.1±0.1μs     0.72  period.period_standard_indexing.time_get_loc
-      3.67±0.4ms      2.65±0.01ms     0.72  reindex.Duplicates.time_frame_drop_dups_na_inplace
-        8.96±2μs      6.37±0.01μs     0.71  period.period_standard_indexing.time_shallow_copy
-        303±30μs        209±0.4μs     0.69  reindex.FillMethod.time_pad_float32
-      5.00±0.3ms      3.46±0.05ms     0.69  algorithms.Algorithms.time_duplicated_int
-           187ms            126ms     0.67  gil.nogil_kth_smallest.time_nogil_kth_smallest
-        19.8±3ms      13.3±0.04ms     0.67  parser_vb.read_csv1.time_sep
-           3.85s            2.53s     0.66  reshape.reshape_unstack_large_single_dtype.time_unstack_full_product
-      21.4±0.3ms      13.9±0.09ms     0.65  parser_vb.read_csv1.time_thousands
-           1.76s            978ms     0.55  groupby.Groups.time_groupby_groups('object_small')
-        17.9±1ms      9.83±0.03ms     0.55  reindex.Duplicates.time_frame_drop_dups_na
-        779±30μs         422±20μs     0.54  reindex.FillMethod.time_pad_daterange
-           1.29s            658ms     0.51  timeseries.AsOfDataFrame.time_asof
-           898ms            454ms     0.51  frame_methods.Dropna.time_dropna_axis0_all
-         466±7ms          234±6ms     0.50  inference.to_numeric_downcast.time_downcast('string-int', 'float')
-           196ms           89.7ms     0.46  packers.JSON.time_write_json_mixed_float_int
-        502±10ms          202±4ms     0.40  panel_ctor.Constructors1.time_panel_from_dict_all_different_indexes
-      1.06±0.01s         385±10ms     0.36  index_object.SetOperations.time_int64_symmetric_difference
-        424±20ms          148±4ms     0.35  panel_ctor.Constructors4.time_panel_from_dict_two_different_indexes
-           317ms            106ms     0.33  frame_methods.Dropna.time_dropna_axis0_any
-       395±0.1ms          116±1ms     0.29  index_object.SetOperations.time_int64_intersection
-           2.04s            599ms     0.29  frame_methods.Dropna.time_count_level_axis1_multi
-        187±50ms         52.5±2ms     0.28  parser_vb.read_csv_categorical.time_convert_post
-           1.89s            529ms     0.28  frame_methods.Dropna.time_count_level_axis0_multi
-           905ms         238±20ms     0.26  index_object.Multi1.time_duplicated
-           3.02s            788ms     0.26  frame_methods.Dropna.time_dropna_axis0_all_mixed_dtypes
-        624±20ms          153±6ms     0.24  binary_ops.Ops.time_frame_multi_and(True, 1)
-      1.04±0.07s         250±40ms     0.24  binary_ops.Ops.time_frame_multi_and(True, 'default')
-        384±20ms         87.7±4ms     0.23  index_object.SetOperations.time_int64_union
-        602±30ms         121±20ms     0.20  binary_ops.Ops.time_frame_multi_and(False, 'default')
-        411±10ms           72.1ms     0.18  frame_methods.Reindex.time_reindex_both_axes_ix
-     1.90±0.05ms        332±0.3μs     0.17  timeseries.DatetimeIndex.time_timeseries_is_month_start
-           3.45s            452ms     0.13  frame_methods.Dropna.time_count_level_axis1_mixed_dtypes_multi
-        161±20ms       15.6±0.1ms     0.10  parser_vb.read_csv2.time_comment
-           2.95s       18.6±0.1ms     0.01  join_merge.ConcatFrames.time_f_ordered_axis1

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@jreback
Copy link
Contributor

jreback commented Aug 21, 2017

i would disagree

you are touching some very performance sensistive code

and you are removing type definitions

this he slowdown

you need to pick a benchmark ensure that there is no perf gap

@gfyoung gfyoung added Clean Internals Related to non-user accessible pandas implementation labels Aug 21, 2017
@jbrockmendel
Copy link
Member Author

I'll start by acknowledging that the edits to _Timestamp.to_pydatetime are probably less efficient than the status-quo version. If that is causing a 100x slowdown in Series.ix.__getitem__ or a 10x speedup in reading CSVs, then I understand this even less than I thought.

and you are removing type definitions

Snark aside, _Timestamp.to_pydatetime is a case where this comment is correct, and I'll be happy to revert that. But look at the others:

  • period.apply_mult is never used.
  • src.datetime.convert_pydatetime_to_datetimestruct is never even imported.
  • src.datetime.make_iso_8601_datetime ditto.
  • [...4 more of these in src.datetime]
  • tslib._is_multiple is never used.
  • tslib.m8_weekday is never used.
  • tslib.date_normalize creates tso = _TSObject() and then never uses it.
  • tslib.get_date_name_field declares _TSObject ts in its cdef block but then never uses it.
  • tslib.get_date_field ditto.
  • Timestamp.replace ditto.

Are any of these changes that could plausibly account for any changes in the asv output? Honestly asking.

tslib.get_start_end_field actually has some action. In the last block this PR deletes an occurrence of ts = convert_to_tsobject(dtindex[i], None, None, 0, 0) where ts is never used. In each of the previous blocks, it replaces:

pandas_datetime_to_datetimestruct(dtindex[i], PANDAS_FR_ns, &dts)
ts = convert_to_tsobject(dtindex[i], None, None, 0, 0)
dow = ts_dayofweek(ts)

with

pandas_datetime_to_datetimestruct(dtindex[i], PANDAS_FR_ns, &dts)
dow = dayofweek(dts.year, dts.month, dts.day)

Note that ts_dayofweek is in inlined call that returns dayofweek(ts.dts.year, ts.dts.month, ts.dts.day) and that the call to convert_to_tsobject will end up making a redundant call to pandas_datetime_to_datetimestruct.

I have a hard time imagining how adding redundant calls to pandas_datetime_to_datetimestruct could slow things down.

All that said, I am the least experienced person here and am willing to be convinced. But I see no evidence that the asv measurements above contain any meaningful information.

@jorisvandenbossche
Copy link
Member

Yes, it is a problem with asv that the measurements are quite noisy, often too noisy to really rely upon (although, that noisy as the ones you show, I haven't seen that before on my laptop.). That said, asv is certainly still useful and captured regressions before.

Given that asv is so noisy on your laptop, I would take some benchmarks out of it that could potentially be impacted (related to timeseries), and test the code snippet directly with %timeit (as I suggested before I think). When doing a performance related PR, I personally also more rely on that while developing than asv (it easier, quicker and more reliable for interactive use).
Eg you mention _Timestamp.to_pydatetime and tslib.get_start_end_field as ones that can be impacted (for the better or the worse). So take a benchmark (or write a small code snippet yourself) that uses this, and run it a few times with %timeit. If I understand you correctly, one with tslib.get_start_end_field should even be faster.

Eg you can easily test to_pydatetime:

In [80]: ts = pd.Timestamp("2017-09-01 09:00:00")

In [81]: %timeit ts.to_pydatetime()
751 ns ± 6.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [82]: s = pd.date_range("2016-01-01", periods=10000, freq='H')

In [85]: %timeit s.to_pydatetime()
3.43 ms ± 39.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Since you mention there are both clean-up of dead code as simplications (like the one in to_pydatetime), it's maybe easier to review to keep those changes separate?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as @jorisvandenbossche already indicated. perf is very important. cleanup must respect this, generally preceding in an iterative manner to changes, testing perf at each step is a good idea.

especially when changing cython code, some things are done for perf and may not be immediately obvious.

@jbrockmendel
Copy link
Member Author

OK. Let's start with to_pydatetime.

In [3]: ts = pd.Timestamp("2017-09-01 09:00:00")
In [4]: %timeit ts.to_pydatetime()

In master this came back with

The slowest run took 20.69 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 484 ns per loop

Under the PR this was in fact much slower:

In [4]: %timeit ts.to_pydatetime()
The slowest run took 109.78 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.95 µs per loop

So I reverted the edits to py_datetime. Post-reversion that measurement back to parity (well,1.03% slower).

In [21]: %timeit ts.to_pydatetime()
The slowest run took 24.39 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 489 ns per loop

All subsequent timings here are post-reversion.

In [26]: s = pd.date_range("2016-01-01", periods=10000, freq='H')
In [27]: sb = pd.date_range("2016-01-01", periods=10000, freq='BH')
In [28]: %timeit s.to_pydatetime()
# PR     --> 100 loops, best of 3: 2.07 ms per loop
# master --> 100 loops, best of 3: 2.13 ms per loop
In [29]: %timeit sb.to_pydatetime()
# PR     --> 100 loops, best of 3: 2.08 ms per loop
# master --> 100 loops, best of 3: 2.12 ms per loop

Trying a method that will go through get_start_end_field. The BH versions actually go through the changed weekday code paths:

In [30]: %timeit s.is_year_end
# PR     --> 1000 loops, best of 3: 307 µs per loop
# master --> 1000 loops, best of 3: 1.7 ms per loop

In [31]: %timeit sb.is_year_end
# PR     --> 1000 loops, best of 3: 362 µs per loop
# master --> 100 loops, best of 3: 1.78 ms per loop

In [32]: %timeit sb.is_quarter_start
# PR     --> 1000 loops, best of 3: 345 µs per loop
# master --> 100 loops, best of 3: 1.73 ms per loop

Given that asv is so noisy on your laptop,

BTW, this is from a fairly beefy desktop in my basement. There is non-zero background work running that can account for some noise, but both the level and variation should be much smaller than a laptop.

@codecov
Copy link

codecov bot commented Aug 21, 2017

Codecov Report

Merging #17297 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17297      +/-   ##
==========================================
- Coverage   91.03%   91.01%   -0.02%     
==========================================
  Files         162      162              
  Lines       49567    49567              
==========================================
- Hits        45123    45114       -9     
- Misses       4444     4453       +9
Flag Coverage Δ
#multiple 88.79% <ø> (ø) ⬆️
#single 40.24% <ø> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.72% <0%> (-0.1%) ⬇️
pandas/core/indexing.py 93.94% <0%> (ø) ⬆️
pandas/core/generic.py 92.03% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3b02e73...a99133c. Read the comment docs.

@jorisvandenbossche
Copy link
Member

Useful timings!

Given that asv is so noisy on your laptop,

BTW, this is from a fairly beefy desktop in my basement. There is non-zero background work running that can account for some noise, but both the level and variation should be much smaller than a laptop.

I don't know why it could be so variable on your desktop, all I can say is that I don't see such a large variation on mine.
Eg I just ran (for another reason) a single benchmark today, one that in your timings above had a factor difference of 3.7, while here it was 1.06:

(dev) joris@joris-XPS-13-9350:~/scipy/pandas/asv_bench$ asv continuous v0.20.0 master -b indexing.Int64Indexing.time_ix_list_like
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
·· Installing into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For pandas commit hash 8354a1df:
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.................................................
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running indexing.Int64Indexing.time_ix_list_like                                                                                                                                                                                                           332±7μs
[ 50.00%] · For pandas commit hash 84fa7449:
[ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..................................................
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· Running indexing.Int64Indexing.time_ix_list_like                                                                                                                                                                                                          353±20μs
BENCHMARKS NOT SIGNIFICANTLY CHANGED.

@jreback jreback added this to the 0.21.0 milestone Aug 21, 2017
@jreback jreback merged commit eff1f88 into pandas-dev:master Aug 21, 2017
@jreback
Copy link
Contributor

jreback commented Aug 21, 2017

thanks!

return datetime(self.year, self.month, self.day,
self.hour, self.minute, self.second,
self.microsecond, self.tzinfo)
ts = convert_to_tsobject(self, self.tzinfo, None, 0, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should add a comment here that self is converted to a TSObject for performance reasons (faster attribute access) for future reference, as I agree you wouldn't necessarily think this

@jbrockmendel jbrockmendel deleted the less_dts branch August 22, 2017 20:34
rs2 added a commit to rs2/pandas that referenced this pull request Aug 30, 2017
* consolidated the duplicate definitions of NA values (in parsers & IO) (pandas-dev#16589)

* GH15943 Fixed defaults for compression in HDF5 (pandas-dev#16355)

* DOC: add header=None to read_excel docstring (pandas-dev#16689)

* TST: Test against python-dateutil master (pandas-dev#16648)

* BUG: .iloc[:] and .loc[:] return a copy of the original object pandas-dev#13873 (pandas-dev#16443)

closes pandas-dev#13873

* TST: Add test of building frame from named Series and columns (pandas-dev#9232) (pandas-dev#16700)

* DOC: fix wrongly placed versionadded (pandas-dev#16702)

* DOC: pin sphinx to version 1.5 (pandas-dev#16704)

* CI: restore np 113 in ci builds (pandas-dev#16656)

* Revert "BLD: fix numpy on 3.6 build as 1.13 was released but no deps are built for it (pandas-dev#16633)"

This reverts commit dfebd8a.

closes pandas-dev#16634

* BUG: Fix regression for RGB(A) color arguments (pandas-dev#16701)

* Add test

* Pass tuples that are RGB or RGBA like in list

* Update what's new

* change whatsnew to reflect regression fix

* Add test for RGBA as well

* CI: pin jemalloc=4.4.0 (pandas-dev#16727)

* MAINT: Drop Categorical.order & sort (pandas-dev#16728)

Deprecated back in 0.18.1

xref pandas-devgh-12882

* Fix reading Series with read_hdf (pandas-dev#16610)

* Added test to reproduce issue pandas-dev#16583

* Fix pandas-dev#16583 by adding an explicit `mode` argument to `read_hdf`

kwargs which are meant for the opening of the HDFStore should be filtered out
before passing the remaining kwargs to the `select` function to load the data.

* Noted fix for pandas-dev#16583 in WhatsNew

* DOC: typo (pandas-dev#16733)

* whatsnew v0.21.0.txt typos (pandas-dev#16742)

* whatsnew v0.20.3 edits (pandas-dev#16743)

* BUG: do not raise UnsortedIndexError if sorting is not required

closes pandas-dev#16734

Author: Pietro Battiston <me@pietrobattiston.it>

This patch had conflicts when merged, resolved by
Committer: Jeff Reback <jeff.reback@twosigma.com>

Closes pandas-dev#16736 from toobaz/index_what_you_can and squashes the following commits:

f77e2b3 [Pietro Battiston] BUG: do not raise UnsortedIndexError if sorting is not required

* DOC: whatsnew typos

* Test for pandas-dev#16726. unittest that ensures datetime is understood (pandas-dev#16744)

* Test for pandas-dev#16726. unittest that ensures datetime is understood

* Corrected the test as suggested by @TomAugspurger

* Fixed flake8 errors and warnings

* DOC: some rst fixes (pandas-dev#16763)

* DOC: Update Sphinx Deprecated Directive (pandas-dev#16512)

* MAINT: Drop Index.sym_diff (pandas-dev#16760)

Deprecated in 0.18.1

xref pandas-devgh-12591, pandas-devgh-12594

* MAINT: Drop pd.options.display.mpl_style (pandas-dev#16761)

Deprecated in 0.18.0

xref pandas-devgh-12190

* DOC: remove section on Panel4D support in HDF io (pandas-dev#16783)

* DOC: add section on data validation and library engarde (pandas-dev#16758)

* TST: register slow marker (pandas-dev#16797)

* TST: register slow marker

* Update setup.cfg

* BUG: Load data from a CategoricalIndex for dtype comparison, closes #… (pandas-dev#16738)

* BUG: Load data from a CategoricalIndex for dtype comparison, closes pandas-dev#16627

* Enable is_dtype_equal on CategoricalIndex, fixed some doc typos, added ordered CategoricalIndex test

* Flake8 windows suggestion

* Fixed some documentation/formatting issues, clarified the purpose of the test case.

* Bug in pd.merge() when merge/join with multiple categorical columns (pandas-dev#16786)

closes pandas-dev#16767

* BUG: Fix read of py3 PeriodIndex DataFrame HDF made in py2 (pandas-dev#16781) (pandas-dev#16790)

In Python3, reading a DataFrame with a PeriodIndex from an HDF file
created in Python2 would incorrectly return a DataFrame with an
Int64Index.

* BUG: Fix Series doesn't work in pd.astype(). Now treat Series as dict. (pandas-dev#16725)

* FIX: Allow aggregate to return dictionaries again pandas-dev#16741 (pandas-dev#16752)

* BUG: fix to_latex bold_rows option (pandas-dev#16708)

* Revert "CI: pin jemalloc=4.4.0 (pandas-dev#16727)" (pandas-dev#16731)

This reverts commit 09d8c22.

* CI: use dist/trusty rather than os/linux (pandas-dev#16806)

closes pandas-dev#16730

* TST: Verify columns entirely below chop_threshold still print (pandas-dev#6839) (pandas-dev#16809)

* BUG: clip dataframe column-wise pandas-dev#15390 (pandas-dev#16504)

* TST: Verify that positional shifting works with duplicate columns (pandas-dev#9092) (pandas-dev#16810)

* BUG: render dataframe as html do not produce duplicate element id's (pandas-dev#16780) (pandas-dev#16801)

* BUG: when rendering dataframe as html do not produce duplicate element id's pandas-dev#16780

* CLN: removing spaces in code causes pylint check to fail

* DOC: moved whatsnew comment to 0.20.3 release from 0.21.0

* fix BUG: ValueError when performing rolling covariance on multi indexed DataFrame (pandas-dev#16814)

* fix multi index names

* fix line length to pep8

* added what's new entry and reference issue number in test

* Update test_multi.py

* Update v0.20.3.txt

* BUG: rolling.cov with multi-index columns should presever the MI (pandas-dev#16825)

xref pandas-dev#16814

* use network decorator on additional tests (pandas-dev#16824)

* BUG: TimedeltaIndex raising ValueError when slice indexing (pandas-dev#16637) (pandas-dev#16638)

* Bug issue 16819 Index.get_indexer_not_unique inconsistent return types vs get_indexer (pandas-dev#16826)

* TST: Verify that float columns stay float after pivot (pandas-dev#7142) (pandas-dev#16815)

* BUG/MAINT: Change default of inplace to False in pd.eval (pandas-dev#16732)

* BUG: kind parameter on categorical argsort (pandas-dev#16834)

* DOC: Updated cookbook to show usage of Grouper instead of TimeGrouper… (pandas-dev#16794)

* BUG: allow empty multiindex (fixes .isin regression, GH16777) (pandas-dev#16782)

* BUG: fix missing sort keyword for PeriodIndex.join (pandas-dev#16586)

* COMPAT: 32-bit compat for testing of indexers (pandas-dev#16849)

xref pandas-dev#16826

* BUG: fix infer frequency for business daily (pandas-dev#16683)

* DOC: Whatsnew updates (pandas-dev#16853)

[ci skip]

* TST/PKG: Move test HDF5 file to legacy (pandas-dev#16856)

It wasn't being picked up in our package data otherwise

* COMPAT: moar 32-bit compat for testing of indexers (pandas-dev#16861)

xref pandas-dev#16826

* MAINT: Drop the get_offset_name method (pandas-dev#16863)

Deprecated since 0.18.0

xref pandas-devgh-11834

* DOC: Fix missing parentheses in documentation (pandas-dev#16862)

* BUG: rolling.quantile does not return an interpolated result (pandas-dev#16247)

* ENH - Modify Dataframe.select_dtypes to accept scalar values (pandas-dev#16860)

* COMPAT: moar 32-bit compat for testing of indexers (pandas-dev#16869)

xref pandas-dev#16826

* Confirm that select was *not* clearer in 0.12 (pandas-dev#16878)

* Added tests for _get_dtype (pandas-dev#16845)

* BUG: Series.isin fails or categoricals (pandas-dev#16858)

* COMPAT with dateutil 2.6.1, fixed ambiguous tz dst behavior (pandas-dev#16880)

* fix wrongly named method (pandas-dev#16881)

* TST/PKG: Removed pandas.util.testing.slow definition (pandas-dev#16852)

* MAINT: Remove unused mock import (pandas-dev#16908)

We import it, set it as an attribute, and then don't use it.

* Let _get_dtype accept Categoricals and CategoricalIndex  (pandas-dev#16887)

* Fixes for pandas-dev#16896(TimedeltaIndex indexing regression for strings) (pandas-dev#16907)

* Fix for pandas-dev#16909(DeltatimeIndex.get_loc is not working on np.deltatime64 data type) (pandas-dev#16912)

* DOC: Recommend sphinx 1.5 for now (pandas-dev#16929)

For the SciPy sprint tomorrow, until the cause of the doc-building slowdown is fully identified.

* BUG: Allow value labels to be read with iterator (pandas-dev#16926)

All value labels to be read before the iterator has been used
Fix issue where categorical data was incorrectly reformatted when
write_index was False

closes pandas-dev#16923

* DOC: Update flake8 command instructions (pandas-dev#16919)

* TST: Don't assert that a bug exists in numpy (pandas-dev#16940)

Better to ignore the warning from the bug, rather than assert the bug is still there

After this change, numpy/numpy#9412 _could_ be backported to fix the bug

* CI: add .pep8speakes.yml

* CLN16668: remove OrderedDefaultDict (pandas-dev#16939)

* Change "pls" to "please" in error message (pandas-dev#16947)

* BUG: MultiIndex sort with ascending as list (pandas-dev#16937)

* DOC: Improving docstring of pop method (pandas-dev#16416) (pandas-dev#16520)

* PEP8

* WARN: add stacklevel to to_dict() UserWarning (pandas-dev#16927) (pandas-dev#16936)

* ERR: add stacklevel to to_dict() UserWarning (pandas-dev#16927)

* TST: Add warning testing to to_dict()

* Fix warning assertion on to_dict() test

* Add github issue to documentation on to_dict() warning test

* CI: fix pep8speaks .yml file

* DOC: whatsnew 0.21.0 edits

* CI: disable codecov reporting

* MAINT: Move series.remove_na to core.dtypes.missing.remove_na_arraylike

Closes pandas-devgh-16935

* Support non unique period indexes on join and merge operations (pandas-dev#16949)

* Support non unique period indexes on join and merge operations

* Add frame assertion on tests and release notes

* Explicitly use dtype int64 on arange

* BUG: Set secondary axis font size for `secondary_y` during plotting

The parameter was not being respected for `secondary_y`.

Closes pandas-devgh-12565

* DOC: more whatsnew fixes

* DOC: Reset index examples

closes pandas-dev#16416

Author: aernlund <awe220@nyumc.org>

Closes pandas-dev#16967 from aernlund/reset_index_docs and squashes the following commits:

3c6a4b6 [aernlund] DOC: added examples to reset_index
4838155 [aernlund] DOC: added examples to reset_index
2a51e2b [aernlund] DOC: added examples to reset_index

* channel from pandas to conda-forge (pandas-dev#16966)

* BUG: coercing of bools in groupby transform (pandas-dev#16895)

* DOC: misspelling in DatetimeIndex.indexer_between_time [CI skip] (pandas-dev#16963)

* CLN: some residual code removed, xref to pandas-dev#16761 (pandas-dev#16974)

* ENH: Create a 'Y' alias for date_range yearly frequency

Closes pandas-devgh-9313

* Revert "ENH: Create a 'Y' alias for date_range yearly frequency" (pandas-dev#16976)

This reverts commit 9c096d2, as it was prematurely made.

* DOC: behavior when slicing with missing bounds (pandas-dev#16932)

closes pandas-dev#16917

* TST: Add test for sub-char in read_csv (pandas-dev#16977)

Closes pandas-devgh-16893.

* DEPR: deprecate html.border option (pandas-dev#16970)

* DOC: document convention argument for resample() (pandas-dev#16965)

* DOC: document convention argument for resample()

* DOC: Clarify 'it' in aggregate doc (pandas-dev#16989)

Closes pandas-devgh-16988.

* CLN/COMPAT: for various py2/py3 in doc/bench scripts (pandas-dev#16984)

* PERF: SparseDataFrame._init_dict uses intermediary dict, not DataFrame (pandas-dev#16883)

Closes pandas-devgh-16773.

* MAINT: Drop line_width and height from options (pandas-dev#16993)

Deprecated since 0.11 and 0.12 respectively.

* COMPAT: Add back remove_na for seaborn (pandas-dev#16992)

Closes pandas-devgh-16971.

* COMPAT: np.full not available in all versions, xref pandas-dev#16773 (pandas-dev#17000)

* DOC, TST: Clarify whitespace behavior in read_fwf documentation (pandas-dev#16950)

Closes pandas-devgh-16772

* API: add infer_objects for soft conversions (pandas-dev#16915)

* API: add infer_objects for soft conversions

* doc fixups

* fixups

* doc

* BUG: np.inf now causes Index to upcast from int to float (pandas-dev#16996)

Closes pandas-devgh-16957.

* DOC: Make highlight functions match documentation (pandas-dev#16999)

Closes pandas-devgh-16998.

* BUG: Large object array isin

closes pandas-dev#16012

Author: Morgan Stuart <morgansstuart243@gmail.com>

Closes pandas-dev#16969 from Morgan243/large_array_isin and squashes the following commits:

31cb4b3 [Morgan Stuart] Removed unneeded details from whatsnew description
4b59745 [Morgan Stuart] Linting errors; additional test clarification
186607b [Morgan Stuart] BUG pandas-dev#16012 - fix isin for large object arrays

* BUG: reindex would throw when a categorical index was empty pandas-dev#16770

closes pandas-dev#16770

Author: ri938 <r_irv938@hotmail.com>
Author: Jeff Reback <jeff@reback.net>
Author: Tuan <tuan.d.tran@hotmail.com>
Author: Forbidden Donut <forbdonut@gmail.com>

This patch had conflicts when merged, resolved by
Committer: Jeff Reback <jeff@reback.net>

Closes pandas-dev#16820 from ri938/bug_issue16770 and squashes the following commits:

0e2d315 [ri938] Merge branch 'master' into bug_issue16770
9802288 [ri938] Update v0.20.3.txt
1f2865e [ri938] Update v0.20.3.txt
83fd749 [ri938] Update v0.20.3.txt
eab3192 [ri938] Merge branch 'master' into bug_issue16770
7acc09f [ri938] Minor correction to previous submit
6e8f1b3 [ri938] Minor corrections to previous submit (pandas-dev#16820)
9ed80f0 [ri938] Bring documentation into line with master branch.
26e1a60 [ri938] Move documentation of change to the next major release 0.21.0
59b17cd [Jeff Reback] BUG: rolling.cov with multi-index columns should presever the MI (pandas-dev#16825)
5362447 [Tuan] fix BUG: ValueError when performing rolling covariance on multi indexed DataFrame (pandas-dev#16814)
800b40d [ri938] BUG: render dataframe as html do not produce duplicate element id's (pandas-dev#16780) (pandas-dev#16801)
a725fbf [Forbidden Donut] BUG: Fix read of py3 PeriodIndex DataFrame HDF made in py2 (pandas-dev#16781) (pandas-dev#16790)
8f8e3d6 [ri938] TST: register slow marker (pandas-dev#16797)
0645868 [ri938] Add backticks in documentation
0a20024 [ri938] Minor correction to previous submit
69454ec [ri938] Minor corrections to previous submit (pandas-dev#16820)
3092bbc [ri938] BUG: reindex would throw when a categorical index was empty pandas-dev#16770

* BUG: Don't with empty Series for .isin (pandas-dev#17006)

Empty Series initializes to float64, even when the data type is object for .isin,
leading to an error with membership.

Closes pandas-devgh-16991.

* ENH: Use 'Y' as an alias for end of year (pandas-dev#16978)

Closes pandas-devgh-9313
Redo of pandas-devgh-16958

* DOC: infer_objects doc fixup (pandas-dev#17018)

* Fixes SparseSeries initiated with dictionary raising AttributeError (pandas-dev#16960)

* DOC: Improving docstring of reset_index method (pandas-dev#16416) (pandas-dev#16975)

* DOC: add warning to append about inefficiency (pandas-dev#17017)

* DOC : Remove redundant backtick (pandas-dev#17025)

* DOC: Document business frequency aliases (pandas-dev#17028)

Follow-up to pandas-devgh-16978.

* DOC: Fix double back-tick in 'Reshaping by Melt' section (pandas-dev#17030)

See current stable docs for the issue: https://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-melt

The double ` is causing the entire paragraph to be fixed width until the next double `. This commit removes the extra "`"

* Define DataFrame plot methods in DataFrame (pandas-dev#17020)

* CLN: move safe_sort from core.algorithms to core.sorting (pandas-dev#17034)

COMPAT: safe_sort will only coerce list-likes to object, not a numpy string type

xref: pandas-dev#17003 (comment)

* DOC: Fixed Minor Typo (pandas-dev#17043)

Cocumentation to Documentation

* BUG: do not cast ints to floats if inputs o crosstab are not aligned (pandas-dev#17011)

closes pandas-dev#17005

* BUG in merging categorical dates

closes pandas-dev#16900

Author: Dave Willmer <dave.willmer@gmail.com>

This patch had conflicts when merged, resolved by
Committer: Jeff Reback <jeff@reback.net>

Closes pandas-dev#16986 from dwillmer/cat_fix and squashes the following commits:

1ea1977 [Dave Willmer] Minor tweaks + comment
21a35a0 [Dave Willmer] Merge branch 'cat_fix' of https://github.com/dwillmer/pandas into cat_fix
04d5404 [Dave Willmer] Update tests
3cc5c24 [Dave Willmer] Merge branch 'master' into cat_fix
5e8e23b [Dave Willmer] Add whatsnew item
b82d117 [Dave Willmer] Lint fixes
a81933d [Dave Willmer] Remove unused import
218da66 [Dave Willmer] Generic solution to categorical problem
48e7163 [Dave Willmer] Test inner join
8843c10 [Dave Willmer] Fix TypeError when merging categorical dates

* BUG: __setitem__ with a tuple induces NaN with a tz-aware DatetimeIndex (pandas-dev#16889) (pandas-dev#16897)

* Added test for _get_dtype_type. (pandas-dev#16899)

* BUG/API: dtype inconsistencies in .where / .setitem / .putmask / .fillna (pandas-dev#16821)

* CLN/BUG: fix ndarray assignment may cause unexpected cast

supersedes pandas-dev#14145
closes pandas-dev#14001

* API: This fixes a number of inconsistencies and API issues
w.r.t. dtype conversions.

This is a reprise of pandas-dev#14145 & pandas-dev#16408.

This removes some code from the core structures & pushes it to internals,
where the primitives are made more consistent.

This should all us to be a bit more consistent for pandas2 type things.

closes pandas-dev#16402
supersedes pandas-dev#14145
closes pandas-dev#14001

CLN: remove uneeded code in internals; use split_and_operate when possible

* BUG: Improved thread safety for read_html() GH16928 (pandas-dev#16930)

* Fixed 'add_methods' when the 'select' argument is specified. (pandas-dev#17045)

* TST: Fix error message check in np.argsort comparision (pandas-dev#17051)

Closes pandas-devgh-17046.

* TST: Move some Series ctor tests to SharedWithSparse (pandas-dev#17050)

* BUG: Made SparseDataFrame.fillna() fill all NaNs

A continuation of pandas-dev#16178
closes pandas-dev#16112
closes pandas-dev#16178

Author: Kernc <kerncece@gmail.com>
Author: keitakurita <kris337jbn@yahoo.co.jp>

This patch had conflicts when merged, resolved by
Committer: Jeff Reback <jeff@reback.net>

Closes pandas-dev#16892 from kernc/sparse-fillna and squashes the following commits:

c1cd33e [Kernc] fixup! BUG: Made SparseDataFrame.fillna() fill all NaNs
2974232 [Kernc] fixup! BUG: Made SparseDataFrame.fillna() fill all NaNs
4bc01a1 [keitakurita] BUG: Made SparseDataFrame.fillna() fill all NaNs

* BUG: Use size_t to avoid array index overflow; add missing malloc of error_msg

Fix a few locations where a parser's `error_msg` buffer is written to
without having been previously allocated. This manifested as a double
free during exception handling code making use of the `error_msg`.
Additionally, use `size_t/ssize_t` where array indices or lengths will
be stored. Previously, int32_t was used and would overflow on columns
with very large amounts of data (i.e. greater than INTMAX bytes).

xref pandas-dev#14696
closes pandas-dev#16798

Author: Jeff Knupp <jeff.knupp@enigma.com>
Author: Jeff Knupp <jeff@jeffknupp.com>

Closes pandas-dev#17040 from jeffknupp/16790-core-on-large-csv and squashes the following commits:

6a1ba23 [Jeff Knupp] Clear up prose
a5d5677 [Jeff Knupp] Fix linting issues
4380c53 [Jeff Knupp] Fix linting issues
7b1cd8d [Jeff Knupp] Fix linting issues
e3cb9c1 [Jeff Knupp] Add unit test plus '--high-memory' option, *off by default*.
2ab4971 [Jeff Knupp] Remove debugging code
2930eaa [Jeff Knupp] Fix line length to conform to linter rules
e4dfd19 [Jeff Knupp] Revert printf format strings; fix more comment alignment
3171674 [Jeff Knupp] Fix some leftover size_t references
0985cf3 [Jeff Knupp] Remove debugging code; fix type cast
669d99b [Jeff Knupp] Fix linting errors re: line length
1f24847 [Jeff Knupp] Fix comment alignment; add whatsnew entry
e04d12a [Jeff Knupp] Switch to use int64_t rather than size_t due to portability concerns.
d5c75e8 [Jeff Knupp] BUG: Use size_t to avoid array index overflow; add missing malloc of error_msg

* TST: remove some test warnings in parser tests (pandas-dev#17057)

TST: move highmemory test to proper location in c_parser_only

xref pandas-dev#16798

* DOC: Add more examples for reset_index (pandas-dev#17055)

* MAINT: Add dash in high memory message

Follow-up to pandas-devgh-17057.

* MAINT: kwards --> kwargs in parsers.pyx

* CLN: Cleanup comments in before_install_travis.sh

envars.sh doesn't exist anymore.  In fact, it's been gone for awhile.

* MAINT: Remove duplicate Series sort_index check

Duplicate boolean validation check for sort_index in series/test_validate.py

* BLD: Pin pyarrow=0.4.1 (pandas-dev#17065)

Addresses pandas-devgh-17064.

Also add some additional build information when calling `pd.show_versions`

* ENH: provide "inplace" argument to set_axis()

closes pandas-dev#14636

Author: Pietro Battiston <me@pietrobattiston.it>

Closes pandas-dev#16994 from toobaz/set_axis_inplace and squashes the following commits:

8fb9d0f [Pietro Battiston] REF: adapt NDFrame.set_axis() calls to new signature
409f502 [Pietro Battiston] ENH: provide "inplace" argument to set_axis(), change signature

* BUG: Fix parser field type compatability on 32-bit systems. (pandas-dev#17071)

Closes pandas-devgh-17063

* COMPAT: rename isnull -> isna, notnull -> notna (pandas-dev#16972)

closes pandas-dev#15001

* BUG: Thoroughly dedup columns in read_csv (pandas-dev#17060)

* ENH: Add skipna parameter to infer_dtype (pandas-dev#17066)

Currently defaults to False for backwards compatibility.  Will default to True in the future.

Closes pandas-devgh-17059.

* MAINT: Remove unused variable in test_scalar.py

The "expected" variable is unused at the end of a test in indexing/test_scalar.py

* TST: Add tests/indexing/ and reshape/ to setup.py (pandas-dev#17076)

Looks like we just forgot about them.  Oops.

* CI: partially revert pandas-dev#17065, un-pin pyarrow on some builds

* DOC: whatsnew typos

* TST: Check more error messages in tests (pandas-dev#17075)

* BUG: Respect dtype when calling pivot_table with margins=True

closes pandas-dev#17013

This fix actually exposed an occurrence of pandas-dev#17035 in an existing test
(as well as in one I added).

Author: Pietro Battiston <me@pietrobattiston.it>

Closes pandas-dev#17062 from toobaz/pivot_margin_int and squashes the following commits:

2737600 [Pietro Battiston] Removed now obsolete workaround
956c4f9 [Pietro Battiston] BUG: respect dtype when calling pivot_table with margins=True

* MAINT: Add missing space in parsers.pyx

"2< heuristic" --> "2 < heuristic"

* MAINT: Add missing paren around print statement

Stray verbose print statement in parsers.pyx was bare without any parentheses.

* DOC: fix typos in missing.rst

xref pandas-dev#16972

* DOC: further clean-up null/na changes (pandas-dev#17113)

* BUG: Allow pd.unique to accept tuple of strings (pandas-dev#17108)

* BUG: Allow Series with same name with crosstab (pandas-dev#16028)

Closes pandas-devgh-13279

* COMPAT: make sure use_inf_as_null is deprecated (pandas-dev#17126)

closes pandas-dev#17115

* CI: bump version of xlsxwriter to 0.5.2 (pandas-dev#17142)

* DOC: Clean up instructions in ISSUE_TEMPLATE (pandas-dev#17146)

* Add missing space to the NotImplementedError's message for compound dtypes (pandas-dev#17140)

* DOC: (de)type the return value of concat (pandas-dev#17079) (pandas-dev#17119)

* BUG: Thoroughly dedup column names in read_csv (pandas-dev#17095)

* DOC: Additions/updates to documentation (pandas-dev#17150)

* ENH: add to/from_parquet with pyarrow & fastparquet (pandas-dev#15838)

* DOC: doc typos, xref pandas-dev#15838

* TST: test for categorical index monotonicity (pandas-dev#17152)

* correctly determine bottleneck version

* tests for categorical index monotonicity

* fix Index.is_monotonic to point to Index.is_monotonic_increasing directly

* MAINT: Remove non-standard and inconsistently-used imports (pandas-dev#17085)

* DOC: typos in whatsnew

* DOC: whatsnew 0.21.0 fixes

* BUG: Fix CSV parsing of singleton list header (pandas-dev#17090)

Closes pandas-devgh-7757.

* ENH: Support strings containing '%' in add_prefix/add_suffix (pandas-dev#17151) (pandas-dev#17162)

* REF: repr - allow block to override values that get formatted (pandas-dev#17143)

* MAINT: Drop unnecessary newlines in issue template

* remove direct import of nan

Author: Brock Mendel <jbrockmendel@gmail.com>

Closes pandas-dev#17185 from jbrockmendel/dont_import_nan and squashes the following commits:

ee260b8 [Brock Mendel] remove direct import of nan

* use == to test String equality (pandas-dev#17171)

* ENH: Add warning when setting into nonexistent attribute (pandas-dev#16951)

 closes pandas-dev#7175
 closes pandas-dev#5904

* DOC: added string processing comparison with SAS  (pandas-dev#16497)

* CLN: remove unused get methods in internals (pandas-dev#17169)

* Remove unused get methods that would raise AttributeError if called

* Remove unnecessary import

* TST: Partial Boolean DataFrame Indexing (pandas-dev#17186)

Closes pandas-devgh-17170

* CLN: Reformat docstring for IPython fixture

* Define Series.plot and Series.hist in class definition (pandas-dev#17199)

* BUG: support pandas objects in iloc with old numpy versions (pandas-dev#17194)

closes pandas-dev#17193

* Implement _make_accessor classmethod for PandasDelegate (pandas-dev#17166)

* Create ABCDateOffset (pandas-dev#17165)

* BUG: resample and apply modify the index type for empty Series (pandas-dev#17149)

* DOC: Updated NDFrame.astype docs (pandas-dev#17203)

* MAINT: Minor touch-ups to GitHub PULL_REQUEST_TEMPLATE (pandas-dev#17207)

Remove leading space from task-list so that tasks aren't nested.

* CLN: replace %s syntax with .format in core.computation (pandas-dev#17209)

* Bugfix for multilevel columns with empty strings in Python 2 (pandas-dev#17099)

* CLN/ASV clean-up frame stat ops benchmarks (pandas-dev#17205)

* BUG: Rolling apply on DataFrame with Datetime index returns NaN (pandas-dev#17156)

* CLN: Remove import exception handling (pandas-dev#17218)

Imports should succeed on all versions of Python that pandas supports.

* MAINT: Remove extra the's in deprecation messages (pandas-dev#17222)

* DOC: Patch docs in _decorators.py

* CLN: replace %s syntax with .format in pandas.util (pandas-dev#17224)

* Add 'See also' sections (pandas-dev#17223)

* move pivot_table doc-string to DataFrame (pandas-dev#17174)

* Remove import of pandas as pd in core.window (pandas-dev#17233)

* TST: Move more frame tests to SharedWithSparse (pandas-dev#17227)

* REF: _get_objs_combined_axis (pandas-dev#17217)

* ENH/PERF: Remove frequency inference from .dt accessor (pandas-dev#17210)

* ENH/PERF: Remove frequency inference from .dt accessor

* BENCH: Add DatetimeAccessor benchmark

* DOC: Whatsnew

* Fix apparent typo in tests (pandas-dev#17247)

* COMPAT: avoid calling getsizeof() on PyPy

closes pandas-dev#17228

Author: mattip <matti.picus@gmail.com>

Closes pandas-dev#17229 from mattip/getsizeof-unavailable and squashes the following commits:

d2623e4 [mattip] COMPAT: avoid calling getsizeof() on PyPy

* CLN: replace %s syntax with .format in pandas.core.reshape (pandas-dev#17252)

Replaced %s syntax with .format in pandas.core.reshape.  Additionally, made some of the existing positional .format code more explicit.

* ENH: Infer compression from non-string paths (pandas-dev#17206)

* Fix bugs in IntervalIndex.is_non_overlapping_monotonic (pandas-dev#17238)

* BUG: Fix behavior of argmax and argmin with inf (pandas-dev#16449) (pandas-dev#16449)

Closes pandas-dev#13595

* CLN: Remove have_pytz (pandas-dev#17266)

Closes pandas-devgh-17251

* CLN: replace %s syntax with .format in core.dtypes and core.sparse (pandas-dev#17270)

* Replace imports of * with explicit imports (pandas-dev#17269)

xref pandas-dev#17234

* TST: pytest deprecation warnings GH17197 (pandas-dev#17253)

Test parameters with marks are updated according to the updated API of
Pytest.
https://docs.pytest.org/en/latest/changelog.html#pytest-3-2-0-2017-07-30
https://docs.pytest.org/en/latest/parametrize.html

* Handle more date/datetime/time formats (pandas-dev#15871)

* DOC: add example on json_normalize (pandas-dev#16438)

* BUG: Have object dtype for empty Categorical.categories (pandas-dev#17249)

* BUG: Have object dtype for empty Categorical ctor

Previously we had a `Float64Index`, which is inconsistent with, e.g., the
regular Index constructor.

* TST: Update tests in multi for new return

Previously these relied worked around the return type by wrapping list-likes
in `np.array` and relying on that to cast to float. These workarounds are no
longer nescessary.

* TST: Update union_categorical tests

This relied on `NaN` being a float and empty being a float. Not a necessary
test anymore.

* TST: set object dtype

* CLN: replace %s syntax with .format in pandas.tseries (pandas-dev#17290)

* TST: parameterize consistency tests for rolling/expanding windows (pandas-dev#17292)

* FIX: define `DataFrame.items` for all versions of python (pandas-dev#17214)

* PERF: Update ASV publish config (pandas-dev#17293)

Stricter cutoffs for considering regressions

[ci skip]

* DOC: Expand docstrings for head / tail methods (pandas-dev#16941)

* MAINT: Use set literal for unsupported + depr args

Initializes unsupported and deprecated argument sets with set literals instead of the set constructor in pandas/io/parsers.py, as the former is slightly faster than the latter.

* DOC: Add proper docstring to maybe_convert_indices

Patches several spelling errors and expands current doc to a proper doc-string.

* DOC: Improving docstring of take method (pandas-dev#16948)

* BUG: Fixed regex in asv.conf.json (pandas-dev#17300)

In pandas-dev#17293 I messed up the syntax. I
used a glob instead of a regex. According to the docs at
http://asv.readthedocs.io/en/latest/asv.conf.json.html#regressions-thresholds we
want to use a regex. I've actually manually tested this change and verified that
it works.

[ci skip]

* Remove unnecessary usage of _TSObject (pandas-dev#17297)

* BUG: clip should handle null values

closes pandas-dev#17276

Author: Michael Gasvoda <mgasvoda@mercatus.gmu.edu>
Author: mgasvoda <mgasvoda01@gmail.com>

Closes pandas-dev#17288 from mgasvoda/master and squashes the following commits:

a1dbdf2 [mgasvoda] Merge branch 'master' into master
9333952 [Michael Gasvoda] Checking output of tests
4e0464e [Michael Gasvoda] fixing whatsnew text
c442040 [Michael Gasvoda] formatting fixes
7e23678 [Michael Gasvoda] formatting updates
781ea72 [Michael Gasvoda] whatsnew entry
d9627fe [Michael Gasvoda] adding clip tests
9aa0159 [Michael Gasvoda] Treating na values as none for clips

* BUG: fillna returns frame when inplace=True if value is a dict (pandas-dev#16156) (pandas-dev#17279)

* CLN: Index.append() refactoring (pandas-dev#16236)

* DEPS: set min versions (pandas-dev#17002)

closes pandas-dev#15206, numpy >= 1.9
closes pandas-dev#15543, matplotlib >= 1.4.3
scipy >= 0.14.0

* CLN: replace %s syntax with .format in core.tools, algorithms.py, base.py (pandas-dev#17305)

* BUG: Fix strange behaviour of Series.iloc on MultiIndex Series (pandas-dev#17148) (pandas-dev#17291)

* DOC: Add module doc-string to tseries/api.py

* MAINT: Clean up docs in pandas/errors/__init__.py

* CLN: replace %s syntax with .format in missing.py, nanops.py, ops.py (pandas-dev#17322)

Replaced %s syntax with .format in missing.py, nanops.py, ops.py. Additionally, made some of the existing positional .format code more explicit.

* Make pd.Period immutable (pandas-dev#17239)

* Bug: groupby multiindex levels equals rows (pandas-dev#16859)

closes pandas-dev#16843

* BUG: Cannot use tz-aware origin in to_datetime (pandas-dev#16842)

closes pandas-dev#16842

Author: step4me <prosikeffect@gmail.com>

Closes pandas-dev#17244 from step4me/step4me-feature and squashes the following commits:

09d051d [step4me] BUG: Cannot use tz-aware origin in to_datetime (pandas-dev#16842)

* Replace usage of total_seconds compat func with timedelta method (pandas-dev#17289)

* CLN: replace %s syntax with .format in core/indexing.py (pandas-dev#17357)

Progress toward issue pandas-dev#16130. Converted old string formatting to new string formatting in core/indexing.py.

* DOC: Point to dev-docs in issue template (pandas-dev#17353)

[ci skip]

* CLN: remove total_seconds compat from json (pandas-dev#17341)

* CLN: Move test_intersect_str_dates (pandas-dev#17366)

Moves test_intersect_str_dates from tests/indexes/test_range.py to tests/indexes/test_base.py.

* BUG: Respect dups in reindexing CategoricalIndex (pandas-dev#17355)

When the indexer is identical to the elements.
We should still return duplicates when the indexer
contains duplicates.

Closes pandas-devgh-17323.

* Unify Index._dir_* with Series implementation (pandas-dev#17117)

* BUG: make order of index from pd.concat deterministic (pandas-dev#17364)

closes pandas-dev#17344

* Fix typo that causes several NaT methods to have incorrect docstrings (pandas-dev#17327)

* CLN: replace %s syntax with .format in io/formats/format.py (pandas-dev#17358)

Progress toward issue pandas-dev#16130. Converted old string formatting to new string formatting in io/formats/format.py.

* PKG: Added pyproject.toml for PEP 518 (pandas-dev#16745)

Declaring build-time requirements: https://www.python.org/dev/peps/pep-0518/

* DOC: Update Overview page in documentation (pandas-dev#17368)

* Update Overview page in documentation

* DOC Revise Overview page

* DOC Make further revisions in Overview webpage

* Update overview.rst

Remove references to Panel

* API: Have MultiIndex consturctors always return a MI (pandas-dev#17236)

* API: Have MultiIndex constructors return MI

This removes the special case for MultiIndex constructors returning
an Index if all the levels are length-1. Now this will return a
MultiIndex with a single level.

This is a backwards incompatabile change, with no clear method for
deprecation, so we're making a clean break.

Closes pandas-dev#17178

* fixup! API: Have MultiIndex constructors return MI

* Update for comments
jowens pushed a commit to jowens/pandas that referenced this pull request Sep 20, 2017
alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Clean Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants