Skip to content

BUG: agg with dictlike and non-unique col will return wrong type #52115

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Apr 11, 2023

Conversation

luke396
Copy link
Contributor

@luke396 luke396 commented Mar 22, 2023

@luke396 luke396 marked this pull request as draft March 22, 2023 13:04
@luke396 luke396 marked this pull request as ready for review March 28, 2023 08:04
@luke396 luke396 changed the title BUG: agg in non-unique col BUG: agg with dictlike and non-unique col will return wrong type Mar 28, 2023
@mroeschke mroeschke requested a review from rhshadrach March 31, 2023 17:26
@mroeschke mroeschke added the Apply Apply, Aggregate, Transform, Map label Mar 31, 2023
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can at least partially avoid some perf regression here, suggestions below. Can you run DataFrame / Groupby ASVs on this.

@luke396
Copy link
Contributor Author

luke396 commented Apr 1, 2023

Some changes have been made, but sorry that I’m not familiar with ASVs. I’m still learning and may need some time.

Can you run DataFrame / Groupby ASVs on this.

@luke396
Copy link
Contributor Author

luke396 commented Apr 2, 2023

run asv continuous -f 1.1 upstream/main bug-agg-nonunique-col -b ^groupby ('bug-agg-nonunique-col' is current branch name)

  before           after         ratio
     [d69eaae7]       [71ec2d63]
     <main>           <bug-agg-nonunique-col>
+        819±50μs         945±50μs     1.15  groupby.Categories.time_groupby_sort
+         546±9μs        625±100μs     1.14  groupby.GroupByMethods.time_dtype_as_field('int', 'rank', 'transformation', 1)
+     1.16±0.04ms      1.32±0.04ms     1.14  groupby.GroupByMethods.time_dtype_as_group('float', 'min', 'transformation', 5)
+     1.29±0.04ms      1.47±0.06ms     1.14  groupby.GroupByMethods.time_dtype_as_group('datetime', 'value_counts', 'direct', 1)
+        783±20μs         885±30μs     1.13  groupby.GroupByMethods.time_dtype_as_group('datetime', 'cummin', 'transformation', 5)
+        328±10μs         371±10μs     1.13  groupby.GroupByMethods.time_dtype_as_group('int', 'skew', 'transformation', 1)
+     1.41±0.02ms      1.59±0.07ms     1.13  groupby.GroupByMethods.time_dtype_as_field('datetime', 'last', 'transformation', 5)
+         510±5μs         575±20μs     1.13  groupby.GroupByMethods.time_dtype_as_group('int', 'median', 'transformation', 1)
+         432±9μs         485±50μs     1.12  groupby.GroupByMethods.time_dtype_as_field('object', 'count', 'transformation', 1)
+     1.01±0.02ms      1.13±0.02ms     1.12  groupby.GroupByMethods.time_dtype_as_group('int16', 'first', 'transformation', 5)
+         317±2μs          356±8μs     1.12  groupby.GroupByMethods.time_dtype_as_field('uint', 'cummin', 'transformation', 1)
+     1.48±0.04ms      1.66±0.08ms     1.12  groupby.GroupByMethods.time_dtype_as_field('float', 'value_counts', 'direct', 1)
+     1.03±0.03ms      1.16±0.04ms     1.12  groupby.GroupByMethods.time_dtype_as_group('int', 'first', 'transformation', 5)
+     1.18±0.04ms      1.31±0.07ms     1.12  groupby.GroupByMethods.time_dtype_as_field('uint', 'cummax', 'transformation', 5)
+        928±30μs      1.03±0.04ms     1.11  groupby.GroupByMethods.time_dtype_as_group('uint', 'bfill', 'transformation', 5)
+     3.66±0.09ms       4.05±0.2ms     1.11  groupby.Categories.time_groupby_ordered_nosort
+        471±20μs          521±9μs     1.11  groupby.GroupByMethods.time_dtype_as_group('int16', 'any', 'transformation', 1)
+        335±20μs          369±5μs     1.10  groupby.GroupByMethods.time_dtype_as_field('uint', 'skew', 'transformation', 1)
+        652±10μs         718±20μs     1.10  groupby.GroupByMethods.time_dtype_as_group('object', 'rank', 'direct', 5)
+        926±40μs      1.02±0.03ms     1.10  groupby.RankWithTies.time_rank_ties('float64', 'average')
+         431±3μs         475±20μs     1.10  groupby.GroupByMethods.time_dtype_as_group('object', 'any', 'transformation', 1)
-        601±30μs          544±8μs     0.91  groupby.GroupByMethods.time_dtype_as_field('int16', 'sum', 'transformation', 1)
-        487±30μs          441±3μs     0.90  groupby.GroupByMethods.time_dtype_as_field('float', 'any', 'transformation', 1)
-     1.46±0.03ms      1.31±0.02ms     0.90  groupby.GroupByMethods.time_dtype_as_field('float', 'any', 'transformation', 5)

run asv continuous -f 1.1 upstream/main bug-agg-nonunique-col -b ^frame , seems nice, without any highlight report.

Given the above, it appears that there is a need for further performance improvement.

Instead of storing results in a dictionary, can you store them in two lists: result_index, result_data. Then you can do Series(result_data, index=result_index) and you don't need separate logic here.

Therefore, it means we should replace the usage of the dict-like results with result_data and result_index throughout the code related to agg_dict_like?

Additionally, considering the improvement of the code's features, it may be helpful to use list comprehension and extend (rather than append).

cc @rhshadrach

@luke396
Copy link
Contributor Author

luke396 commented Apr 3, 2023

For now, asv continuous -f 1.1 upstream/main bug-agg-nonunique-col -b ^groupby return below, and asv continuous -f 1.1 upstream/main bug-agg-nonunique-col -b ^frame seems not significant influence.

       before           after         ratio
     [bcc5160b]       [ade2efe5]
     <main>           <bug-agg-nonunique-col>
-     1.09±0.04ms         989±20μs     0.91  groupby.RankWithTies.time_rank_ties('float64', 'max')
-        10.2±1ms      9.27±0.05ms     0.91  groupby.Nth.time_frame_nth_any('float32')
-        98.6±3μs       89.4±0.5μs     0.91  groupby.GroupByMethods.time_dtype_as_field('int', 'prod', 'direct', 1)
-     1.18±0.04ms      1.07±0.01ms     0.91  groupby.GroupByMethods.time_dtype_as_group('int16', 'sum', 'transformation', 5)
-        462±30μs          419±6μs     0.91  groupby.GroupByMethods.time_dtype_as_group('int16', 'bfill', 'transformation', 1)
-        597±20μs          541±5μs     0.91  groupby.GroupByMethods.time_dtype_as_field('object', 'any', 'transformation', 1)
-        268±10μs          242±5μs     0.90  groupby.GroupByMethods.time_dtype_as_field('datetime', 'diff', 'direct', 1)
-      1.24±0.1ms      1.12±0.02ms     0.90  groupby.GroupByMethods.time_dtype_as_group('int', 'median', 'transformation', 5)
-         340±6μs          306±6μs     0.90  groupby.GroupByMethods.time_dtype_as_field('int', 'shift', 'transformation', 1)
-        519±10μs          468±8μs     0.90  groupby.GroupByMethods.time_dtype_as_field('int', 'std', 'transformation', 1)
-         103±6μs       93.0±0.7μs     0.90  groupby.GroupByMethods.time_dtype_as_group('float', 'max', 'direct', 1)
-        900±60μs          810±4μs     0.90  groupby.GroupByMethods.time_dtype_as_group('int', 'cumprod', 'transformation', 5)
-        572±40μs          515±5μs     0.90  groupby.GroupByMethods.time_dtype_as_group('uint', 'sem', 'transformation', 1)
-     1.35±0.03ms      1.21±0.04ms     0.90  groupby.GroupByMethods.time_dtype_as_field('int', 'cummax', 'transformation', 5)
-     1.09±0.06ms         981±20μs     0.90  groupby.GroupByMethods.time_dtype_as_group('int16', 'bfill', 'transformation', 5)
-         160±7μs          143±2μs     0.90  groupby.GroupByMethods.time_dtype_as_field('int', 'cumcount', 'direct', 1)
-      4.32±0.2ms      3.88±0.04ms     0.90  groupby.Categories.time_groupby_ordered_nosort
-        362±20μs          324±6μs     0.90  groupby.GroupByMethods.time_dtype_as_field('int16', 'nunique', 'direct', 5)
-         135±3μs          121±3μs     0.90  groupby.GroupByMethods.time_dtype_as_group('float', 'std', 'direct', 5)
-     1.46±0.05ms      1.31±0.04ms     0.90  groupby.GroupByMethods.time_dtype_as_field('int', 'count', 'transformation', 5)
-     1.11±0.02ms         994±20μs     0.90  groupby.GroupByMethods.time_dtype_as_group('int16', 'any', 'transformation', 5)
-      5.92±0.3ms      5.30±0.04ms     0.89  groupby.Apply.time_scalar_function_single_col(5)
-        486±30μs          434±4μs     0.89  groupby.GroupByMethods.time_dtype_as_group('uint', 'count', 'transformation', 1)
-        508±20μs         454±20μs     0.89  groupby.GroupByMethods.time_dtype_as_field('int16', 'mean', 'transformation', 1)
-        553±10μs          494±9μs     0.89  groupby.GroupByMethods.time_dtype_as_group('int', 'any', 'transformation', 1)
-     1.08±0.03ms         968±10μs     0.89  groupby.GroupByMethods.time_dtype_as_group('int', 'ffill', 'transformation', 5)
-     1.65±0.05ms      1.47±0.01ms     0.89  groupby.GroupByMethods.time_dtype_as_field('int16', 'sum', 'transformation', 5)
-     1.03±0.03ms         922±20μs     0.89  groupby.GroupByMethods.time_dtype_as_group('datetime', 'ffill', 'transformation', 5)
-     1.21±0.05ms      1.08±0.02ms     0.89  groupby.GroupByMethods.time_dtype_as_group('datetime', 'any', 'transformation', 5)
-        894±90μs         796±10μs     0.89  groupby.GroupByMethods.time_dtype_as_group('uint', 'shift', 'transformation', 5)
-        522±10μs         464±10μs     0.89  groupby.GroupByMethods.time_dtype_as_field('uint', 'last', 'transformation', 1)
-        695±20μs          618±7μs     0.89  groupby.GroupByMethods.time_dtype_as_group('datetime', 'rank', 'transformation', 1)
-     1.27±0.05ms      1.12±0.04ms     0.89  groupby.GroupByMethods.time_dtype_as_group('int16', 'diff', 'transformation', 5)
-     1.27±0.05ms      1.12±0.01ms     0.89  groupby.GroupByMethods.time_dtype_as_group('datetime', 'last', 'transformation', 5)
-     1.60±0.08ms      1.42±0.02ms     0.89  groupby.GroupByMethods.time_dtype_as_field('datetime', 'rank', 'transformation', 5)
-      9.64±0.5ms       8.54±0.1ms     0.89  groupby.ApplyNonUniqueUnsortedIndex.time_groupby_apply_non_unique_unsorted_index
-      3.47±0.6ms      3.08±0.09ms     0.89  groupby.Categories.time_groupby_extra_cat_nosort
-         314±7μs          278±4μs     0.88  groupby.GroupByMethods.time_dtype_as_group('uint', 'diff', 'direct', 5)
-        673±30μs         595±20μs     0.88  groupby.GroupByMethods.time_dtype_as_group('object', 'nunique', 'transformation', 1)
-     1.38±0.04ms      1.22±0.03ms     0.88  groupby.GroupByMethods.time_dtype_as_field('object', 'value_counts', 'direct', 1)
-        568±20μs          501±8μs     0.88  groupby.GroupByMethods.time_dtype_as_field('datetime', 'any', 'transformation', 1)
-       581±100μs          513±4μs     0.88  groupby.GroupByMethods.time_dtype_as_group('uint', 'min', 'transformation', 1)
-     1.21±0.06ms      1.07±0.02ms     0.88  groupby.GroupByMethods.time_dtype_as_group('uint', 'diff', 'transformation', 5)
-     1.06±0.06ms         935±20μs     0.88  groupby.GroupByMethods.time_dtype_as_group('int', 'pct_change', 'transformation', 1)
-        784±40μs         689±30μs     0.88  groupby.GroupByMethods.time_dtype_as_group('datetime', 'quantile', 'transformation', 1)
-     1.36±0.04ms      1.19±0.05ms     0.88  groupby.GroupByMethods.time_dtype_as_group('float', 'max', 'transformation', 5)
-        930±20μs         814±10μs     0.88  groupby.GroupByMethods.time_dtype_as_group('int', 'cumsum', 'transformation', 5)
-        564±50μs         493±10μs     0.87  groupby.GroupByMethods.time_dtype_as_group('int16', 'var', 'transformation', 1)
-        371±20μs          324±4μs     0.87  groupby.GroupByMethods.time_dtype_as_field('int16', 'cummin', 'transformation', 1)
-     1.14±0.03ms          997±9μs     0.87  groupby.GroupByMethods.time_dtype_as_group('uint', 'ffill', 'transformation', 5)
-        553±10μs          483±4μs     0.87  groupby.GroupByMethods.time_dtype_as_group('float', 'cummax', 'transformation', 1)
-        469±40μs          409±8μs     0.87  groupby.GroupByMethods.time_dtype_as_field('int', 'diff', 'transformation', 1)
-        537±20μs         467±20μs     0.87  groupby.GroupByMethods.time_dtype_as_field('uint', 'prod', 'transformation', 1)
-        531±20μs          463±6μs     0.87  groupby.GroupByMethods.time_dtype_as_field('int16', 'max', 'transformation', 1)
-        984±50μs         855±30μs     0.87  groupby.Categories.time_groupby_extra_cat_sort
-        29.4±2ms       25.5±0.6ms     0.87  groupby.GroupByMethods.time_dtype_as_group('int16', 'describe', 'direct', 1)
-     1.18±0.05ms      1.02±0.02ms     0.87  groupby.GroupByMethods.time_dtype_as_group('uint', 'all', 'transformation', 5)
-        687±30μs          596±7μs     0.87  groupby.GroupByMethods.time_dtype_as_group('int', 'rank', 'transformation', 1)
-     1.16±0.04ms      1.00±0.01ms     0.87  groupby.GroupByMethods.time_dtype_as_group('uint', 'bfill', 'transformation', 5)
-        347±20μs          300±4μs     0.87  groupby.GroupByMethods.time_dtype_as_field('int16', 'shift', 'transformation', 1)
-        519±40μs          449±6μs     0.86  groupby.GroupByMethods.time_dtype_as_group('datetime', 'diff', 'transformation', 1)
-      1.69±0.1ms      1.46±0.02ms     0.86  groupby.GroupByMethods.time_dtype_as_field('int16', 'prod', 'transformation', 5)
-     1.21±0.03ms      1.04±0.01ms     0.86  groupby.GroupByMethods.time_dtype_as_group('int16', 'std', 'transformation', 5)
-        738±40μs         635±20μs     0.86  groupby.GroupByMethods.time_dtype_as_group('float', 'min', 'transformation', 1)
-        965±50μs         829±20μs     0.86  groupby.GroupByMethods.time_dtype_as_group('datetime', 'cummin', 'transformation', 5)
-        397±10μs          340±7μs     0.86  groupby.GroupByMethods.time_dtype_as_field('uint', 'skew', 'transformation', 1)
-     1.25±0.04ms      1.08±0.02ms     0.86  groupby.GroupByMethods.time_dtype_as_group('float', 'ffill', 'transformation', 5)
-        630±50μs          540±5μs     0.86  groupby.GroupByMethods.time_dtype_as_group('datetime', 'max', 'transformation', 1)
-      4.01±0.7ms      3.44±0.05ms     0.86  groupby.Transform.time_transform_ufunc_max
-        971±40μs         832±20μs     0.86  groupby.GroupByMethods.time_dtype_as_group('int16', 'cummin', 'transformation', 5)
-        971±40μs         831±20μs     0.86  groupby.GroupByMethods.time_dtype_as_group('int16', 'cumprod', 'transformation', 5)
-      1.62±0.1ms      1.39±0.04ms     0.86  groupby.GroupByMethods.time_dtype_as_field('uint', 'all', 'transformation', 5)
-        577±50μs          493±7μs     0.85  groupby.GroupByMethods.time_dtype_as_group('int16', 'std', 'transformation', 1)
-     1.41±0.04ms      1.20±0.01ms     0.85  groupby.GroupByMethods.time_dtype_as_field('int16', 'cummin', 'transformation', 5)
-        534±60μs          455±6μs     0.85  groupby.GroupByMethods.time_dtype_as_group('int16', 'cumcount', 'transformation', 1)
-        773±20μs          658±5μs     0.85  groupby.GroupByMethods.time_dtype_as_group('float', 'first', 'transformation', 1)
-     1.25±0.05ms      1.06±0.02ms     0.85  groupby.GroupByMethods.time_dtype_as_group('int16', 'sem', 'transformation', 5)
-        607±40μs          515±7μs     0.85  groupby.GroupByMethods.time_dtype_as_group('uint', 'sum', 'transformation', 1)
-        545±40μs          462±7μs     0.85  groupby.GroupByMethods.time_dtype_as_group('object', 'first', 'transformation', 1)
-     1.43±0.06ms      1.21±0.01ms     0.85  groupby.GroupByMethods.time_dtype_as_group('float', 'std', 'transformation', 5)
-        325±10μs          274±5μs     0.84  groupby.GroupByMethods.time_dtype_as_group('int16', 'diff', 'direct', 5)
-     1.41±0.07ms      1.18±0.02ms     0.84  groupby.GroupByMethods.time_dtype_as_group('float', 'last', 'transformation', 5)
-      34.3±0.9ms         28.7±1ms     0.84  groupby.GroupByMethods.time_dtype_as_group('uint', 'describe', 'direct', 1)
-     1.20±0.06ms      1.01±0.01ms     0.84  groupby.GroupByMethods.time_dtype_as_group('int16', 'count', 'transformation', 5)
-     1.07±0.04ms         894±30μs     0.84  groupby.Categories.time_groupby_ordered_sort
-        569±60μs          475±4μs     0.83  groupby.GroupByMethods.time_dtype_as_field('uint', 'sum', 'transformation', 1)
-        427±30μs         355±20μs     0.83  groupby.GroupByMethods.time_dtype_as_group('int16', 'skew', 'transformation', 1)
-        902±50μs         749±10μs     0.83  groupby.FillNA.time_df_bfill
-        637±40μs         529±20μs     0.83  groupby.GroupByMethods.time_dtype_as_group('int16', 'first', 'transformation', 1)
-     1.08±0.07ms         893±10μs     0.83  groupby.Categories.time_groupby_sort
-        968±10μs         801±10μs     0.83  groupby.GroupByMethods.time_dtype_as_group('int16', 'cummax', 'transformation', 5)
-        464±40μs          383±4μs     0.82  groupby.GroupByMethods.time_dtype_as_group('datetime', 'cummin', 'transformation', 1)
-        441±70μs          362±6μs     0.82  groupby.GroupByMethods.time_dtype_as_group('int16', 'cumsum', 'transformation', 1)
-      3.08±0.6ms      2.53±0.02ms     0.82  groupby.CountMultiDtype.time_multi_count
-        973±40μs         794±20μs     0.82  groupby.FillNA.time_df_ffill
-        494±40μs          403±6μs     0.82  groupby.GroupByMethods.time_dtype_as_group('uint', 'ffill', 'transformation', 1)
-      1.50±0.2ms      1.22±0.01ms     0.82  groupby.GroupByMethods.time_dtype_as_group('float', 'sem', 'transformation', 5)
-        453±30μs          369±6μs     0.82  groupby.GroupByMethods.time_dtype_as_group('int16', 'cummin', 'transformation', 1)
-     1.28±0.06ms      1.04±0.01ms     0.81  groupby.GroupByMethods.time_dtype_as_group('uint', 'min', 'transformation', 5)
-      1.04±0.2ms         842±20μs     0.81  groupby.GroupByMethods.time_dtype_as_group('int16', 'cumsum', 'transformation', 5)
-      1.28±0.3ms      1.04±0.01ms     0.81  groupby.GroupByMethods.time_dtype_as_group('uint', 'var', 'transformation', 5)
-       847±100μs          678±7μs     0.80  groupby.GroupByMethods.time_dtype_as_group('uint', 'quantile', 'transformation', 1)
-        555±30μs         443±10μs     0.80  groupby.GroupByMethods.time_dtype_as_group('int16', 'count', 'transformation', 1)
-       611±100μs         488±10μs     0.80  groupby.GroupByMethods.time_dtype_as_group('uint', 'var', 'transformation', 1)
-      2.55±0.6ms      2.03±0.05ms     0.80  groupby.GroupByMethods.time_dtype_as_group('uint', 'quantile', 'transformation', 5)
-      1.11±0.4ms         858±10μs     0.78  groupby.GroupByMethods.time_dtype_as_group('uint', 'skew', 'transformation', 5)
-      1.88±0.5ms      1.43±0.02ms     0.76  groupby.GroupByMethods.time_dtype_as_group('uint', 'value_counts', 'direct', 1)
-       709±200μs         536±10μs     0.76  groupby.GroupByMethods.time_dtype_as_field('datetime', 'min', 'transformation', 1)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@luke396 luke396 requested a review from rhshadrach April 3, 2023 06:21
@rhshadrach
Copy link
Member

@luke396 - I don't understand how the changes here could increase performance. When I run ASVs on this branch in groupby, I get:

       before           after         ratio
     [ebe484a6]       [2cba11a3]
     <clean_gb_tests_1~2>       <bug-agg-nonunique-col>
+      25.8±0.5ms       28.8±0.5ms     1.12  groupby.Groups.time_series_indices('int64_large')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

Can you confirm?

@luke396
Copy link
Contributor Author

luke396 commented Apr 10, 2023

@rhshadrach - It is possible that the differing ASV results are due to the use of different commit versions. I apologize for any confusion caused by my imprecise language, as I am not very familiar with this topic.

I attempted to compare the different commit versions of the branch (ade2efe and 2cba11a) with the main branch (3c44745). The results showed a significant difference, with minimal differences between ade2efe and main, and a significant decrease with 2cba11a compared to main (even much more than the differences shown below). However, I do not believe that this significant performance change can be attributed solely to the code changes made in this PR. It could be due to changes made in the main branch's code.

In any case, if we compare the results from the main branch (3c44745) to latest of this branch (5daa349), we can see the differences below.

       before           after         ratio
     [3c447454]       [5daa349d]
     <main>           <bug-agg-nonunique-col>
+         327±4μs         447±40μs     1.37  groupby.GroupByMethods.time_dtype_as_group('int', 'shift', 'transformation', 1)
+        470±10μs        628±100μs     1.34  groupby.GroupByMethods.time_dtype_as_group('int', 'prod', 'transformation', 1)
+        980±20μs      1.18±0.02ms     1.20  groupby.GroupByMethods.time_dtype_as_group('int', 'prod', 'transformation', 5)
+        984±30μs      1.14±0.05ms     1.15  groupby.GroupByMethods.time_dtype_as_group('datetime', 'last', 'transformation', 5)
+         555±8μs         639±60μs     1.15  groupby.GroupByMethods.time_dtype_as_group('float', 'std', 'transformation', 1)
+     1.08±0.02ms       1.23±0.1ms     1.14  groupby.GroupByMethods.time_dtype_as_field('int', 'cummin', 'transformation', 5)
+     1.07±0.02ms       1.21±0.1ms     1.13  groupby.GroupByMethods.time_dtype_as_group('float', 'std', 'transformation', 5)
+      93.9±0.5μs          104±2μs     1.11  groupby.GroupByMethods.time_dtype_as_group('int', 'size', 'direct', 1)
+        438±10μs         484±20μs     1.10  groupby.GroupByMethods.time_dtype_as_group('uint', 'any', 'transformation', 1)
-        499±20μs          453±5μs     0.91  groupby.GroupByMethods.time_dtype_as_group('uint', 'first', 'transformation', 1)
-        527±20μs         468±10μs     0.89  groupby.GroupByMethods.time_dtype_as_group('int', 'max', 'transformation', 1)
-        341±50μs          302±3μs     0.89  groupby.GroupByMethods.time_dtype_as_group('datetime', 'ffill', 'transformation', 1)
-        852±30μs         755±20μs     0.89  groupby.GroupByMethods.time_dtype_as_group('datetime', 'cummin', 'transformation', 5)
-     1.12±0.03ms         976±30μs     0.87  groupby.GroupByMethods.time_dtype_as_group('uint', 'diff', 'transformation', 5)
-     1.44±0.06ms      1.25±0.07ms     0.87  groupby.GroupByMethods.time_dtype_as_field('int', 'any', 'transformation', 5)
-      1.04±0.2ms         876±10μs     0.84  groupby.RankWithTies.time_rank_ties('float64', 'min')
-      1.06±0.1ms         880±20μs     0.83  groupby.RankWithTies.time_rank_ties('int64', 'max')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

@luke396
Copy link
Contributor Author

luke396 commented Apr 10, 2023

The main changes I've made are replacing the use of the dictionary results with two lists, result_data and result_index, and using list comprehension instead of some for-loops. These changes resulted in a performance increase, at least indeed in last week when I tested.

@rhshadrach
Copy link
Member

rhshadrach commented Apr 11, 2023

These changes resulted in a performance increase, at least indeed in last week when I tested.

I don't believe any of the results from the ASVs posted here hit the code that is changing in this PR. Correct me if you think this is wrong.

In order to get accurate results, you should not have any other processes running in the foreground (e.g. a web browser) when running the ASVs.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - just some minor requests.

@luke396
Copy link
Contributor Author

luke396 commented Apr 11, 2023

In order to get accurate results, you should not have any other processes running in the foreground (e.g. a web browser) when running the ASVs.

I think you're right! I didn't notice that other processes running might be influencing the results. I definitely need to learn more about ASVs in depth.

Do you think we should run a new ASV analysis instead of relying on the previous one I posted? It's possible that the previous analysis (the last one) wasn't accurate either.

@rhshadrach
Copy link
Member

@luke396 - I'm satisfied with the ASVs I posted; they were run without any other foreground processes.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@luke396
Copy link
Contributor Author

luke396 commented Apr 11, 2023

@luke396 - I'm satisfied with the ASVs I posted; they were run without any other foreground processes.

Fine! 😄

@mroeschke mroeschke added this to the 2.1 milestone Apr 11, 2023
@mroeschke mroeschke merged commit 8111099 into pandas-dev:main Apr 11, 2023
@mroeschke
Copy link
Member

Thanks @luke396

@luke396 luke396 deleted the bug-agg-nonunique-col branch April 12, 2023 00:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: apply/agg with dictlike and non-unique columns
3 participants