-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
index is included in memory usage by default #11867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -544,6 +544,8 @@ def memory_usage(self, deep=False): | |||
v += lib.memory_usage_of_objects(self.values) | |||
return v | |||
|
|||
__sizeof__ = com._sizeof |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can put this in PandasObject
I think
@@ -7693,6 +7693,10 @@ def test_info_memory_usage(self): | |||
DataFrame(1,index=pd.MultiIndex.from_product([['a'],range(1000)]),columns=['A']).index.nbytes | |||
DataFrame(1,index=pd.MultiIndex.from_product([['a'],range(1000)]),columns=['A']).index.values.nbytes | |||
|
|||
# sys.getsizeof works the same as memory_usage with defaults, albeit with some | |||
# GC overhead | |||
self.assertAlmostEqual(sys.getsizeof(df), df.memory_usage().sum(), delta=100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test this on series/Index/Categorical
as well. (for Index
put a test in test_index on the Base
which will run for all indexes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added onto the existing tests in IndexOps
- lmk if that's not OK
c462b19
to
c99cda7
Compare
self.assertEqual(o.memory_usage(index=False) + o.index.memory_usage(), | ||
o.memory_usage(index=True)) | ||
|
||
# sys.getsizeof will call the .memory_usage defaults, and add on some GC overhead | ||
self.assertAlmostEqual(res, sys.getsizeof(o), delta=100) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't use this
use tm.assert_almost_equal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we disabled the test routines that one shouldn't use - in tm.TestCase
but maybe that's an open issue (or if not can you create one)
we are dropping 2.6 soon - but still like for there to be 1 way to do testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - although the functionality is materially worse than the nosetests one (actually kinda peculiar with the bool argument for 3 vs 5):
check_less_precise : bool, default False
Specify comparison precision.
5 digits (False) or 3 digits (True) after decimal points are compared.
51de6c4
to
36d75d8
Compare
@jreback |
@@ -108,6 +108,9 @@ Other enhancements | |||
- A simple version of ``Panel.round()`` is now implemented (:issue:`11763`) | |||
- For Python 3.x, ``round(DataFrame)``, ``round(Series)``, ``round(Panel)`` will work (:issue:`11763`) | |||
- ``Dataframe`` has gained a ``_repr_latex_`` method in order to allow for automatic conversion to latex in a ipython/jupyter notebook using nbconvert. Options ``display.latex.escape`` and ``display.latex.longtable`` have been added to the configuration and are used automatically by the ``to_latex`` method.(:issue:`11778`) | |||
- sys.getsizeof(obj) returns the memory usage of a pandas object including the (non-object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I think this should call with deep=True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have enough context to know whether deep
should be True or False by default, but I can see the logic behind the default for .memory_usage
should being the same as that for sys.getsizeof
, given the similar use cases.
Or why are the use cases different? Because a user calling from the system wants accuracy vs. speed but a user calling from pandas wants speed vs. accuracy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using __sizeof__
should have deep=True
which gives the 'best' report of memory used. This is not the default for .memory_usage()
because its expensive to compute (potentially). see #11595 where it can be somewhat slow (though the cython impl helps).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. if someone is calling sys.getsizeof(df)
then I think its appropriate to give the most accurate (if maybe somewhat non-performant) answer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, if it's a material performance drag and a different enough set of use cases, sobeit
3c02600
to
6045ae7
Compare
@jreback updated & green |
# we could check the kwarg with inspect module, but different methods for Py versions | ||
try: | ||
mem = self.memory_usage(deep=True) | ||
except TypeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nothing should raise a TypeError
, what does? rather than catching this error, need to fix the underlying.
@jreback updated & green |
@MaximilianR lgtm. can you run: ping when green. |
Done (phew...) |
import pandas.util.testing as tm | ||
from pandas import Categorical, Index, Series, DataFrame, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, though we usually do this with parens
lgtm. ping when green. |
|
||
# sys.getsizeof will call the .memory_usage with | ||
# deep=True, and add on some GC overhead | ||
diff = df.memory_usage().sum() - sys.getsizeof(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how did this pass before?
this will never be true. as deep=True
is >> non-deep with object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that df
isn't an object dtype, I don't think this was affected the results - I will fix regardless
merged via 60cacab thanks! |
...and sys.getsizeof returns correct value. Closes #11597
Is this the best implementation for a common method across classes?