Description
Alex_Kotlar said:
Could you help me understand when (and why) NA != NA in Hail? This came up in a discussion with a user, where a table join was conducted, and yielded no NA matches from the joined-on column.
This may be to follow numpy/pandas behavior, but seems pretty confusing, as NaN != NaN is IEEE standard, while None != None is a numpy/pandas change from Python’s handling. Since Python is the closest user-facing interface, this seems magical.
May relate to: pandas-dev/pandas#20442
In particular:
In [1]: import pandas as pd
In [2]: A = pd.Series(['a', None, 2])
In [3]: B = pd.Series(['a', None, 2])
In [4]: all(A == B)
Out[4]: False
In [5]: all(lambda a, b: a == b for a, b in zip(A, B))
Out[5]: True
In [5]: all(A.values == B.values)
Out[5]: True
In [6]: all(lambda a, b: a == b for a, b in zip(A.values, B.values))
Out[6]: True
tpoterba said:
Oops, sorry to have missed this!
Could you help me understand when (and why) NA != NA in Hail
This is a really hard point to get right. I’ll do my best to list missingness treatment in various places.
- In Hail expressions,
NA == NA => NA
andNaN == NaN => False
. Top-level missingness always bubbles up, with only a few operations able to circumvent that rule:hl.is_defined
,hl.is_missing
, as well as short-circuiting in|
and&
. Note that in the current system, comparisons of nested objects will treat nested missingness as equal:
In [7]: hl.eval(hl.null('int') == hl.null('int'))
<None>
In [8]: hl.eval(hl.float('nan') == hl.float('nan'))
Out[8]: False
You could easily argue that this is a bug:
In [11]: hl.eval(hl.array([hl.null('int')]) == hl.array([hl.null('int')]))
Out[11]: True
- In joins,
NA == NA => no match
andNaN == NaN => no match
.
In [1]: ht = hl.utils.range_table(2).key_by(x = hl.null('int'), y = hl.float('nan'))
In [3]: ht.join(ht, 'outer').show()
+-------+---------+-------+-------+
| x | y | idx | idx_1 |
+-------+---------+-------+-------+
| int32 | float64 | int32 | int32 |
+-------+---------+-------+-------+
| NA | NaN | 0 | NA |
| NA | NaN | 1 | NA |
| NA | NaN | NA | 0 |
| NA | NaN | NA | 1 |
+-------+---------+-------+-------+
- In some other operations that use table keys,
NA == NA => True
andNaN == NaN => True
:
In [19]: ht = hl.utils.range_table(2).key_by(x = hl.null('int'), y = hl.float('nan'))
In [20]: ht.show()
+-------+-------+---------+
| idx | x | y |
+-------+-------+---------+
| int32 | int32 | float64 |
+-------+-------+---------+
| 0 | NA | NaN |
| 1 | NA | NaN |
+-------+-------+---------+
In [21]: ht.distinct().show()
+-------+-------+---------+
| idx | x | y |
+-------+-------+---------+
| int32 | int32 | float64 |
+-------+-------+---------+
| 0 | NA | NaN |
+-------+-------+---------+
Alex_Kotlar said:
NA == NA and NaN != NaN both sound perfectly correct, and evaluating objects of the same shape / value as equal also seems right.
The user’s query joined on a field that was occasionally missing. In those cases, the join, which was a left join, failed to return the right operand’s row in which the joined-on key was NA in both the left and right operands.
So, x.key = NA y.key = NA, join failed to return y. Is that expected behavior when x.key and y.key are int (or in any other case)? I assumed this was due to numpy-like silent conversion to NaN.
tpoterba said:
we definitely don’t convert to NaN – I think we explicitly filter out missing keys in left/right joins.
You’ve definitely turned up inconsistencies, though. We should get that fixed.
cseed said:
You could easily argue that this [lists of NAs comparing equal] is a bug
I improved some comparisons (<, etc.) recently, but related to NaNs, not missingness. We now have:
>>> a = hl.array([hl.float('nan')])
>>> hl.eval(a == a)
False
I agree, although (1) this will be another behavior change, and (2) this will require making ExtendedOrdering return java.lang.Boolean instead of Boolean (and the analogous change for the C++ and CodeOrdering comparisons).
NA == NA sound[s] perfectly correct
This is not the path we’ve taken so far. We think of NA not as a value, but as a value we don’t know. So comparing two unknown values shouldn’t be True, it should also be unknown. While this diverges from the Python’s behavior with respect to None (and one might argue our NA shouldn’t be converted into Python None), it does match SQL and R, for example:
$ R
> NA == NA
[1] NA
and the join behavior you’re seeing is the standard SQL join behavior.
cseed said:
One more comment:
In some other operations that use table keys,
NA == NA => True
andNaN == NaN => True
How did this come about? There are two sets of comparisons: the normal <, <=, etc. that are not a total comparison (e.g. NaN compares false with everything) and a total comparison on every type (implemented via compare: Int
). To reliably order tables, we need to use the total comparison. That then got used for various operations like distinct.
I’d think this is a bug, but surprisingly, SQL SELECT DISTINCT matches our current behavior (NAs are equal).