None != None (NA != NA)

<h2>Alex_Kotlar said:</h2>
<p>Could you help me understand when (and why) NA != NA in Hail? This came up in a discussion with a user, where a table join was conducted, and yielded no NA matches from the joined-on column.</p>
<p>This may be to follow numpy/pandas behavior, but seems pretty confusing, as NaN != NaN is IEEE standard, while None != None is a numpy/pandas change from Python’s handling. Since Python is the closest user-facing interface, this seems magical.</p>
<p>May relate to: <a href="https://github.com/pandas-dev/pandas/issues/20442" rel="nofollow noopener">https://github.com/pandas-dev/pandas/issues/20442</a></p>
<p>In particular:</p>


```python
In [1]: import pandas as pd
In [2]: A = pd.Series(['a', None, 2])
In [3]: B = pd.Series(['a', None, 2])

In [4]: all(A == B)
Out[4]: False
In [5]: all(lambda a, b: a == b for a, b in zip(A, B))
Out[5]: True

In [5]: all(A.values == B.values)
Out[5]: True
In [6]: all(lambda a, b: a == b for a, b in zip(A.values, B.values))
Out[6]: True

```



<h2>tpoterba said:</h2>
<p>Oops, sorry to have missed this!</p>
<blockquote>
<p>Could you help me understand when (and why) NA != NA in Hail</p>
</blockquote>
<p>This is a really hard point to get right. I’ll do my best to list missingness treatment in various places.</p>
<ol>
<li>In Hail expressions, <code>NA == NA => NA</code> and <code>NaN == NaN => False</code>. Top-level missingness always bubbles up, with only a few operations able to circumvent that rule: <code>hl.is_defined</code>, <code>hl.is_missing</code>, as well as short-circuiting in <code>|</code> and <code>&</code>. Note that in the current system, comparisons of nested objects will treat nested missingness as equal:</li>
</ol>


```python
In [7]: hl.eval(hl.null('int') == hl.null('int'))
<None>

```




```python
In [8]: hl.eval(hl.float('nan') == hl.float('nan'))
Out[8]: False

```


<p>You could easily argue that this is a bug:</p>


```python
In [11]: hl.eval(hl.array([hl.null('int')]) == hl.array([hl.null('int')]))
Out[11]: True

```


<ol start="2">
<li>In joins, <code>NA == NA => no match</code> and <code>NaN == NaN => no match</code>.</li>
</ol>


```python
In [1]: ht = hl.utils.range_table(2).key_by(x = hl.null('int'), y = hl.float('nan'))

In [3]: ht.join(ht, 'outer').show()
+-------+---------+-------+-------+
|     x |       y |   idx | idx_1 |
+-------+---------+-------+-------+
| int32 | float64 | int32 | int32 |
+-------+---------+-------+-------+
|    NA |     NaN |     0 |    NA |
|    NA |     NaN |     1 |    NA |
|    NA |     NaN |    NA |     0 |
|    NA |     NaN |    NA |     1 |
+-------+---------+-------+-------+

```


<ol start="3">
<li>In some other operations that use table keys, <code>NA == NA => True</code> and <code>NaN == NaN => True</code>:</li>
</ol>


```python
In [19]: ht = hl.utils.range_table(2).key_by(x = hl.null('int'), y = hl.float('nan'))

In [20]: ht.show()
+-------+-------+---------+
|   idx |     x |       y |
+-------+-------+---------+
| int32 | int32 | float64 |
+-------+-------+---------+
|     0 |    NA |     NaN |
|     1 |    NA |     NaN |
+-------+-------+---------+


In [21]: ht.distinct().show()
+-------+-------+---------+
|   idx |     x |       y |
+-------+-------+---------+
| int32 | int32 | float64 |
+-------+-------+---------+
|     0 |    NA |     NaN |
+-------+-------+---------+

```



<h2>Alex_Kotlar said:</h2>
<p>NA == NA and NaN != NaN both sound perfectly correct, and evaluating objects of the same shape / value as equal also seems right.</p>
<p>The user’s query joined on a field that was occasionally missing. In those cases, the join, which was a left join, failed to return the right operand’s row in which the joined-on key was NA in both the left and right operands.</p>
<p>So, x.key = NA y.key = NA, join failed to return y. Is that expected behavior when x.key and y.key are int (or in any other case)? I assumed this was due to numpy-like silent conversion to NaN.</p>

<h2>tpoterba said:</h2>
<p>we definitely don’t convert to NaN – I think we explicitly filter out missing keys in left/right joins.</p>
<p>You’ve definitely turned up inconsistencies, though. We should get that fixed.</p>

<h2>cseed said:</h2>
<blockquote>
<p>You could easily argue that this [lists of NAs comparing equal] is a bug</p>
</blockquote>
<p>I improved some comparisons (<, etc.) recently, but related to NaNs, not missingness.  We now have:</p>


```python
>>> a = hl.array([hl.float('nan')])
>>> hl.eval(a == a)
False

```


<p>I agree, although (1) this will be another behavior change, and (2) this will require making ExtendedOrdering return java.lang.Boolean instead of Boolean (and the analogous change for the C++ and CodeOrdering comparisons).</p>
<blockquote>
<p>NA == NA sound[s] perfectly correct</p>
</blockquote>
<p>This is not the path we’ve taken so far.  We think of NA not as a value, but as a value we don’t know.  So comparing two unknown values shouldn’t be True, it should also be unknown.  While this diverges from the Python’s behavior with respect to None (and one might argue our NA shouldn’t be converted into Python None), it does match SQL and R, for example:</p>


```python
$ R
> NA == NA
[1] NA

```


<p>and the join behavior you’re seeing is the standard SQL join behavior.</p>

<h2>cseed said:</h2>
<p>One more comment:</p>
<blockquote>
<p>In some other operations that use table keys, <code>NA == NA => True</code> and <code>NaN == NaN => True</code></p>
</blockquote>
<p>How did this come about? There are two sets of comparisons: the normal <, <=, etc. that are not a total comparison (e.g. NaN compares false with everything) and a total comparison on every type (implemented via <code>compare: Int</code>).  To reliably order tables, we need to use the total comparison.  That then got used for various operations like distinct.</p>
<p>I’d think this is a bug, but surprisingly, SQL SELECT DISTINCT matches our current behavior (NAs are equal).</p>



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

None != None (NA != NA) #2215

Alex_Kotlar said:

tpoterba said:

Alex_Kotlar said:

tpoterba said:

cseed said:

cseed said:

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

None != None (NA != NA) #2215

Description

Alex_Kotlar said:

tpoterba said:

Alex_Kotlar said:

tpoterba said:

cseed said:

cseed said:

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions