Add more tests for sizeAdd node. #4258

vanbasten23 · 2022-12-01T01:16:08Z

Test SizeAdd::Lower function.
Fix an issue with another existing test "test_simple_expand".

test/test_dynamic_shapes.py

vanbasten23 · 2022-12-01T01:18:50Z

test/test_dynamic_shapes.py

+    # Exercise SizeAdd::Lower.
+    t4 = t3.expand(dyn_size)
+    self.assertEqual(t4.size(0), 3)
+    print(torch_xla._XLAC._get_xla_tensors_text([t4]))


I wasn't able to test SizeAdd::ToString() method. Let me try to look further.

what's the output of torch_xla._XLAC._get_xla_tensors_text([t4]) here? we want to do a self.assertIn for the IR here.

The output of torch_xla._XLAC._get_xla_tensors_text([t4]) here is

IR { %0 = s32[] prim::Constant(), value=2 %1 = f32[] prim::Constant(), value=1 %2 = f32[1]{0} aten::view(%1), output_size=(1) %3 = f32[] prim::Constant(), value=0 %4 = f32[5,2]{1,0} aten::expand(%3), size=(5, 2) %5 = f32[1,2]{1,0} xla::generic_slice(%4), base_indices=(3, 0), sizes=(1, 2) %6 = f32[2]{0} aten::view(%5), output_size=(2) %7 = f32[2]{0} xla::update_slice(%6, %2), base_indices=(0) %8 = f32[1,2]{1,0} aten::view(%7), output_size=(1, 2) %9 = f32[5,2]{1,0} xla::update_slice(%4, %8), base_indices=(3, 0) %10 = (s32[<=10,2]{1,0}, s32[]) aten::nonzero(%9), num_outputs=2 %11 = s32[] aten::size(%10.0) %12 = s64[] aten::add(%11, %0) %13 = f32[] prim::Constant(), value=1 %14 = f32[1]{0} aten::expand(%13), size=(1) %15 = f32[<=12]{0} aten::expand(%14, %12), size=(12), dynamic_dims=(1), ROOT=0 }

It seems hard to test in the python code. But how about let me test in

xla/test/cpp/test_ir.cpp

Line 165 in 1685a28

TEST_F(IrTest, TestSizeAddNode) {

You can just assert if f32[<=12]{0} aten::expand is in the torch_xla._XLAC._get_xla_tensors_text([t4])

What I am trying to test is SizeAdd::ToString() instead of expand. I set SizeAdd::String() to return "SizeAdd for op aten::add" but as you can see the string does not appear in torch_xla._XLAC._get_xla_tensors_text([t4]).

So I don't think torch_xla._XLAC._get_xla_tensors_text([t4]) will exercise SizeAdd::ToString(). That's why I suggest to test in test/cpp/test_ir.cpp in this pr.

test/test_dynamic_shapes.py

torch_xla/csrc/ops/dynamic_ir.cpp

test/test_dynamic_shapes.py

vanbasten23 · 2022-12-02T23:30:21Z

Ok, now the test that I added TestDynamicShapes.test_sizeAdd passes on TPU but fails on CPU. I was able to repro the failure on my local cloudtop machine and the failure is:

RuntimeError: Error while lowering: SizeAdd for op aten::add
XLA builder error: INVALID_ARGUMENT: Binary op add with different element types: s32[] and s64[].:

The hlo and the ir graph. The full error message also doesn't show anymore useful info.

Via gdb, I was able to break at

xla/torch_xla/csrc/ops/dynamic_ir.cpp

Line 90 in 1685a28

auto input2 = loctx->GetOutputOp(operand(1));

and find out input1 has type s32 and input2 has type s64, hence the type mismatch. However, I couldn't find where input1 and input2 come from because p input1 doesn't reveal any useful info.

So my question is:

Why didn't the ir graph or the hlo graph show the type mismatch?
What do you suggest to debug further?
@JackCaoG

torch_xla/csrc/ops/dynamic_ir.cpp

JackCaoG

Let's dig a bit into the

======================================================================
ERROR: test_sizeAdd (__main__.TestDynamicShapes)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/pytorch/xla/test/test_dynamic_shapes.py", line 48, in test_sizeAdd
    self.assertEqual(t4.size(0), 3)
  File "/opt/conda/lib/python3.7/unittest/case.py", line 852, in assertEqual
    assertion_func(first, second, msg=msg)
  File "/opt/conda/lib/python3.7/unittest/case.py", line 842, in _baseAssertEqual
    if not first == second:
  File "/opt/conda/lib/python3.7/site-packages/torch/__init__.py", line 212, in __bool__
    return self.node.bool_()
RuntimeError: Error while lowering: SizeAdd for op aten::add
XLA builder error: INVALID_ARGUMENT: Binary op add with different element types: s32[] and s64[].:

We need to identify where the s32 coming from. Dumping the IR of t4 should help. Maybe we should make SizeAdd to handle the casting.

vanbasten23 · 2022-12-07T18:40:03Z

Dumping the IR of t4 should help. Maybe we should make SizeAdd to handle the casting.

I already did. Here is the IR of t4. Specifically

%0 = s32[] prim::Constant(), location=to_node@symbolic_shapes.py:135, value=2
%11 = s32[] aten::size(%10.0)
%12 = s64[] aten::add(%11, %0)

Both operands of aten::add are type s32. Via gdb, I was able to find #4258 (comment).

torch_xla/csrc/ops/dynamic_ir.cpp

test/cpp/test_ir.cpp

torch_xla/csrc/ops/dynamic_ir.cpp

vanbasten23 · 2022-12-16T23:24:33Z

@JackCaoG I followed your suggestion to create a shorter repo test_sizeAdd_shortened and try to examine the HLO dump, in order to find where the s32/s64 mismatch comes from. But I couldn't get an HLO graph. I tried print(torch_xla._XLAC._get_xla_tensors_hlo([t4])) and $ debug_flag save_ir XLA_DUMP_HLO_GRAPH=1 python3 pytorch/xla/test/test_dynamic_shapes.py TestDynamicShapes.test_sizeAdd_shortened. Neither of them generate the HLO for me. Do you know if I'm missing anything?

JackCaoG · 2022-12-20T21:03:19Z

I will try to repo sometime today

JackCaoG · 2022-12-21T01:49:09Z

I am able to repo the error, will post my debugging steps tonight.

JackCaoG · 2022-12-21T04:14:40Z

After verify that expand is the source of the error, I dump the IR of the t4

(Pdb) print(torch_xla._XLAC._get_xla_tensors_text([t4]))
IR {
  %0 = s64[] prim::Constant(), location=to_node@symbolic_shapes.py:139, value=2
  %1 = f32[] prim::Constant(), location=test_sizeAdd_shortened@test_dynamic_shapes.py:94, value=1
  %2 = f32[1]{0} aten::view(%1), location=test_sizeAdd_shortened@test_dynamic_shapes.py:96, output_size=(1)
  %3 = f32[] prim::Constant(), location=test_sizeAdd_shortened@test_dynamic_shapes.py:93, value=0
  %4 = f32[5,2]{1,0} aten::expand(%3), location=test_sizeAdd_shortened@test_dynamic_shapes.py:93, size=(5, 2)
  %5 = f32[1,2]{1,0} xla::generic_slice(%4), location=test_sizeAdd_shortened@test_dynamic_shapes.py:96, base_indices=(3, 0), sizes=(1, 2)
  %6 = f32[2]{0} aten::view(%5), location=test_sizeAdd_shortened@test_dynamic_shapes.py:96, output_size=(2)
  %7 = f32[2]{0} xla::update_slice(%6, %2), location=test_sizeAdd_shortened@test_dynamic_shapes.py:96, base_indices=(0)
  %8 = f32[1,2]{1,0} aten::view(%7), location=test_sizeAdd_shortened@test_dynamic_shapes.py:96, output_size=(1, 2)
  %9 = f32[5,2]{1,0} xla::update_slice(%4, %8), location=test_sizeAdd_shortened@test_dynamic_shapes.py:96, base_indices=(3, 0)
  %10 = (s64[<=10,2]{1,0}, s64[]) aten::nonzero(%9), num_outputs=2, location=test_sizeAdd_shortened@test_dynamic_shapes.py:96
  %11 = s64[] aten::size(%10.0)
  %12 = s64[] aten::add(%11, %0)
  %13 = f32[] prim::Constant(), location=test_sizeAdd_shortened@test_dynamic_shapes.py:104, value=1
  %14 = f32[1]{0} aten::expand(%13), location=test_sizeAdd_shortened@test_dynamic_shapes.py:104, size=(1)
  %15 = f32[<=12]{0} aten::expand(%14, %12), location=test_sizeAdd_shortened@test_dynamic_shapes.py:106, size=(12), dynamic_dims=(1), ROOT=0
}

weirdly both input seems to be s64, and this is the only add explicitly in the IR.

(Pdb) t4
*** RuntimeError: Error while lowering: aten::add_size
XLA builder error: INVALID_ARGUMENT: Binary op add with different element types: s32[] and s64[].:

One possibility is that add that error out here not come from aten::add explictly but is part of the lowering of some other op, for example expand. I am just going to use gdb to verify.

(pytorch) root@65d79ef78e4c:/pytorch/xla# dynamic cgdb python


(gdb) catch throw
Catchpoint 1 (throw)
(gdb) r test/test_dynamic_shapes.py TestDynamicShapes.test_sizeAdd_shortened

and I saw

 82│ LoweringContext::LoweringContext(
 83│     const std::string& name, torch::lazy::BackendDevice device,
 84│     c10::ArrayRef<const torch::lazy::Node*> post_order,
 85│     torch::lazy::Util::EmissionMap emit_status)
 86│     : torch::lazy::LoweringContext(name, device, {}, emit_status),
 87│       builder_(name) {
 88│   for (auto node : post_order) {
 89├>    LowerNode(node);
 90│   }
 91│ }

and if I do

(gdb) p node->ToString()
$1 = "aten::add_size"

so I was wrong, it is actually coming from aten::add_size. Now I need to go back to the lowering code of the SizeAdd and figure out why type at IR level matches, but failed during the actual lowering.

JackCaoG · 2022-12-21T04:16:15Z

  %11 = s64[] aten::size(%10.0)

looks suspious, if I have to guess, GetDimensionSize also returns s32, but we incorrectly flag it to be s64.

JackCaoG · 2022-12-21T04:20:55Z

ok confirmed, if I do (which avoids SizeAdd and use SizeNode directly which won't crash)

(Pdb) t4 = t3.expand(shape0)
(Pdb) print(torch_xla._XLAC._get_xla_tensors_hlo([t4]))

I am able to see

...
  %get-dimension-size.53 = s32[] get-dimension-size(s64[<=10,2]{1,0} %set-dimension-size.52), dimensions={0}, metadata={op_type="aten__size" op_name="aten__size" source_file="test_sizeAdd_shortened@test_dynamic_shapes.py" source_line=100}
...

which suggest Size is actually a s32 even on CPU, but we force it to be S64 in

xla/torch_xla/csrc/tensor_util.cpp

Lines 1253 to 1259 in d7d0479

    
           xla::PrimitiveType GetShapeDimensionType( 
        
               const torch::lazy::BackendDevice* device) { 
        
             torch::lazy::BackendDevice xla_device = GetDeviceOrCurrent(device); 
        
             XlaDeviceType hw_type = static_cast<XlaDeviceType>(xla_device.type()); 
        
             return hw_type == XlaDeviceType::CPU ? xla::PrimitiveType::S64 
        
                                                  : xla::PrimitiveType::S32; 
        
           }

JackCaoG · 2022-12-21T04:43:53Z

with

diff --git a/torch_xla/csrc/tensor_util.cpp b/torch_xla/csrc/tensor_util.cpp
index 4329fa0f..27e30982 100644
--- a/torch_xla/csrc/tensor_util.cpp
+++ b/torch_xla/csrc/tensor_util.cpp
@@ -1252,10 +1252,7 @@ bool RequiresRawTypeCasting(at::ScalarType scalar_type,
 
 xla::PrimitiveType GetShapeDimensionType(
     const torch::lazy::BackendDevice* device) {
-  torch::lazy::BackendDevice xla_device = GetDeviceOrCurrent(device);
-  XlaDeviceType hw_type = static_cast<XlaDeviceType>(xla_device.type());
-  return hw_type == XlaDeviceType::CPU ? xla::PrimitiveType::S64
-                                       : xla::PrimitiveType::S32;
+  return xla::PrimitiveType::S32;
 }

issue is gone. That being said SizeAdd needs to be able to handle type casting. Please change the lowering to

xla/torch_xla/csrc/ops/arithmetic_ir_ops.cpp

Line 17 in d7d0479

return node.ReturnOp(XlaHelpers::PromotedAdd(op0, op1), loctx);

JackCaoG · 2022-12-21T04:45:26Z

Here is the full diff

diff --git a/torch_xla/csrc/ops/dynamic_ir.cpp b/torch_xla/csrc/ops/dynamic_ir.cpp
index 01c20462..f1de87f0 100644
--- a/torch_xla/csrc/ops/dynamic_ir.cpp
+++ b/torch_xla/csrc/ops/dynamic_ir.cpp
@@ -7,6 +7,9 @@
 #include "torch_xla/csrc/tensor.h"
 #include "torch_xla/csrc/tensor_util.h"
 #include "torch_xla/csrc/xla_graph_executor.h"
+#include "torch_xla/csrc/convert_ops.h"
+#include "torch_xla/csrc/helpers.h"
+
 
 namespace torch_xla {
 
@@ -88,9 +91,11 @@ int64_t SizeAdd::getDynamicValue() const {
 std::string SizeAdd::ToString() const { return "aten::add_size"; }
 
 XlaOpVector SizeAdd::Lower(LoweringContext* loctx) const {
-  auto input1 = loctx->GetOutputOp(operand(0));
-  auto input2 = loctx->GetOutputOp(operand(1));
-  return ReturnOp((input1 + input2), loctx);
+  static xla::PrimitiveType dim_type = GetShapeDimensionType(/*device=*/nullptr);
+  xla::XlaOp input1 = loctx->GetOutputOp(operand(0));
+  xla::XlaOp input2 = loctx->GetOutputOp(operand(1));
+  xla::XlaOp res = XlaHelpers::PromotedAdd(input1, input2);
+  return ReturnOp(MaybeConvertTo(res, dim_type), loctx);
 }
 
 SizeEq::SizeEq(torch::lazy::Value a, torch::lazy::Value b)
diff --git a/torch_xla/csrc/tensor_util.cpp b/torch_xla/csrc/tensor_util.cpp
index 4329fa0f..27e30982 100644
--- a/torch_xla/csrc/tensor_util.cpp
+++ b/torch_xla/csrc/tensor_util.cpp
@@ -1252,10 +1252,7 @@ bool RequiresRawTypeCasting(at::ScalarType scalar_type,
 
 xla::PrimitiveType GetShapeDimensionType(
     const torch::lazy::BackendDevice* device) {
-  torch::lazy::BackendDevice xla_device = GetDeviceOrCurrent(device);
-  XlaDeviceType hw_type = static_cast<XlaDeviceType>(xla_device.type());
-  return hw_type == XlaDeviceType::CPU ? xla::PrimitiveType::S64
-                                       : xla::PrimitiveType::S32;
+  return xla::PrimitiveType::S32;
 }
 
 }  // namespace torch_xla

vanbasten23 · 2022-12-22T22:27:06Z

Thanks for looking into it. So to confirm, the s32 comes from SizeNode and s64 comes from %0 = s64[] prim::Constant(), location=to_node@symbolic_shapes.py:139, value=2. By looks suspious, if I have to guess, GetDimensionSize also returns s32, but we incorrectly flag it to be s64., we don't actually need to do anything here. Is it correct?

JackCaoG · 2022-12-22T23:01:21Z

my patch will force all size to be s32, as long as make sure SizeAdd and other size ops will cast result to s32 we should be good.

vanbasten23 · 2022-12-22T23:17:48Z

looks suspious, if I have to guess, GetDimensionSize also returns s32, but we incorrectly flag it to be s64.

But regarding your comment "looks suspious, if I have to guess, GetDimensionSize also returns s32, but we incorrectly flag it to be s64.", we don't have to change the part that flags it to be s64, right?

JackCaoG · 2022-12-23T00:09:15Z

no, we need to make sure dimensionType is always s32.

However for the case of SizeAdd, if we have a s32 + s64, I think result will be s64, hence we need MaybeConvertTo to cast the result back to s32.

vanbasten23 · 2023-01-09T21:56:28Z

Recap:
The failing test is

  def test_sizeAdd_shortened(self):
    self.assertNotEqual(os.environ['XLA_EXPERIMENTAL'], '')
    size1 = 5
    size2 = 2
    t1 = torch.zeros([size1, size2], device=dev)
    t1[3][0] = 1
    # t2 has size [<=10, 2]
    t2 = torch.nonzero(t1)

    shape0 = t2.shape[0]
    shape1 = t2.shape[1]
    dyn_size = shape0 + shape1
    t3 = torch.ones(1, device=dev)

    # Exercise SizeAdd::Lower.
    t4 = t3.expand(dyn_size)
    print(torch_xla._XLAC._get_xla_tensors_hlo([t4]))
    self.assertEqual(t4.size(0), 3) # exception throws here.

What we concluded is "Size is actually a s32 even on CPU, but we force it to be S64 in here.

To confirm:

To fix:

Change SizeAdd's Lower() to be return node.ReturnOp(XlaHelpers::PromotedAdd(op0, op1), loctx);.
Cast SizeAdd's result to s32. Is it by changing this line to be s32?

Since you mentioned Size is actually a s32 even on CPU, but we force it to be S64 in XLA in your comment, should we also change the return tup of getDynamicValue from int64 to int32 as well?
Also should we change

 xla::PrimitiveType GetShapeDimensionType( 
     const torch::lazy::BackendDevice* device) { 
   torch::lazy::BackendDevice xla_device = GetDeviceOrCurrent(device); 
   XlaDeviceType hw_type = static_cast<XlaDeviceType>(xla_device.type()); 
   return hw_type == XlaDeviceType::CPU ? xla::PrimitiveType::S64 
                                        : xla::PrimitiveType::S32; 
 }

just so that it'll always return s32 for size operation?

test/test_dynamic_shapes.py

JackCaoG

mostly lgtm, minor nits

vanbasten23 commented Dec 1, 2022

View reviewed changes

test/test_dynamic_shapes.py Show resolved Hide resolved

vanbasten23 commented Dec 1, 2022

View reviewed changes

vanbasten23 requested review from miladm and JackCaoG December 1, 2022 01:19

vanbasten23 marked this pull request as ready for review December 1, 2022 01:20

vanbasten23 commented Dec 1, 2022

View reviewed changes

test/test_dynamic_shapes.py Show resolved Hide resolved

JackCaoG reviewed Dec 1, 2022

View reviewed changes

torch_xla/csrc/ops/dynamic_ir.cpp Outdated Show resolved Hide resolved

JackCaoG reviewed Dec 1, 2022

View reviewed changes

torch_xla/csrc/ops/dynamic_ir.cpp Show resolved Hide resolved

JackCaoG reviewed Dec 1, 2022

View reviewed changes

test/test_dynamic_shapes.py Show resolved Hide resolved

JackCaoG reviewed Dec 7, 2022

View reviewed changes

torch_xla/csrc/ops/dynamic_ir.cpp Outdated Show resolved Hide resolved

JackCaoG reviewed Dec 7, 2022

View reviewed changes

torch_xla/csrc/ops/dynamic_ir.cpp Outdated Show resolved Hide resolved

JackCaoG reviewed Dec 7, 2022

View reviewed changes

test/cpp/test_ir.cpp Outdated Show resolved Hide resolved

JackCaoG reviewed Dec 7, 2022

View reviewed changes

torch_xla/csrc/ops/dynamic_ir.cpp Outdated Show resolved Hide resolved

miladm assigned vanbasten23 Dec 13, 2022

miladm added dynamism Dynamic Shape Features testing Testing and coverage related issues. labels Dec 13, 2022

vanbasten23 requested a review from JackCaoG December 20, 2022 06:01

vanbasten23 mentioned this pull request Dec 20, 2022

Add SizeNe #4338

Open

vanbasten23 mentioned this pull request Jan 4, 2023

Force fail on all unimplemented methods. #4409

Merged

vanbasten23 added 9 commits January 10, 2023 22:41

add moore tests.

8bb14cb

Added more test for SizeAdd

95be5ac

fix linter

12dbdf5

check EXPERIMENTAL flag in the test.

4e80a7e

check SizeAdd::ToString()

54f9458

resolved a few pr comments.

04f62a8

fix typo.

61219a8

fix more pr comments

5513343

make sizeAdd not to use s64.

70a9a79

vanbasten23 force-pushed the addMoreTestsForSizeAdd branch from 89475b0 to 70a9a79 Compare January 10, 2023 22:56

vanbasten23 added 3 commits January 10, 2023 22:57

fix typo

bf94e23

fix linter

146ef06

added a cpp test

1c90af7

JackCaoG reviewed Jan 11, 2023

View reviewed changes

test/test_dynamic_shapes.py Outdated Show resolved Hide resolved

JackCaoG approved these changes Jan 11, 2023

View reviewed changes

vanbasten23 added 2 commits January 11, 2023 01:10

fix a comment.

e1a1f7b

return s32 for GetShapeDimensionType

fb9e471

vanbasten23 merged commit 0062e45 into master Jan 11, 2023

Add more tests for sizeAdd node. #4258

Add more tests for sizeAdd node. #4258

Uh oh!

Conversation

vanbasten23 commented Dec 1, 2022

Uh oh!

Uh oh!

vanbasten23 Dec 1, 2022

Choose a reason for hiding this comment

Uh oh!

JackCaoG Dec 2, 2022

Choose a reason for hiding this comment

Uh oh!

vanbasten23 Dec 2, 2022

Choose a reason for hiding this comment

Uh oh!

JackCaoG Dec 7, 2022

Choose a reason for hiding this comment

Uh oh!

vanbasten23 Dec 7, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vanbasten23 commented Dec 2, 2022

Uh oh!

Uh oh!

JackCaoG left a comment

Choose a reason for hiding this comment

Uh oh!

vanbasten23 commented Dec 7, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vanbasten23 commented Dec 16, 2022

Uh oh!

JackCaoG commented Dec 20, 2022

Uh oh!

JackCaoG commented Dec 21, 2022

Uh oh!

JackCaoG commented Dec 21, 2022

Uh oh!

JackCaoG commented Dec 21, 2022

Uh oh!

JackCaoG commented Dec 21, 2022

Uh oh!

JackCaoG commented Dec 21, 2022

Uh oh!

JackCaoG commented Dec 21, 2022

Uh oh!

vanbasten23 commented Dec 22, 2022

Uh oh!

JackCaoG commented Dec 22, 2022

Uh oh!

vanbasten23 commented Dec 22, 2022

Uh oh!

JackCaoG commented Dec 23, 2022

Uh oh!

vanbasten23 commented Jan 9, 2023

Uh oh!

Uh oh!

JackCaoG left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!