Skip to content

Add more tests for sizeAdd node. #4258

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jan 11, 2023
Merged

Add more tests for sizeAdd node. #4258

merged 14 commits into from
Jan 11, 2023

Conversation

vanbasten23
Copy link
Collaborator

  • Test SizeAdd::Lower function.
  • Fix an issue with another existing test "test_simple_expand".

# Exercise SizeAdd::Lower.
t4 = t3.expand(dyn_size)
self.assertEqual(t4.size(0), 3)
print(torch_xla._XLAC._get_xla_tensors_text([t4]))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to test SizeAdd::ToString() method. Let me try to look further.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the output of torch_xla._XLAC._get_xla_tensors_text([t4]) here? we want to do a self.assertIn for the IR here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output of torch_xla._XLAC._get_xla_tensors_text([t4]) here is

IR {
  %0 = s32[] prim::Constant(), value=2
  %1 = f32[] prim::Constant(), value=1
  %2 = f32[1]{0} aten::view(%1), output_size=(1)
  %3 = f32[] prim::Constant(), value=0
  %4 = f32[5,2]{1,0} aten::expand(%3), size=(5, 2)
  %5 = f32[1,2]{1,0} xla::generic_slice(%4), base_indices=(3, 0), sizes=(1, 2)
  %6 = f32[2]{0} aten::view(%5), output_size=(2)
  %7 = f32[2]{0} xla::update_slice(%6, %2), base_indices=(0)
  %8 = f32[1,2]{1,0} aten::view(%7), output_size=(1, 2)
  %9 = f32[5,2]{1,0} xla::update_slice(%4, %8), base_indices=(3, 0)
  %10 = (s32[<=10,2]{1,0}, s32[]) aten::nonzero(%9), num_outputs=2
  %11 = s32[] aten::size(%10.0)
  %12 = s64[] aten::add(%11, %0)
  %13 = f32[] prim::Constant(), value=1
  %14 = f32[1]{0} aten::expand(%13), size=(1)
  %15 = f32[<=12]{0} aten::expand(%14, %12), size=(12), dynamic_dims=(1), ROOT=0
}

It seems hard to test in the python code. But how about let me test in

TEST_F(IrTest, TestSizeAddNode) {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just assert if f32[<=12]{0} aten::expand is in the torch_xla._XLAC._get_xla_tensors_text([t4])

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I am trying to test is SizeAdd::ToString() instead of expand. I set SizeAdd::String() to return "SizeAdd for op aten::add" but as you can see the string does not appear in torch_xla._XLAC._get_xla_tensors_text([t4]).

So I don't think torch_xla._XLAC._get_xla_tensors_text([t4]) will exercise SizeAdd::ToString(). That's why I suggest to test in test/cpp/test_ir.cpp in this pr.

@vanbasten23 vanbasten23 marked this pull request as ready for review December 1, 2022 01:20
@vanbasten23
Copy link
Collaborator Author

Ok, now the test that I added TestDynamicShapes.test_sizeAdd passes on TPU but fails on CPU. I was able to repro the failure on my local cloudtop machine and the failure is:

RuntimeError: Error while lowering: SizeAdd for op aten::add
XLA builder error: INVALID_ARGUMENT: Binary op add with different element types: s32[] and s64[].: 

The hlo and the ir graph. The full error message also doesn't show anymore useful info.

Via gdb, I was able to break at

auto input2 = loctx->GetOutputOp(operand(1));
and find out input1 has type s32 and input2 has type s64, hence the type mismatch. However, I couldn't find where input1 and input2 come from because p input1 doesn't reveal any useful info.

So my question is:

  1. Why didn't the ir graph or the hlo graph show the type mismatch?
  2. What do you suggest to debug further?
    @JackCaoG

Copy link
Collaborator

@JackCaoG JackCaoG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's dig a bit into the

======================================================================
ERROR: test_sizeAdd (__main__.TestDynamicShapes)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/pytorch/xla/test/test_dynamic_shapes.py", line 48, in test_sizeAdd
    self.assertEqual(t4.size(0), 3)
  File "/opt/conda/lib/python3.7/unittest/case.py", line 852, in assertEqual
    assertion_func(first, second, msg=msg)
  File "/opt/conda/lib/python3.7/unittest/case.py", line 842, in _baseAssertEqual
    if not first == second:
  File "/opt/conda/lib/python3.7/site-packages/torch/__init__.py", line 212, in __bool__
    return self.node.bool_()
RuntimeError: Error while lowering: SizeAdd for op aten::add
XLA builder error: INVALID_ARGUMENT: Binary op add with different element types: s32[] and s64[].:

We need to identify where the s32 coming from. Dumping the IR of t4 should help. Maybe we should make SizeAdd to handle the casting.

@vanbasten23
Copy link
Collaborator Author

Dumping the IR of t4 should help. Maybe we should make SizeAdd to handle the casting.

I already did. Here is the IR of t4. Specifically

%0 = s32[] prim::Constant(), location=to_node@symbolic_shapes.py:135, value=2
%11 = s32[] aten::size(%10.0)
%12 = s64[] aten::add(%11, %0)

Both operands of aten::add are type s32. Via gdb, I was able to find #4258 (comment).

@miladm miladm added dynamism Dynamic Shape Features testing Testing and coverage related issues. labels Dec 13, 2022
@vanbasten23
Copy link
Collaborator Author

@JackCaoG I followed your suggestion to create a shorter repo test_sizeAdd_shortened and try to examine the HLO dump, in order to find where the s32/s64 mismatch comes from. But I couldn't get an HLO graph. I tried print(torch_xla._XLAC._get_xla_tensors_hlo([t4])) and $ debug_flag save_ir XLA_DUMP_HLO_GRAPH=1 python3 pytorch/xla/test/test_dynamic_shapes.py TestDynamicShapes.test_sizeAdd_shortened. Neither of them generate the HLO for me. Do you know if I'm missing anything?

@vanbasten23 vanbasten23 requested a review from JackCaoG December 20, 2022 06:01
@vanbasten23 vanbasten23 mentioned this pull request Dec 20, 2022
@JackCaoG
Copy link
Collaborator

I will try to repo sometime today

@JackCaoG
Copy link
Collaborator

I am able to repo the error, will post my debugging steps tonight.

@JackCaoG
Copy link
Collaborator

After verify that expand is the source of the error, I dump the IR of the t4

(Pdb) print(torch_xla._XLAC._get_xla_tensors_text([t4]))
IR {
  %0 = s64[] prim::Constant(), location=to_node@symbolic_shapes.py:139, value=2
  %1 = f32[] prim::Constant(), location=test_sizeAdd_shortened@test_dynamic_shapes.py:94, value=1
  %2 = f32[1]{0} aten::view(%1), location=test_sizeAdd_shortened@test_dynamic_shapes.py:96, output_size=(1)
  %3 = f32[] prim::Constant(), location=test_sizeAdd_shortened@test_dynamic_shapes.py:93, value=0
  %4 = f32[5,2]{1,0} aten::expand(%3), location=test_sizeAdd_shortened@test_dynamic_shapes.py:93, size=(5, 2)
  %5 = f32[1,2]{1,0} xla::generic_slice(%4), location=test_sizeAdd_shortened@test_dynamic_shapes.py:96, base_indices=(3, 0), sizes=(1, 2)
  %6 = f32[2]{0} aten::view(%5), location=test_sizeAdd_shortened@test_dynamic_shapes.py:96, output_size=(2)
  %7 = f32[2]{0} xla::update_slice(%6, %2), location=test_sizeAdd_shortened@test_dynamic_shapes.py:96, base_indices=(0)
  %8 = f32[1,2]{1,0} aten::view(%7), location=test_sizeAdd_shortened@test_dynamic_shapes.py:96, output_size=(1, 2)
  %9 = f32[5,2]{1,0} xla::update_slice(%4, %8), location=test_sizeAdd_shortened@test_dynamic_shapes.py:96, base_indices=(3, 0)
  %10 = (s64[<=10,2]{1,0}, s64[]) aten::nonzero(%9), num_outputs=2, location=test_sizeAdd_shortened@test_dynamic_shapes.py:96
  %11 = s64[] aten::size(%10.0)
  %12 = s64[] aten::add(%11, %0)
  %13 = f32[] prim::Constant(), location=test_sizeAdd_shortened@test_dynamic_shapes.py:104, value=1
  %14 = f32[1]{0} aten::expand(%13), location=test_sizeAdd_shortened@test_dynamic_shapes.py:104, size=(1)
  %15 = f32[<=12]{0} aten::expand(%14, %12), location=test_sizeAdd_shortened@test_dynamic_shapes.py:106, size=(12), dynamic_dims=(1), ROOT=0
}

weirdly both input seems to be s64, and this is the only add explicitly in the IR.

(Pdb) t4
*** RuntimeError: Error while lowering: aten::add_size
XLA builder error: INVALID_ARGUMENT: Binary op add with different element types: s32[] and s64[].: 

One possibility is that add that error out here not come from aten::add explictly but is part of the lowering of some other op, for example expand. I am just going to use gdb to verify.

(pytorch) root@65d79ef78e4c:/pytorch/xla# dynamic cgdb python


(gdb) catch throw
Catchpoint 1 (throw)
(gdb) r test/test_dynamic_shapes.py TestDynamicShapes.test_sizeAdd_shortened

and I saw

 82│ LoweringContext::LoweringContext(
 83│     const std::string& name, torch::lazy::BackendDevice device,
 84│     c10::ArrayRef<const torch::lazy::Node*> post_order,
 85│     torch::lazy::Util::EmissionMap emit_status)
 86│     : torch::lazy::LoweringContext(name, device, {}, emit_status),
 87│       builder_(name) {
 88│   for (auto node : post_order) {
 89├>    LowerNode(node);
 90│   }
 91│ }

and if I do

(gdb) p node->ToString()
$1 = "aten::add_size"

so I was wrong, it is actually coming from aten::add_size. Now I need to go back to the lowering code of the SizeAdd and figure out why type at IR level matches, but failed during the actual lowering.

@JackCaoG
Copy link
Collaborator

  %11 = s64[] aten::size(%10.0)

looks suspious, if I have to guess, GetDimensionSize also returns s32, but we incorrectly flag it to be s64.

@JackCaoG
Copy link
Collaborator

ok confirmed, if I do (which avoids SizeAdd and use SizeNode directly which won't crash)

(Pdb) t4 = t3.expand(shape0)
(Pdb) print(torch_xla._XLAC._get_xla_tensors_hlo([t4]))

I am able to see

...
  %get-dimension-size.53 = s32[] get-dimension-size(s64[<=10,2]{1,0} %set-dimension-size.52), dimensions={0}, metadata={op_type="aten__size" op_name="aten__size" source_file="test_sizeAdd_shortened@test_dynamic_shapes.py" source_line=100}
...

which suggest Size is actually a s32 even on CPU, but we force it to be S64 in

xla::PrimitiveType GetShapeDimensionType(
const torch::lazy::BackendDevice* device) {
torch::lazy::BackendDevice xla_device = GetDeviceOrCurrent(device);
XlaDeviceType hw_type = static_cast<XlaDeviceType>(xla_device.type());
return hw_type == XlaDeviceType::CPU ? xla::PrimitiveType::S64
: xla::PrimitiveType::S32;
}

@JackCaoG
Copy link
Collaborator

with

diff --git a/torch_xla/csrc/tensor_util.cpp b/torch_xla/csrc/tensor_util.cpp
index 4329fa0f..27e30982 100644
--- a/torch_xla/csrc/tensor_util.cpp
+++ b/torch_xla/csrc/tensor_util.cpp
@@ -1252,10 +1252,7 @@ bool RequiresRawTypeCasting(at::ScalarType scalar_type,
 
 xla::PrimitiveType GetShapeDimensionType(
     const torch::lazy::BackendDevice* device) {
-  torch::lazy::BackendDevice xla_device = GetDeviceOrCurrent(device);
-  XlaDeviceType hw_type = static_cast<XlaDeviceType>(xla_device.type());
-  return hw_type == XlaDeviceType::CPU ? xla::PrimitiveType::S64
-                                       : xla::PrimitiveType::S32;
+  return xla::PrimitiveType::S32;
 }

issue is gone. That being said SizeAdd needs to be able to handle type casting. Please change the lowering to

return node.ReturnOp(XlaHelpers::PromotedAdd(op0, op1), loctx);

@JackCaoG
Copy link
Collaborator

Here is the full diff

diff --git a/torch_xla/csrc/ops/dynamic_ir.cpp b/torch_xla/csrc/ops/dynamic_ir.cpp
index 01c20462..f1de87f0 100644
--- a/torch_xla/csrc/ops/dynamic_ir.cpp
+++ b/torch_xla/csrc/ops/dynamic_ir.cpp
@@ -7,6 +7,9 @@
 #include "torch_xla/csrc/tensor.h"
 #include "torch_xla/csrc/tensor_util.h"
 #include "torch_xla/csrc/xla_graph_executor.h"
+#include "torch_xla/csrc/convert_ops.h"
+#include "torch_xla/csrc/helpers.h"
+
 
 namespace torch_xla {
 
@@ -88,9 +91,11 @@ int64_t SizeAdd::getDynamicValue() const {
 std::string SizeAdd::ToString() const { return "aten::add_size"; }
 
 XlaOpVector SizeAdd::Lower(LoweringContext* loctx) const {
-  auto input1 = loctx->GetOutputOp(operand(0));
-  auto input2 = loctx->GetOutputOp(operand(1));
-  return ReturnOp((input1 + input2), loctx);
+  static xla::PrimitiveType dim_type = GetShapeDimensionType(/*device=*/nullptr);
+  xla::XlaOp input1 = loctx->GetOutputOp(operand(0));
+  xla::XlaOp input2 = loctx->GetOutputOp(operand(1));
+  xla::XlaOp res = XlaHelpers::PromotedAdd(input1, input2);
+  return ReturnOp(MaybeConvertTo(res, dim_type), loctx);
 }
 
 SizeEq::SizeEq(torch::lazy::Value a, torch::lazy::Value b)
diff --git a/torch_xla/csrc/tensor_util.cpp b/torch_xla/csrc/tensor_util.cpp
index 4329fa0f..27e30982 100644
--- a/torch_xla/csrc/tensor_util.cpp
+++ b/torch_xla/csrc/tensor_util.cpp
@@ -1252,10 +1252,7 @@ bool RequiresRawTypeCasting(at::ScalarType scalar_type,
 
 xla::PrimitiveType GetShapeDimensionType(
     const torch::lazy::BackendDevice* device) {
-  torch::lazy::BackendDevice xla_device = GetDeviceOrCurrent(device);
-  XlaDeviceType hw_type = static_cast<XlaDeviceType>(xla_device.type());
-  return hw_type == XlaDeviceType::CPU ? xla::PrimitiveType::S64
-                                       : xla::PrimitiveType::S32;
+  return xla::PrimitiveType::S32;
 }
 
 }  // namespace torch_xla

@vanbasten23
Copy link
Collaborator Author

Thanks for looking into it. So to confirm, the s32 comes from SizeNode and s64 comes from %0 = s64[] prim::Constant(), location=to_node@symbolic_shapes.py:139, value=2. By looks suspious, if I have to guess, GetDimensionSize also returns s32, but we incorrectly flag it to be s64., we don't actually need to do anything here. Is it correct?

@JackCaoG
Copy link
Collaborator

my patch will force all size to be s32, as long as make sure SizeAdd and other size ops will cast result to s32 we should be good.

@vanbasten23
Copy link
Collaborator Author

looks suspious, if I have to guess, GetDimensionSize also returns s32, but we incorrectly flag it to be s64.

But regarding your comment "looks suspious, if I have to guess, GetDimensionSize also returns s32, but we incorrectly flag it to be s64.", we don't have to change the part that flags it to be s64, right?

@JackCaoG
Copy link
Collaborator

no, we need to make sure dimensionType is always s32.

However for the case of SizeAdd, if we have a s32 + s64, I think result will be s64, hence we need MaybeConvertTo to cast the result back to s32.

@vanbasten23
Copy link
Collaborator Author

Recap:
The failing test is

  def test_sizeAdd_shortened(self):
    self.assertNotEqual(os.environ['XLA_EXPERIMENTAL'], '')
    size1 = 5
    size2 = 2
    t1 = torch.zeros([size1, size2], device=dev)
    t1[3][0] = 1
    # t2 has size [<=10, 2]
    t2 = torch.nonzero(t1)

    shape0 = t2.shape[0]
    shape1 = t2.shape[1]
    dyn_size = shape0 + shape1
    t3 = torch.ones(1, device=dev)

    # Exercise SizeAdd::Lower.
    t4 = t3.expand(dyn_size)
    print(torch_xla._XLAC._get_xla_tensors_hlo([t4]))
    self.assertEqual(t4.size(0), 3) # exception throws here.

What we concluded is "Size is actually a s32 even on CPU, but we force it to be S64 in here.

To confirm:

  1. To fix:
  • Change SizeAdd's Lower() to be return node.ReturnOp(XlaHelpers::PromotedAdd(op0, op1), loctx);.
  • Cast SizeAdd's result to s32. Is it by changing this line to be s32?
  1. Since you mentioned Size is actually a s32 even on CPU, but we force it to be S64 in XLA in your comment, should we also change the return tup of getDynamicValue from int64 to int32 as well?
    Also should we change
 xla::PrimitiveType GetShapeDimensionType( 
     const torch::lazy::BackendDevice* device) { 
   torch::lazy::BackendDevice xla_device = GetDeviceOrCurrent(device); 
   XlaDeviceType hw_type = static_cast<XlaDeviceType>(xla_device.type()); 
   return hw_type == XlaDeviceType::CPU ? xla::PrimitiveType::S64 
                                        : xla::PrimitiveType::S32; 
 } 

just so that it'll always return s32 for size operation?

@vanbasten23 vanbasten23 force-pushed the addMoreTestsForSizeAdd branch from 89475b0 to 70a9a79 Compare January 10, 2023 22:56
Copy link
Collaborator

@JackCaoG JackCaoG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly lgtm, minor nits

@vanbasten23 vanbasten23 merged commit 0062e45 into master Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dynamism Dynamic Shape Features testing Testing and coverage related issues.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants