fix: Prevent in-place mutation in collective ops and simplify metric sync#3607
fix: Prevent in-place mutation in collective ops and simplify metric sync#3607aaishwarymishra wants to merge 1 commit intopytorch:masterfrom
Conversation
ignite/metrics/metric.py
Outdated
| # Backends should not mutate the caller's tensor; idist implementations | ||
| # are normalized to operate on a cloned tensor. We can therefore pass | ||
| # the original tensor and rely on the distributed layer to avoid | ||
| # in-place modifications and always return the reduced value. |
There was a problem hiding this comment.
Once we have done that we can remove the comment, IMO
|
|
||
| # Work on a cloned tensor so backend implementations that perform in-place | ||
| # operations do not mutate the original tensor passed by the caller. | ||
| tensor = tensor.clone() |
There was a problem hiding this comment.
This method is in collective ops like all_reduce and all_gather. All gather does not do in-place op, so it is unnecessary to copy the input. The copy should be only done for all_reduce, IMO.
| if fn is self._do_all_reduce: | ||
| tensor = tensor.clone() |
There was a problem hiding this comment.
I think this is more subtle than how you do that: 1) idist.all_reduce can work on float and torch.Tensor. if input is float then a new tensor is created from it => copy is redundant,
2) idist.all_reduce works with various backends: pytorch nccl, gloo, horovod, xla. Pytorch distributed and XLA are only doing an in-place all reduce op, Horovod allreduce is not an in-place op: https://horovod.readthedocs.io/en/stable/api.html#horovod.torch.allreduce
There was a problem hiding this comment.
I get it I can just apply this clone logic in reduce_all method itself if its a tensor clone it if, its float do nothing.
As you mentioned a certain backends are doing the inplace ops, I can just add an isinstance(self,"_XlaDistModel") check to clone only tensors which are of these classes only, does it sound good?
There was a problem hiding this comment.
I think it is more appropriate to clone the tensor inside the dispatched _do_all_reduce methods of Native and XLA backends.
There was a problem hiding this comment.
yeah but _do_all_reduce tensor could be the original tensor and if the float was sent to reduce_all it gets converted to tensor, so cloning inside _do_all_reduce can create redundant tensor too if the input was float.
There was a problem hiding this comment.
Also to be sure comment said only ```reduce_all`` can modify the tensor in-place right on native and xla?
There was a problem hiding this comment.
I don't think it will be BC breaking as originally we were sending cloned tensor to all_reduce so all the in place operations were always happening in the cloned tensor, I could be wrong though.
If this is the case, why we want to have this PR then?
Please try locally and see. This is easily emulated with CPU and for example using scripts like:
https://colab.research.google.com/github/pytorch-ignite/playground/blob/main/index.ipynb#scrollTo=ddMXyp4ek1DO
There was a problem hiding this comment.
In distributed/comp_models/horovod.py in line 195 hvd.allreduce returns a new tensor , while dist.all_reduce in native.py and xm.all_reduce in xla.py does the in-place operation, that is dual behaviour so the main objective here is to either mutate the tensor or return new tensor.
There was a problem hiding this comment.
yeah but _do_all_reduce tensor could be the original tensor and if the float was sent to reduce_all it gets converted to tensor, so cloning inside _do_all_reduce can create redundant tensor too if the input was float.
Yes, you are right about that, my bad.
Also to be sure comment said only ```reduce_all`` can modify the tensor in-place right on native and xla?
there is no
reduce_allmethod, but all_reduce is the one inplace op for the backends native and xla.On the other hand I'm thinking if we were fixing this behavior, it will be BC-breaking for people relying on that inplace behavior for native pytorch distributed backend. We should put a note about that in the docs.
You are right, we also have to refactor code in the project itself which uses all_reduce well the main question is should we change the behaviour? If yes whether to return a new tensor or mutate in place.
Sorry I think we were not on same page before
There was a problem hiding this comment.
Personally, I wont change the behavior for several reasons:
- if we adopt one of the two options is will be BC breaking.
- among the supported backends torch native is the most useful thing as horovod and xla are mostly sunsetting. I prefer not to change the working part
4ad94cf to
048b5c9
Compare
b3739c8 to
4bb7da1
Compare
…effects in native and XLA distributed models
This pull request improves the handling of tensors in distributed operations to ensure that backend implementations do not inadvertently modify the original tensors passed by callers. The changes focus on preventing in-place mutations and standardizing the behavior across distributed backends.
Distributed tensor mutation prevention:
ignite/distributed/comp_models/base.py, the tensor is now cloned before applying backend operations, ensuring that any in-place modifications by backend implementations do not affect the original tensor.Standardization of distributed reduction behavior:
ignite/metrics/metric.py, the wrapper no longer clones tensors before passing them toidist.all_reduce, relying on the distributed layer to avoid in-place mutations and always return the reduced value. This change aligns the behavior across backends and simplifies the code.…tric sync_all_reduce