[DPO] average_log_prob when loss is IPO (huggingface#1265)

kashif · web-flow · commit 29d439a2043e · 2024-01-24T12:18:04.000+01:00
* average_log_prob when loss is IPO

* updated docs with the fix
diff --git a/docs/source/dpo_trainer.mdx b/docs/source/dpo_trainer.mdx
@@ -86,7 +86,7 @@ Given the preference data, we can fit a binary classifier according to the Bradl
 
 The [RSO](https://arxiv.org/abs/2309.06657) authors propose to use a hinge loss on the normalized likelihood from the [SLiC](https://arxiv.org/abs/2305.10425) paper. The `DPOTrainer` can be switched to this loss via the `loss_type="hinge"` argument and the `beta` in this case is the reciprocal of the margin.
 
-The [IPO](https://arxiv.org/abs/2310.12036) authors provide a deeper theoretical understanding of the DPO algorithms and identify an issue with overfitting and propose an alternative loss which can be used via the `loss_type="ipo"` argument to the trainer.
+The [IPO](https://arxiv.org/abs/2310.12036) authors provide a deeper theoretical understanding of the DPO algorithms and identify an issue with overfitting and propose an alternative loss which can be used via the `loss_type="ipo"` argument to the trainer. Note that the `beta`  parameter is the reciprocal of the gap between the log-likelihood ratios of the chosen vs the rejected completion pair and thus the smaller the `beta` the larger this gaps is. As per the paper the loss is averaged over log-likelihoods of the completion (unlike DPO which is summed only).
 
 The [cDPO](https://ericmitchell.ai/cdpo.pdf) is a tweak on the DPO loss where we assume that the preference labels are noisy with some probability that can be passed to the `DPOTrainer` via `label_smoothing` argument (between 0 and 0.5) and then a conservative DPO loss is used. Use the `loss_type="cdpo"` argument to the trainer to use it.
 
diff --git a/trl/trainer/dpo_trainer.py b/trl/trainer/dpo_trainer.py
@@ -966,7 +966,7 @@ def concatenated_forward(
         all_logps = self.get_batch_logps(
             all_logits,
             concatenated_batch["concatenated_labels"],
-            average_log_prob=False,
+            average_log_prob=self.loss_type == "ipo",
             is_encoder_decoder=self.is_encoder_decoder,
             label_pad_token_id=self.label_pad_token_id,
         )

Original file line number	Diff line number	Diff line change
`@@ -966,7 +966,7 @@ def concatenated_forward(`
`966`	`966`	`all_logps = self.get_batch_logps(`
`967`	`967`	`all_logits,`
`968`	`968`	`concatenated_batch["concatenated_labels"],`
`969`		`- average_log_prob=False,`
	`969`	`+ average_log_prob=self.loss_type == "ipo",`
`970`	`970`	`is_encoder_decoder=self.is_encoder_decoder,`
`971`	`971`	`label_pad_token_id=self.label_pad_token_id,`
`972`	`972`	`)`