# Judges

> [!WARNING]
> TRL Judges is an experimental API which is subject to change at any time. As of TRL v1.0, judges have been moved to the `trl.experimental.judges` module.

TRL provides judges to easily compare two completions.

Make sure to have installed the required dependencies by running:

```bash
pip install trl[judges]
```

## Using the provided judges

TRL provides several judges out of the box. For example, you can use the [experimental.judges.HfPairwiseJudge](/docs/trl/v1.0.0rc1/en/judges#trl.experimental.judges.HfPairwiseJudge) to compare two completions using a pre-trained model from the Hugging Face model hub:

```python
from trl.experimental.judges import HfPairwiseJudge

judge = HfPairwiseJudge()
judge.judge(
    prompts=["What is the capital of France?", "What is the biggest planet in the solar system?"],
    completions=[["Paris", "Lyon"], ["Saturn", "Jupiter"]],
)  # Outputs: [0, 1]
```

## Define your own judge

To define your own judge, we provide several base classes that you can subclass. For rank-based judges, you need to subclass [experimental.judges.BaseRankJudge](/docs/trl/v1.0.0rc1/en/judges#trl.experimental.judges.BaseRankJudge) and implement the [experimental.judges.BaseRankJudge.judge()](/docs/trl/v1.0.0rc1/en/judges#trl.experimental.judges.BaseRankJudge.judge) method. For pairwise judges, you need to subclass `experimental.judges.BasePairJudge` and implement the `experimental.judges.BasePairJudge.judge` method. If you want to define a judge that doesn't fit into these categories, you need to subclass [experimental.judges.BaseJudge](/docs/trl/v1.0.0rc1/en/judges#trl.experimental.judges.BaseJudge) and implement the `experimental.judges.BaseJudge.judge()` method.

As an example, let's define a pairwise judge that prefers shorter completions:

```python
from trl.experimental.judges import BasePairwiseJudge

class PrefersShorterJudge(BasePairwiseJudge):
    def judge(self, prompts, completions, shuffle_order=False):
        return [0 if len(completion[0]) > len(completion[1]) else 1 for completion in completions]
```

You can then use this judge as follows:

```python
judge = PrefersShorterJudge()
judge.judge(
    prompts=["What is the capital of France?", "What is the biggest planet in the solar system?"],
    completions=[["Paris", "The capital of France is Paris."], ["Jupiter is the biggest planet in the solar system.", "Jupiter"]],
)  # Outputs: [0, 1]
```

## Provided judges

### PairRMJudge[[trl.experimental.judges.PairRMJudge]]

#### trl.experimental.judges.PairRMJudge[[trl.experimental.judges.PairRMJudge]]

[Source](https://github.com/huggingface/trl/blob/v1.0.0rc1/trl/experimental/judges/judges.py#L200)

LLM judge based on the PairRM model from AllenAI.

This judge uses the PairRM model to rank pairs of completions for given prompts. It's designed for pairwise
comparison of language model outputs. The PairRM model is loaded using the llm-blender library and runs on the
default Accelerator device.

**Attributes**:

blender (`llm_blender.Blender`):
An instance of the Blender class from llm-blender.

**Example**:
```python
>>> pairrm_judge = PairRMJudge()
>>> prompts = ["Translate 'hello' to French", "What's the capital of Japan?"]
>>> completions = [["Bonjour", "Salut"], ["Kyoto", "Tokyo"]]
>>> results = pairrm_judge.judge(prompts, completions)
>>> print(results)  # [0, 1] (indicating the first completion is preferred for the first prompt and the second)
```

> [!TIP]
> This class requires the llm-blender library to be installed. Install it with: `pip install llm-blender`.

judgetrl.experimental.judges.PairRMJudge.judgehttps://github.com/huggingface/trl/blob/v1.0.0rc1/trl/experimental/judges/judges.py#L242[{"name": "prompts", "val": ": list"}, {"name": "completions", "val": ": list"}, {"name": "shuffle_order", "val": ": bool = True"}, {"name": "return_scores", "val": ": bool = False"}, {"name": "temperature", "val": ": float = 1.0"}]- **prompts** (`list[str]`) --
  List of prompts to judge.
- **completions** (`list[list[str]]`) --
  List of completion pairs for each prompt.
- **shuffle_order** (`bool`, *optional*, defaults to `True`) --
  Whether to shuffle the order of the completions to avoid positional bias.
- **return_scores** (`bool`, *optional*, defaults to `False`) --
  If `True`, return probability scores of the first completion instead of ranks (i.e. a *soft-judge*).
- **temperature** (`float`, *optional*, defaults to `1.0`) --
  Temperature for scaling logits if `return_scores` is True.0`list[int | float]`If `return_scores` is `False`, returns a list of ranks (`0` or `1`) for each prompt, indicating which
completion is preferred. If `return_scores` is `True`, returns softmax probabilities for the first
completion.- ``ValueError`` -- 
  If the number of completions per prompt is not exactly 2.``ValueError``

Judge the completion pairs for the given prompts using the PairRM model.

Note:
Unlike llm-blender, ranks are 0-indexed (`0` means the first completion is preferred).

**Parameters:**

prompts (`list[str]`) : List of prompts to judge.

completions (`list[list[str]]`) : List of completion pairs for each prompt.

shuffle_order (`bool`, *optional*, defaults to `True`) : Whether to shuffle the order of the completions to avoid positional bias.

return_scores (`bool`, *optional*, defaults to `False`) : If `True`, return probability scores of the first completion instead of ranks (i.e. a *soft-judge*).

temperature (`float`, *optional*, defaults to `1.0`) : Temperature for scaling logits if `return_scores` is True.

**Returns:**

``list[int | float]``

If `return_scores` is `False`, returns a list of ranks (`0` or `1`) for each prompt, indicating which
completion is preferred. If `return_scores` is `True`, returns softmax probabilities for the first
completion.

### HfPairwiseJudge[[trl.experimental.judges.HfPairwiseJudge]]

#### trl.experimental.judges.HfPairwiseJudge[[trl.experimental.judges.HfPairwiseJudge]]

[Source](https://github.com/huggingface/trl/blob/v1.0.0rc1/trl/experimental/judges/judges.py#L309)

Pairwise judge based on the Hugging Face API with chat completion.

This judge is relevant for assessing the quality chat models, where the completion is a response to a given prompt.

**Parameters:**

model (`str`, *optional*, defaults to `"meta-llama/Meta-Llama-3-70B-Instruct"`) : Model to use for the judge.

token (`str`, *optional*) : Hugging Face API token to use for the `huggingface_hub.InferenceClient`.

system_prompt (`str`, *optional*) : The system prompt to be used for the judge. If not provided, a default prompt is used. Note that the system prompt should contain the following placeholders: `{prompt}`, `{response0}`, and `{response1}`. Also, the inference is called with `max_tokens=1`, consequently the system prompt should ask for a single token response.

### OpenAIPairwiseJudge[[trl.experimental.judges.OpenAIPairwiseJudge]]

#### trl.experimental.judges.OpenAIPairwiseJudge[[trl.experimental.judges.OpenAIPairwiseJudge]]

[Source](https://github.com/huggingface/trl/blob/v1.0.0rc1/trl/experimental/judges/judges.py#L365)

Judge based on the OpenAI API.

This judge is relevant for assessing the quality chat models, where the completion is a response to a given prompt.

**Parameters:**

model (`str`, *optional*, defaults to `"gpt-4-turbo-preview"`) : Model to use for the judge.

system_prompt (`str`, *optional*) : System prompt to be used for the judge. If not provided, a default prompt is used. Note that the system prompt should contain the following placeholders: `{prompt}`, `{response0}`, and `{response1}`. Also, the inference is called with `max_tokens=1`, consequently the system prompt should ask for a single token response.

max_requests (`int` or `None`, *optional*, defaults to `1000`) : Maximum number of requests to make to the OpenAI API. If set to `None`, there is no limit.

### AllTrueJudge[[trl.experimental.judges.AllTrueJudge]]

#### trl.experimental.judges.AllTrueJudge[[trl.experimental.judges.AllTrueJudge]]

[Source](https://github.com/huggingface/trl/blob/v1.0.0rc1/trl/experimental/judges/judges.py#L440)

Unify the decision of multiple [experimental.judges.BaseBinaryJudge](/docs/trl/v1.0.0rc1/en/judges#trl.experimental.judges.BaseBinaryJudge) instances.

Returns `1` only if all inner binary judges return `1`. If any judge returns `0`, it returns `0`. If any judge
returns `-1`, indicating a failure in its process, this judge will also return `-1`.

Implements the Mixture of Judges as described in the [CGPO paper](https://huggingface.co/papers/2409.20370).

**Parameters:**

judges (`list` of [experimental.judges.BaseBinaryJudge](/docs/trl/v1.0.0rc1/en/judges#trl.experimental.judges.BaseBinaryJudge)) : A list of [experimental.judges.BaseBinaryJudge](/docs/trl/v1.0.0rc1/en/judges#trl.experimental.judges.BaseBinaryJudge) instances whose decisions will be unified.

## Base classes

### BaseJudge[[trl.experimental.judges.BaseJudge]]

#### trl.experimental.judges.BaseJudge[[trl.experimental.judges.BaseJudge]]

[Source](https://github.com/huggingface/trl/blob/v1.0.0rc1/trl/experimental/judges/judges.py#L77)

Base class for judges. The subclasses of this class should implement the `judge` method.

### BaseBinaryJudge[[trl.experimental.judges.BaseBinaryJudge]]

#### trl.experimental.judges.BaseBinaryJudge[[trl.experimental.judges.BaseBinaryJudge]]

[Source](https://github.com/huggingface/trl/blob/v1.0.0rc1/trl/experimental/judges/judges.py#L160)

Base class for binary judges.

judgetrl.experimental.judges.BaseBinaryJudge.judgehttps://github.com/huggingface/trl/blob/v1.0.0rc1/trl/experimental/judges/judges.py#L165[{"name": "prompts", "val": ": list"}, {"name": "completions", "val": ": list"}, {"name": "gold_completions", "val": ": list[str] | None = None"}, {"name": "shuffle_order", "val": ": bool = True"}]- **prompts** (`list[str]`) -- List of prompts.
- **completions** (`list[str]`) -- List of completions.
- **gold_completions** (`list[str]`, `optional`) -- List of gold completions if it exists.
- **shuffle_order** (`bool`) -- Whether to shuffle the order of the completions to avoid positional bias.0list[int]A list of binary labels:
- 1 indicates that the completion satisfies the evaluated constraint.
- 0 indicates that the completion does not satisfy the evaluated constraint.

Judge the completion for a given prompt. Used to assess if a completion satisfies a constraint.

This base class should be used to implement binary evaluations as done in section 4.1.4 of the [CGPO
paper](https://huggingface.co/papers/2409.20370). It is relevant for assessing whether a prompt-completion pair
satisfies a specific constraint.

Note:
If the judge returns -1 for any prompt, it indicates that the inner process used to compute the preference
has failed. For instance, this could occur if the underlying language model or rule based constraint
returned an invalid answer. In such cases, the caller should handle these invalid indices appropriately,
possibly by implementing fallback logic or error handling.

**Parameters:**

prompts (`list[str]`) : List of prompts.

completions (`list[str]`) : List of completions.

gold_completions (`list[str]`, `optional`) : List of gold completions if it exists.

shuffle_order (`bool`) : Whether to shuffle the order of the completions to avoid positional bias.

**Returns:**

`list[int]`

A list of binary labels:
- 1 indicates that the completion satisfies the evaluated constraint.
- 0 indicates that the completion does not satisfy the evaluated constraint.

### BaseRankJudge[[trl.experimental.judges.BaseRankJudge]]

#### trl.experimental.judges.BaseRankJudge[[trl.experimental.judges.BaseRankJudge]]

[Source](https://github.com/huggingface/trl/blob/v1.0.0rc1/trl/experimental/judges/judges.py#L87)

Base class for LLM ranking judges.

**Example**:
```python
class MyRankJudge(BaseRankJudge):
    def judge(self, prompts, completions, shuffle_order=True):
        return ...  # Your ranking logic here

judge = MyRankJudge()
judge.judge(
    prompts=["The capital of France is", "The capital of Germany is"],
    completions=[[" Paris", " Marseille", "Lyon"], [" Munich", " Berlin"]],
)  # [[0, 1, 2], [1, 0]]
```

judgetrl.experimental.judges.BaseRankJudge.judgehttps://github.com/huggingface/trl/blob/v1.0.0rc1/trl/experimental/judges/judges.py#L106[{"name": "prompts", "val": ": list"}, {"name": "completions", "val": ": list"}, {"name": "shuffle_order", "val": ": bool = True"}]- **prompts** (`list[str]`) --
  List of prompts.
- **completions** (`list[list[str]]`) --
  List of completions list, where each element is a list of completions for the corresponding prompt.
- **shuffle_order** (`bool`, *optional*, defaults to `True`) --
  Whether to shuffle the order of the completions to avoid positional bias.0`list[list[int]]`List of lists of idxs, where each list contains the ranks of the completions for the corresponding
prompt. E.g., `[1, 2, 0]` means that the second completion (`idx=1`) is the best, followed by the
third, and then the first.

Judge the completion for the given prompts and return the ranks of each completion.

**Parameters:**

prompts (`list[str]`) : List of prompts.

completions (`list[list[str]]`) : List of completions list, where each element is a list of completions for the corresponding prompt.

shuffle_order (`bool`, *optional*, defaults to `True`) : Whether to shuffle the order of the completions to avoid positional bias.

**Returns:**

``list[list[int]]``

List of lists of idxs, where each list contains the ranks of the completions for the corresponding
prompt. E.g., `[1, 2, 0]` means that the second completion (`idx=1`) is the best, followed by the
third, and then the first.

### BasePairwiseJudge[[trl.experimental.judges.BasePairwiseJudge]]

#### trl.experimental.judges.BasePairwiseJudge[[trl.experimental.judges.BasePairwiseJudge]]

[Source](https://github.com/huggingface/trl/blob/v1.0.0rc1/trl/experimental/judges/judges.py#L128)

Base class for pairwise judges.

judgetrl.experimental.judges.BasePairwiseJudge.judgehttps://github.com/huggingface/trl/blob/v1.0.0rc1/trl/experimental/judges/judges.py#L133[{"name": "prompts", "val": ": list"}, {"name": "completions", "val": ": list"}, {"name": "shuffle_order", "val": ": bool = True"}]- **prompts** (`list[str]`) --
  List of prompts.
- **completions** (`list[list[str]]`) --
  List of completions pairs, where each element is a pair of completions for the corresponding prompt.
- **shuffle_order** (`bool`, *optional*, defaults to `True`) --
  Whether to shuffle the order of the completions to avoid positional bias.0`list[int]`List of idxs, where each idx is the rank of the best completion for the corresponding prompt. E.g., `1`
means that the second completion (`idx=1`) is the best.

Judge the completion pairs for the given prompts.

Note:
If the judge returns `-1` for any prompt, it indicates that the inner process used to compute the
preference has failed. For instance, this could occur if the underlying language model returned an invalid
answer. In such cases, the caller should handle these invalid indices appropriately, possibly by
implementing fallback logic or error handling.

**Parameters:**

prompts (`list[str]`) : List of prompts.

completions (`list[list[str]]`) : List of completions pairs, where each element is a pair of completions for the corresponding prompt.

shuffle_order (`bool`, *optional*, defaults to `True`) : Whether to shuffle the order of the completions to avoid positional bias.

**Returns:**

``list[int]``

List of idxs, where each idx is the rank of the best completion for the corresponding prompt. E.g., `1`
means that the second completion (`idx=1`) is the best.

