Pairwise String Comparison
Often you will want to compare predictions of an LLM, Chain, or Agent for a given input. The StringComparison
evaluators facilitate this so you can answer questions like:
- Which LLM or prompt produces a preferred output for a given question?
- Which examples should I include for few-shot example selection?
- Which output is better to include for fintetuning?
The simplest and often most reliable automated way to choose a preferred prediction for a given input is to use the labeled_pairwise_string
evaluator.
With References
- npm
- Yarn
- pnpm
npm install @langchain/anthropic
yarn add @langchain/anthropic
pnpm add @langchain/anthropic
import { loadEvaluator } from "langchain/evaluation";
const chain = await loadEvaluator("labeled_pairwise_string", {
criteria: "correctness",
});
const res = await chain.evaluateStringPairs({
prediction: "there are three dogs",
predictionB: "4",
input: "how many dogs are in the park?",
reference: "four",
});
console.log(res);
/*
{
reasoning: 'Both responses attempt to answer the question about the number of dogs in the park. However, Response A states that there are three dogs, which is incorrect according to the reference answer. Response B, on the other hand, correctly states that there are four dogs, which matches the reference answer. Therefore, Response B is more accurate.Final Decision: [[B]]',
value: 'B',
score: 0
}
*/
API Reference:
- loadEvaluator from
langchain/evaluation
Methods
The pairwise string evaluator can be called using evaluateStringPairs methods, which accept:
- prediction (string) – The predicted response of the first model, chain, or prompt.
- predictionB (string) – The predicted response of the second model, chain, or prompt.
- input (string) – The input question, prompt, or other text.
- reference (string) – (Only for the labeled_pairwise_string variant) The reference response.
They return a dictionary with the following values:
- value: 'A' or 'B', indicating whether
prediction
orpredictionB
is preferred, respectively - score: Integer 0 or 1 mapped from the 'value', where a score of 1 would mean that the first
prediction
is preferred, and a score of 0 would meanpredictionB
is preferred. - reasoning: String "chain of thought reasoning" from the LLM generated prior to creating the score
Without References
When references aren't available, you can still predict the preferred response. The results will reflect the evaluation model's preference, which is less reliable and may result in preferences that are factually incorrect.
import { loadEvaluator } from "langchain/evaluation";
const chain = await loadEvaluator("pairwise_string", {
criteria: "conciseness",
});
const res = await chain.evaluateStringPairs({
prediction: "Addition is a mathematical operation.",
predictionB:
"Addition is a mathematical operation that adds two numbers to create a third number, the 'sum'.",
input: "What is addition?",
});
console.log({ res });
/*
{
res: {
reasoning: 'Response A is concise, but it lacks detail. Response B, while slightly longer, provides a more complete and informative answer by explaining what addition does. It is still concise and to the point.Final decision: [[B]]',
value: 'B',
score: 0
}
}
*/
API Reference:
- loadEvaluator from
langchain/evaluation
Defining the Criteria
By default, the LLM is instructed to select the 'preferred' response based on helpfulness, relevance, correctness, and depth of thought. You can customize the criteria by passing in a criteria
argument, where the criteria could take any of the following forms:
Criteria
- to use one of the default criteria and their descriptionsConstitutional principal
- use one any of the constitutional principles defined in langchainDictionary
: a list of custom criteria, where the key is the name of the criteria, and the value is the description.
Below is an example for determining preferred writing responses based on a custom style.
import { loadEvaluator } from "langchain/evaluation";
const customCriterion = {
simplicity: "Is the language straightforward and unpretentious?",
clarity: "Are the sentences clear and easy to understand?",
precision: "Is the writing precise, with no unnecessary words or details?",
truthfulness: "Does the writing feel honest and sincere?",
subtext: "Does the writing suggest deeper meanings or themes?",
};
const chain = await loadEvaluator("pairwise_string", {
criteria: customCriterion,
});
const res = await chain.evaluateStringPairs({
prediction:
"Every cheerful household shares a similar rhythm of joy; but sorrow, in each household, plays a unique, haunting melody.",
predictionB:
"Where one finds a symphony of joy, every domicile of happiness resounds in harmonious, identical notes; yet, every abode of despair conducts a dissonant orchestra, each playing an elegy of grief that is peculiar and profound to its own existence.",
input: "Write some prose about families.",
});
console.log(res);
/*
{
reasoning: "Response A is simple, clear, and precise. It uses straightforward language to convey a deep and universal truth about families. The metaphor of joy and sorrow as music is effective and easy to understand. Response B, on the other hand, is more complex and less clear. It uses more sophisticated language and a more elaborate metaphor, which may make it harder for some readers to understand. It also includes unnecessary words and details that don't add to the overall meaning of the prose.Both responses are truthful and sincere, and both suggest deeper meanings about the nature of family life. However, Response A does a better job of conveying these meanings in a simple, clear, and precise way.Therefore, the better response is [[A]].",
value: 'A',
score: 1
}
*/
API Reference:
- loadEvaluator from
langchain/evaluation
Customize the LLM
By default, the loader uses gpt-4
in the evaluation chain. You can customize this when loading.
import { loadEvaluator } from "langchain/evaluation";
import { ChatAnthropic } from "@langchain/anthropic";
const model = new ChatAnthropic({ temperature: 0 });
const chain = await loadEvaluator("labeled_pairwise_string", { llm: model });
const res = await chain.evaluateStringPairs({
prediction: "there are three dogs",
predictionB: "4",
input: "how many dogs are in the park?",
reference: "four",
});
console.log(res);
/*
{
reasoning: 'Here is my assessment:Response B is more correct and accurate compared to Response A. Response B simply states "4", which matches the ground truth reference answer of "four". Meanwhile, Response A states "there are three dogs", which is incorrect according to the reference. In terms of following instructions and directly answering the question "how many dogs are in the park?", Response B gives the precise numerical answer, while Response A provides an incomplete sentence. Overall, Response B is more accurate and better followed the instructions to directly answer the question.[[B]]',
value: 'B',
score: 0
}
*/
API Reference:
- loadEvaluator from
langchain/evaluation
- ChatAnthropic from
@langchain/anthropic
Customize the Evaluation Prompt
You can use your own custom evaluation prompt to add more task-specific instructions or to instruct the evaluator to score the output.
Note: If you use a prompt that expects generates a result in a unique format, you may also have to pass in a custom output parser (outputParser=yourParser()
) instead of the default PairwiseStringResultOutputParser
import { loadEvaluator } from "langchain/evaluation";
import { PromptTemplate } from "@langchain/core/prompts";
const promptTemplate = PromptTemplate.fromTemplate(
`Given the input context, which do you prefer: A or B?
Evaluate based on the following criteria:
{criteria}
Reason step by step and finally, respond with either [[A]] or [[B]] on its own line.
DATA
----
input: {input}
reference: {reference}
A: {prediction}
B: {predictionB}
---
Reasoning:
`
);
const chain = await loadEvaluator("labeled_pairwise_string", {
chainOptions: {
prompt: promptTemplate,
},
});
const res = await chain.evaluateStringPairs({
prediction: "The dog that ate the ice cream was named fido.",
predictionB: "The dog's name is spot",
input: "What is the name of the dog that ate the ice cream?",
reference: "The dog's name is fido",
});
console.log(res);
/*
{
reasoning: 'Helpfulness: Both A and B are helpful as they provide a direct answer to the question.Relevance: Both A and B refer to the question, but only A matches the reference text.Correctness: Only A is correct as it matches the reference text.Depth: Both A and B are straightforward and do not demonstrate depth of thought.Based on these criteria, the preferred response is A. ',
value: 'A',
score: 1
}
*/
API Reference:
- loadEvaluator from
langchain/evaluation
- PromptTemplate from
@langchain/core/prompts