April 30, 2025
LLM for Search Evaluation
In 2025, Large Language Models (LLMs) are not only powerful text generation tools but also critical instruments for search evaluation. Their ability to accurately reproduce user preferences and assess information utility is radically transforming how search engines and recommendation systems are tested and improved.‍

Benchmarks (2025) LLM for Search Evaluation

In 2025, Large Language Models (LLMs) are not only powerful text generation tools but also critical instruments for search evaluation. Their ability to accurately reproduce user preferences and assess information utility is radically transforming how search engines and recommendation systems are tested and improved.

Several foundational studies published between 2023 and 2025 reveal both the capabilities and limitations of current LLMs in this domain. In this article, we will review the main research works:

• Thomas et al. (Microsoft Research) — on LLMs’ ability to predict user preferences (arXiv:2309.10621)

• Dewan et al. (University of Waterloo) — on task-aware usefulness evaluation (arXiv:2504.14401)

• Zhao et al. (Tsinghua University) — on personalization and user preference following (arXiv:2502.09597)

Using LLMs for Automatic Relevance Judgments

Source: Thomas, P., Spielman, S., Craswell, N., Mitra, B. (Microsoft Research)

The study by Thomas et al. aimed to determine whether LLMs could replace human annotators in assessing the relevance of search results. The authors trained models using carefully designed prompts and compared their outputs against gold labels — relevance judgments derived from real user preferences.

Methodology

• Data: User preference datasets based on pairwise document comparisons.

• Models: GPT-4 and custom Microsoft LLMs.

• Evaluation: Accuracy in matching gold labels.

Key Results

• LLM annotations achieved 82–85% agreement with gold labels, outperforming traditional crowd-sourced annotators (typically around 75–78%).

• LLMs showed lower variance in judgments, indicating more consistent evaluation behavior.

• The cost of producing 1,000 LLM-based judgments was approximately 10 times lower than human annotation.

Conclusion

LLMs are not only approaching but in some metrics surpassing human-level quality in relevance assessment, particularly in consistency and cost-efficiency.

Task-Aware Usefulness Evaluation (TRUE Benchmark)

Source: Dewan, M. et al. (University of Waterloo)

Dewan et al. extended research beyond relevance by focusing on the usefulness of documents in the context of user tasks (task-aware evaluation).

TRUE Methodology

• Search sessions are defined by user tasks (e.g., “compare smartphone prices” vs. “find technical specifications”).

• LLMs are provided additional session context before evaluating documents.

• A new metric, TRUE score (Task-aware Rubric-based Usefulness Evaluation), is introduced.

Key Results

• Incorporating task context improved alignment with real user satisfaction by 12–18%.

• LLMs trained with TRUE prompting provided more accurate usefulness labels than those relying solely on topic relevance.

• The researchers released a dataset of 50,000 query-task-document triplets to support further research.

Example TRUE Prompt

User task: “Select a laptop for video editing.”

Document: A laptop description emphasizing strong GPU performance.

LLM task: Judge the document’s usefulness not merely by topical keywords but by its relevance to video editing needs.

Personalization and LLMs’ Ability to Follow User Preferences

Source: Zhao, S. et al. (Tsinghua University)

Zhao et al. evaluated LLMs’ ability to adapt to individual user preferences using PrefEval — a specialized benchmark for personalized evaluation in dialogue contexts.

Methodology

• PrefEval requires models to infer and apply user preferences throughout a multi-turn conversation.

• Models are assessed based on accuracy in satisfying personalized user needs over long contexts.

Key Results

• LLMs exhibited a 15–20% accuracy drop in personalized settings compared to non-personalized relevance tasks.

• Errors often occurred when user preferences conflicted or evolved during the session.

• Dynamic online fine-tuning helped mitigate errors but came with significant computational costs.

Conclusion

While LLMs excel at general relevance prediction, personalization remains a challenge, particularly in maintaining preference continuity in long interactions.

Comparative Table of Results

Study Focus Area Key Findings
Thomas et al. (Microsoft) Relevance 82–85% agreement with gold labels; 10× lower annotation costs
Dewan et al. (Waterloo) Usefulness (Task-aware) +12–18% better alignment with user goals using TRUE prompts
Zhao et al. (Tsinghua) Personalization 15–20% performance drop in personalized evaluations

Final Remarks

Large Language Models are becoming pivotal in automating search quality evaluation. They outperform traditional approaches in relevance and task-oriented usefulness assessments, though they still face challenges in personalization. At Nibelung, we actively adopt this LLM-driven evaluation paradigm to benchmark and refine search experiences across diverse contexts, ensuring both scalability and precision.

Future research directions include:

• Advancing task-aware prompting strategies.

• Developing memory mechanisms for long-term user preference tracking.

• Designing hybrid systems that combine LLM reasoning with traditional recommender logic.

LLMs are moving beyond text generation into the core of search and recommendation system quality assurance, offering both opportunities and challenges for the next generation of AI-driven applications.

References:

• Thomas, P., Spielman, S., Craswell, N., Mitra, B. Large Language Models Can Accurately Predict Searcher Preferences. Microsoft Research. arXiv:2309.10621

• Dewan, M., et al. LLM-Driven Usefulness Judgment for Web Search Evaluation. University of Waterloo. arXiv:2504.14401

• Zhao, S., et al. Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs. Tsinghua University. arXiv:2502.09597