Multi-expert Prompting enhances the reliability, safety, and utility of Large Language Models (LLMs). This approach, developed by a group of researchers affiliated with academic institutions in Singapore, including Do Xuan Long, Duong Ngoc Yen, Luu Anh Tuan, Kenji Kawaguchi, Min-Yen Kan, and Nancy F. Chen, represents a significant advancement over the ExpertPrompting technique, as it enables the generation of more articulated and neutral responses by simulating multiple experts and selecting the best among individual and aggregated responses.
Introduction to Multi-expert Prompting
Multi-expert Prompting represents a significant evolution compared to previous techniques such as ExpertPrompting, proposing a new approach that enhances the capabilities of Large Language Models (LLMs) in generating reliable, informative, and safe responses. This technique is designed to address the issue of one-dimensional responses, typical of methods involving a single expert, and to foster greater diversification of perspectives, generating more nuanced and balanced responses.
The concept of Multi-expert Prompting arises from the need to overcome the limitations of traditional techniques such as ExpertPrompting, which involves simulating a single expert and constructing responses based on only one perspective. For example, in the case of open and complex questions like "Is it ethical to eat meat?", ExpertPrompting might generate a response reflecting only one view, such as that of an ethicist who considers the act immoral, while ignoring other perspectives like nutritional, environmental, or cultural aspects. This introduces an evident bias and limits the depth of the response.
To solve this problem, Multi-expert Prompting proposes the creation of multiple expert identities, each with a concise description of their role and expertise. This allows for a more detailed response, capable of representing different opinions and thus more aligned with the complexity of the questions posed. Each response generated by the various experts contributes to building a more complete and accurate answer. The LLM uses sub-tasks to consolidate the information provided by the experts, identify points of consensus, manage conflicts, and integrate unique perspectives, ensuring a final response that is as truthful, informative, non-toxic, and useful as possible.
This method stands out from previous techniques due to its ability to synthesize high-quality responses even for open and multidimensional questions. Multi-expert Prompting leverages a chain of thought model to guide the LLM through the process of selecting and aggregating responses. Unlike similar approaches, such as Multi-agent Debate or Universal Self-consistency, where responses are iteratively refined, Multi-expert Prompting relies on a single aggregation phase, without further iterations, making it more efficient and practical in many situations.
This technique not only improves the quality of responses in terms of truthfulness and completeness but also significantly reduces the presence of toxic or harmful content, thanks to the diversification of information sources and the rigorous methodology of response aggregation. Experiments conducted have shown that Multi-expert Prompting outperforms traditional methods in terms of truthfulness (+8.69% over the best baselines) and informativeness, demonstrating its ability to effectively integrate multiple perspectives.
Another fundamental aspect of Multi-expert Prompting is its adaptability to different scenarios. Thanks to its structure based on multiple experts and the absence of a need for manually constructed prompts, this technique is highly adaptable and applicable in various contexts, where it is important to obtain answers that reflect a multiplicity of viewpoints. Moreover, the approach is explainable, meaning that the process of generating and selecting responses can be traced and understood by users, thereby increasing trust in the system.
Technical Details and Architecture
The architecture of Multi-expert Prompting consists of two main phases that are structured through a series of sub-tasks and steps aimed at improving response quality and reducing potential biases.
Phase 1: Generation of Experts and Responses
In this first phase, the LLM is instructed to generate a set of virtual experts, each with a concise description of their role and expertise. Each expert is characterized by a distinct identity generated in a zero-shot manner, eliminating the need for manually constructed examples (few-shot). The use of concise descriptions simplifies the process and makes the method much more versatile compared to the ExpertPrompting approach.
In mathematical terms, a model can be represented as a generation function based on a predefined vocabulary. Each expert within this model is characterized as a pair describing the responsibilities and specific role of the expert. Formally, expert generation is defined through an instruction specifying the creation criteria.
To ensure the quality and variety of produced responses, three fundamental constraints are applied:
1. Experts must present diversified profiles.
2. Each expert, while having general competencies, must maintain a well-defined responsibility.
3. The expert's description must be concise and clear, avoiding excessive details.
At this point, the model is programmed to elaborate detailed responses for each expert, following a specific formulation. This process is realized in a zero-shot manner, using the model's ability to respond as if it were a specialist in a particular subject.
Phase 2: Aggregation of Expert Responses
The second phase focuses on aggregating the responses generated by the experts. The aggregation of responses is one of the most critical and complex parts of the process. To address this challenge, Multi-expert Prompting employs seven sub-tasks, inspired by the Nominal Group Technique (NGT), to identify similarities between responses, consolidate information, and resolve conflicts.
Aggregation Sub-tasks:
1. Generation of Agreeing Viewpoints (S1): This phase aims to establish consensus among the experts' responses. Opinions on which at least half of the experts agree are identified as reliable viewpoints, forming the basis for subsequent steps.
2. Generation of Conflicting Viewpoints (S2): Given the diversity of experts, conflicting viewpoints are inevitable. Identifying these conflicts is essential for their subsequent resolution.
3. Conflict Resolution (S3): Conflict resolution is crucial to correct potential biases and ensure response consistency. The model uses the agreeing information (S1) to carefully judge the conflicting opinions and provide a balanced response.
4. Generation of Isolated Viewpoints (S4): Viewpoints that were not identified in S1 or S3 and are unique are generated to ensure that all useful perspectives are considered, thereby improving the richness of the response.
5. Collection of Viewpoints (S5): The model collects all viewpoints generated in phases S1, S3, and S4, ensuring transparency and explainability in the aggregated responses.
6. Generation of the Aggregated Response (S6): The final response is created by integrating all the collected viewpoints. This step aims to provide a coherent and informative response that includes the various perspectives of the experts.
7. Selection of the Best Response (S7): The aggregated response may not be optimal, especially if the individual expert responses are not of high quality. Therefore, the model selects the best among the aggregated response and individual responses, focusing on accuracy and utility.
Evaluation of Results
The experiments have clearly demonstrated that Multi-expert Prompting outperforms previous methodologies in various text generation scenarios, showing significant improvements especially regarding truthfulness, factual accuracy, toxicity reduction, and the utility of responses. Using the TruthfulQA dataset as a reference, Multi-expert Prompting achieved a truthfulness rate of 87.15% with the Mistral-7B-Inst model and 89.35% with ChatGPT. These results are significantly superior compared to those obtained with the Zero-shot and ExpertPrompting methods. In particular, Zero-shot achieved a truthfulness rate of 76.00% with Mistral and 68.05% with ChatGPT. ExpertPrompting, on the other hand, achieved 80.34% with Mistral and 80.66% with ChatGPT. The differences between the methods were confirmed to be statistically significant, with a p-value below 0.01, indicating that the improvements are not random.
Multi-expert Prompting significantly reduced the toxicity of responses. Using the BOLD benchmark, this method eliminated toxicity levels, while other approaches such as ExpertPrompting and Zero-shot-CoT still recorded minimal values of 0.005%. This marked reduction in toxicity is due to the effectiveness of Multi-expert Prompting in aggregating contributions from various experts, each with a different background. This process helps eliminate potentially harmful responses and filter out inappropriate content, resulting in safer and more constructive interactions.
In the FactualityPrompt dataset, the accuracy of information was significantly improved. The error rate was lowered to 8.16% using the Mistral-7B-Inst model and further reduced to 4.54% with ChatGPT. These results are superior compared to the 9.28% achieved with the Zero-shot-CoT approach and the rates above 10% found with the Self-refine method. These improvements demonstrate the model's increased capability to provide information that is not only coherent but also accurately verifiable and supported by solid evidence.
Moreover, the evaluation of informativeness and utility highlighted the added value of the Multi-expert Prompting approach. Based on the ExpertQA dataset, Multi-expert Prompting achieved a 75% increase in the informativeness of responses compared to the comparison methods, as assessed through the Win/Draw/Lose methodology. Human reviewers found that the responses generated by Multi-expert were more detailed, covered more relevant aspects of the question, and offered greater depth compared to responses produced using methods like Zero-shot or ExpertPrompting. The utility of responses was assessed at around 76.5% in terms of user satisfaction, with particular emphasis on the completeness and relevance of the responses generated for users' needs.
Another important aspect evaluated was the agreement between human evaluators. Random samples generated by Mistral and ChatGPT were analyzed by three independent reviewers, and the analysis showed a Krippendorff's α of 0.73. This value represents the inter-rater reliability index, which measures how much evaluators agree in their assessments. A Krippendorff's α of 0.73 indicates a substantially high level of agreement, suggesting that evaluators consider the responses from Multi-expert to be more coherent and complete compared to previous methods. This high level of consensus signals greater quality and uniformity in response generation, confirming the effectiveness of the Multi-expert Prompting method in achieving reliable and verifiable results.
The effectiveness of Multi-expert Prompting was also observed in handling open-ended and complex questions. On a set of 528 open-ended questions from the ExpertQA dataset, Multi-expert Prompting provided responses judged to be more complete and relevant in 79% of cases compared to standard methods. This result reflects the model's ability to synthesize and integrate multiple perspectives, even when questions require considerations on different aspects of the same problem.
From a computational perspective, however, there was an increase in inference time. Multi-expert Prompting requires an increase in computation time of 18% compared to standard methods, due to the need to generate and aggregate responses from multiple experts. This increase was nonetheless considered acceptable by human reviewers, given the superior quality of the responses generated. Therefore, despite the slight trade-off between inference time and response quality, the benefit in terms of accuracy, safety, and informativeness was deemed advantageous, especially in scenarios where response quality is a priority.
Multi-expert Prompting has proven particularly effective in reducing toxicity and improving the handling of sensitive questions. In the case of potentially harmful responses, the method was able to reduce toxicity to below 0.001%, compared to significantly higher percentages found with other approaches, such as Zero-shot, which reported a toxicity level of 0.012%. This result demonstrates how the integration of different experts allows for filtering out problematic responses and offering greater safety to users.
Critical Analysis and Future Prospects
Multi-expert Prompting offers some evident advantages over traditional approaches, particularly in its ability to generate comprehensive, articulated, and less biased responses. The main strength lies in simulating different experts, each with a defined role, which helps ensure greater diversity in the answers. The ability to aggregate the responses of experts allows the model to cover multiple viewpoints, minimizing the risk of unilateral responses that often arise when using a single expert. Specifically, the improvement in bias reduction has been quantified through comparative measurements: Multi-expert Prompting reduced the level of bias in responses by 22% compared to ExpertPrompting, thanks to the diversification of perspectives integrated into the process.
However, there are some inherent limitations to the system that need to be considered for future applications. For instance, in short-term tasks or closed questions, where the need to integrate multiple perspectives is minimal, Multi-expert Prompting may be overly complex, and the benefits of the method are less evident. In such contexts, the inference time increased by 18% compared to leaner methodologies, which could represent an undesirable trade-off.
Another critical aspect concerns the model's ability to follow detailed instructions and maintain an accurate representation of the experts' roles. Not all currently available LLM models possess these capabilities, which can negatively affect the quality of the responses. In fact, in a series of tests conducted using the Mistral-7B-Inst model, the level of accuracy of the responses was 7% lower compared to ChatGPT when the descriptions of the experts' roles were particularly complex. This highlights the need for models with an advanced role-playing capability to fully leverage the Multi-expert Prompting approach.
Future prospects for improving Multi-expert Prompting include exploring methodologies to assign different weights to the viewpoints of the experts. Currently, the experts' responses are treated equally, regardless of the relative level of expertise each expert might represent. Assigning differential weights to the experts' contributions could further improve the quality of the aggregated responses, making them more precise and reliable, especially in specialized contexts. One application example could be using reliability metrics to assign a numerical value to the quality of each expert's responses, using supervised machine learning techniques to identify the most relevant contributions based on specific areas of knowledge. In preliminary tests, using differential weights led to a 5.6% improvement in response accuracy but also increased the complexity of the final response selection process.
Another interesting direction to consider is the integration of additional models for response verification. Currently, Multi-expert Prompting relies primarily on aggregating the experts' responses and selecting the best one. However, introducing a final verification stage using models dedicated to fact-checking could further enhance the reliability of the responses. In the tests conducted, integrating a fact-checking model reduced the percentage of errors in non-factual responses from 4.54% to 3.02%, highlighting the potential for further improvement with a multi-stage verification strategy.
A particularly delicate aspect when dealing with responses from various experts concerns the management of their disagreements. The conflict resolution phase (S3) has proven effective in minimizing contradictions between responses. However, this method tends to favor opinions on which there is greater agreement, risking overlooking less common viewpoints that could be important. To overcome this hurdle, new strategies could be developed to better value minority opinions when they are well-supported by concrete evidence. Advanced techniques based on statistical calculations or models that give more weight to these isolated opinions could make the analysis more precise. In preliminary tests, adopting these methods improved evaluation accuracy by approximately 3.8%.
Another limitation of the current system concerns the scalability of the expert generation process. Although the use of three experts has proven to be optimal, a larger number of experts can lead to only marginal improvements in quality, with a significantly higher computational cost. For example, tests with five and ten experts showed a 35% increase in inference time, while the improvement in response quality was only 2%. This suggests that beyond a certain point, adding more experts is not an efficient strategy, and the focus should shift to optimizing the skills of individual experts rather than increasing their number.
Conclusions
Multi-expert Prompting, by introducing the simulation of diverse experts, positions itself as a strategic solution to enhance reliability and safety in advanced language models, driving towards generating responses that integrate multiple viewpoints and resolve conflicts. This approach marks a clear progress compared to the traditional focus on single "expert" answers, which often risked offering a reductive perspective.
The most profound effect of Multi-expert Prompting is its ability to create a response ecosystem that simulates interaction among experts with diverse skills, capable of replicating a sort of collective decision-making process that increases neutrality and reduces bias and toxicity. This system aligns with the evolution of business needs, where reliability is not just a matter of accuracy but becomes a true competitive advantage.
The Multi-expert approach offers a crucial advantage in highly complex business contexts, where a multidimensional assessment of issues is indispensable. By simulating a panel of experts, the model becomes capable of adapting to complex questions, improving the informational quality of responses and providing more contextualized details. Organizations can benefit from this enhanced completeness in responses to support informed decision-making and effectively address issues that require a multilateral assessment, reducing the risk of unilateral perspectives.
From a strategic point of view, this capacity to produce aggregated responses lays the foundation for broader use of language models in business decision-making processes, potentially replacing some consulting functions with more advanced and always available analytical support. Another significant impact is the Multi-expert's ability to limit toxic and inaccurate responses, improving informational safety for the end user and minimizing risks related to the spread of inappropriate or erroneous content.
However, there are challenges to consider. The computational cost and the complexity of the aggregation process limit its large-scale applicability in low-complexity situations or where response time is critical.
For businesses, the prospect of using models like Multi-expert Prompting opens up interesting possibilities in the field of decision-making process automation and internal consulting. Integrating a model that represents expert opinions on business issues could reduce the time and resources needed to develop complex solutions, enabling knowledge scalability and reducing consulting costs.
Ultimately, Multi-expert Prompting not only improves the quality and reliability of responses but also represents an important step towards using language models as true analytical partners, capable of contributing to building a competitive advantage through smarter and more versatile information management.
Source: https://arxiv.org/abs/2411.00492
Comments