top of page
Immagine del redattoreAndrea Viliotti

Generative Agent Simulations: Modeling Human Behavior through Qualitative Interviews

The study titled “Generative Agent Simulations of 1,000 People,” conducted by Joon Sung Park, Carolyn Q. Zou, and Aaron Shaw, highlights how Generative Agent Simulations can model human behavior through qualitative interviews, with support from prestigious institutions such as Stanford University, Northwestern University, the University of Washington, and Google DeepMind, explores how Generative Agent Simulations can replicate human behaviors using large language models. The research focuses on how in-depth qualitative interviews can provide essential data for constructing generative agents capable of accurately replicating the responses of more than a thousand people in sociological and experimental contexts. The overarching goal is to understand whether these simulations can offer a virtual laboratory for testing theories and policies in the social domain.

Generative Agent Simulations
Generative Agent Simulations: Modeling Human Behavior through Qualitative Interviews

Generative Agent Simulations: Data Insights and Research Goals

This study, focusing on Generative Agent Simulations, aligns with a sociological tradition that models human behavior through abstract agents, typically anchored to mathematical rules or simplified assumptions about decision-making processes. While this approach is useful for testing basic theories, it often struggles to capture the real-world complexity of everyday life. In “Generative Agent Simulations of 1,000 People,” the challenge is different: leveraging the power of large language models to build agents generated from qualitative interview transcripts. The research team aimed to collect extensive and detailed information about the lives of over a thousand individuals, with the goal of creating a wide array of agents capable of providing coherent answers to diverse questions, stimuli, and situations.


The selection of the human sample was based on demographic stratification criteria that took into account age, geographic area, gender, education level, and political orientation, among other factors. The aim was to obtain a representative sample of the U.S. population, avoiding models that would only be valid for specific subgroups. Each participant took part in a two-hour interview conducted through an AI system acting as a “virtual interviewer.” This choice helped maintain a certain uniformity in style and expertise when posing follow-up questions, so as to extract personal and complex information.


The interviews included both general questions—on life history and the perception of social issues—as well as more personal inquiries, such as educational paths, family relationships, political values, and work-related aspects. A protocol inspired by the American Voices Project, a well-established sociological initiative in the United States, was adopted to capture the wide variety of nuances through which people describe their lives. It is important to note that the interview questions were not specifically tailored to subsequent tests (General Social Survey, Big Five, or experimental games), thereby reducing the risk of unintentionally “training” participants to respond in line with those tests.


The breadth of the thematic coverage, coupled with the freedom granted to the interviewees, produced very extensive transcripts: on average, about 6,491 words per person, with some interviews far exceeding this threshold. These data form the “memory” of each generative agent. Essentially, a large language model such as GPT-4 was fed the full transcript of each participant. When a researcher wants to query the agent that represents a specific individual, the model receives the interview as a prompt, along with certain internal reflection mechanisms that help identify the most relevant content to deliver.


A crucial point involves verifying how closely these simulations reflect the real behavior of the interviewees. It is not enough to confirm that the agent responds coherently; a quantitative comparison is needed between the answers provided by the real participants and the answers from the agents in follow-up surveys. To this end, each subject was asked to complete four types of tests: the core part of the General Social Survey (GSS), the Big Five questionnaire (BFI-44), a battery of well-known economic games (such as the Dictator Game, the Trust Game, and the Public Goods Game), and some social psychology experiments already replicated at a large scale. The participants completed these tests twice: once immediately after the interview and once two weeks later, to measure inconsistencies in their own responses. In other words, if a person contradicts themselves easily, it becomes more difficult for the agent to replicate their behavior. From this, the concept of normalized accuracy arises, calculated by dividing the agent’s accuracy by the participant’s demonstrated consistency, i.e.:normalized accuracy = (agent accuracy) / (participant’s internal replication)


The research also highlights the privacy and data security measures adopted, such as name redaction, de-identification of transcripts, and the possibility of revoking consent. Along with these safeguards, the authors devised an “agent bank” infrastructure to allow other scholars to test hypotheses and query these agents under an ethical framework that respects data protection regulations.


Essentially, this first phase aims to understand how the depth and variety of topics covered in the interview can give rise to generative agents for each individual, potentially capable of answering questions on any topic: political, social, or even experimental. The use of broad-ranging interviews addresses the need to go beyond traditional models based on a few demographic variables, thereby reducing the risk of falling into stereotypes. The presence of a rich and personal data set should allow the agent to approximate what the interviewee actually thinks or does.

 

Generative Architecture: Advancing Precision and Reliability

One of the distinctive features of this study is the method used to transform interview transcripts into true agents. Specifically, each time a query is made, the entire transcript is “injected” into the language model prompt. A text-based memory of synthetic reflections, often generated automatically, is also included to help the model retrieve the relevant information that emerged during the conversation. Practically speaking, when one asks an agent, “What do you think about a hypothetical new public health law?”, the model scans the corresponding participant’s interview and “expert” reflections to produce a plausible response that is consistent with the positions expressed by the original interviewee.


This approach differs significantly from classic agent-based models that use rigid rules or abstract utility functions. The project relies on the assumption that large language models incorporate general knowledge of human behavior and that combining them with individual testimonies could enable the creation of agents capable of reproducing specific personalities. However, to confirm whether this actually happens in an accurate way, the authors opted for a direct comparison between each agent’s responses and the real individual’s answers to the relevant survey or experiment.


A first level of analysis involves the General Social Survey. This includes 177 core questions with categorical or ordinal responses and 6 questions of a numeric type. It was calculated that the average consistency among participants—i.e., the degree to which each individual replicated their own answers after two weeks—was about 81.25%, whereas the agents’ raw accuracy on the same responses was around 68.85%. If one normalizes 68.85% by dividing it by 81.25%, the result is about 0.85. In other words, the agent approaches 85% of the consistency that a real person has with themselves. This result is deemed more than satisfactory, especially compared to “brief description” alternatives (demographic data or short self-written portraits), which produced normalized accuracy values around 0.70–0.71.


A second level of analysis concerns the Big Five Personality Inventory, composed of 44 items aimed at assessing traits such as openness, conscientiousness, extraversion, agreeableness, and emotional stability. Here, accuracy was evaluated using correlation and Mean Absolute Error, since the answers follow a continuous scale. Comparing the agent’s responses with the participant’s self-replication yielded a normalized correlation of about 0.80 for interview-informed agents, suggesting a solid ability to capture personality structure without falling into stereotypes. Again, agents constructed only from demographic data showed lower correlations.


A third level of analysis involved five economic games: the Dictator Game, the Trust Game (both first and second player), the Public Goods Game, and the Prisoner’s Dilemma. These tests introduce monetary incentives along with dynamics of cooperation or trust. The average correlation between the agents’ choices and the actual participants’ choices was around 0.66, with a similar normalization (0.66). Unlike the GSS and Big Five, here there is not such a pronounced statistical advantage when comparing these agents to other agent types, although the interview agents generally perform better. One point raised by the authors is that economic behaviors can be more variable and subject to contextual factors not necessarily surfacing during the interview: a participant may decide to be altruistic on a certain day and more selfish on another, reducing even their own internal consistency.


From a technical standpoint, the generative architecture also employs a so-called “reflection module” to extract high-level inferences and allow the model to focus on crucial portions of the transcript. In addition, a specific effort is made to reduce biases by introducing more behavioral descriptions, rather than labeling individuals by race, gender, or ideology. In fact, one of the most interesting findings is a reduction in accuracy disparities across political or racial subgroups. For instance, with political groups, using interview-based agents reduces the accuracy gap between ideological extremes from about 12.35% to 7.85%. This suggests that agents based on rich, personal information are less prone to the typical generalizations made by simple demographic agents.

 

Assessing Results: GSS, Big Five, and Economic Games

After discussing the architecture and general goals, it is helpful to delve into the evaluation methodologies employed in the study, focusing on the accuracy and consistency measures, as well as the reasons behind the choice of the GSS, Big Five, and certain classic economic games.


The General Social Survey is one of the most long-standing and respected sociological surveys, covering a wide spectrum of questions ranging from social and political considerations to matters of religiosity, family customs, and perceptions of institutions. In the research, the authors specify that they used 177 core questions and excluded those with more than 25 response options or open-ended answers that could not be compared. Through these items, participants reveal their positions on topics ranging from support for specific public policies to their level of trust in institutions. The agent, in turn, must select among the same options the one that best reflects the original participant’s viewpoint, as gleaned from the interview. All of this is compared with the actual answers the individual provided in the questionnaires.


One of the most challenging aspects of this process lies in the fact that humans themselves are not always consistent in their opinions. Numerous studies have shown that, over time, a person may give somewhat different responses when taking the same survey again, owing to mood changes, new information, or even a slightly different interpretation of the question. This is why the study introduced internal replication of each participant after two weeks. For example, if an individual confirms 80% of their previous responses, an agent that hits 70% of the same answers actually achieves a performance of (70% / 80%) = 0.875, i.e., a normalization of 0.875.


Moving on to the Big Five Personality Inventory, the choice of this scale is strategic for two reasons. First, personality traits have a strong foundation in the literature and tend to remain relatively stable over time, at least for adults. Second, these trait scores are derived from multiple questions, which, when summed in an index, help reduce statistical noise. The use of Likert scales with continuous values requires correlation calculations and Mean Absolute Error (MAE) to measure the distance between answers. Here as well, participant consistency is not guaranteed, so the researchers evaluated the correlation between the initial session and the one two weeks later. The interview agent showed a correlation with human scores that, in numerical terms, yields a normalized value of about 0.80. According to the authors, these figures are higher than those for agents fed only demographic information or brief “person-based” descriptions.


The economic games add a different behavioral dimension: they are no longer just verbal preferences but involve choices with real monetary costs and benefits. The Dictator Game, for instance, measures a person’s willingness to share (or not) a sum of money with another player. The Trust Game focuses on trust and repayment, while the Public Goods Game examines how multiple players contribute to a collective good. Finally, the Prisoner’s Dilemma is a classic for exploring strategic cooperation or defection. The paper mentions that real monetary incentives were used, encouraging participants to choose sincerely.

Results show that the correlation between the agents’ actions and the participants’ actual choices is about 0.66 for interview-based agents—a figure considered noteworthy, given the chance variability typical of such games. The challenge here is not just interpreting the interview and guessing someone’s personality but also anticipating strategic choices, possibly influenced by emotional factors.


In summary, the evaluations covering the GSS, Big Five, and economic games span a broad range of attitudes, beliefs, and practical behaviors. The agents excel particularly in replicating responses to sociopolitical questionnaires and in identifying personality traits. Meanwhile, their performance in strategic games, though still interesting, is more modest. This suggests that, while the interview provides a significant information repository, certain aspects of behavior may not be fully captured by mere autobiographical narratives.

 

Experimental Insights: Simulations and Treatment Effects

A further step that characterizes the study is the verification of the agents’ capacity to predict treatment effects in experimental contexts. Social research often uses experiments in which subjects are split into control and treatment groups to test hypotheses about reactions to artificial situations, moral vignettes, or scenario manipulations. The paper describes five experiments from a large-scale replication project (the Mechanical Turk Replication Project), involving scenarios such as perceived harm based on intentionality, the relevance of fairness in decisions, and the role of dehumanization in someone’s willingness to harm another.


In short, the real participants successfully replicated four out of five studies, failing in one—an outcome not surprising in scientific literature, as replications don’t always fully confirm the original reported effects. The novelty lies in the fact that the interview agents produced the same replication results: they detected a significant effect in four studies and a non-relevant outcome in the fifth. Even more surprising is the correlation between the effect sizes observed among real participants and those of the agents, which nearly reaches 0.98. This value reflects a near-perfect alignment with the experimental “variance” measured in the participants. Essentially, this suggests that the agent not only reproduces individual behaviors but also mirrors group dynamics, showing the same effects observed in the experimental conditions applied to the real participants.


Each of the five experiments had a slightly different design. In one, for example, people were asked to judge whether the culprit of a harmful act had acted intentionally or by mistake, and how this influenced the need for punishment. In another, the effect of a sense of power on the level of trust in a potential exchange partner was tested. For each scenario, the agents received the same instructions and conditions (text or images) and, just like the participants, provided their response.


According to the authors, the fact that group-level differences match real-world outcomes on a population scale could open new possibilities. One might imagine conducting a pilot study on a thousand agents—each anchored to a real person’s interview—to “probe” the expected effect of an intervention before investing in an expensive human experiment. Caution is advised, however: the idea is not to completely replace real participants, because even the most accurate model cannot update itself on events occurring after the interview. Also, if significant changes occur or if they concern areas not mentioned during the interview, the simulation may be incomplete.


The paper also highlights the risks of using these agents superficially in policy-making contexts. For instance, if one wanted to test a new public health awareness campaign, the agents could offer a glimpse of how various population segments might react. But one must remember that the agents cannot exceed the limits of the data they contain: if the interview failed to address crucial aspects, their responses might be arbitrary. Nonetheless, the high correlation coefficient between the treatments experienced by participants and those produced by the agents shows that, with proper controls and a thorough interview protocol, these simulation systems can serve as a useful and stimulating virtual laboratory.

 

Addressing Biases and Data Access in Generative Simulations

A well-known issue in artificial intelligence is the presence of bias during training or in defining agent profiles. Models relying on simple demographic labels often fall into stereotypes, disadvantaging minority groups or underrepresented categories. Encouragingly, the study shows that agents generated from interviews exhibit a smaller performance gap than those relying solely on demographic attributes. Looking at the Demographic Parity Difference (DPD), which measures the disparity in accuracy between the most and least favored group, interview-based models significantly shrink the gap—for instance, from 12.35 percentage points at ideological extremes down to about 7.85. A similar pattern is observed for racial variables, although the degree of improvement can vary in some cases.


This finding can be explained by the very nature of qualitative interviews, which enable the agent to draw from a wide range of personal content, rather than relying on a “typical profile.” In the case of agents built on basic categories like gender, age, or ideological stance, the language model tends to reproduce typical images that are necessarily incomplete and fail to capture individual complexity. Conversely, if a person from a certain minority group shares a specific life experience in the interview, the text-based agent will remember that experience, reducing the risk of broad generalizations.


The study also introduces an “agent bank” system designed to make these virtual profiles available to the scientific community. The idea is to provide access at two levels: a more open level with aggregated data, allowing researchers to explore general trends without violating participants’ privacy; and a more restricted level with specific permissions for open-ended queries and access to individual responses. This second level would be useful for those needing to run particularly detailed simulations or test new experimental protocols, which require interacting with individual agents in a personalized way. However, oversight procedures, control logs, and restrictions on commercial use would be necessary to safeguard participants’ rights.


On an application level, the prospects appear varied. In the social sciences, simulating a thousand individuals anchored to real interviews could help formulate hypotheses about how different population segments might react to a particular event, such as a new legislative proposal or a health crisis. One could analyze how a group of agents behaves on virtual social networks, exploring opinion polarization or information spread. In marketing and market research contexts, a company might want to “question” the agents to grasp purchasing trends, with the understanding that these agents represent a snapshot in time rather than a dynamic update.


At the same time, the research invites caution. Although the results are promising and show a strong alignment between agents and real participants, no simulation can entirely replace real-world studies on human samples, particularly in evolving social and informational contexts. The value of the “Generative Agent Simulations of 1,000 People” approach is to offer a preliminary testing ground for research hypotheses—a virtual lab where one can explore, at lower cost and in less time, the impact of certain inputs. Yet the authors maintain that any significant conclusions must be backed up by field verification and ongoing reassessment of the timeliness and validity of the interview data.


A further strategic consideration is the opportunity to expand the study to other populations or to specialized interviews on niche topics. If the interview protocol targeted a specific group—for instance, individuals working in a particular medical field—the resulting agents could provide very detailed projections about hospital policies. Conversely, maximum diversity (like the general population in this study) offers a broader perspective but one that is less deeply specialized. In any case, the richness of these agents hinges on the comprehensiveness of their interviews, which must be meticulously designed to capture the complexities of human life without resorting to excessive redundancies.

 

Conclusions

The findings of “Generative Agent Simulations of 1,000 People” indicate that combining in-depth qualitative interviews with large-scale language models can yield highly plausible human simulation scenarios. Accuracy reaches notable levels in questionnaire responses, such as those in the GSS, and in detecting personality traits. Even in economic games and social psychology experiments, the collective coherence of the agents is strikingly close to that observed in actual participants. However, from a managerial or entrepreneurial standpoint, one should not expect these agents to become a perfect substitute for field surveys. The social context evolves, and interview data are ultimately a static snapshot that will age over time.


The strategic reflection, then, revolves around the possibility of using these agents as an initial testing ground for communication or policy strategies. If a company or institution wished to gauge how a certain population segment might respond to a new product, it could run a preliminary simulation with hundreds of “personalized” agents, gathering insights on potential reactions, conflicts, or divergences. Subsequently, a more targeted and smaller-scale traditional experiment could be conducted, optimizing time and costs. One could also investigate group dynamics, for example, the formation of opinion clusters, in a virtual environment. While this approach is certainly feasible, it should be accompanied by ongoing scrutiny of the data’s origin and relevance: if the interview materials are outdated or incomplete, the simulation’s outcomes will likewise be limited or biased.


Comparing this method with other existing simulation tools reveals that generative agents offer a remarkably greater level of granularity, as each agent is anchored to a real individual rather than to a generic construct. Still, open questions remain regarding how participants’ personalities and choices might change over time—an issue not handled by static models. Moreover, simpler techniques already exist that address “similar tasks,” such as traditional preference modeling in marketing or electoral behavior simulators. However, those solutions rarely integrate such a rich textual component that could reflect responses to complex proposals and scenarios. The study at hand thus introduces new possibilities but also requires caution, enhanced ethical oversight, and continual data maintenance.


Over the longer term, one could imagine extending this approach internationally or integrating additional data collection methods, such as face-to-face interviews or biometric information and social media history—provided there is explicit consent. This would be another step toward simulations that more closely reflect real people, with the caveat of heightened privacy concerns. The authors of this work underscore the importance of a responsible governance system—one that balances data transparency, the protection of interviewees, and the need for innovative research. The path is clear: using interview-based agents could lead to deeper analyses of human dynamics, yielding rapid feedback and lowering certain logistical barriers. Nonetheless, any simulation must be handled with awareness of its limitations and the inherent uncertainty in forecasting real human behavior.


 

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
bottom of page