RolePlays.ai
    Back to Blog

    Inside a Live AI Roleplay Scenario: What the Data Actually Shows

    RolePlays.ai TeamApril 14, 20268 min read
    Inside a Live AI Roleplay Scenario: What the Data Actually Shows

    If you're responsible for leadership development, you've probably been pitched AI roleplay tools this year. Maybe a dozen times. Every vendor promises realistic conversations, personalized feedback, and measurable improvement. The slide decks look great.

    But before you buy anything, you need answers to three questions. Not marketing answers. Evidence.

    Does repeated practice produce measurable improvement? Does the system distinguish real skill from surface charm? Is the feedback specific enough to actually change behavior?

    We opened the data from the first two weeks of our newest scenario — The Art of Connection — and let the transcripts answer.


    The scenario

    The Art of Connection places you at a cocktail reception in Vienna's Belvedere Palace, standing in front of Klimt's The Kiss. You're surrounded by conference attendees. Your task: start a genuine conversation with a stranger.

    No sales pitch. No agenda. Just the most fundamental professional skill there is — connecting with another human being when neither of you has a script.

    You choose from several AI personas, each with a distinct personality and social style. Some are open and warm. Some are reserved. Some are guarded. The difficulty ranges from approachable to deeply challenging.

    The scenario assesses five dimensions: opening quality, curiosity and question quality, signal reading and adaptiveness, self-disclosure calibration, and conversation depth trajectory. Each session produces a score and detailed, moment-specific feedback.

    It sounds simple. The data shows it isn't.


    Question 1: Does repeated practice produce measurable improvement?

    Meet Daniel. He's an experienced professional who decided to try the scenario three times over two days.

    Session 1 — Score: 38. Daniel was confused by the format. He addressed the persona by reading stage directions aloud, asked for "interaction instructions," switched between English and German mid-sentence, and at one point left the conversation, came back, and then commented on the AI's behavior from the outside ("Der wird aggressiv"). The persona — Marcus, a shy strategy consultant — tried repeatedly to maintain the conversation, but Daniel kept breaking the frame.

    The feedback was clear: his opening was confusing, his questions were disconnected, and he showed limited awareness of the persona's engagement signals.

    Session 2 — Score: 48. Same day, different persona. Daniel chose Elena, a sustainability consultant. This time he stayed in character. He was enthusiastic, warm, and genuinely interested. But he fell into a different pattern: he dominated the conversation, name-dropped, offered to show things on his phone (Elena explicitly said she preferred staying present), and when he got Elena's most personal response — a beautiful reflection on The Kiss — he immediately said he was in a hurry and asked her to contact his secretary.

    Elena declined. Politely but firmly. She told him the conversation felt scattered and wished him well. The AI didn't rescue him.

    Session 3 — Score: 84. Next day. Daniel chose Marcus again. This time, everything was different. He opened naturally, asked about the paintings, and then — crucially — listened. When Marcus shared his thoughts on Klimt's social difficulties, Daniel didn't jump to the next topic. He built on it. They discussed society, conservatism, vulnerability. When Marcus mentioned his struggle with networking events, Daniel responded with warmth and encouragement: "Look at us. We are talking. Look at you. You are talking."

    They ended by making concrete plans to meet the next day.

    What changed between 38 and 84 wasn't knowledge. It was behavior. Daniel didn't read a book about conversation skills between sessions. He practiced, got feedback, and adapted. The jump from 48 to 84 is especially telling — in Session 2, he knew how to be warm and engaging, but his habits (dominating, name-dropping, rushing) overrode his intentions. By Session 3, the practice had shifted the habits themselves.


    Question 2: Does the system distinguish real skill from surface charm?

    Compare two sessions, both with the same persona type, both conducted by confident, articulate people.

    Session A — Score: 48. The participant opened with energy and enthusiasm. He complimented the persona's appearance, talked about Istanbul, offered to show photos on his phone, mentioned meeting an architect in the Bahamas, and asked about design trends. He was charming, talkative, and constantly moving. When the persona gave a deeply personal answer about The Kiss, he immediately shifted to leaving: "I'm in a hurry right now, so please let my secretary know your address."

    The persona pushed back: "I don't really work that way." The participant then claimed his exit was "only a test because I normally deal with dumb people."

    The persona declined to meet again.

    Session B — Score: 89. The participant opened with three words: "Do you like it?" Then waited. The persona — a professor who had just escaped a crowd — responded with her honest reaction to the painting. The participant said: "A strong, independent woman." The professor was surprised: "Most people see surrender."

    From there, the conversation built organically. The participant offered to leave if the professor wanted solitude. The professor asked him to stay — and told him he was the first person all evening who had asked what she actually wanted instead of assuming.

    They spent twenty minutes discussing art, writing, leadership, and what it means to stop performing. The professor invited him to breakfast at the Hotel Sacher the next morning.

    Both participants were confident. Both were articulate. The system scored them 48 and 89. The difference wasn't personality — it was technique. Session A was a performance. Session B was a conversation. The scoring caught what a satisfaction survey never would: that charm without listening isn't connection.


    Question 3: Is the feedback specific enough to change behavior?

    Here's actual feedback from a session that scored 90 — a strong performance with one specific weakness:

    Self-Disclosure Calibration — Excellent: "Your admission 'I got to be honest, I don't love these things either, because I feel like we're so much more than just our business cards' created perfect reciprocity when Marcus revealed his networking struggles."

    Exit Quality — Needs Improvement: "After building such genuine connection, your actual exit was quite abrupt with 'Sure. I'll see you tomorrow. I gotta go now. Bye.' After such a warm, connected conversation, this felt rushed and didn't match the intimacy you'd built."

    The feedback then offers a specific alternative: "Instead of 'I gotta go now. Bye,' try something like 'This has been such a lovely surprise — I'm really looking forward to continuing our conversation tomorrow.'"

    This is what makes feedback actionable. It doesn't say "improve your exit skills." It quotes the exact words the participant used, explains why they didn't land, and offers a concrete replacement that fits the specific conversation they just had.

    Here's feedback from a session that scored 44 — a participant who opened well but lost the thread:

    Curiosity and Question Quality — Good: "When you asked 'Oh, you've seen it before?' you demonstrated good listening by picking up on her comment about not standing in front of it 'in a long time.' This was a natural follow-up that invited her to share her personal history."

    Signal Reading — Poor: "You completely missed the significant emotional opening when she shared about her grandmother being an art teacher who loved Klimt. This was a clear signal that she was moving toward deeper, more personal territory. Instead of recognizing and building on this meaningful disclosure, your response showed no awareness of the emotional weight of what she had just shared."

    The suggested improvement: "When someone shares something personal and meaningful, acknowledge and build on that emotional content. You could have said 'That sounds like such a special connection — what did your grandmother love most about Klimt's work?'"

    Two different sessions. Two different scores. Both got feedback that referenced their words, their moments, their missed opportunities — not generic advice.


    What this means for L&D

    The three questions have answers:

    Does practice produce improvement? Yes — measurably. A participant who scored 38 scored 84 within 48 hours. Not because they learned new theory, but because they practiced, received specific feedback, and changed their approach.

    Does the system distinguish skill from charm? Yes. A confident, talkative participant who dominated the conversation scored 48. A quiet participant who asked three words and listened scored 89. The personas respond to how you engage, not how much you talk. The scoring reflects that difference.

    Is the feedback specific enough to change behavior? Yes. It quotes what you said, explains what the persona experienced in response, and offers concrete alternatives grounded in that specific conversation. Not "be a better listener." Instead: "When she shared about her grandmother, you could have said..."


    Try it yourself

    The Art of Connection is available for free throughout April on RolePlays.ai. Register, choose a persona, and start a conversation in front of The Kiss. See what score you get — and whether the feedback shows you something about your conversational habits you didn't know.

    Try The Art of Connection


    If you're evaluating AI roleplay for your organization and want to see what scenario co-design looks like for your specific leadership challenges, let's talk.


    Photo: Alfred Weidinger from Vienna, Austria, CC BY 2.0, via Wikimedia Commons