Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
03 Tools / AI / 2328.2023 Johnny Can’t Prompt_How Non-AI Experts Try (and fail) to Design LLM prompts.pdf
Скачиваний:
0
Добавлен:
08.01.2024
Размер:
2.34 Mб
Скачать

CHI ’23, April 23–28, 2023, Hamburg, Germany

and engaging in social conversation throughout the process. Traditionally requiring multiple large datasets and extensive model training, building an instructional chatbot is one of the tasks for which a near-trivial prompt provided to GPT-3 (“Walk the user through the following steps: <steps>”) alone yields a strong baseline. Among other uses, merely enabling non-AI-experts to customize and improve instructional chatbots has the potential to revolutionize customer service bots, one of the most common chatbot use cases. The explosion of interest in LLM-based ChatGPT [1] demonstrates that chat-based interactions with LLMs can provide a powerful engine for a wide variety of tasks, including joke-writing, programming, writing college-level essays, medical diagnoses, and more; see [41] for a summary.

Toward this goal, we created a no-code LLM-based chatbot design tool, BotDesigner, that (1) allows users to create an LLMbased chatbot solely through prompts, and (2) encourages iterative design and evaluation of efective prompt strategies. Using this tool as a design probe, we observed how 10 participants without substantial prompt design experience executed a chatbot design task using BotDesigner, to explore a few diferent pieces of this larger question. Our fndings suggest that, while end-users can explore prompt designs opportunistically, they struggle to make robust, systematic progress. Their struggles echo many well-known struggles observed in end-user programming systems (EUPS) and non-expect users of interactive machine learning (iML) systems. Additional barriers to efective prompt design stem from limited conceptions of LLMs’ prompt understanding and execution abilities and their understandable inclinations to design prompts that resemble human-to-human instructions. We discuss these observations’ implications for designing efective end-user-facing LLM-based design tools, implications for education that improves LLM-and- prompt literacy among programmers and the general public, and opportunities for further research.

This paper makes three contributions. First, it describes a novel, no-code LLM-and-prompt-based chatbot design tool that encourages iterative design and evaluation of robust prompt strategies (rather than opportunistic experimentation.) Second, it ofers a rare rich description of how non-experts intuitively approached prompt design, and where, how, and why they struggled. Finally, it identifes opportunities for non-expert-facing prompt design tools and open research questions in making LLM-powered design innovation accessible.

2 RELATED WORK

2.1The Promises of Non-Expert Prompt Design

Today’s chatbot design practice—i.e., designing multi-turn conversational interactions—follows a well-established workfow [12, 13, 24, 25, 39, 44]. Designers frst (i) identify the chatbot’s functionality or persona and draft ideal user-bot conversations, for example, through Wizard-of-Oz or having experts drafting scripts; (ii) create a dialogue fow template (e.g., “(1) greeting message; (2) questions to collect user intention; (3) ...”); (iii) fll the template with supervised NLP models (e.g., user intent classifer, response generator, etc.); and fnally (iv) iterate on these components to achieve a desired conversational experience. In this “supervised learning” paradigm, designers make NLP models generate their desired interactions by

Zamfirescu-Pereira, Wong, Hartmann & Yang

improving its training data and feature design; tasks that require substantial machine learning and programming knowledge [50].

The emergent “pre-train, prompt, predict1” paradigm in NLP promises to lower the entry barrier for non-experts innovating on conversational interactions [30]. In this paradigm, designers can create conversational agents with little to no training data, programming skills, or even NLP knowledge. Leveraging pre-trained large language models such as ChatGPT [1], designers can create a general-purpose conversational system with passable, though sometimes problematic, performance. Next, they can improve the LLM outputs using natural language prompts (see Table 1 for an example) and/or model fne-tuning. In this paradigm, people without programming skills or NLP knowledge can nonetheless make NLP models generate desired interactions by crafting efective prompt strategies (e.g., natural language instructions, examples, and templates) [3, 15].

Prompt

Resulting Human-GPT-3 Conversation

Strategy

 

No prompt

User: Ok hang on while I get a chair

(baseline)

Bot: Scoot to the front of your chair [...]

Explicit

Prompt design: If the user asks you to wait,

instruction

explain that this is not a problem [...]

 

User: Ok hang on while I get a chair

 

Bot: Once you have your chair, scoot to the

 

front of it [...]

 

 

Table 1: An example of how designers ( ) can directly improve chatbot interactions by modifying prompt strategies. Note the changes in the bot’s ( ) response to the user’s ( ) statement.

2.2Known Challenges in Prompt Design

While prompting can appear as easy as instructing a human, crafting efective and generalizable prompt strategies is a challenging task. How a prompt or a prompt strategy directly impacts model outputs, and how prompts modify LLMs’ billions of parameters during re-training, are both active areas of NLP research [30, 42]. Moreover, established prompt design workfows do not yet exist. Even for NLP experts, prompt engineering requires extensive trial and error, iteratively experimenting and assessing the efects of various prompt strategies on concrete input-output pairs, before assessing them more systematically on large datasets.

That said, ongoing NLP research does ofer some hints toward efective prompt design strategies. Notably, these prompt strategies are efective in improving LLMs’ performance over a range of NLP tasks and conversational contexts; it remains unclear to what extent these strategies can improve any particular conversational interactions or contexts.

1Predict here refers to generating model outputs, referred to as “prediction” because model outputs are probabilistic predictions for the words (tokens) that might follow the prompt.