Voice interface design is about good writing, not just good design
April 16, 2020I wrote a book about voice content and how to build content-driven voice interfaces. Voice Content and Usability is A Book Apart’s first-ever voice design book. You can learn more about what’s in the book or sign up for preorders.
Every element of a voice interface can be reduced to two designed entities: a conversational flow (made up of consecutive intents or desired actions) that allows for the user to access various areas of the interface and language in the form of user- and machine-generated utterances. I’ve long discussed both of these in previous talks (most notably in Utrecht) and writing, including more fundamental concerns like how conversational interfaces differ from others, information architecture, conversational design, content strategy, usability testing, and affordance and wayfinding.
While conversational flows are analogous to interaction flows in other interfaces and are familiar to designers working in visual and physical interface design, conversational interface language is unique to conversational design and therefore a more novel concept to designers coming from other media. And as writing becomes increasingly integral to design as a practice, good writing for conversational interfaces is accordingly becoming a pressing concern for designers everywhere.
In this article, I explore the primary designed elements of voice design and how to best employ flows and language to articulate a robust voice interface not only from the standpoint of navigability but also from the perspective of usability. In short, voice design is not solely about good design from the standpoint of crafting interaction flows; it is also about good writing, as machine-generated utterances must yield appropriate user-generated utterances from voice interface users. In doing so, machines must effectively capture the intent articulated through user utterances. In the subsequent paragraphs I also describe why language is more essential to voice interfaces than flows and how thinking in analogies can help us not only design but write better voice interfaces.
Conversational design and conversational writing
As I’ve written previously in my prior content about effective conversational design, “One of the most crucial differences between web or graphic design and conversational design is the notion that while traditional designers focus primarily on visual and spatial elements in a physical space, conversational designers must reorient themselves to emphasize aural and verbal elements in either a visual space (for chatbots and other messaging bots) or an auditory environment (for voice assistants).”
In short, conversational design is about creating the same types of signifiers and indicators that websites and web applications can conceive of easily and rendering them more appropriate for the voice user audience, which is no small feat. But with the correct approach to writing, not just designing, voice interfaces, we can reduce the mental distance between voice interfaces and their more physical, tactile counterparts.
Hall’s key moments in conversational design
In my previous article on the topic, I cite Erika Hall’s seminal book Conversational Design in discussing philosopher H. Paul Grice’s conversational maxims, or principles that demonstrate the effectiveness of a conversation. Based on Grice’s characterization of the optimal conversation, Hall articulated four key moments during such conversations that not only approximate authentic human conversation but also reflect the best-written conversational interfaces:
Introduction. Invite interest and get the user excited about the conversation. Encourage trust and help the user feel at ease. This is the first impression they will have.
Orientation. For users that are unfamiliar with conversational interfaces, an initial stage of orientation is important in order to provide system options and actions that can be undertaken in order to achieve a goal. This dovetails with information architecture.
Action. At each stage in the conversation, the user should be presented with a set of options that represent tasks to move the conversation forward as well as user controls.
Guidance. The interface should endeavor to provide clear and concise instructions when the user is expected to perform an action and feedback at the end of every interaction.
These summaries of Hall’s key moments are taken from my previous article on best practices in conversational design.
Conversational writing in written and spoken form
If we take a closer look at Hall’s key moments in conversational interfaces, we can see that there are two types of concerns when it comes to conversational writing: the “system options” and “actions that can be undertaken” representing the flows or intents that a user can follow to reach their goal, and the “orientation” or “instructions” reflecting the type of language that must be deployed in order for users to feel not only welcome to the interface but also in control of their destiny borne out by interactions.
While I’ve written extensively about the ways in which Hall’s key moments can guide the writing of effective language for conversational interfaces, I’ve not considered the fundamental differences that distinguish written language for conversational interfaces trafficking in text from spoken language for voice interfaces. As we know intimately from misinterpreted intentions in our conversations over text message, written language can be a problematic medium. This is compounded even further by the fact that to the untrained eye, text written by a machine (such as a Slackbot) can be indistinguishable from text written by a human.
Voice interface writing has the unenviable distinction of relying on speech synthesis to convey meaning to the user. This places even more onus on the voice designer to ensure that positive emotions are conveyed effectively and loyally during the introduction stage of Hall’s key moments as well as during the guidance stage. In addition, the designer must scrutinize how the speech synthesizer handles certain uncommon words (in Alexa, this is possible with a glossary) or disambiguates similar-sounding terminology.
Two designed elements of voice design
Generally speaking, voice designers work with two design elements when it comes to the voice context: flows, which map user intent to navigable trajectories articulated in the form of virtual interaction patterns; and language, which guides users along that trajectory.
Flows: The paths our users take
Without a flow to handle specific user intent, there is no means for the user’s intent to be realized in a satisfactory manner. In short, flows represent the wayfinding and signposting of our voice interface, often represented by sitemaps and navigation tools on websites and option selection in a takeout ordering website. In other words, flows are the paths that our users take; they are users’ means of navigation and routing—a switchboard that the user can manipulate to conduct the transaction or find the information they require. There is one significant exception to these similarities: in voice, interaction flows are necessarily abstract and invisible to the human user; there is no way to utilize visual elements to suggest or approximate a concrete interaction flow that offers users granular wayfinding.
Moreover, conversational flows in voice interfaces are not simply about the user clicking figurative buttons or proceeding through theoretical steps in an imaginary physical interface. Flows also represent, from the user’s standpoint, the directions the eyes take when first envisaging an experience and the ways we move our mouse to discover new elements in the interface. Conversational flows encompass all of these possible interactions, not just those that result in a submission of information or state change in the application. Flows should also enable users in the same way that design techniques like top-down, left-to-right orientation facilitate better eye navigation or that hover actions intensify certain affordances. There are voice equivalents of all of these typically web-based interactions; the trick is to find their proper analogues appropriate for your use cases.
In this way, conversational flows should aim to approximate a human conversation as closely as possible—much like the conversation-centric or conversation-first approach encouraged by many conversational designers as a best practice—with the added caveat that these conversations should always endeavor to reach a certain goal. Grice captured this mission-driven approach in describing his cooperative principle. But just as human conversation has the tendency to meander serpentinely without moving to the next step, so too should voice interface flows reflect web users’ tendencies to fixate on particular areas of the page or consider intended next steps before clicking to proceed.
Language: The fabric of the interface
From a superficial perspective, it may seem that language in voice interfaces is merely help text. As a matter of fact, utterances comprise the very fabric of the interface; the complex lattice of words that ensure that users understand where they are and how they should respond. If flows are the navigation system, language is the very fabric of the voice interface.
Spoken language in voice interfaces also serves the purpose of an interface itself, as utterances are the only method by which flows can be accessed and traversed. More importantly, however, language provides the way for users to offer their input to the conversational interface at important points during the interaction flow. To draw a similar analogy to visual user interface elements, language and utterances in voice interfaces can be considered akin to the elements we click and focus, the modals we dismiss, and the fields we fill in on web forms.
Though both flows and language are essential to the success of a complex voice interface, language is more deeply fundamental to the functioning of voice assistants or voice-driven content. Consider, for instance, the fact that conversational forms in conversational interfaces with visual components generally reside in the same expected locations as user-provided utterances in voice interfaces without visual components. Like forms we fill in in a conversational form, utterances emitted by the machine outline our own human input and provide structure to our interactions.
Language without flows, flows without language
Whereas voice interfaces can have an entirely language-based or utterance-based conversational interface without flows, it’s not possible to design a voice interface that lacks effective language and utterances. Consider, for instance, a voice interface whose sole concern is issuing the daily weather forecast upon request, a ubiquitous feature in voice assistants, does not necessarily need to bring the user down a variety of interaction flows, unless the interface allows deeper access to other weather information. Such a one-track voice interface approaches idempotence in that responses of the same nature are always issued; there are no state changes, no changes in intent, and the interaction is identical every time.
Even a simple question-and-answer (Q&A) conversation bot implemented in a voice assistant lacks substantial complexity when it comes to designing flows, as any interactions would not deviate from the typical question–response pattern. On the other hand, it isn’t possible in the voice context to implement a voice interface that leverages flows without language, given once again that language comprises the fabric that ties the interface together gracefully.
How to write well-written utterances
Whereas many conversational interfaces, namely those that are text-based or have the luxury of a visual component, deploy language in the form of written entreaties and responses to the user, utterances are by necessity the primary unit of language in voice interfaces. And because utterances, emitted by any interlocutor, whether human or machine, represent the form elements, buttons, modals, and other interface components that are key to the voice user experience, they need to be well-written, and particularly so for the voice context.
While utterances in the form of language are essential from the standpoint of the voice interface, utterances generated by the user are perhaps even more fundamental to the proper functioning of voice interfaces due to their limning of intents that applications can then employ to forward a machine-readable response onward. These intents often contain slots for crucial pieces of information such as dates and times that can leverage handling distinct from speech recognition and natural language processing.
Eliciting a desired user response
Machine-generated utterances that outline the manner in which a human user should interact with a voice interface should be extremely well-written. Not only should designers consider potential linguistic ambiguities between words that, rendered through synthesized speech, could prove confusing for listeners; they should also ensure that such utterances serve the purpose of soliciting an adequate user response and outlining context for the user. This is of particular importance in an environment where the user lacks visual cues (apart from light indicators on devices like Amazon Alexa) and cannot glean guidance from visual feedback or derive possible interactions from affordances that are not aural.
Based on a taxonomy of conversational interaction styles from “Conversational UX Design: An Introduction” by Robert J. Moore and Raphael Azar in the anthology Studies in Conversational UX Design (excluding those that are irrelevant to voice), here are a few examples of how well-written machine utterances can ensure an appropriate response from the user:
Welcome! What question can I help answer for you today? [User responds with a question (content-centric).]
Please tell me your phone number. [User responds with a recitation of their phone number (system-centric).]
Hey. What’s on your mind? [User responds with an addition to the conversation à la Grice (conversation-centric).]
Outlining context and eliciting a contextualized response
While these preceding examples serve their purposes of suggesting how the user should respond, the examples may not offer the full landscape of context the user needs in order to understand in what context they are within the interface. In this way, the utterances that our mechanical conversation partners yield also serve the important purpose of encapsulating and scoping the user-generated responses that are returned to the interface.
Utterances articulating the desired specifications for a user’s response, for instance, should endeavor to clarify at the outset whether a one-word answer is sufficient or a more open-ended response is required for the interface to proceed. In this way, our voice interfaces must explain how a user should format their response in much the same way as humans would seek an abridged response or one adhering to a particular form from their counterparts. Voice interfaces can thus adhere richly to the cooperative principle—as long as we design them to do so effectively.
Consider also the following examples of machine utterances that serve to limn the contextual milieu for the user (such outlining is underlined in the subsequent examples) so that the interface can correctly capture the user’s intent and avoid a switch in context within the user’s mind:
Welcome! What question about our services can I help answer for you today? [User responds with a question about available services (content-centric).]
Please tell me your phone number so we can call you if something happens with your order. [User responds with a recitation of their phone number (system-centric).]
Tell me what you would like to do next with your checking account. [User responds requesting a bank action (system-centric).]
What toppings would you like on that first large pizza? [User responds with a topping from a set list (system-centric).]
In this way, machine-generated utterances can shape and mold the resulting user utterance. By providing additional context and clearly articulating how the user should respond, conversational context can become much less significant of an issue.
Human considerations for writing machine utterances
As we know from human spoken language, utterances are not simply about expressing meaning. There is considerable subtext in any utterance, and speech synthesizers have long toiled to provide a suitable emotional approach to how utterances are issued by voice interfaces that have no feelings and no soul. Among other aspects of human language, the machines we are now tasking with interpreting the vagaries of the spoken word need to consider elements of human speech such as style and tone, word choice, and dialect.
Is your interface intended to approximate the experience of conversing with a hotel concierge, a check-in agent at the airport, or a health provider? Depending on the nature of the conversation in a normal social context, voice interfaces can adjust how they approach or respond to users based on the level of formality and honorifics they deploy, the colloquial and slangy registers they may use, and, in situations where the speech synthesizer’s dialect can be controlled, regionalisms that may be appropriate for a certain user base. Consider, for instance, the distance between the following two hypothetical greetings:
Welcome to Delta, where the safety of our customers and crew is our first priority. [Large company represented by a formal register and direct, businesslike tone.]
¡Hola! and welcome to Café Oaxaca, where we treat you like family! [Family-owned restaurant represented by a colloquial register and friendly, gregarious tone.]
Nonetheless, it is critical to recognize that certain characteristics applied to speech synthesizers may in fact intensify or perpetuate biases that voice users have from structural and institutional systems of oppression in society. For instance, it is no accident that most voice assistants employ a feminine synthesized voice as well as a neutral General American dialect. In recent years, navigational interfaces like Waze have begun allowing users to provide their own recordings that replace the default voice recordings, but this phenomenon of voice customization has not yet reached voice assistants in a way that both facilitates inclusive design and shrouds the synthesized nature of machine-generated utterances.
Scrutinize the response to the personality of your voice interface as well, whether it greets users with “Hey y’all” or “Welcome! How may I help you?” Designers should always consider the implications of language they write in terms of how it is presented in the form of a synthesized voice. Considering how your voice interface adheres to and perpetuates particular stereotypes of voice assistants is an important means to ensure that your voice design is as inclusive as it is usable.
Eliciting or discouraging user responses in later interactions
As we proceed to the later stages of the voice interaction beyond the initial greeting, we can use Hall’s key moments in Conversational Design as guidelines for how to write machine utterances that not only outline context for users but also illustrate the expected response from users, where one is needed. In many cases, utterances can be employed not only to elicit a response but also to suggest that a user-generated response is not necessary, particularly in the case of mere guidance or feedback.
Consider the following examples, in which the interface deploys a strong distinction between questions that solicit user responses and statements that indicate status or context changes and encourage the user to wait for the next utterance:
Welcome to Main Street Credit Union, your neighborhood credit union! [User waits for system main menu options to be presented (introduction).]
Would you like to check your balance, check your last statement, pay your credit card bill, or learn about other services? [User responds with a choice of system main menu option (orientation).]
One thousand two hundred seventy-eight dollars and twenty-five cents is your current balance. Would you like me to repeat or go back? [User responds with a selection of system action (action).]
Sure! Just one moment. [User waits for next machine utterance (guidance).]
We can see from these machine utterances that in addition to keeping Hall’s key moments in mind, the other two essential objectives of a voice interface’s utterances are to maintain the user’s ability to comprehend their context and to elicit responses adhering to certain characteristics.
Conclusion
Voice design is a uniquely challenging arena for designers and user experience specialists not solely due to the novelty of its aural canvas and verbally manipulated elements. The fact that voice design requires designers to leverage a wholly distinct skill set, that of writing, in the context of architecting interfaces and conceiving cohesive experiences, means that the vast majority of the design tropes we are accustomed to in visual and physical interfaces must go out the window.
Instead of form elements and buttons, we must think of how utterances, whether machine-generated guidance or user-generated intents, can best smooth interactions. In lieu of navigation menus and sitemaps, the traditional wayfinding tools of the web, we must consider how our flows lend themselves to better discoverability and greater unidirectionality. In the same way that we must avoid writing incoherently and eschew garden-path sentences in crucial utterances, flows must also have the mission of ensuring the user is never waylaid or encounters a dead end in the voice interfaces.
Ultimately, voice interfaces can be extraordinarily rewarding for organizations, not just from the standpoint of digital innovation but also from the perspective of clear differentiation when it comes to user experience and accessibility. Nonetheless, designers interested in pursuing voice as a medium must become well-versed writers. To design the optimal conversational interface, we must wield language and flows. We must put down our well-worn pixels and signposts and immerse ourselves in their conversational counterparts: the aural affordances and abstract flows that voice interfaces truly require.