Register, diglossia, and why it's important to distinguish spoken from written conversational interfaces

June 2, 2021

Preorders are now available for my new book Voice Content and Usability, coming June 22nd! Want to know more about A Book Apart’s first-ever title on voice interface design? Preorder the book, learn what’s inside, and subscribe for more insights like this.

One of the more captivating and joyous developments of the conversation design landscape has been the advent of cross-platform frameworks that allow designers to create a single conversational interface that then manifests as an Amazon Alexa skill, a written chatbot, and a smartphone textbot. These conveniences allow for users to leverage any device of their choice to interact with largely the same conversational experience. But are there unintended side effects from what at face value appears to confer immense improvements in efficiency and user experience?

The biggest risk of deploying conversational interfaces for both the spoken and written medium simultaneously that are mostly indistinguishable from each other is glossing over the subtle nuances that separate voice interfaces from their textual counterparts. As conversationalists, we seldom use the same words and vernacular in written conversation via text message as we do in spoken conversation. No one says “LOL” out loud except ironically, and most English speakers write “literally” far less often than they speak it aloud.

While cross-platform conversational interfaces can accelerate the launch timetable and facilitate a more seamless experience for users across a variety of settings, they run the risk of washing away the crucial but underrecognized distinctions between how we use spoken and written language. More importantly, however, the distance between spoken and written language can be even more pronounced in languages other than English, a phenomenon known as diglossia. By implicitly mandating cross-platform approaches, we may in fact be making it more challenging for conversation designers around the world to succeed, especially those working in highly diglossic languages.

In this article, I want to cover some of the key differences between spoken and written conversational interfaces, which I discuss with more depth in my new book Voice Content and Usability, A Book Apart’s first-ever book on voice and the first and only book about voice content, including voice content strategy and voice content design. Preorders are open, but you can also check out what’s inside and subscribe to my newsletter for even more insights like this.

Untangling types of conversational interfaces

Before we delve into why understanding their differences are so important, we need to ensure a common understanding as to what entails spoken and written interfaces in the first place. Conversational interfaces are interfaces that operate on the medium of conversation to facilitate user interactions, whether written, spoken, or both. They can be part of larger multimodal interfaces (such as a dismissible chat window on a marketing website) or contain smaller visual interfaces (such as conversational forms).

Instead of leveraging tools that are fundamentally artificial, like keyboards, mice, and touchscreens, conversational interfaces stand alone in their embrace of natural human communication as the primary means of interaction. But voice interfaces take things a step further by winnowing away all of the physical elements—buttons, icons, affordances—that typically mediate our interactions with the computer.

Written conversational interfaces

Written conversational interfaces, which run the gamut from web-based chatbots and SMS (text messaging) bots to Slackbots and Facebook Messenger bots, usually require a keyboard as an input method, speech-to-text dictation notwithstanding. Though written chatbots engage in natural human conversation, they always have the unique privilege over speech-based interfaces of a visual fallback—namely, written text.

Normal spoken conversation differs in striking ways from typical written conversation. In our instant messages and text-message conversations, we have access to a running backscroll of previous context and chat history as convenient references, an ironclad archive of past, present, and future interactions. As a result, a user conversing with an airline about a cancelled flight on Facebook Messenger will have a wholly different experience from someone chatting with an air carrier’s Alexa skill instead.

The permanent staying power of written conversation over the fleeting transience of spoken conversation lends textbots, chatbots, and other written conversational interfaces considerable privilege over their spoken counterparts. This is why the greater the distance between your spoken and written conversational experiences, the more context-switching the user will need to do as they toggle between them. Hence the desire for a more unified approach to implementing conversational interfaces that enables cross-platform implementations.

Multimodal conversational interfaces

Written text isn’t the only possible fallback for conversational interfaces. Whether it’s a series of radio buttons to derive user-generated input or a simple set of buttons to allow the user to select one option among three, conversational forms interpolated into chatbots represent not only a crutch for conversational designers but also an arguable intrusion of “artificial” interactions like form-filling and button-clicking into an organic, human-led conversation. Nonetheless, conversational form elements and other visual interface components can certainly enrich and deepen how people converse with your chatbot.

But just as we rarely resort to visual aids in social conversation among friends, written conversational interfaces do themselves a potential disservice by resorting to these “familiar” fallbacks, which reflect artificial interaction rather than natural interlocution. Such a fusion of natural and artificial elements in an interface can also limit that chatbot’s ability to easily transform into a speech-based interface without substantial refactoring.

As visual elements continue to infiltrate chatbots, they’re entering voice interfaces as well, most notably in devices like the Amazon Echo Show, which adds a screen to Alexa. Such multimodal interfaces with both spoken and visual components can address many usability issues associated with typical voice interfaces, as Michael Cohen, James Giangola, and Jennifer Balogh write in Voice User Interface Design. Even a minuscule screen containing some visual elements can alleviate the user’s concerns about the ephemerality of speech.

Voice interfaces

At their core, voice interfaces, also called voice user interfaces (VUIs), employ speech to support the user in reaching their goals. According to Randy Harris in Voice Interaction Design, pure voice interfaces, limited from the user’s standpoint to speech recognition (input) and speech synthesis (feedback), in addition to underlying business logic, are distinct from multimodal voice interfaces like the Amazon Echo Show. While my book Voice Content and Usability, available to preorder and preview, covers voice interfaces writ large, it focuses most of its attention on pure voice interfaces.

Moreover, because chatbots, conversational forms, and the Amazon Echo Show are, at their core, graphical conversational interfaces, users can consume much more information displayed simultaneously rather than relying on our more tenuous working memory. Like written conversational interfaces, multimodal interfaces are also capable of preserving more context thanks to semipermanent archival methods like chat histories and backscrolls within their visual screens.

Spoken conversational interfaces differ from written conversational interfaces

It isn’t necessary to belabor the point that conversational interfaces operate on a fundamentally different dimension from the interfaces we work with on a daily basis on desktops and smartphones. But less clear-cut is the notion that written conversational interfaces and their spoken counterparts require different approaches to the language within, because how we speak differs at a profound level from how we write, even if it’s a matter of writing an informal e-mail or thank-you card.

Chatbots, Slackbots, textbots, Facebook Messenger bots, and WhatsApp bots all have the privilege of leveraging written text and, potentially, conversational forms or visual components that reduce the reliance on dialogue exclusively. Whereas written interfaces can display a form or use visual affordances, spoken interfaces must deploy written dialogue only, not just to give the user much-needed feedback and information but also to facilitate the navigation of the interface itself. In this way, the utterances machines wield and the language designers write have far more importance in voice interfaces that lack a backscroll and chat history than written interfaces that provide a running archive of a long-running conversation.

This trait also places even more pressure on voice designers to scrutinize foibles unique to speech such as unusual pronunciations of neologisms or uncommon words, as well as disambiguation of homophones or near-homophones for terms that sound similar. In addition, conversation designers need to consider whether certain words come across a bit too formally for voice interfaces but sound perfect for a written chatbot. I examine both of these areas in my new book Voice Content and Usability, available for preorder or preview. And this is where we turn our attention to next: the sticky issues of register and diglossia, which are established topics in linguistics but less well-known in conversation design.

Why conversation designers need to know about register and diglossia

In linguistics, register is the term given to a variety of language that’s used for a particular setting or situation, like delivering a lecture at a university or having a punny chat in an in-group environment. Many people switch between a variety of registers on a regular basis, and we do it in English all the time when we toggle between speaking to audiences, speaking with colleagues, and speaking with our families at home. For some communities, especially immigrant, bilingual, and marginalized communities, code-switching between registers or even entire languages is a common occurrence.

Register is particularly important for conversation designers to understand, because how we expect our conversational interfaces to speak to us is a complex calculus that depends on our surrounding community and the sort of conversation we’re having at that moment. Although many voice interfaces default to a friendly, approachable register when it comes to the words they use, their purpose or motivation may not be appropriate for certain informality or colloquialisms at all.

This becomes an especially dangerous thicket of problems when it comes to the delicate balance conversation designers must strike when crafting interfaces for languages that exhibit a high degree of diglossia, another linguistic term that describes a scenario in which two remarkably different dialects or languages are used by a single community in socially distinct situations. Diglossia is when there is such a significant gap between registers that they have become varieties in their own right, typically one that’s more literary or formal and another that is more colloquial or informal.

In many cases, one variety is used when writing, sometimes even in informal writing, and the other is used in speech and seldom written. In diglossia, it isn’t enough to learn a single grammar or lexicon; only learning the colloquial variety isn’t enough to communicate at the level of fluency that societies often demand, meaning you need to keep two forms of language at top of mind at all times depending on the situation.

Languages like Greek, Arabic, and Brazilian Portuguese all exhibit a strong degree of diglossia and necessitate a careful and well-considered approach to conversation design. Consider these two examples of Brazilian Portuguese, the first demonstrating the spoken variety of Brazilian Portuguese and the second the written variety. They both say the same thing in English: “We can’t wait to welcome you again!”

Spoken: A gente não vê a hora de te receber de novo.
Written: Não vemos a hora de recebê-lo novamente.

For instance, many Brazilian Portuguese conversational interfaces still use formal modes of speech in day-to-day situations where a more informal register might be more appropriate. This exposes a limitation of the rigid paradigm imported from Anglophone interfaces that expects most differences between written and spoken language to be relatively minor. In these diglossic languages, what reads well written might not sound so great when recited, putting at risk the hard work designers have done to build once, deploy everywhere, regardless of whether the destination is a chatbot, voice interface, or textbot.

Diglossia isn’t a main focus of my book Voice Content and Usability (preorder or preview), which emphasizes how voice interfaces operate in English, perhaps to a fault. Nonetheless, conversation designers who work in multinational corporations or who expect their interface to be translated into languages other than English should cater their approaches more tightly to the preferences of language communities by considering how diglossia impacts their users. Whether it means a register used solely by queer communities or a “high” variety used even in informal writing, the worst-case scenario is to end up with a conversational interface that sounds strange both in written chatbot form and in spoken voice form due to the lack of distinction between them.

Conclusion

In many ways, conversation designers working predominantly with written conversational interfaces have a deep and enduring advantage thanks to the permanence of the written word, the set-in-stone ink of today’s devices. Pure voice interfaces, on the other hand, have no textual equivalent and operate much like the user having a phone call with an invisible interlocutor. But simply catering to the unique technical distinctions between spoken and written conversational interfaces isn’t enough.

In conversational interfaces, linguistic register plays an outsized role in determining how users will respond to a given conversation and whether they’ll end up feeling it’s too stilted or too intimate for the experience they expect. And complicating matters even further is the widespread phenomenon of diglossia in languages besides English, a tongue in which many conversational interfaces still find themselves deeply rooted. If you’re catering to an international audience, or a diverse set of communities, paying proper attention to matters of register and diglossia will ensure the success of your conversational interface regardless of where it lives.

For more insights like this, preorder my new book Voice Content and Usability from A Book Apart, check out what’s in the book, and subscribe to my newsletter.

preston.so