Can voice assistants displace screen readers?May 19, 2021
Preorders are now available for my new book Voice Content and Usability, coming June 22nd! Want to know more about A Book Apart’s first-ever title on voice interface design? Preorder the book, learn what’s inside, and subscribe for more insights like this article.
Since the first voice assistants were released to much spilled ink, futurists have waxed poetic about the potential for smart speakers and disembodied voices like Amazon Alexa, Google Home, and Apple Siri to revolutionize everything about how we interact with our brands as customers through our most human medium: our own voices. But comparatively little attention has come to the other significant reinvention that could take place in the realm of accessibility: the prospect of voice assistants outperforming screen readers at their own job.
Blind users of websites not only have to contend with the lack of alternative text and accessible markup from the get-go; they also have to leverage frequently clunky and meandering screen reader output that tests their patience and offers a second-class experience. After all, the web when it comes to content is a predominantly written, not spoken, conduit for copy and media, which disadvantages users who solely consume their content through aural and verbal means.
Though there is a distinct possibility for voice assistants to usurp screen readers as the primary means of interacting with content on the web, there is still a long ways to go before voice assistants are authentically capable of outmoding the screen readers that have long cemented themselves as fixtures of the accessible web. In this article, I want to cover three of these factors: the usability of screen readers, the lack of conversation-centric capabilities, and finally, the intrinsic bias that disadvantages not only disabled individuals but also other marginalized groups.
I cover all of these topics at length in my new book Voice Content and Usability, which is the first book from A Book Apart on voice interface design and the first-ever title to hit bookshelves about voice content, including comprehensive coverage of voice content strategy and voice content design, with an eye trained on inclusion and equity. For my full discussion on this topic, preorder the book today, find out what’s inside, and subscribe to my newsletter for even more insights.
Screen readers can be clunky for written content
For Blind and low-vision users, the primary method of interacting with written content and visual media has been the screen reader, which is responsible for transcribing the contents of a website into synthesized speech—text read aloud. But there’s one big problem with screen readers; they’re talky and seldom deliver the information that Blind and low-vision users need immediately, because many visually rooted obstacles stand in their way, like “Skip to main content” and other puzzling pronouncements that waylay disabled users.
To illustrate this, accessibility advocate and voice engineer Chris Maury shared his experience interacting with screen readers:
"From the beginning, I hated the way that Screen Readers [sic] work. Why are they designed the way they are? It makes no sense to present information visually and then, and only then, translate that into audio. All of the time and energy that goes into creating the perfect user experience for an app is wasted, or even worse, adversely impacting the experience for blind users."
If designed appropriately, voice interfaces that are optimized for spoken content and the spoken word to begin with could be much more powerful and efficient than screen readers that have multiple challenges to contend with when it comes to voice content. Without the obstacles of navigational elements and other motifs of the web to stand in their way, Blind and low-vision users can home in on the content they need quickly without having to brush aside breadcrumbs or lunge past links that are much quicker to navigate in a visual browser than a verbal screen reader.
This is another example of the ways in which spoken content and written content differ so considerably, and it illustrates why we as voice interface designers and web designers alike need to pay close attention to the unique differences between conversational interfaces that leverage the spoken word and those that employ the written word. As I wrote in my article introducing my new book Voice Content and Usability, voice assistants aren’t like other conversational interfaces, as much as conversation designers would like them to be to unify the experiences they design into a single format appropriate for both.
Are voice assistants optimized for spoken content?
Though it stands to reason that voice assistants can potentially spirit users to their desired content more quickly than screen readers could ever be capable of, there’s something odd about reducing the entirety of a website’s richness to a series of isolated back-and-forths with a voice interface, beyond the distinctions between written and spoken conversation. Any voice assistant that operates in the realm of delivering a website’s content in spoken fashion has to be capable of traversing the entire corpus of that web-based information.
There’s one key difference between screen readers and voice assistants that many proponents of voice assistants gloss over: While it’s completely clear to users that screen readers are limited to the jurisdiction of the website’s content itself, voice assistants like Amazon Alexa and Google Home in fact compete with one another on the basis of their ability to access the entire trove of information available on the web. In other words, screen readers are inherently very clear about the borders they cannot cross; voice assistants, meanwhile, argue that those borders shouldn’t exist in the first place.
One of the biggest barriers that face web designers and content strategists looking to make the leap into spoken content and giving their content a voice is that their purview doesn’t extend into the furthest reaches of the information web. Our perspective is usually limited to the content we manage, the terms we index, and the breadcrumbs we lay, not the billions and billions of pages that may tangentially relate to our area of focus.
Conversation designers have long been excited about the potential for the conversational singularity, the point in time when those boundaries that are very real to screen reader users but arbitrary to voice assistant users simply disappear. Reaching a state where we can feasibly have a conversation about anything even remotely related to our topic of discussion, however, is still a distant proposition, because voice assistants still use “installable skills” or “apps” to draw lines in the sand between themes like ordering takeout or playing Jeopardy.
I explore futuristic forecasts like the arrivals of the conversational singularity and true conversation-centric design (that moment when voice assistants are capable of truly extemporaneous dialogue) in my new book Voice Content and Usability. Preorder a copy for you and your team, see what people are saying about it, and sign up for more insights about these topics.
But there are other bumps—and biases—in the way
It’s all well and good to drive toward a future where voice assistants outperform screen readers at their own game of delivering written content, but at what cost? The breathless discourse currently overtaking the world of conversation design and voice interface design touts the impending conversation singularity and victory of voice assistants, but are these newfangled devices actually deepening exclusion rather than advancing equity? For one, voice assistants aren’t usable by Deaf or hard-of-hearing individuals.
Voice assistants today occupy a strange position in our social consciousness, a place that screen readers have never found themselves. When you think of a disembodied voice like Apple Siri, Amazon Alexa, or Windows Cortana, who is the person you picture in your mind? The ways in which our voice assistants are inherently illustrated in our minds as secretarial women is a step backwards from the configurable voices on screen readers and their relative lack of personification. In other words, we don’t imbue JAWS or ChromeVox with the same human identities that are now inextricably linked with Alexa, Siri, and Cortana.
So this begs a fundamental question about our responsibility as designers and technologists. Even as we advance innovation in voice assistants and make our content more equitable for disabled users who are frustrated by screen readers, are we reversing progress in equity in other areas of our society? Improving user experiences to be more accessible for one marginalized group should not come at the expense of another oppressed minority, even if the links between them aren’t as clear-cut.
The discussion about the impact of voice assistants on our deeply held biases and the potential revolution in accessibility that could come thanks to voice assistants is just beginning, but I’m excited to participate in this important conversation about how accessibility and equity are far more multifaceted than simply granting access to content in more efficient ways. We must widen access without entrenching exclusion elsewhere.
For my full view on the problems less talked about in voice interface design, I tackle these topics head-on at length in Voice Content and Usability.
Though it’s clear we still have a long ways to go before screen readers truly outpace screen readers in their own race when it comes to enabling accessibility, there are some intriguing forays in the right direction. My new book Voice Content and Usability delves into Ask GeorgiaGov, the first-ever voice interface built for the residents of the state of Georgia, and its progression from idea to prototype to fully-fledged voice assistant. It was just one facet of Digital Services Georgia’s ongoing efforts to empower all Georgians with the ability to access the content they need, however they feel most comfortable.
My new book Voice Content and Usability covers the entire journey of voice interface design and voice content strategy, from planning your corpus of content all the way to implementing it and testing it in a smart speaker or voice assistant. It also comes packed with illuminating insights from our voyage designing and building Ask GeorgiaGov, which was also one of the first-ever content-driven voice interfaces. In my next article, I’ll share more about how informational and transactional voice interfaces differ and why it’s important to understand the difference.