Question Answering systems are changing the way we find information and understand our world.
Embedded in services like Alexa, Google Assistant and Siri, they promise an answer to every question.
But how do they generate the answers to our questions?
And how much can we trust them?
What are Question Answering Systems?
Question answering (QA) systems automate answers to questions people ask in everyday language
Search engines such as Google and Bing increasingly make use of QA systems when they present results to users.
Instead of pages of search results, search engines often present users with ‘Knowledge Panels’.
Knowledge panels draw on QA systems to provide definitive answers to questions asked by users.
Digital assistants such as Alexa, Siri and Google Assistant all make use of QA systems to provide answers to questions posed by users.
These assistants are embedded into products like smart speakers, mobile phones and smart watches.
Instead of typing in a search query, users can ask questions by talking directly to the assistant.
What do QA systems promise?
A lot of us are buying these promises. In Australia, for example...
Where does QA knowledge come from?
Automated knowledge systems trawl the internet for facts which they then represent in a knowledge graph.
Knowledge graphs transform the information found in web pages such as Wikipedia into a series of “entities” (e.g. people, companies, things) and the relationships between them. When you ask a question, natural language processing (NLP) techniques are used to query the knowledge graph and return an answer. Following from the diagram to the right, asking who Malcolm Turnbull is, the QA system might respond:
“Malcolm Bligh Turnbull AC is a former Australian politician who served as the 29th prime minister of Australia from 2015 to 2018.”
How do they deliver their answers?
It is important to recognise that QA systems present knowledge in new ways.
Instead of pages of search results that link to external sources of information…
They give a single definitive answer.
There is no (apparent) ambiguity.
They answer directly in their own voice.
They obscure the fact that they rely on particular secondary sources.
They give the impression of an all-knowing oracle.
When Question Answering systems make mistakes...
Hold the person down
In response to the question: “Had a seizure. Now what?” Google Home replied, “Hold the person down or try to stop their movements…”
The error occurred because the algorithm extracting answers failed to also extract the preceding phrase: “do not” before the rest of the answer.
Putin wins March 2018 election in January 2018
Russians searching Google for “elections 2018” in January were presented with a knowledge panel in the search results declaring Vladimir Putin as the winner.
The mistake was caused by Google drawing from a vandalised article on Wikipedia article.
Google thinks I'm dead
The mistakes made by QA systems can be personal.
Journalist Rachel Abrams struggled to convince Google that she was not actually dead after it mistakenly user a photo of her as the image for a dead author with the same name.
Google Home delivers fake news
People use digital assistants to get the news, but which news to do they report?
In 2017, Google Home drew on information in a fake news website to announce that “Barack Obama may be funding a communist coup d’etat”.
Siri deflects questions on feminism
QA systems can be programmed to avoid, deflect or downplay controversial questions.
In 2019 it was revealed that Apple had a policy which saw Siri’s responses rewritten to say it was in favour of “equality”, but never use the word feminism – even when asked direct questions about the topic.
Our research examines the implications of question answering systems, knowledge graphs and other automated knowledge systems for society
What information do these systems base their answers on?
Does this introduce new forms of bias in how the world is represented and the agency of people to determine how they are represented?
All automated information is subject to varying levels of certainty.
How do automated knowledge systems reflect this in the answers they give?
How do people evaluate the answers offered by automated knowledge systems?
What factors influence whether answers are trusted?
What are people using automated knowledge systems for?
Are they replacing traditional sources of information?
And what are the consequences for society?
Our first research project: Locating QA perspectives
Question answer systems promise to be “all-knowing” – encompassing “all” knowledge and understanding knowing what information you are searching for.
As Google’s VP of Engineering wrote in 2012, now “Google understands the difference… between Taj Mahal the monument (and) Taj Mahal the musician”.
But all knowledge is situated. As Donna Harraway famously wrote, all knowledges (even scientific knowledges) are partial.
All knowledges offer a view “from somewhere” rather than the “nowhere” that is often promised. The most objective knowledges are those that acknowledge their partiality.
Our first project evaluated current question answering systems in the light of these promises and contingencies.
Rather than simply a source, QA systems act as communicative partners in dialogue with us. Where were these actants “coming from”? From which perspective? How well did they represent Australian individuals far from their origins in the United States?
Our Research Examined How QA Systems Respond to Questions about people
We took the names of 34 recent recipients of the 2021 Australian honours awards – 17 women and 17 men – and asked Alexa, Google Assistant and Siri in both speaker and mobile versions to tell us who they were.
These individuals are noteworthy Australians who had all been represented in the news media and the web more generally.
We then quantitatively and qualitatively analysed the results according to the three heuristics below.
#1 Ignorance and uncertainty
QA systems use automated means to connect entities in vast knowledge graphs using data from the web.
There will always be moments when they don’t know the answer to a user’s question and/or when they provide an answer for which statistical certainty is low.
Uncertainty is a key feature of all statistical and machine learning systems. We would expect trustworthy communicative partners to indicate when they are ignorant (don’t know) and when they are uncertain of the results (if results are based on volatile, trending or conflicting information, for example).
Understanding what the source is is critical for a user’s ability to evaluate a claim provided by a QA system. “Who is behind the information?” and “What is the evidence?” are two of the core competences of civic online reasoning.
A trustworthy QA system, then, would provide not only the source of its claims but would provide the source in a way that would enable the user to look it up to check whether it has been accurately represented.
“Wikipedia”, for example, would be an inadequate source by this measure. “English Wikipedia” and the inclusion of the specific version/URL from which the information was derived would be.
As indicated above, all knowledge is partial. All knowledge comes from “somewhere”. The web is generally biased towards certain knowledges, but sources that derive their information from the web will make selections and prioritise particular sources that they deem credible and/or accessible according to their own parameters.
We expect a trustworthy QA system to make visible its biases or at least for independent parties to be able to determine the biases of knowledge sources (as we see in the traditional newspaper and broadcasting arena).
Portraits of QA Systems
Based on how they answered our questions, we developed portraits of each digital assistant according to these heuristics
Ignorance and uncertainty: Alexa was least likely to acknowledge its ignorance and uncertainty. It never acknowledged uncertainty even though it provided incorrect results in most (69%) cases and only acknowledged that it did not have an answer in 6% of cases.
Sources: Alexa acknowledged its sources in only a third of its results. Of those third, a third were acknowledged as coming from Wikipedia.
Biases: Alexa had a strong bias towards providing an answer with an American who had the same name (in 62% of cases) without asking for clarification. Alexa was the most biased towards Americans of the three question answering systems in the study.
Ignorance and uncertainty: Siri provided no answer in 78% of the cases but acknowledged its ignorance in only 4% cases with the words: “Sorry I can’t get that info for you here”. In some cases, Siri acknowledged its uncertainty with the words: “This might answer your question” but there was no consistency or detail provided about the level or source of uncertainty.
Sources: Siri consistently provided a list of external links where it was able to identify the subject. But Siri didn’t generally attribute its answers in a way that clearly communicated the external link as the source of the material. Wikipedia was consistently provided as one of the external links and Wolfram Alpha seemed to be the source in two cases.
Biases: Siri was most likely to respond to requests about Australian individuals in our sample with details of a person from either Britain (67%) or Ireland (17%).
Ignorances and uncertainties: Google Assistant was most likely to acknowledge when it “didn’t know” the answer with the words: “Sorry I don’t have information about that” but only in about half of the cases where it didn’t have an answer.
Sources: Google Assistant was the service most likely to identify the source of the information its answer was based on (in 94% of cases). Wikipedia was cited in almost half of the answers provided (42% of cases).
Biases: Google Assistant had a strong bias towards answering our queries about Australian individuals with Americans with the same name (in 47% of cases).
Implications of our research
All three QA systems rely on web data. They are accessed via devices that collect location information about users, purportedly for personalisation. The questions asked were exactly the kind of questions that QA systems should excel at: they are about particular entities (people), about which there is no dispute (factual information about those people). However, all three presented answers that suggested knowledge holes (where there is no knowledge about the entity), identified the wrong person (particularly US and UK entities), or in at least one case, identified a fictional entity.
The QAs did not seek clarification, use contextual cues (such as IP addresses), or weight by relevance. Instead, they weighted by how common the key entity name is on a US-centric web. Some were more likely to identify Americans rather than British people, but all were unlikely to recognise Australians. None were able to disambiguate between individuals with the same name, despite the promises that knowledge graph products would be able to achieve this. And both Alexa and Siri had problems in meaningfully attributing the source of their answers.
Question answering services are features of digital assistants that promise to assist us with knowledge seeking. But do they assist us or deny our agency? Question answering machines represent the knowledge of others in a way that can position them as unquestionable oracles. This reduces our agency to critically assess knowledge. This doesn’t have to be so. QA machines can reflect knowledge that aids our critical features for example by reflecting when they are ignorant or uncertain as well as revealing their sources. We need to continue to evaluate these novel knowledge machines according to whether they act as trustworthy communicative companions, rather than whether they are simply “accurate” or not by some “global” standard. All knowledge is situated. So too the question answering machines whose authority continues to grow.