The accuracy dilemma, trading search for speed

Midjourney prompt: AI as a search engine

Are we trading the ease of natural language interaction for less accurate results?

Everyone has read that AI is changing how we search the internet to provide intelligent answers to our routine questions with natural language queries. 

When testing the accuracy of results, it is usual to conduct straightforward questions and ensure that the answer is the same. What is 2+2? If the answer is not four, then something may be odd. Yet basic question-and-answer sets such as these do not truly test the complications of an AI response mechanism.

Don't worry. This article is not about to become a scientific study of testing AI algorithms. Instead, it will focus on a general question with a specific element: time. 

Automated systems may produce accurate sounding results, yet, because of their learning nature, they tend to prefer the consistency of previous responses rather than recency.

2+2=4 today, yesterday, tomorrow, and last year. It is a constant. But what if you ask who won last night's Eurovision Song Contest? Search engines typically respond immediately with the most popular and most recent results. Accuracy, relevance, and recency all factor in presenting users with the results. No longer.

Some engines have fixed time cut-offs, with no data after a particular time, and may state this in their responses. Although not knowing something after a specific date still does not prevent it from providing an accurate sounding answer.

Others engage across the internet, but their learning models place different weightings on recent information rather than larger volume, older data.

Let's look at the English Football Premier League, the biggest football league in the world. As of 14th May, the top of the league table was:

  1. Manchester City 82 points

  2. Arsenal 81 points

  3. Newcastle 66 points

  4. Manchester United 66 points

  5. Liverpool 62 points

Premier League Table

Taken from the Premier League Website at Premier League Table, Form Guide & Season Archives

If we enter "What are the top 4 teams in the Premier League table" in a 'traditional' search engine, then we should, and do, have page links showing the same league table from BBC Sport, Sky Sport, and similar sites. The top four teams are Manchester City, Arsenal, Newcastle, and Manchester United.

Now if we ask the same question from an AI-powered search engine or an AI LLM transformer?

The first answer, 

"As of 14th May 2023, the current top four teams in the Premier League are:

  1. Manchester City

  2. Liverpool

  3. Chelsea

  4. Arsenal

These teams have qualified for the Champions League next season."

Another responded with:

"The current top four teams in the Premier League are Manchester City, Manchester United, Liverpool and Chelsea"

This second example included a "Learn More" link and listed 20 websites. Any user would assume that those 20 websites supported this statement of the current top four standing.

Click on those links, and you will find the first page dates from August 2021, as that model only referenced data until that point; however, that needed to be made clear in the response.

As a Liverpool fan, I was very excited to see my team shoot from 5th to 2nd overnight. Also, being a Liverpool fan, I knew this was a completely wrong statement, but one made entirely convincingly.

It is possible that the natural language query used, "Who are the top teams in the Premier League?" led to a confused answer. Whilst Arsenal and Newcastle may be in the top four now, they are not "top" Premier League teams. Chelsea and Liverpool may own those credentials based on their long-term success in the league, at least in some opinions. The AI may provide a view over a period of time rather than a specific moment.

Not so, as the use of "Currently" clearly placed the time reference into today, 14th May, and the query about the table should have been picked up as a specific question, as the 'traditional' search engines applied it.

This easily tested question was not asking for an opinion but rather an accurate response at a defined moment. 

Therefore, users need greater caution with more complicated questions. A football fan would quickly spot that Liverpool's season has been terrible (relatively), and they are not in the top 4 of the table. 

Would a non-football fan know the same thing? How often do people use a search engine or, increasingly, an AI system because they do NOT see an answer or do not know enough about a subject to assess right or wrong responses? That dilemma is the basis of most search engine queries: tell me something I do not know.

Is this a catastrophic problem? Probably not. AI search development is still early but available for general use. AI search will learn and adapt its responses. The mere act of my querying, challenging, and asking about the Premier League is probably already leading to those systems at least querying themselves on this subject. Clearly, the future of search is AI-empowered.

Taking another query, which country won Eurovision 2023, generates more consistent results, "Sweden's Loreen" is the consistent response from both search and AI-search.

However, it reinforces a critical rule about using Generative AI and Large Language Models. The responses generated to your queries are not always facts, but opinions caused by bias in the underlying data, the tool's algorithm, or your question. 

However, they will often be presented as facts and, worryingly, be presented with items that look like supporting evidence that doesn't actually reinforce the answer.

As such, an AI-powered search may require more human review and interaction rather than reducing human effort and work. Especially if the answer is essential or humans will be making decisions using that answer.

GenAI is regularly "100% confident, yet only 80% accurate" 

This will improve, but when using AI-search for anything important (like predicting whether Liverpool will play in either next season's Champions or Europa League), review any answer provided and, ideally, run your query through more than one GenAI toolset to compare answers. If there is a difference, then research further.

Previous
Previous

Building Trust in AI and Democracy Together.

Next
Next

Books, films, podcasts, and experiments