Building Confidence: A Case Study in How to Create Confidence Scores for GenAI Applications

TL;DR Getting a response from GenAI is quick and straightforward. But what about the confidence level for that response? In certain applications, especially in the financial domain, confidence scores are required. In a document-parsing task related to financial automation, we tested three approaches to address confidence level: calibrator models, logarithmic probabilities (logprobs), and majority voting. Majority voting turned out to be the most performant technique. Although the idea is relatively simple, the devil is in the details.

Introduction

The introduction of GenAI technology has proven to be revolutionary in the current business landscape, including its improvements to internal business efficiency. Repetitive tasks that were once burdensome to humans can be automated by GenAI. Compared to traditional deterministic methods or machine learning (ML) models, using GenAI has the following benefits:

Fast development: Rapid training, testing, and implementation reduce development time.
Scalability and flexibility: GenAI can easily scale to incorporate new cases with simple prompt adjustments.
Maintainability: In many cases, GenAI-powered solutions are easier to maintain than traditional ML models, making them more cost-effective over time.

Despite its advantages, GenAI faces significant challenges in accuracy and reliability, with its lack of confidence scores and its hallucinations.

Confidence levels are crucial in building trust and informing decisions, but they are not integral to AI technologies. In this post, we will explore how we created confidence scores in GenAI applications for a financial automation use case, detailing the approaches, important implementation details, and challenges/limits.

Assessing confidence in a financial automation application

Our team sits in the Financial Engineering org at Spotify and aims to enhance efficiency in financial processes by automation. Recently we worked on automating invoice parsing, which was part of a larger initiative to streamline invoice processing. The invoices, from a large distribution of vendors globally, come in various languages, formats, and structures. This complexity makes deterministic models inadequate, as they struggle with high numbers of edge cases and with ambiguous and incomplete data.

GenAI, on the other hand, has proven to be more versatile and adaptable. It can also make adequate inferences for complex tasks such as invoice parsing. However, a significant challenge arises while employing GenAI models in this task: confidence scores must be produced to support human-in-the-loop decisions and to meet regulatory requirements like IT general controls (ITGC) and the Sarbanes–Oxley Act (SOX). Specific thresholds need to be crossed for the models’ outputs to be trusted. To address the need, we researched the following approaches and implemented a reliable way to generate confidence scores for GenAI model outputs.

Approaches evaluated

Calibrator

“Calibrator” refers to using a separate single GenAI model to evaluate the outputs generated by other models and assign confidence scores. This approach offers an independent and unbiased evaluation of the outputs from other models. The calibrator model is also trainable by learning from feedback and can potentially improve over time.

However, confidence scores generated by calibrators are difficult to interpret and sometimes counterintuitive. Besides, the scores are not consistent across multiple runs. In financial applications where confidence in technology is crucial, unclear scoring and inconsistencies are not acceptable.

Logprobs

Logprobs, or logarithmic probabilities, are the log-transformed probabilities for each token (a unit of text, such as a word or part of a word) in a generated output. These probabilities indicate how confident the model is in choosing a token. A higher (less negative) logprob means the model is more confident in the choice. However, the methodology is not always transparent – it varies by provider, and for many models it is unknown how they were calculated.

The confidence of an entire response can be deduced by combining the logprobs of its tokens. We averaged the logprobs instead of summing to normalize for output length. The confidence score was then calculated by exponentiating the average logprobs of the tokens.

We tested the approach for different extraction fields in our application. We plotted the results for one numeric field (invoice total) and one text field (invoice number) against the accuracy in the figure below. No clear correlation was found between lobprob-based confidence score and accuracy, and the conclusion holds for other text and numeric fields as well. The lack of correlation suggests that averaging logprobs is not a reliable measure of overall confidence.

Figure 1: Accuracy versus logprob-based confidence score for invoice total and invoice number extraction.

Majority voting

Majority voting is an ensemble method that selects the final output by choosing the most common response from multiple prompts or among multiple GenAI models. For example, if five models are asked to classify an image and four suggest “cat” while one suggests “dog,” the final output would be “cat.” The confidence score can be calculated based on the proportion of agreeing models, in this case 80% (four out of five models).

We tested the same application using five GenAI models and majority voting and observed a strong positive correlation between confidence score and accuracy. The same two fields were plotted below, and the conclusion holds for other fields as well.

Figure 2: Accuracy versus majority-voting-based confidence score for invoice total and invoice number extraction.

The careful art of majority voting

With the evaluation, we concluded that only majority voting was a suitable choice for our application, as it’s the only method that showed strong positive correlation with accuracy and is relatively consistent and interpretable. While its concept is straightforward, in practice we realized that the implementation requires careful consideration of many factors.

Deciding the number of models

The optimal number of models depends on factors such as task complexity, model diversity, available resources, and specific project goals and constraints. Larger ensembles typically offer greater stability and accuracy by reducing individual model errors. However, they increase computational complexity and may yield diminishing returns if the models are too similar. Smaller ensembles are more efficient but less stable.

Literature often suggests using four to seven models in an optimal ensemble. In our use cases, we leveraged five or six different LLMs. We found that this number provided sufficient diversity and struck a good balance between speed and diversity.

Assigning weights to voting

To calculate the final score, a weighted majority voting approach was implemented. Weight for each model was based on model accuracy and then normalized to sum to one. This method minimizes the chances of a tie in the voting process, as opposed to an unweighted vote. It also improves the accuracy by giving greater influence to better-performing models.

We evaluated both linear weights (where weights are linearly correlated with model accuracy) and exponential weights (where top-performing models get weights close to or exceeding half). Both methods showed similar performance. We opted to use linear weights for our application, as it created more balanced and consistent outcomes and was easier to explain.

Calibrating confidence score

While there is a positive correlation between confidence and accuracy, the relationship is not one-to-one and can vary across different fields.To correct over- or under-confidence in various fields, we applied Platt scaling, a technique commonly used to calibrate probabilistic outputs. This calibration process adjusts the raw confidence scores to better align with the accuracy, as shown in the two examples in the chart below:

Figure 3: Accuracy versus majority-voting-based confidence scores before and after calibration for invoice total and invoice number extraction.

Majority voting — limitations, challenges, and explorations

Long text fields

Majority voting works effectively for numeric and short text responses but faces challenges with long text fields, such as addresses or item descriptions. Due to variations in phrasing, it is less likely to get model agreement using direct string matching. As an attempt to address this issue, we explored two methods to get a majority response in these scenarios:

Embedding similarity, which groups similar text responses into clusters based on cosine similarity of their embeddings. The largest cluster is the majority vote. This approach requires careful tuning of distance thresholds.
GenAI selection, where a separate GenAI model selects the majority vote from multiple responses. This approach requires extensive prompt engineering to ensure the selector prioritizes factual accuracy and consistency.

During our testing, both approaches were able to identify similar outputs for long text fields. However, they failed to catch some spelling differences — for example, not distinguishing between “0” and “O,” which can lead to significant errors in a financial application. Therefore, we decided not to implement them for production.

As a workaround, we broke down long text fields into smaller, more manageable components (e.g., splitting addresses into street, city, state, and zip code). It improved the likelihood of model agreement during majority voting. However, developing a more robust solution for the long text field remains a key area for further research.

The granularity problem

Given the limited number of models in an ensemble, majority voting brings a granularity problem. With seven models (the upper end of the normal range), each additional vote results in a significant ~14% step change in confidence level. This isn’t granular enough for applications requiring fine-grained confidence figures (e.g., needing 95% confidence to pass a check).

To improve granularity, we experimented with the permutation approach: by using multiple prompts per model, the total responses increased from X to X * Y (where X is the number of models and Y is the number of prompts per model). We tested with seven GenAI models and five different prompts, and 35 responses were generated. The outcome is summarized in the table below.

Approach	# of GenAI Models	# of Prompts	Total Responses	# of Agreed Responses	Output	Confidence
Original	7	1	7	6	Majority voting output is correct.	86% (6/7)
Permutation	7	5	35	33	Majority voting output is correct.	94% (33/35)

Table 1: Comparison of majority voting with and without permutation.

The permutation method could potentially increase the pass rate for our system. If the confidence threshold was 90%, the original approach would result in a false rejection, while the permutation approach would successfully return the correct output.

However, it significantly increases the number of model runs, where cost increases linearly and the time to finish now depends on the slowest model out of the 35, and a potential failure is more likely . Although the experiments provided valuable insights and guided further discussions with the business regarding thresholds and their implications, future research is still warranted for a more cost-effective long-term solution.

Conclusion

In this post, we examined a unique challenge that arises when using GenAI in financial applications — determining the confidence score. We discussed the various approaches we explored, culminating in a successfully implemented technique. However, as always, the devil is in the details, and many aspects of implementation require careful consideration. Finally, there are still challenges and limitations in our approach. We explored a few solutions, but some problems remain the topic of future research. We are optimistic that new developments will continue to evolve this groundbreaking technology, and we hope our explorations inspire not only our teams at Spotify but also practitioners across the industry.