How accurate is your AI? Why you must understand accuracy, precision, recall and F1 scores

11 min readOct 10, 2019

AI software vendors routinely champion the superior accuracy of their tech vs. the equivalent human effort it seeks to replace. But what is “accuracy”? Likewise, when buyers ask “how accurate is your system at X?” is that the correct question to consider?

As this article will explain, accuracy is not a single metric. In fact, accuracy is but one of four potential metrics.

The other three metrics are precision, recall and F1 score.

Each metric measures something different about the system’s performance. For this reason, it is also often desirable to optimise, and therefore prioritise, one metric over the other. Which metric to optimise depends on the context and objectives of the system. Therefore — spoiler alert — asking “how accurate is your system at X?” is the wrong question to ponder! The correctness of both the question and the answer each depends on the context of the business problem.

It’s a tricky topic, but a critical one. This guide is for non-technical individuals, and primarily lawyers at law firms, in-house legal teams and other legal service providers. That said, it should be useful for any interested business professional dabbling in artificial intelligence or data science (so we hope!).

🎓 What you’ll learn

You’ll learn:

Why “accuracy” is not a single metric, but merely one of four useful metrics, and often the least useful.
What do “accuracy”, “precision”, “recall” and “F1 score” mean, how are they different and what do they tell us about a system’s performance?
When to prioritise precision over recall and vice versa?
How this applies to AI in legaltech, particularly for AI-assisted contract review software such as Kira Systems, Seal, iManage Extract, Eigen Technologies, Luminance, e-Brevia, Diligen etc or their toolkit competitors such as Google Document Understanding AI.

But first, a non-legal but real-life example: AI-assisted tumour diagnosis for cancer detection.

🏥 Diagnosing cancer with AI

The vendor’s product

A vendor offers an AI-powered tumour testing system. This system uses a type of supervised machine learning to build a classifier.

A classifier is an algorithm that learns how to detect whether something belongs to one class or another. In this case, whether a tumour scan is either:

benign (non-cancerous); or
malignant (cancerous).

The AI learnt this ability after training on a dataset comprising a set of tumour images. This dataset is known as the Training Dataset.

Each tumour image in the Training Dataset represents an input-output pair, i.e. an image of a tumour (input) plus a label (output). Each label — benign or malignant — was applied to each image by oncologists (cancer specialist doctors), and verified as true via further medical testing.

In other words, for each tumour image in the Training Dataset, we know with 100% accuracy whether a tumour is benign or cancerous. We simply look at each image’s label. This is the source of truth against which the system was trained.

The vendor then uses an additional and separate dataset against which to assess the system’s ability to replicate this behaviour, i.e. tumour classification. This further dataset is known as a Test Dataset. Like the Training Dataset, it is a collection of tumour images human labelled with their corresponding tumour classifications, i.e. benign or malignant. Unlike the Training Dataset, this data is held back and not used to train the system. A copy of the Test Dataset is fed into the system, but this time the human applied labels have been removed beforehand.

The sole purpose of the Test Dataset is to test the system. It’s essentially a blind test for the machine. After the system finishes classifying the labelless images from the Test Dataset we then compare the system’s classifications against the human applied labels for the same images.

Let’s explore these results.

😉 The vendor’s claim

The vendor claims their system is “99% accurate”. Sounds great, but what does it mean? Let’s break this down with the aid of their data.

1. The Test Dataset

The “99%” figure is based on the below Test Dataset. These are the actual labels, applied by human reviewers:

1,000,000 tumours in total
999,000 out of 1,000,000 are actually benign
1,000 out of 1,000,000 are actually malignant

This is the same Test Dataset we described above, but to be clear, is distinct from the much larger Training Dataset.

2. The Vendor’s Performance Data

Having been fed the Test Dataset, the system’s classifications for each tumour are summarized in the below grid. The vendor’s “99%” accurate claim derives directly from the results summarised below:

The above graphic is a confusion matrix. A confusion matrix is a table used to describe the performance of a classifier on a set of test data for which the true values are known.

The green quadrants summarise the correct classifications made by the system, and the red quadrants summarise the incorrect classifications made by the system.

Another way to represent these results is as follows:

As you can see we describe results in terms of the following:

True Positive: tumour predicted malignant + actually malignant
False Positive: tumour predicted malignant + actually benign
True Negative: tumour predicted benign + actually benign
False Negative: tumour predicted benign + actually malignant

Correct performance is when the system produces either True Positives or True Negatives. Incorrect performance occurs when the system produces False Positives or False Negatives.

In summary, we want the system to be correct at labelling positive results for cancer (malignant tumours) and negative results for cancer (benign tumours).

Anything else is incorrect.

3. So how does the system perform?

When data scientists and machine learning engineers talk about “accuracy” what they mean — or should be talking about — is the relationship and applicability of four metrics:

Precision;
Recall;
Accuracy; and
F1 Scores.

Let’s breakdown each of these, both in general terms and their specific application to the cancer detection system and lastly to legal scenarios.

3.1 Precision

Precision is the ratio of system results that correctly predicted positive observations (tumour = malignant + in fact malignant, i.e. a True Positive) to the system’s total predicted positive observations, both correct (True Positives) and incorrect (tumour = malignant + in fact, benign, i.e. False Positives).

In other words, precision answers the following question:

How many of those tumours labelled by the system as malignant are actually malignant?

In formula the precision ratio is this:

Or slightly simplified:

Applied to our cancer example we get this result:

Precision = 990 / (990 + 9,990) = 990 / 10,980 = 0.09 = 9%

Ouch! 9% isn’t 99% and doesn’t sound great… but we will come back to this and explain why this isn’t a disaster nor evidence of a vendor falsehood.

3.2. Recall (aka Sensitivity)

Recall is the ratio of system results that correctly predicted positive observations (tumour = malignant + in fact malignant, i.e. True Positive) to the all observations in the actual malignant class (i.e. all tumours that are actually malignant).

In other words, recall answers the following question:

Of all the tumours that are malignant, how many of those did the system correctly classify as malignant?

In formula the recall ratio is this:

Or slightly simplified:

Applied to our cancer example we get this result:

Recall = 990 / (990 + 10) = 990 / 1,000 = 0.99 = 99%

3.3. Accuracy

Accuracy is the most intuitive performance measure. It’s what most people are taught at school, often in isolation and without consideration of precision, recall and F1 score.

Accuracy is simply a ratio of the correctly predicted classifications (both True Positives + True Negatives) to the total Test Dataset.

In other words, accuracy answers the following question:

How many tumours did the system correctly classify (i.e. as True Positives or True Negatives) out of all the tumours?

Accuracy is a great measure but only when you have symmetric datasets where there is an even split between True Negatives and True Positives, unlike in this example where there are many more True Negatives than True Positives.

In formula the accuracy ratio is this:

Or slightly simplified:

Applied to our cancer example we get this result:

Accuracy = (990 + 989,010) / 1,000,000 = 0.99 = 99%

3.4. F1 Score

The F1 Score is the weighted average (or harmonic mean) of Precision and Recall. Therefore, this score takes both False Positives and False Negatives into account to strike a balance between precision and Recall.

So what is the difference between F1 Score and Accuracy?

An accuracy % can be significantly contributed to by a large number of True Negatives. For instance, consider briefly this separate example:

In this example, the system is 99.9% accurate. It’s also 100% precise. Sounds amazing. But its recall is only 50%. Hmm… what’s going on? In other words, it missed half of the True Positives. But does that matter? Well, it depends. What if that missing 50% of True Positives are terrorists or an individual carrying a zombie plague? The system starts to look pretty dire in this context. Just one False Negative and the cost could be gargantuan and irrecoverable.

The point of this separate example is that in most business circumstances, we do not focus on True Negatives whereas we are almost always concerned about False Negatives and False Positives, which usually have business costs (tangible and intangible).

Thus F1 Score might be a better measure vs. accuracy if we need to seek a balance between Precision and Recall AND there is an uneven class distribution, e.g. a large number of True Negatives as in the above mini example. For completeness, the F1 Score for the above example is 67%.

In formula the F1 score ratio is this:

Returning to to our cancer example we get this result:

F1 = 2 (0.09 x 0.99) / (0.99 + 0.09) = 0.0891 / 1.08 = 0.0825 = 8%

04. So what does the above mean?

Based on the above we can say the cancer diagnosis system has high recall and high accuracy, but low precision. If asked:

“what is the chance that a tumour identified by the system as malignant is in fact malignant” what would you say?

If you said 99% you’d be wrong.

The answer is only 9%! But 9% sounds terrible… or does it? In other words, this means the majority of tumours identified as malignant will actually be benign. Surely the system is garbage?

Not so fast. Actually the system is very good when considered in light of its business objectives. This is because the system is designed to prioritise better recall vs. precision. Another way of describing this is to say the system is deliberately overly inclusive, i.e. statistically errs on the side of classifying a tumour as malignant.

But why might this make sense in the context of tumour classification?

🎯 So what to prioritise when?

Which metric to prioritise depends on your business objective and the relative costs of False Positives vs. False Negatives. Let’s examine this in more detail.

🤔 When to prioritise Recall over Precision?

Recall should be optimised over precision when there is a high cost associated with a False Negative, i.e. system predicts benign when tumour is in fact malignant.

Our cancer detection scenario is a good example. If a patient’s tumour is actually malignant (True Positive) yet the system incorrectly predicted it as benign (False Negative), that patient may very easily die if this means their cancer goes undetected and untreated. There is a high — and in this case irreversible — to getting the diagnosis wrong (death), so we accept the lower cost of worry and further tests that a False Negative might entail before cancer is ruled out.

⚖️ Applying this to Legal AI

When using AI-powered contract analysis software such as Kira Systems it is possible to adjust configurations and prioritise recall over precision and vice versa. Users will want to optimise recall over precision when using such tools for due diligence, i.e. to analyse contracts and flag clauses undesirable for their client such as “indemnities” or “termination at will”.

This is because there is a high cost associated with a False Negative in such circumstances, i.e. the system failing to red flag such provisions, leading the lawyers to miss key information that adversely impacts their client’s position.

Another example is eDiscovery and predictive coding. If the system fails to identify something as responsive to the litigation (i.e. a False Negative) there is a high cost associated with it — losing the case to the extent that evidence is a smoking gun. To avoid that you prioritise recall over precision and accept the lesser cost of having to wade through more False Positives (things labelled responsive to the litigation but in fact unresponsive).

🤔 When to prioritise Precision over Recall?

Precision should be optimised over recall when there is a high cost associated with a False Positive, i.e. spam detection.

In email spam detection, if an email is actually non-spam (True Negative) yet the system incorrectly predicted it as spam (False Positive), that email will be sent to the spam folder and / or possibly deleted without the addressee’s knowledge. As the email user might lose important emails if the precision is too low for spam detection, we can say there is a high cost associated with a False Positive in this scenario.

⚖️ Applying this to Legal AI

Returning to the AI-assisted contract analysis example, assume the software is instead being used to extract clauses for a clause library. The clause library will collate clauses deemed by lawyers to be on market, that is clauses worded in a manner generally accepted by the legal market for that contract or transaction type.

Unlike in due diligence, there is here a high cost associated with a False Positive — a clause labelled as on market when it is, in fact, the opposite, i.e. off-market.

The cost is high because a junior lawyer may inadvertently rely upon this off-market clause by virtue of it being listed in the clause library as on market, which could have adverse consequences. Those adverse consequences could include inserting a non-standard contract provision that damages the client’s position.

🙌 Conclusion

If you made it this far, give yourself a pat on the back. This topic isn’t a walk in the park. There’s a lot of similar-sounding definitional terms to take in, and even some basic maths. Hopefully, you now understand:

Why “accuracy” is not a single metric, but one of four useful metrics
What do “accuracy”, “precision”, “recall” and “F1 score” mean and what do they tell us about a system’s performance?
When to prioritise precision over recall and vice versa?
How this applies to AI in legaltech, particularly for AI-assisted contract review software such as Kira Systems, Seal, iManage Extract, Eigen Technologies, Luminance, e-Brevia, Diligen etc or their toolkit competitors such as Google Document Understanding AI.

Remember: accuracy is in the eye of the beholder!

Originally published at lawtomated.

How accurate is your AI? Why you must understand accuracy, precision, recall and F1 scores

🎓 What you’ll learn

🏥 Diagnosing cancer with AI

The vendor’s product

😉 The vendor’s claim

1. The Test Dataset

2. The Vendor’s Performance Data

3. So how does the system perform?

3.1 Precision

3.2. Recall (aka Sensitivity)

3.3. Accuracy

3.4. F1 Score

04. So what does the above mean?

🎯 So what to prioritise when?

🤔 When to prioritise Recall over Precision?

🤔 When to prioritise Precision over Recall?

🙌 Conclusion

Written by Lawtomated