Supervised or Unsupervised Learning — which is better? (A Lawyer’s Guide)

7 min readDec 4, 2019

We’ll explain:

what each of supervised and unsupervised learning means;
how they work, plus an example of each in a legal context;
when to use each, and which is better; and
the out of the box = unsupervised learning misconception.

Supervised Learning

Learning from labels

Supervised learning requires labelled data.

That data is typically labelled by a domain expert, i.e. someone who is expert at identifying what labels go with what data.

In the legal context, this will be a lawyer or legally trained individual.

In the consumer space, this is often you! For instance, Facebook is great at automatically tagging your friends in photos.

Why is that?

It is because of the historical training you provided — and continue to provide — when manually tagging photos of your friends. Over time, with more examples of your friends in different conditions (lighting, angles and obscuring detail), Facebook’s algorithms learn how to tag photo A as “Arnold” and photo B as “Linda”.

Legal A.I. systems identifying and extracting clauses (or intra-clause data, e.g. a financial number such as rent amount) also achieve this via supervised learning.

For example, a legal A.I. due diligence tool may extract governing law from SPAs, NDAs or loan agreements. To do so, either vendor or user provides the system with labelled examples of governing law clauses.

This process is known as training. In doing so a supervised machine learning algorithm is used to generate a predictive model.

A predictive model is a mathematical formula able to map a given input to the desired output, in this case, its predicted classification, i.e. the correct governing law.

The model is predictive because it relies on statistical and probabilistic techniques to predict the correct governing law based on historical data.

A basic workflow describing the above process for the governing law example is shown below:

The above generates a predictive model mathematically optimised to predict whether a given combination of words is more or less likely to belong to a particular label, e.g. “English law” or “Spanish law” etc.

In machine learning terms this type of supervised learning is known as classification, i.e. because we are building a system to classify something into one of two or more classes (i.e. governing laws).

Accurate though it might become, the model never understands neither the labels nor what it is labelling. As we always like to stress at lawtomated, machine learning is maths not minds.

If you are interested in digging deeper, check out our forthcoming guide to training, testing and cross-validation of machine learning systems, which are each fundamental concepts in any machine learning system, albeit usually abstracted or unavailable to the users via the UI of legal A.I. systems.

Unsupervised Learning

Pattern spotting

Unlike supervised learning, unsupervised learning does not require labelled data.

This is because unsupervised learning techniques serve a different process:

They are designed to identify patterns inherent in the structure of the data.

A typical non-legal use case is to use a technique called clustering. This is used to segment customers into groups by distinct characteristics (e.g. age group) to better assign marketing campaigns, product recommendations or prevent churn.

A common legal use case for this same technique is diagrammed below in the case of A.I. powered contract due diligence:

As the above illustrates we start with a disorganised bag of governing law clauses. An unsupervised technique such as clustering can be used to identify statistical patterns inherent in the data, clustering similar governing law clause formulations together and separating each cluster from dissimilar items.

In this example, the data scientist — or in some cases the end-user to the extent such controls are exposed via a UI — can adjust the similarity threshold, typically a value between 0 and 1.

If set to 1 the algorithm will cluster together only identical items, i.e. identifying duplicates. This turns data — random clauses — into information we can use, i.e. we now understand the dataset contains duplicate data, which in turn may be a valuable insight.

If set to 0 the algorithm will cluster apart items that are entirely distinct from one another.

A setting between 0 and 1 will cluster data into varying cluster sizes and groupings. To be clear, a setting of 0.8 would cluster together clauses 80% similar. Users might use this to detect near-duplicates, i.e. documents that are virtually but not entirely identical.

Which is better: supervised or unsupervised?

You’re asking the wrong question

Here’s a helpful analogy for the supervised vs. unsupervised learning question.

Ask yourself: which is better, screwdriver or hammer?

The answer is neither.

They serve similar but different purposes, albeit sometimes work hand in hand (literally) to achieve a bigger outcome, e.g. a set of shelves.

In the same way, when people ask the question — “Which is better supervised or unsupervised learning?” — the answer is neither, albeit they are often combined to achieve a result.

An example

For example, unsupervised learning is sometimes used to automatically preprocess data into logical groupings based on the distribution of the data, such as in the clause clustering example above.

This might result in groupings based on the type of paperwork used for a contract type, e.g. all the contracts stemming from template A may fall into one cluster vs. those falling into a separate cluster. This turns data into useful information to the extent such insights were not previously known, nor immediately identifiable (or verifiable), by a human reviewer.

This may, in turn, assist human domain experts with their dataset labelling, e.g. by identifying which documents will most likely contain representative examples of the data points they wish to label at a more granular level and those which won’t.

The subsequent labelling will then feed into a supervised learning algorithm that produces the final result, e.g. a due diligence report summary of red flag clauses in an M&A data room.

**To recap:** the left finds logical groupings; the right identifies a boundary between 2 classes.

Out of the box vs. Unsupervised Learning

Good vendors distinguish, bad vendors disguise

Any legal team buying an A.I. system (or anyone buying an “A.I” product) will want to know which is best for them. Vendors in the crowded A.I. contract due diligence space typically provide one or both of two features:

OOTB Extractors: these are product features pre-trained by the vendor to identify and extract popular contract provisions, e.g. governing law, termination, indemnity etc.
Self-trained extractors: these are product features capable of training by the user to generate a user-specific predictive model for a contract provision of their choosing and design.

In either case, someone has to train the system with labelled data.

This is because both features use supervised learning techniques of the sort described above.

Unfortunately, some vendors deliberately, or by omission, (miss)lead people (media, buyers and users) to believe that because something comes ready and working “out of the box” (aka “OOTB”) this means it uses unsupervised learning.

This is patently false: it will have been trained by the vendor if it is performing a classification task such as extracting clauses from contracts.

By extension, conflating OOTB Extractors with unsupervised learning is usually intended to suggest a solution is superior to products without such features, i.e. because it “requires no training” or worse implies the system “just learns by itself”. Again, this is inaccurate and misleading.

OOTB Extractors vs. Self-trained Extractors

Another bake-off!

Flowing from the above, and as with the earlier point about which of supervised vs. unsupervised learning is better, so too the question of OOTB Extractors vs. Self-trained Extractors.

Recall both are supervised learning techniques. The differences, however, are these:

Who, What & How

Pros & Cons

Pros and cons of machine learning OOTB extractors vs self-trained extractors

Conclusion

Hopefully, you’ve learnt:

What is supervised learning?
What is unsupervised learning?
How each of the above work (at a high level).
A basic use case example for each.
The key difference for most legal use cases: that supervised learning requires labelled data to predict labels for new data objects whereas unsupervised learning does not require labels and instead mathematically infers groupings.
That neither supervised learning nor unsupervised learning is objectively better — each serves different purposes, albeit can be (and often are) used in combination to achieve a larger goal.
That unsupervised learning and OOTB pre-trained extractors are not the same, that the latter is supervised learning (albeit trained by the vendor) and doesn’t simply “learn by itself”!
The who, what, how, pros and cons of OOTB pre-trained extractors vs. self-trained extractors.