A.I. Technical: Machine vs Deep Learning
There’s lots of confusion surrounding machine learning vs deep learning, what each means and which is better. To put the record straight we will explain the difference between: (1) machine learning and (2) deep learning.
Note this article is principally aimed at non-techies, i.e. legal professionals wanting to understand these terms and their application to their domain. Naturally it therefore necessarily involves abstraction and simplification.
Like “A.I.”, machine and deep learning get misused and abused. To help out we’ll explain:
- What is machine learning?
- What is deep learning?
- How the above differ from one another?
- Their relationship with A.I.
- Which of machine or deep learning is better?
The good news is that none of this is complicated. Understanding these terms and their relationships enhances your ability to make informed decisions about:
- A.I. vendors;
- the A.I. debate in general; and
- whether the needs of your business case map to a ML / DL solution, or neither!
What is machine learning?
Machine learning (“ML”) is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying instead on patterns and inference derived from data.
ML is nothing new
ML was first coined in 1959 by Arthur Samuel, an early pioneer of computer gaming and A.I.
Tom M. Mitchell another computer scientist, later provided a widely quoted formal definition of ML:
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E”
Mitchell, T. (1997). Machine Learning. McGraw Hill.
In practical terms this means designing algorithms that:
- consume data (the “E” in the above);
- apply statistical techniques to detect patterns; and
- thereby “learn” how to perform some task (the “T” in the above).
This process is iterative. Each iteration’s attempt at the particular task is measured against the correct outcome for that task (the “P” in the above). If it improves, the algorithm adjusts itself further in the direction of adjustments that led to that performance improvement. If it gets worse, the opposite action is undertaken by the algorithm.
One method by which many ML / DL techniques achieve the aforementioned optimisation process is called gradient descent, which we will cover in later posts given it is a significant topic in and of itself.
Lastly, depending on the progress of the above process the algorithm is tuned by feeding it more data and / or specific tweaks to the algorithm’s configurations until the desired performance is achieved.
Garbage in, garbage out
Because ML relies upon data, it’s very much garbage in, garbage out (“GIGO”). Lacking access to sufficient quantities of quality data severely limits ML performance, or indeed the general suitability of your problem to a ML solution.
Major Types
ML has three major types:
- supervised learning;
- unsupervised learning; and
- reinforcement learning.
01. Supervised Learning
An algorithm uses training data and feedback from humans to learn the relationship of given inputs to desired outputs.
The training data is labelled by humans, e.g. photo X = cat, photo Y = potato etc. The labeling, together with the human-influenced feedback loop to improve the machine-generated results, explains why we term this type of ML “supervised”.
All supervised learning techniques are either:
- classification problems; or
- regression problems.
Classification is used to predict discrete responses, that is things with fixed values, e.g. number of students in a class (you can’t have half a student). In a legal context, ML classification algorithms are used to classify whether clause X describes French governing law or English governing law.
Regression is used for predicting continuous responses, that is a value within a range, e.g. the height of students in a class is not fixed but rather a sliding scale of all possible human heights. In a legal context, ML regression algorithms could be used to predict the ideal fee quote for a matter with X, Y and Z variables.
02. Unsupervised Learning
Unlike supervised learning, unsupervised learning does not require labelled data. This is because unsupervised learning techniques are designed to identify patterns inherent in the structure of the data.
For instance, in a legal context, you might use an unsupervised learning algorithm to identify logical groupings of contracts based on their shared syntax. Upon further human inspection, these groupings might reveal useful insights, e.g. documents with certain counterparties being more similar than not vs. documents without other counterparties.
03. Reinforcement Learning
An algorithm that learns to perform a task simply by trying to maximise rewards it receives for its actions.
For example:
- The algorithm takes an action on the environment, e.g. makes a move in chess.
- It receives a reward if the action brings the machine closer to maximising the total reward, i.e. winning the game not the move, or a penalty if it takes it away from maximising the total reward.
- The algorithm optimises for the best series of actions by updating its policies to reflect whether it received a reward or penalty after step (2), in turn improving its ability to achieve the desired outcome over time.
This technique is most often used in game-like situations, e.g. playing games such as Go, self-driving cars, trading strategies, balancing electricity grid loads or optimising auction pricing in real time.
See also
For more on the distinction between supervised and unsupervised learning, including which is better for what plus common points of confusion, please see our detailed explainer here.
We will do a deeper dive into reinforcement learning at some point in the future: stay tuned!
What is deep learning?
Deep learning ( “DL”) is a subtype of machine learning. DL can process a wider range of data resources, requires less data preprocessing by humans (e.g. feature labeling), and can sometimes produce more accurate results than traditional ML approaches (although it requires a larger amount of data to do so).
However, it is computationally more expensive in time to execute, hardware costs and data quantities.
DL is also nothing new
Like ML, DL is not particularly new. Indeed the principal component of DL systems — artificial neural networks — began to take shape in the 1940s, seeing significant breakthroughs in the 1960s and each decade thereafter.
DL’s use has accelerated in recent decades. This is due to:
- the availability of cheapening but increasingly powerful computer hardware; and
- the crowdsourcing of rich datasets via the internet, which helps create, capture and curate the necessary labelled datasets at massive scale.
It’s all about neural networks
In deep learning, interconnected layers of software-based calculators known as “neurons” form a neural network. The idea is to replicate an abstracted understanding of how we believe the human brain might process similar information and learn from its surroundings and sensory input.
The network ingests vast amounts of input data, processing it through multiple layers of neurons that learn increasingly complex features of the data at each layer:
The network can then:
- make a determination about the data, e.g. photo X = dog;
- learn if its determination is correct, i.e. is photo X in fact dog; and
- use what it has learned to make determinations about new data, i.e. photo Y = dog but photo Z = not dog.
Neural networks are a huge and complex topic. In later articles, we will try to break these down into more detail given the huge interest and mysticism regarding their abilities.
It’s not that hard to build one from scratch, and even easier to build one using pre-built code libraries but the underlying mathematical concepts can take some time to get your head around. At the end of the day, they combine matrices, linear algebra, and some other clever math but aren’t brains. As always, it’s maths not minds that drive today’s A.I.
Check out the resources at the end of this article if you want to dive deeper (pun intended).
How do machine and deep learning differ?
means and ends
DL is a subfield of ML. In this sense ML and DL share many characteristics, including their ability to detect patterns from data and learn how to perform a specific task with improving performance over time given further inputs.
There are however many differences between ML and DL both in means and application.
The main differences that distinguish DL from ML are fourfold:
- data consumption;
- dedicated hardware;
- feature extraction; and
- use cases.
01. Data Consumption
DL requires a vast number of labelled samples for it to succeed. For this reason, the explosion of data over the past few decades has enabled DL as a viable technique (along with cheaper and more powerful hardware, of which see below).
However, the quantity of data isn’t in and of itself enough: it needs to be of the right quality, i.e. labelled. Not all data collected is labelled, labelled correctly or in a manner suitable for DL. Nor is such data always publicly accessible.
Where this is the case you (or someone on your behalf) must undertake a data labelling exercise, which is costly in terms of time and money and often requires a defined and rigorous set of procedures, quality controls and domain expertise.
Unfortunately this fact, and its impact on DL’s utility for real-world problems, is often downplayed in discussions concerning DL (and to a similar degree in discussions regarding ML).
02. Dedicated Hardware
The training phase of DL systems typically requires dedicated hardware such as Graphics Processing Units (GPUs) to reduce execution time to something manageable, i.e. hours, days or weeks vs. years. These systems, although increasingly cheaper, are still expensive vs. the needs of simpler ML set-ups.
03. Feature Extraction
Feature extraction (aka feature engineering) is the process of putting domain knowledge into the creation of feature extractors to reduce the complexity of the data and make patterns more visible to learning algorithms.
This process is difficult and expensive in terms of time and expertise.
This is best explained via an example:
- Assume you are building a system that will learn to classify images as either Car or Not Car.
- In classical ML, the algorithmic approach will use data to learn whether the image is Car or Not Car. To help this along, it might have been possible for a human to label constituent features indicative of Car (e.g. wheels) in the images thereby providing extra features with which the system can assess.
- By contrast, a DL solution will also attempt to determine which parts of the image make up the car, e.g. wheels, wing mirrors, headlamps, windscreen etc.
- As a result, DL can reduce the amount of hard coding humans have to apply to define features in datasets. This is the difference between having to label the image as Car vs. Not Car and having to do that plus label other data about the image such as wheels, windscreen, wing mirror etc that indicate Car or Not Car.
A legal example might be contract due diligence software. Vendors often offer “pre-trained” or “out of the box” provisions that can identify common clauses without the user specifically training the system (as we note in this article, that doesn’t mean it’s magically unsupervised learning as some vendors suggest!).
In that scenario the vendor will have employed domain experts — i.e. lawyers — to label clauses by type, e.g. French governing law, English governing law and so on.
The vendor might also ask the domain experts to label features about those clauses that relate to the overall label. For instance, presence of the word “French” and “governing” and “law” might be additional features worth extracting to improve the algorithm’s performance given the presence of these words (i.e. features) within a clause strongly suggest the label might be “French Governing Law”.
A DL approach would ideally learn that these features are important and extract them along with the attendant classification label.
04. Use Cases
DL is more expensive in terms of:
- time — both regarding time to set-up and time to run;
- hardware; and
- data.
As a result we can generalise and say that DL and ML are best used as follows:
Machine learning Deep Learning Data Volume Better than DL on small datasets. Better than ML on large datasets. Computational Cost Cheaper in time to execute and hardware vs. DL. More expensive in time to execute and hardware vs. ML. Adaptability Both domain specific and application specific ML techniques and feature engineering are required to build high-performance ML models. Therefore the resulting models are less adaptable, even within similar domains. DL techniques can be adapted to different domains and applications far more easily than ML algorithms.
For example, once one understands the underlying deep learning theory for the domain of speech recognition, then learning how to apply deep networks to natural language processing isn’t too challenging since the baseline knowledge is quite similar.
Feature Engineering ML often requires complex feature engineering, which is costly in terms of time and hiring or contracting domain expertise. DL techniques can eliminate or reduce the need for complex feature extraction, thereby reducing the time and cost of that step, albeit potentially in exchange for more expense in time and money regarding hardware and execution time of DL approaches. Interpretability Due to the feature engineering and simpler models, ML systems are generally easier to interpret than DL.
In other words, it’s easier to understand how and why the ML algorithm arrived at an outcome.
This can be incredibly helpful, not to mention necessary (e.g. in regulated environments), to unwind and correct a system that produces incorrect results in unexpected circumstances.
Less capable of interpretation.
Often seen as “blackbox” systems that researchers struggle to unwind and explain how and why the system reached a particular outcome.
That said, there continue to be significant developments in this area that are opening up the blackbox so this distinction may reduce over time.
What is the relationship of ML and DL with A.I.?
Finally, let’s be clear: DL is a subtype of ML and each is a subtype of A.I. When we talk about A.I. today we are really talking about ML and DL.
ML and DL are the major techniques (along with rules and search) via which today’s A.I. happens. There is no “other” A.I. to speak of in practical terms, although there are plenty of theoretical means by which we might create A.I. (including biological or hybrid biological means as explored in Nick Bostrom’s excellent book,Superintelligence).
As we’ve covered in a previous article, another way to categorise today’s A.I. is by it’s ability to generalise or not.
By this, we mean an A.I.’s ability to learn how to perform any given task characteristic of human intelligence vs. being a specialist at one task and one task only. In the case of the former, we say the A.I. is able to generalise its ability from one domain to another whereas in the latter this simply isn’t possible.
Through that lens (i.e. the ends), all of today’s A.I. — whether ML or DL is used (i.e. the means) — are Artificial Narrow Intelligence (“ANI”).
This is because they do one thing well and one thing only. Google DeepMind’s Alpha Go can only play Go, but can’t play chess and is, therefore, an ANI despite its incredibly impressive technical feats.
A nice way to put this altogether is the below diagram:
So hopefully you have a better understanding of ML, DL, the differences between ML and DL and their relationship with A.I. In turn, this will assist your understanding and further exploration of A.I. systems for your business problem or general interest!
Bonus Material
nerd into neural networks
In the meantime, if you want to learn more about deep learning and neural networks but can’t wait for our follow-up to this article regarding those topics then check out the below video, which is a friendly and visual way to understand the key components and intuitions regarding DL:
After watching that, provided you want to learn more, check out the additional videos below by Grant Sanderson at the incomparable 3Blue1Blrown (see also their Youtube channel). These videos are probably the best visual introductions to neural networks and some of the math behind them.
Each video gets progressively deeper (pun intended) in understanding and complexity. Seeing things visually with regard to neural networks really helps understand what’s going on, how they work and why.
To do so, we suggest watching the below videos in the order described below:
01. But what *is* a Neural Network?
This one is a gentle top-down introduction to neural networks. For a lot of people, this will be as much as you need / want to know in a legal context.
If you want to understand the underlying mechanics in more detail, progress to videos 2–4 below.
02. Gradient descent, how neural networks learn
Gradient descent is a fundamental technique in ML and DL. Essentially it controls how the algorithm decides what adjustments to make after each iteration in order to progress it towards its goal of achieving the desired accuracy of performance at a given task.
Bear in mind this is where things start to get more mathsy!
03. What is backpropagation really doing?
Backpropagation is a tricky topic. If you can understand it, you’re miles ahead of most people. But don’t feel you need to understand this topic unless you are a data scientist or ML engineer, in which case you must.
Essentially backpropagation is how neural networks work back from a given output to determine what adjustments to make to the various “weights” affecting the significance or insignificance of features (and their relationships) across the network for a given input in order to determine the correct output.
04. Backpropagation calculus
Same focus as 3, but goes into greater detail regarding the underlying calculus. Be warned, also very mathsy.
Originally published at lawtomated.