Structured Data vs. Unstructured Data: what are they and why care?

13 min readApr 7, 2019

Many legaltech products talk about turning unstructured data into structured data or at least being able to work with unstructured data. Similarly, in Apil 2019 Google announced a play for the contract extraction space with its Document Understanding AI (see here and here). That product’s marketing and positioning explicitly describe itself in these terms.

So what does that mean? What’s the difference between the two types and data and what examples can we identify in a legal context? Also, why should we care about transforming unstructured data into structured data? What problem is that solving?

Structured Data
20% of all data

What is it?

Structured data resides in relational databases: a database structured to recognise relations between stored items of data. Databases of this type are typically managed via a relational database management system (“RDBMS”).

This is usually what people think of when they think of a database, i.e. a table of rows and columns containing related information. There are of course many different flavours of database, which we will cover in subsequent articles.

For now, it’s easiest to think of something like this:

ID Forename Surname Age 0 Arnold Schwarzenegger 71 1 Sylvester Stallone 72 2 Chuck Norris 79

A RDBMS uses structured query language (“SQL”) to access and manipulate items in the RDBMS. Definitionally, in either SQL or general RDMBS terminology we describe the above as having these features:

SQL RDBMS Description Row Tuple / Record A data set representing a single item, e.g. Arnold Schwarzenegger’s data described above Column Attribute / Field A specific and labelled element of a column, e.g. “Age” Table Relation A set of rows and columns sharing the same attributes, i.e.organising the same information about a set of data objects

The benefit of structured data is its labelling to describe its attributes and relationships with other data. This data structure is easily searchable using a human or algorithmically generated query.

Unstructured Data
80% of all data

What is it?

Unstructured data is everything else. Unstructured data:

has an internal structure (i.e. bits and bytes)
but is not structured via pre-defined data models or schema, i.e. not organised and labelled to identify meaningful relationships between data

It may be textual / non-textual. It may be human / machine-generated. It might also be stored within a non-relational database like NoSQL.

Human generated unstructured data

Typical human-generated unstructured data includes:

Text files: word processing files, spreadsheets, presentations, emails.
Email: largely text, but has some internal structure thanks to its metadata (e.g. including the visible “to”, “from”, “date / time”, “subject” entered to send an email) but also mixes in unstructured data via the message body. For this reason, email is also referred to as semi-structured data.
Social Media: like email, this is often semi-structured data, containing unstructured data (e.g. a Tweet) but also structured data (e.g. the number of “Likes”, “retweets”, “date”, “author” etc).
Websites: YouTube, Instagram etc contain lots of unstructured data, but also much structured data, e.g. like described above for Twitter
Mobile data: text messages, locations.
Communications: IMs, dictaphone recordings.
Media: MP3, digital photos, audio recordings and video files.
Business applications: MS Office documents, PDFs and similar.

Machine generated unstructured data

Common types of machine-generated unstructured data include:

Satellite imagery: weather data, geographic forms, military movements.
Scientific data: oil and gas exploration, space exploration, seismic imagery and atomosphereic data.
Digital surveillance: CCTV.

Unstructured legal data

In the legal context, unstructured data is common across the following areas:

Document / Email Management: although the organisation of the DMS is structured (e.g. basic metadata (data about data): file names, doc IDs, version numbers, creation / edit / read dates etc) the most valuable content is unstructured, i.e. the contents of the constituent documents and emails. For this reason, it is often a pain to search and analyse this data in a meaningful manner, e.g. to find a specific clause wording requires finding target document types, opening those and scrolling around inside because there is no structured data about the content of that document (i.e. down to the clause or intra-clause level), only it’s basic metadata. Unfortunately, it is precisely that type of data which is most useful, but least accessible, to a lawyer.
eDiscovery: most of the content under review is email, email attachments (i.e. MS office docs, images, PDFs and sometimes voice) and naturally suffers from the same limitations described for document and email management.
Legal Due Diligence: the content is almost exclusively MS Word and PDF docs but also sometimes spreadsheets and slide decks — again, like the above it’s all unstructured beyond the basic metadata.

The split of structured and unstructured data
80:20

On average unstructured data makes up 80%+ of today’s enterprise data, with the remaining 20% being structured data.

Not only does unstructured data account for the majority of enterprise data, but the amount of unstructured data is also growing at an average rate of 55% — 65% per year. Granted these are both generalizations but each illustrates the general problem: unstructured data is a challenge and one which continues to grow. But why is this?

Unstructured data has grown, and continues to grow, because of:

decreasing costs of data storage and processing power;
ever-widening use of technology to create and manage work product (accelerated by minicomputers, then PCs and now mobile and IoT devices etc); and
the internet and ever-increasing interconnectedness of devices and data.

All of the above means it’s never been easier (or cheaper) to create and capture data, whether deliberately or through our interactions with the various systems of our daily lives.

The challenge of unstructured data in legal

Often, but not always, it requires a significant degree of human labour to create and maintain structured data. This challenge is no different in the law firm or in-house legal context.

Overly manual transformation

Transforming unstructured data into structured data is common within a legal context but labour intensive.

For instance, creation and curation of a deal capture report to meaningfully label and relate the contents of a contract to its context, i.e. the underlying transaction. This information is typically captured by the lawyer who worked on the deal and / or subsequently verified by a knowledge management lawyer specialist in cataloguing the firm’s knowledge.

In either case, very little structured data is captured automatically via technology alone. At best the version and edit history for the document can be pulled from the document management system.

More or less all other useful data about the document and the transaction must be manually recorded, or collated from other sources, including:

filename
deal value
parties
covenant types
key clauses / key features
client industry
transaction type, subtype and so on
law firm lawyers involved
law firm role
the identity / role of each opposing law firm
client-matter number
corresponding billing data

It’s time-consuming but hugely valuable to any legal organisation. Legal organisations are their know-who (experts) and know-how (expertise), and the former are unarmed without the latter.

Hopefully, this underscores the importance of unstructured data to your legal organisation, and the need to build better processes and systems to automate where possible and augment everything else in between regarding its creation or capture.

Another illustrative example is a contract due diligence or eDiscovery exercise. In either scenario, much effort is expended (even with machine learning and search techniques) sorting, tagging and organising data into relevant subsets capable of interpretation and resultant advice.

Volume

Combine the above with huge volume (as is the case for KM, DD and eDiscovery) and it becomes nigh, but not quite, impossible, to sensibly manage and make the best use of a firm’s (or a client’s) unstructured data via traditional means alone without comprising in some material aspect, e.g. capturing less data or capturing data less frequently.

Relevance: correlation vs. causation

Machine learning and data science techniques can augment, and in some cases automate away, human efforts to transform data. However, such techniques often run into the classic correlation is not causation dilemma due to their statistical and probabilistic underpinnings.

One example is clause libraries. A lot of vendors talk about using their A.I. due diligence system to create a clause library.

Simplistically this is doable. But is the result useful? Often not. Why is this? The answer is that these techniques usually:

Group and extract clauses based on syntactic but not semantic similarity. This is a blunt tool in the legal context where vastly different meaning might, and often does, hinge on the presence or absence of a negative (e.g. “not”) or punctuation mark, making a syntactic comparison alone somewhat ineffective and potentially misleading.
Fail to recognise nor anticipate similar clauses may be more or less relevant depending on whether they are friendly to one side of the contract than the other, e.g. buyer vs. seller friendly termination provisions.
Fail to appreciate a clause sitting in a signed document on a firm’s document management system does not necessarily mean it must be a “gold standard” clause to be re-used. This misunderstands negotiation, whereby it is perfectly sensible and often necessary to agree a worse position on clause A to secure a better position on clause B if the latter matters more than the former to your client.

As you can see, confusing correlation with causation overshoots relevancy, the ultimate arbiter of whether such systems are useful vs. technologically clever but irrelevant. In some ways these systems often become solutions in search of a problem, having also solved the wrong problem to begin with!

That said, being able to surface a 100 change of control provisions that are syntactically similar is a better starting point than 100 documents to be separately opened and scrolled / searched to find relevant clauses. But the point remains, such solutions are a foundation toward a better structure, not the end-to-end solution without the deeper understanding of the problem described above. The good news is that tools able to search for clauses based on semantic meaning are gradually emerging, however, in many cases, they have a long way to go before robust enough for legal.

Quality

By nature, a large volume of unstructured data is unverified and / or incomplete.

There are plenty of jokes about “Instagram lives,” in which a person’s Instagram updates are more fantasy than reality. The same goes for enterprise data, which is frequently incomplete (e.g. the associate that half completed a deal capture report) or entered incorrectly and awaiting verification that may never arrive (e.g. that same associate making mistakes due to exhaustion after several all-nighters in the office).

On an enterprise level, making business decisions based on inaccurate or incomplete data is at best a massive inconvenience in terms of having the right information at the right time, e.g. when negotiating a document and you need to find that precedent you remember for weeks or months back with just the right wording. At worst decisions based on inaccurate or incomplete data can extremely costly if it leads to mistakes.

In particular, for legal contexts, the physical quality of documents can be a further unstructured data blocker. PDFs are used to lock down an authoritative “final” version of the signed contract for evidential reasons. However, whilst it is possible to PDF the final contract and insert into that PDF the PDF signed signature pages and thereby preserve the text layer for the body of the document, this best practice is not often followed.

Instead, the fully signed contract is more often scanned through a scanner, turning it into an image, thereby removing any machine-readable text layer previously present in the document. Scanning also introduces other data integrity issues, e.g. text obscuring features such as speckling, shadowing, marks, manuscript elements, stamps, watermarks and stains. Docs like this:

Attempts to use optical character recognition (“OCR”) to turn that image into (or back into) a machine-readable text document will be lossy, i.e. the mechanically recovered data will be incomplete, incorrect and potentially unverifiable if a human cannot eyeball with confidence discern what text should have been identified.

Unfortunately, this is the theoretically avoidable — but in practice unavoidable — starting point for most A.I. tools used in contract due diligence and eDiscovery. These challenges shall remain so long as contracts live and die in the PDF format alongside poor practices surrounding the very PDFing docs. Creating and maintaining contracts in a structured format from cradle to grave would massively expedite the use of A.I. in legal contexts for documentary data, whether for eDiscovery, due diligence or KM.

The opportunity for unstructured data in legal

If we’ve done our job correctly, it should go without saying by now that better creation, capture and maintenance of unstructured data (or simply ensuring more data is structured, or at least semi-structured, to begin with) supercharges the opportunities to do more with that information! These include the following:

Search

The more structured the data the easier it is to search, filter and sort. This is why listing websites require listing agents to complete large volumes of data in a structured format via a form, e.g. for a hotel this might include filling out a structured form to capture the address, hotel type, number of rooms, facility types, distance from town centre etc.

Hence, the resulting search abilities allow you to be very specific about the results that matter most to you, e.g. a hotel that has:

a swimming pool
a gym
included breakfast
free wifi
less than 1 mile from the town centre
suitable for couples
in-room hot-tub

Capturing this type data about the contents of documents — including down to the clause and intra-clause level — whether manually, via an augmented process to guide the user, or via an automated one, can significantly enable enhanced opportunities to use and interpret the underlying data.

KM, negotiation and just in time information

Flowing from the above, this exercise also enhances KM. The more you capture about documents, the better your ability to manage and find that data at just the right time. This is known as just in time information.

In an ideal legal world, an example might be receiving a marked-up contract from the other side’s lawyers. If your contract drafting / review tool can highlight similarly worded clauses to the changed wording you’ve received and relate that to the context, e.g. you acting for the buyer and the other lawyers being firm X, then you are in a better position to understand what might be acceptable changes based on historical data in similar scenarios. But as noted above, correlation is not causation: lawyer skills are still required but it cuts down on the wasted time searching for the last X type of clause wording in Y type of doc negotiated by firm Z in a deal of type A.

Analytics & data driven decisions

This is really an extension or overlap with the foregoing point. In addition to having just in time information at critical negotiation points, it becomes possible to analyse your data to inform how you develop templates and precedent wording, but potentially also how you provide active advice and thought leadership on market trends for contract drafting.

It might also be possible to identify where you are spending inordinate time negotiating clauses only to end up 5% off of where you began. Equally, it might highlight clauses in your standard form that are always deleted or virtually entirely amended through negotiation. In either case, that might suggest:

the offending clause(s) needs removal or significant revision to adjust to market practice;
a change in the law has not yet been reflected in the drafting; and / or
changes to the underlying mechanics of the commercial bargain dictate this type of wording no longer makes sense.

This information might also be used to your advantage, e.g. to overstate a clause’s importance to the other side knowing it is a bargaining chip to be traded for something more valuable elsewhere in the contract.

Automation & augmentation

Having more structured data from the outset makes it easier to populate and interrelate that data with other systems via application programming interfaces (“APIs”).

For instance, if contracts are created in a structured format they are more easily interoperated with trade and other regulatory reporting tools which typically require users to manually fill out 10s — 100s of form fields with discrete data or tags based on the wording in a contract (i.e. in a similar way to the KM deal capture example described above).

Structuring this data can help automate or at least augment that process, e.g. if the system cannot be 100% accurate at populating a deal capture report it might nevertheless be able to capture 80% well enough that it significantly reduces manual population and verification. Having a way to tag data down to the clause and intra-clause level as documents are being created and maintained would aid this exercise.

And yes… that also means blockchain and smart contract technologies might usefully be integrated to the extent there is a problem to be solved that can’t be solved via other extant means. Although, as with A.I. (see next), these technologies are overhyped, misunderstood and are frequently solutions in search of a problem.

A.I.

Lastly, A.I. relies on machine learning today (and often also rules and search techniques). To do so, machine learning needs good quantities of good quality data. Supervised learning, in particular, requires large volumes of well-labelled data, i.e. semi-structured or structured data, e.g. not just examples of clauses, but clauses labelled to identify their type and potentially other metadata such as buyer or seller friendly etc.

Again, having solutions to capture and curate this data easily and at scale can be a massive enabler to the suitability and success / fail of these projects and the potential for meaningful, scalable ROI. We will cover this in more detail via subsequent articles.

Conclusion

Hopefully you now understand:

The difference between (a) structured data and (b) unstructured data.
Examples of each type, both in general and in legal.
The challenges and opportunities for unstructured data in legal.

This should help you understand and navigate these terms and their impact when assessing vendor solutions that talk to these concerns.

Noticeably we’ve not described in detail the solutions necessary to deliver on the identified opportunities above, in particular projects and products trying to create documents as structured objects. That’s deliberate to keep this post succinct. We will follow-up with a subsequent piece to that end!

In the meantime, sit back, relax and enjoy this neat graphic summarising the key points from this article:

https://www.igneous.io/blog/structured-data-vs-unstructured-data

Originally published at lawtomated.

Structured Data vs. Unstructured Data: what are they and why care?

What is it?

What is it?

Human generated unstructured data

Machine generated unstructured data

Unstructured legal data

Overly manual transformation

Volume

Relevance: correlation vs. causation

Quality

Search

KM, negotiation and just in time information

Analytics & data driven decisions

Automation & augmentation

A.I.

Written by Lawtomated

No responses yet