I recently watched a demo of the IBM Watson natural language processing (NLP) tool that showed how it was used by police for criminal investigations at a Records and Information Management conference. I was struck by the fact that this fascinating tool may have led some Records and Information Managers in the audience to think that this was the new, disruptive way to manage records.
CAN YOU RUN A RECORDS AND INFORMATION MANAGEMENT PROGRAM ON POST-CLASSIFICATION AND MACHINE LEARNING?
IBM Watson can be looked at as a post-classification tool, representing a whole group of NLP tools often marketed under the Big Data umbrella. Post-classification tools use algorithms to produce classification for an information asset after the asset is created or stored. In fact, every major supplier and the open source community offers similar tools. This can even seem overwhelming for those in the eDiscovery space.
The demo was really impressive. It was almost as if Watson went through the test documents by magic— thereby identifying people, organizations, places, phone numbers and frequently used terms. A click on one of the filtered terms lists its occurrences across various documents. The tool also has the ability to produce graphics that connect interrelated terms visually. This is, of course, quite a departure from the rather minimalistic user interfaces typically found in electronic records management systems (ERMS).
What is attractive about potentially using such an analytics tool for Records and Information Management is its ability to reduce the complexity of information management processes, in particular, capturing metadata. It could:
- save us from lengthy discussions with information asset creators to get them to provide the right metadata and how and in what format its delivery should take place
- eliminate the need to design metadata structures for document types, simplifying forms management and the optical character recognition process (OCR) during scanning
- allow users to search for criteria originally not available for lookup
- provide the ability to search for important emails in the email archive or unstructured documents.
No doubt, adopting such a tool has the potential to appear to be a silver bullet for Records and Information Managers.
However, two things should be kept in mind: with a tool like Watson, you (a) buy into algorithms and (b) require these algorithms to build the structure that has not been created and provided upfront, i.e. during an on-boarding analysis processes. Or to summarize the two issues: what sort of structure is actually used or produced by such algorithms?
NLP and text mining algorithms (with impressive names like Latent Semantic Indexing or Naïve Bayes) work on a mathematical-statistical basis, building trees based on the frequency and proximity of terms in a given document or set of documents. Typically, they don’t work with pre-existing ontologies, and they don’t really understand what they recognize. The algorithms are able to identify amounts in documents, but what sort of amounts are they? This is left to human review to sort out, but at least these tools can provide us hints.
PRE-CLASSIFICATION IS STILL THE WAY TO GO FOR RECORDS AND INFORMATION MANAGEMENT
The classification structure for Watson is determined by its algorithms, whose outcomes are somehow unpredictable. Tools like Watson are widespread in the forensics and e-discovery communities, as these domains are often interested in unknown bits of information in datasets. But for Records and Information Management usage, this approach is simply not good enough. Precision and recall seem too arbitrary that they could truly satisfy the findability requirement stipulated in ARMA’s General Record Keeping Principles, for example.
For simplicity, I am defining a record as something that it is created by a business process. Consequently, as much as business processes are defined and executed in a controlled environment, properties found in records and their metadata are also well known, and their possible values are actually restricted. Here are some of those properties:
- client ID
- transaction number
- transaction type
- privacy level
It is fairly intuitive what the metadata terms above mean. The size of their value sets (i.e. possible values) may differ dramatically, from a handful (privacy classification) to perhaps billions (transaction numbers) but both represent restricted value sets and not free, unstructured text. There is semantic behind these properties. Additionally, client IDs and transaction numbers may overlap in terms of format and value, creating serious problems for post-classification algorithms.
Pre-classification, or classifying information before creating or storing it, is clearly still the way to go for Records and Information Management professionals. Declaring the semantics of each required metadata property that describes a record and promotes finding it, and having the business process owner and the producing application to supply the correct values, remains an indispensable activity even in times of “machine learning”.
That said, I would not encourage you to dismiss of NLP or text mining. These tools can be very helpful if you need to classify, say, millions of office documents on shared drives or Sharepoints. But make sure it is you, the information professional, who can supply the terms, taxonomies, ontologies and document types of the knowledge domain to the analytics tool so the classification does not become arbitrary. It’s all about creating well-understood, intended structure.