Keith Clouston, Head of Data Management SGX [SGX: S68]
The variety and volume of data collected by organisations today is greater than it has ever been. In addition, advances in technologies such as machine learning and predictive analytics present immense potential to leverage and add value to available data. However, for the financial industry, the ability to properly harness data in a timely manner is becoming an ever-growing challenge.
All too often, you will hear of data lakes having become “data swamps” – usually a result of growing too quickly and without proper controls in place. Another common problem faced is that data scientists spend approximately 80 percent of their time finding, cleaning and reorganising data before doing any actual analysis.
At Singapore Exchange (SGX), data management is being used as the bridge between raw, unprocessed data, and the analytics created by data science teams. Data management processes ensure that data is of a quality suitable for use, that it is easy to understand and that proper data access is in place. This will ultimately help save data scientists from having to put in large amounts of effort to do the same.
Understanding the Data
Most data sets used for analysis will come from a system of some form. This often means the data has been designed for an application to read and is heavily enumerated or encoded. Exposing a data scientist who does not have firsthand knowledge of the systems involved will often result in an extensive amount of data exploration before the true meaning of data can be extracted.
In short, understanding data is a challenge – even for the data-savvy.
This problem has less to do with technology than it has to do with knowledge management. To properly address it, robust processes need to be put in place to understand the data at a technical level, catalogue it concisely, and then translate it to usable business context. In data management terms at SGX, this is done through the use of a data dictionary and a business glossary.
Defining data at the technical layer is not new to databases. Data dictionaries have been around for years and, when consolidated across multiple data sets, can be an effective way to define data lineage, acceptable value ranges and data format. However, this same document often gets used directly to derive business meaning and that is where problems start to occur. Business definitions should be decoupled and maintained separately in a dedicated business glossary.
A business glossary, as its name implies, is a list of terms key to the core business units. These business terms are owned by a respective business owner and each term comes with a formal description as well as detailed business rules. The glossary is treated as an evolving entity and undergoes regular review to ensure accurate representation of business needs as things change. Alongside business definitions, we also provide a technical description of how each business term is derived. This allows for traceability between the business glossary and the data dictionary.
Data management has traditionally played a big part in ensuring an organisation’s data quality
As it evolves, the glossary will eventually mean more to the business and to data scientists than the data dictionary, as it will be set in a context directly relevant to the business and will remove much of the guesswork in finding and extracting data.
The data dictionary and business glossary are made publicly available across all units in the organisation. Our aim being to get everyone speaking in a common vocabulary and reporting to the same benchmarks.
Ensuring Data Quality
Data management has traditionally played a big part in ensuring an organisation’s data quality.
For businesses where data was generated directly from human input (form submission, for example), data quality management (DQM) teams would often take up a more operational role in organising, cleaning, and monitoring data that had been input, and ensuring minimum standards are met for processing. Newer technologies such as natural language processing (NLP) are helping reduce the need for manual effort and are allowing systems to perform the necessary monitoring.
At SGX, a large volume of our data is system-generated. This allows us to monitor data through automated data quality processes. DQM as a practice has been around for some time. So we can observe established best practice for automated measurement of key data quality attributes: accuracy, legitimacy, consistency, timeliness, completeness, availability, and granularity of data.
What we have found is that these practices work very well for measuring the quality of data from a technical and even operational perspective, but they start to fall short when viewed from a business context. That is where the business glossary previously mentioned plays a key role in enhancing the measurement of data quality. Using business rules stipulated in the glossary, data quality processes can be created which report the health of data based directly on key business terms. The better the quality of data from a business context, the easier it will be to analyse and to gain insights from.
Managing Data Access
Data Access Management presents an interesting challenge, particularly for the financial industry. Data has fast become an invaluable asset and something which needs to be protected, as ensuring confidentiality is one of the chief concerns. In the past, the simplest way to achieve this would be by locking down data sets and keeping access limited to a need-to-know basis. In more recent times, there has been a growing demand for data democracy and for data to be made freely accessible to people who are entitled to see it.
It was previously enough to grant access to data at a table, object or schema level. Today, data access frameworks have to be flexible enough that permissions can be granted at a field level as well. Furthermore, there is a need to specify the level of data access to be granted. An example being where the same field in a table has to be seen in the plain by an operations staff member, but anonymised for analysis by data scientists. Data access also needs to be defined at a function level, rather than by individual.
We use the business glossary and data dictionary to complement data access management by identifying the business owners and technical owners of data. Thus making it easier to know who to approach for approval and who has final say on access rights.
Data Management: the Journey Ahead
Effective data management is a people and process challenge. We have placed focus on changes in culture and work practice, rather than buying in costly tools. Data management standards are introduced through small, unintrusive process changes. Where possible, this is done through automation via existing tools and technologies which the business and wider organisation would already be familiar with. Our main aim being to demonstrate the benefits of changes made and get buy-in from major stakeholders. Over time, data management will evolve to a more formal operating model before a move to specialised tools is considered.
Having said that, there is no perfect data management solution, whether it be built or bought. Whatever the path taken, it is of utmost importance to always have standards and best practice in mind as well as to build processes that are flexible and responsive to change.