Artificial Intelligence is Only as Good as Data Labeling

Data Labeling with SynerScope

Recent events in my home country inspired me to write this blog. Every day we hear stories about businesses and government organizations struggling to sufficiently understand individual files or cases. Knowledge gaps and lack of access to good information hurts individual-and-organizational well-being. Sometimes, the prosperity of society itself is affected. For example, with large-scale financial crime and remediation cases in banking, insurance, government, and pandemics.

We simply have little understanding of the data which means AI and analytics are set up to fail. In addition, it’s difficult to see what data we can or may collect to run human-computer processes of extracting relevant information to solve those issues.

Unlimited Data with No Application

The COVID19 pandemic shows not only how difficult it is to generate the right data, but also how difficult it is to use existing data. Therefore, data-driven decision-making often shows gaps in understanding data.

Banks spend billions on technology and people in KYC, AML, and customer remediation processes. Yet, they’re still not fully meeting desired regulatory goals.

Governments also show signs of having difficulties with data. For example, recent scandals in the Dutch tax office, such as the Toeslagenaffaire, show how difficult it is to handle tens of thousands of cases in need of remediation. And the Dutch Ministry of Economic Affairs is struggling to determine individual compensation in Groningen, where earthquakes caused by gas extraction have damaged homes.

Today, the world is digitized to an unbelievable extent. So, society, from citizens to the press to politicians and the legal system, overestimate the capabilities of organizations to get the right information from the data which is so plenty available.

After all, those organizations, their data scientists, IT teams, cloud vendors, and scholars have promised a world of well-being and benevolence based on data and AI. Yet, their failure to deliver on those promises is certainly not a sign that conspiracy theories are true. Rather, it shows the limits of AI in a world where organizations understand less than half of the data they have when it is not in a machine processing ready state. After all, if you don’t know what you have, you can’t tell what data you’re missing.

Half of All Data is Dark Data

Gartner coined the term “Dark Data” to refer to that half of all data that we know nothing about. And, if Dark Matter influences so much in our universe, could Dark Data not have a similar impact on our ability to extract information and knowledge from the data?

We have come to believe in the dream of AI too much, because what if dark data behaves as dark matter? By overestimating what is possible with data-driven decision making, people may believe that the powers that be are manipulating this data.

SynerScope’s driving concept is based on our technology to assess Dark Data within organizations. By better understanding our dark data, we can better understand our world, get better results from human and computer intelligence (AI) combined.

Algorithms Rely on Labeled Datasets

Today’s AI, DL (Deep Learning, and ML (Machine learning) need data to learn – and lots of it. Data bias is a real problem for that process. The better training data is, the better the model performs. So, the quality and quantity of training data has as much impact on the success of an AI project as the algorithms themselves.

Unfortunately, unstructured data and even some well-structured data, is not labeled in a way that makes it suitable as a training set for models. For example, sentiment analysis requires slang and sarcasm labels. Chatbots require entity extraction and careful syntactic analysis, not just raw language. An AI designed for autonomous driving requires street images labeled with pedestrians, cyclists, street signs, etc.

Great models require solid data as a strong foundation. But how do we label the data that could help us improve that foundation. For chatbots, for self-driving vehicles, and for the mechanisms behind customer remediation, fraud prevention, government support programs, pandemics, and accounting under IFRS?

Regulation and pandemics appear in the same sentence because, from a data perspective, they’re similar. They both represent a sudden or undetected arrival that requires us to extract new information from existing data. Extracting that new information is only manageable for AI if training data has been labeled with that goal in mind.

Let me explain with an easy example of self-driving vehicles. Today, training data is labelled for pedestrians, bicycles, cars, trucks, road signs, prams, etc. What if, tomorrow, we decide that the AI also must adapt to the higher speed of electric bikes? You will need a massive operation of collecting new data and re-training of that data, as the current models would be unlikely to perform well for this new demand.

Companies using software systems with pre-existing meta data models or business glossaries have the same boundaries. They work by selecting and applying labels without deriving any label from the content – otherwise they must label by hand, which is labor and time intensive – and often too much so to allow for doing this under the pressure of large-scale scandals and crises.

Automatic Data Labeling and SynerScope

The need to adapt data for sudden crises does not allow for manual labeling. Instead, automatic labeling is a better choice. But, as we know from failures by organizations and by government, AI alone is not accurate enough to take individual content into account.

For SynerScope, content itself should always drive descriptive labeling. Labeling methodology should always evolve with the content. That’s why we use a combination of algorithm automation and human supervision, to bring the best of both worlds together – for fast and efficient data labeling.

If you want to learn more about how our labelling works, feel free to contact us at

Handling Redress and Remediation

Redress and Remediation

No organization wants to move into a redress and remediation process. But, once you do, time is of the essence. Launching a redress investigation can happen suddenly. In other cases, it can involve slower planning. In either case, you suddenly have very different needs for organizational data compared to business-as-usual processes. In some cases, you might even need access to data that’s normally stored in the dark or in low-priority servers, which completely changes how your organization is able to access that data.

Redress and Remediation Processes are High Priority

If you’re facing the need to redress, remediate, or provide compensation, you likely have pressing reasons to do so. For example, your organization may be facing dwindling customer satisfaction, supplier de-listing, legal action, regulatory action, or damage to your organization’s reputation.

Redress and remediation processes bring individual case and file details to the forefront. Resolving those details is of high importance. However, without an immediate overview or a way to create high-quality comparisons of those individual or group cases quickly and efficiently, little can be done. For example, you first must manually review to see which cases require redress. And, deciding on what redress, remediation, or compensation should apply will remain difficult. Without those overviews, you could be providing too much or too little compensation.

Resolving this means making data a central part of the process. You must implement processes to direct redress and remediation actions. You also have to keep regulatory stakeholders informed enough so they do not escalate or start proceedings against you.

You Have to Act Fast, but Systems Aren’t Designed for Redress Processes

The default response to a redress & remediation process is to put people to work. Unfortunately, many of those people are called in ad-hoc, without the information and data they need to act upon.

Getting started means creating in-depth overviews of each case, with enough context from similar cases to guide decisions. Putting that into a control framework allows people to get started, while avoiding the risk of overcompensating individual cases or approving fraudulent claims.

Yet, making that shift of switching data management from everyday operations to a full investigation of minute data is not something that IT systems and support is normally designed for. Instead, you must combine data in new ways, to resolve individual cases quickly and fairly. That’s especially true when your cases demand bulk data access and processing, as remediation cases do. Remediation never starts out at a trickle of cases, you always need to address all of them, all at once. The level at which you can handle that bulk data will impact how much damage you can mitigate, how much work and rework is necessary, and how quickly you can finalize the project to the satisfaction of customers, internal and external stakeholders, and regulatory or legal stakeholders.

External Organizations Can’t Work Without Data

Large organizations often rely on third parties, whether specialized service providers, lawyers, consultants, or subject matter experts, to help manage these processes. Often, these include data and IT services as well. However, those consultants still need access to data, which your own IT systems must supply. Further, when you bring in consultants for IT design and implementations, their aim is to provide and build efficient solutions and applications – usually with the goal of running and supporting daily processes inside the organization.

Redress Demands Scaling Up Data-Handling Capabilities

Redress and remediation situations demand support in a very different way. You have to greatly enhance your capacity to manage data. Think of an airplane during an emergency landing. People don’t exit the plane in an orderly fashion using the stairs. Instead, they use emergency slideways, which greatly increase the capacity to empty the plane quickly whilst people arrive safely on the ground.

You cannot afford to lose time to prepare data or build up IT solutions for support during redress and remediation. Ad-hoc tools with query writing and spreadsheets often don’t help either. Instead, they can add to the confusion and make problems bigger, allowing individual cases to slip through the cracks.

If you need an immediate solution to remediation and redress processes, Synerscope is here to help. Our tooling installs quickly onto your Azure tenant, with data kept under your governance, so you can quickly sort, label, and review cases with the microscopic level of detail needed to ensure proper handling. And, with no changes in governance, you can implement the solution quickly and get your redress and remediation program running.

Customer Case: Stedin: MDM remediation