Tag Archive for: dynamic data labelling

Artificial Intelligence is Only as Good as Data Labeling

Data Labeling with SynerScope

Recent events in my home country inspired me to write this blog. Every day we hear stories about businesses and government organizations struggling to sufficiently understand individual files or cases. Knowledge gaps and lack of access to good information hurts individual-and-organizational well-being. Sometimes, the prosperity of society itself is affected. For example, with large-scale financial crime and remediation cases in banking, insurance, government, and pandemics.

We simply have little understanding of the data which means AI and analytics are set up to fail. In addition, it’s difficult to see what data we can or may collect to run human-computer processes of extracting relevant information to solve those issues.

Unlimited Data with No Application

The COVID19 pandemic shows not only how difficult it is to generate the right data, but also how difficult it is to use existing data. Therefore, data-driven decision-making often shows gaps in understanding data.

Banks spend billions on technology and people in KYC, AML, and customer remediation processes. Yet, they’re still not fully meeting desired regulatory goals.

Governments also show signs of having difficulties with data. For example, recent scandals in the Dutch tax office, such as the Toeslagenaffaire, show how difficult it is to handle tens of thousands of cases in need of remediation. And the Dutch Ministry of Economic Affairs is struggling to determine individual compensation in Groningen, where earthquakes caused by gas extraction have damaged homes.

Today, the world is digitized to an unbelievable extent. So, society, from citizens to the press to politicians and the legal system, overestimate the capabilities of organizations to get the right information from the data which is so plenty available.

After all, those organizations, their data scientists, IT teams, cloud vendors, and scholars have promised a world of well-being and benevolence based on data and AI. Yet, their failure to deliver on those promises is certainly not a sign that conspiracy theories are true. Rather, it shows the limits of AI in a world where organizations understand less than half of the data they have when it is not in a machine processing ready state. After all, if you don’t know what you have, you can’t tell what data you’re missing.

Half of All Data is Dark Data

Gartner coined the term “Dark Data” to refer to that half of all data that we know nothing about. And, if Dark Matter influences so much in our universe, could Dark Data not have a similar impact on our ability to extract information and knowledge from the data?

We have come to believe in the dream of AI too much, because what if dark data behaves as dark matter? By overestimating what is possible with data-driven decision making, people may believe that the powers that be are manipulating this data.

SynerScope’s driving concept is based on our technology to assess Dark Data within organizations. By better understanding our dark data, we can better understand our world, get better results from human and computer intelligence (AI) combined.

Algorithms Rely on Labeled Datasets

Today’s AI, DL (Deep Learning, and ML (Machine learning) need data to learn – and lots of it. Data bias is a real problem for that process. The better training data is, the better the model performs. So, the quality and quantity of training data has as much impact on the success of an AI project as the algorithms themselves.

Unfortunately, unstructured data and even some well-structured data, is not labeled in a way that makes it suitable as a training set for models. For example, sentiment analysis requires slang and sarcasm labels. Chatbots require entity extraction and careful syntactic analysis, not just raw language. An AI designed for autonomous driving requires street images labeled with pedestrians, cyclists, street signs, etc.

Great models require solid data as a strong foundation. But how do we label the data that could help us improve that foundation. For chatbots, for self-driving vehicles, and for the mechanisms behind customer remediation, fraud prevention, government support programs, pandemics, and accounting under IFRS?

Regulation and pandemics appear in the same sentence because, from a data perspective, they’re similar. They both represent a sudden or undetected arrival that requires us to extract new information from existing data. Extracting that new information is only manageable for AI if training data has been labeled with that goal in mind.

Let me explain with an easy example of self-driving vehicles. Today, training data is labelled for pedestrians, bicycles, cars, trucks, road signs, prams, etc. What if, tomorrow, we decide that the AI also must adapt to the higher speed of electric bikes? You will need a massive operation of collecting new data and re-training of that data, as the current models would be unlikely to perform well for this new demand.

Companies using software systems with pre-existing meta data models or business glossaries have the same boundaries. They work by selecting and applying labels without deriving any label from the content – otherwise they must label by hand, which is labor and time intensive – and often too much so to allow for doing this under the pressure of large-scale scandals and crises.

Automatic Data Labeling and SynerScope

The need to adapt data for sudden crises does not allow for manual labeling. Instead, automatic labeling is a better choice. But, as we know from failures by organizations and by government, AI alone is not accurate enough to take individual content into account.

For SynerScope, content itself should always drive descriptive labeling. Labeling methodology should always evolve with the content. That’s why we use a combination of algorithm automation and human supervision, to bring the best of both worlds together – for fast and efficient data labeling.

If you want to learn more about how our labelling works, feel free to contact us at info@synerscope.com

Using Dynamic Data Labelling to Drive Business Value

Dynamic Data Labelling with Ixivault

Before deriving any value from data, you need to find and retrieve relevant data. Search allows you to achieve that goal. However, for ‘search’ to work, we need two things: A search term needs to be defined by humans; data must be indexed for the computer to find it with cost and speed efficiency and to keep the user engaged. But search efficiency breaks under the sheer scale of all-available data and the presence of dark data (with no indexes or labels attached), when considering either finance or response time points of view.

Technologies like enterprise search never took off for this exact reason. Without labels, it’s ineffective to ask a system to select results from the data. At the very moment of creating the data the creator knows exactly what a file contains. But as time passes our memories fail, and other people might be tasked with finding and retrieving data long after we’ve moved on. Searching data in enterprise applications often means painstakingly looking up each subject or object recorded. For end-user applications like MS Office, we lack even that possibility. Without good labels search and retrieval options are near impossible.  And, while the people who create data know exactly what’s in it, the people who come after, and the programs we create to manage that data, cannot perform the same mental hat trick of pulling meaning from unsorted data.

At SynerScope we offer a solution easily recover data that was either lost over time or vaguely defined from the start.  We first lift such ‘unknown’ data into an automated, AI-based, sorting machine. Once sorted, we involve a human data specialist, who can then work with sub-groups of data rather than individual files. Again, unsupervised, our solution presents the user with the discerning words that represent each sub-group in relation to each other. In essence, the AI presents the prime label options for files and content in each subgroup, no matter what the size in number of files, pages, or paragraphs. The human reviewer only has to select and verify a label option, rather than taking on the heavy lifting task of generating labels.

Thus labeled, the data is ready for established processes for enterprise data. Cataloging, access management, analysis, AI, machine learning, and remediation are common end goals for data after Synerscope Ixivault generates metadata and labels.

SynerScope also allows for ongoing, dynamic relabeling of data as new needs appear. That’s important in this age of fast digital growth, with a constant barrage of new questions and digital needs. Ixivault’s analysis and information extraction capabilities can evolve and adapt to future requirements with ease, speed, and accuracy.

How Does Unlabeled Data Come about?

Data is constantly created and collected. When employees capture or create data, they are adding to files and logs. Humans are also very good at mentally categorizing data – we can navigate with ease through most recent data, unsorted and all. Whether that means navigating a stack of papers or nested folders – our associative brain can remember the general idea of what is in each pile of data – so long as that data doesn’t move. But we’re very limited by the scale we can handle. We have mental pictures of scholars and professors working in rooms where data is piled to the ceiling everywhere, but where little cleaning was ever allowed. This paradigm doesn’t hold for digital data in enterprises. Collaboration, analysis, AI needs and regulations always put too much pressure on knowing where data is.

Catalogs and classification solutions can help, but automation levels for filling process are too low. That leads to gaps and arrears in labeling data. The AI for fully automatic labeling isn’t there yet. Cataloging and classifying business documentation is even harder than classifying digital images and video footage.

Digital Twinning and Delivering Value with Data

Before broadband, there was no such thing as a digital twin for people, man-made objects, or natural objects. Only necessary information was stored in application-based data silos. By 2007, the arrival of the iPhone and its revolution in mobile and mobile devices changed that. Everyone and everything were online, all the time, and constantly generating data. The digital twin, a collection of data representing a real person or a natural or man-made object was born.

In most organizations, these digital twins remain mostly in the dark. Most organizations collect vast quantities of data on clients, customer cases, accounts, and projects. It stays in the dark because it’s compiled, stored, and used in silos. When the people who created the data retire or move to another company, its meaning and content fade quickly – because no one else knows what’s there or why. And, without proper labels your systems will have a hard time handling any of it.

GDPR, HIPPA, CCPA etc. forces organizations to understand what data they have regarding real people, and they demand the same for any historic data stored from the days before those regulations existed.

Regulations evolve, technologies evolve, markets evolve, and your business evolves, all driving highly dynamic changes to what you need to know from your data. If you want to keep up, ensuring that you can use that data to drive business value – while avoiding undue risks from business regulations, data privacy and security regulation – you must be able to search your data. Failing this, you could get caught in a chaotic remediation procedure, accompanied by unsorted data that doesn’t reduce the turmoil, but adds to the chaos.

Dynamic Data Labelling with Ixivault

Ixivault helps you to match data to new realities in a flexible, efficient way, with a dynamic, weakly-supervised system for data labeling. The application installs in your own secure Microsoft Azure client-tenant, using the very data stores you set up and control, so all data always remains securely under your governance. Our solution, and its data sorting power, helps your entire workforce – from LOB to IT –  to categorize, classify, and label data by content – essentially lifting it out of the dark.

Your data is then accessible for all your digital processes.  Ixivault shows situations and objects grouped by similarity of documentation and image recordings and allows you to compare groups for differences in the content.  This simplifies and speeds the tasks of assigning labels to the data. Any activity that requires comparison between cases, objects, situations, data, or a check against set standards is made simple.  Ixivault also improves the quality of data selection, which helps in a range of applications ranging from Know Your Customer and Customer Due Diligence to analytics and AI based predictions using historical data.

For example, insurance companies can use that data to find comparable cases, match them to risks and premium rates, and thereby identify outliers – allowing the company to act in pricing, underwriting or binding or all of them.

SynerScope’s type of dynamic labelling creates opportunities to match any data, fast and flexible. As perception and the cultural applications of data change over time, you can also match data with the evolving needs for information extraction, change labels as data contexts change, and to continue driving value from the data you have at your disposal.

If you want to know more about Ixivault or its dynamic matching capabilities in your organization, contact us for personalized information.