Tag Archive for: machine learning

Artificial Intelligence is Only as Good as Data Labeling

Data Labeling with SynerScope

Recent events in my home country inspired me to write this blog. Every day we hear stories about businesses and government organizations struggling to sufficiently understand individual files or cases. Knowledge gaps and lack of access to good information hurts individual-and-organizational well-being. Sometimes, the prosperity of society itself is affected. For example, with large-scale financial crime and remediation cases in banking, insurance, government, and pandemics.

We simply have little understanding of the data which means AI and analytics are set up to fail. In addition, it’s difficult to see what data we can or may collect to run human-computer processes of extracting relevant information to solve those issues.

Unlimited Data with No Application

The COVID19 pandemic shows not only how difficult it is to generate the right data, but also how difficult it is to use existing data. Therefore, data-driven decision-making often shows gaps in understanding data.

Banks spend billions on technology and people in KYC, AML, and customer remediation processes. Yet, they’re still not fully meeting desired regulatory goals.

Governments also show signs of having difficulties with data. For example, recent scandals in the Dutch tax office, such as the Toeslagenaffaire, show how difficult it is to handle tens of thousands of cases in need of remediation. And the Dutch Ministry of Economic Affairs is struggling to determine individual compensation in Groningen, where earthquakes caused by gas extraction have damaged homes.

Today, the world is digitized to an unbelievable extent. So, society, from citizens to the press to politicians and the legal system, overestimate the capabilities of organizations to get the right information from the data which is so plenty available.

After all, those organizations, their data scientists, IT teams, cloud vendors, and scholars have promised a world of well-being and benevolence based on data and AI. Yet, their failure to deliver on those promises is certainly not a sign that conspiracy theories are true. Rather, it shows the limits of AI in a world where organizations understand less than half of the data they have when it is not in a machine processing ready state. After all, if you don’t know what you have, you can’t tell what data you’re missing.

Half of All Data is Dark Data

Gartner coined the term “Dark Data” to refer to that half of all data that we know nothing about. And, if Dark Matter influences so much in our universe, could Dark Data not have a similar impact on our ability to extract information and knowledge from the data?

We have come to believe in the dream of AI too much, because what if dark data behaves as dark matter? By overestimating what is possible with data-driven decision making, people may believe that the powers that be are manipulating this data.

SynerScope’s driving concept is based on our technology to assess Dark Data within organizations. By better understanding our dark data, we can better understand our world, get better results from human and computer intelligence (AI) combined.

Algorithms Rely on Labeled Datasets

Today’s AI, DL (Deep Learning, and ML (Machine learning) need data to learn – and lots of it. Data bias is a real problem for that process. The better training data is, the better the model performs. So, the quality and quantity of training data has as much impact on the success of an AI project as the algorithms themselves.

Unfortunately, unstructured data and even some well-structured data, is not labeled in a way that makes it suitable as a training set for models. For example, sentiment analysis requires slang and sarcasm labels. Chatbots require entity extraction and careful syntactic analysis, not just raw language. An AI designed for autonomous driving requires street images labeled with pedestrians, cyclists, street signs, etc.

Great models require solid data as a strong foundation. But how do we label the data that could help us improve that foundation. For chatbots, for self-driving vehicles, and for the mechanisms behind customer remediation, fraud prevention, government support programs, pandemics, and accounting under IFRS?

Regulation and pandemics appear in the same sentence because, from a data perspective, they’re similar. They both represent a sudden or undetected arrival that requires us to extract new information from existing data. Extracting that new information is only manageable for AI if training data has been labeled with that goal in mind.

Let me explain with an easy example of self-driving vehicles. Today, training data is labelled for pedestrians, bicycles, cars, trucks, road signs, prams, etc. What if, tomorrow, we decide that the AI also must adapt to the higher speed of electric bikes? You will need a massive operation of collecting new data and re-training of that data, as the current models would be unlikely to perform well for this new demand.

Companies using software systems with pre-existing meta data models or business glossaries have the same boundaries. They work by selecting and applying labels without deriving any label from the content – otherwise they must label by hand, which is labor and time intensive – and often too much so to allow for doing this under the pressure of large-scale scandals and crises.

Automatic Data Labeling and SynerScope

The need to adapt data for sudden crises does not allow for manual labeling. Instead, automatic labeling is a better choice. But, as we know from failures by organizations and by government, AI alone is not accurate enough to take individual content into account.

For SynerScope, content itself should always drive descriptive labeling. Labeling methodology should always evolve with the content. That’s why we use a combination of algorithm automation and human supervision, to bring the best of both worlds together – for fast and efficient data labeling.

If you want to learn more about how our labelling works, feel free to contact us at info@synerscope.com

Ixivault Helps Labeling and Categorizing Dark Data in the Azure Cloud

Ixivault, a managed app on Microsoft Azure

Your organization’s dark data presents challenges when you move to the cloud. Yet, leaving it in a current location is also not the solution.

Dark data includes digital data which is stored but never mobilized for analysis or to deliver information. If you have dark data, your organization is already missing opportunities to derive value from it. However, if you don’t take dark data with you to the cloud, it drifts even further from your other data assets. Meanwhile, the flexible computation and memory infrastructure of the cloud offers a very cost-effective solution to mobilizing that data. Most importantly, it does so at any scale your organization needs.

However, there are still challenges here. For example, overcoming the risks of governance and compliance, increased storage costs, and storage tiering choices. Do you choose to store data in close proximity to synchronize with other data – but at a higher storage cost?

Migrating Dark Data to the Azure Cloud

For most organizations, failure to create and execute a dark data plan as part of the cloud transition is undesirable at best and breaching data compliance at worst. Synerscope delivers the tools to analyze and “unlock” that data during the transition, making efficient use of cloud computing, while keeping data in your full control. This means no additional risks arise for compliance, security, etc.

Synerscope also helps you mobilize dark data, using a combination of machine learning, AI, and human expertise. Unlocking dark data is essential for most organizations. That remains true whether you’re shifting from legacy systems to Azure, are reducing your governance footprint, or are pressed into unlocking data for compliance or a regulatory audit. Synerscope’s Ixivault comes into play at any point where you need detailed and broad overviews of complex data. This is achieved through sorting, categorizing, and revealing patterns and giving domain experts the tools to label categories at speed, with high accuracy.

Your Data, Your Azure Tenant

Ixivault is a managed app on Microsoft Azure. When you deploy the tool, it installs on top of your Azure Blob or ADLS where the data stays in your control. We power Ixivault on Azure computing, meaning that it dynamically scales up computing power to meet the size and complexity of the data you direct to it for scanning and computation. At no point does the data leave your Azure tenant or any assigned secured storage used before separating sensitive data out. SynerScope’s design suits the most stringent demands for compliance and governance. Our Ixivault feels and operates like a SaaS but does so in your tenant, without any proprietary back-end for storing your data assets. Therefore, Synerscope allows you to categorize, sort, and label your dark data without introducing additional regulatory complexities. Your data stays in your cloud, the process is fully transparent, and you control and monitor your tenant for all matters related to data sovereignty.

That applies whether you’re importing data to Azure for the first time to inspect before deciding where to store it or already have data in a Blob or ADLS and must inspect it or want to open data on legacy infrastructure.

Sorting and Categorizing Dark Data

Ixivault leverages AI and machine learning for sorting and text extraction. Here, visual displays offer domain experts rich and discerning context from which to choose the most suitable labels of descriptive metadata. Our technology is a weak supervised system, first unsupervised computing handles the data in bulk, followed by a human operator to validate labels and bulk sorted data categories. The system works on raw data inputs directly, without training. Using raw data sets with human validation to add labels means we can make the system smarter over time. Future raw data sets are automatically checked for similarities with previously processed data sets. So, high value can be achieved from day one, but the system learns over time. .

Ixivault abstracts data to hypervectors – comparing the similarity between data algorithmically. Using algorithms, the AI can accurately sort data into “Stacks” of similar files. Format, lay-out and content of documents are all used by the algorithms to separate common business documents e.g., contracts, letters, offers, invoices, emails, brochures, claims, and different tables. And our algorithms separate sub-groups according to actual content within each of these. Our language extraction presents distinctive groups of words from each “Stack”, allowing humans to select the most appropriate labels. The same extracted words can also be matched to business glossaries and data catalogs already available to your organization. Hypervectors allow our algorithms to detect similarities across documents ‘holistically’, at a scale beyond unaided human capacity. The resulting merge of rich ontologies and semantic knowledge are re-usable throughout the organization and the many applications it runs.

Machine Learning with Human Context

Ixivault creates outputs that allow your data experts to step in at maximum velocity and scale. The application displays a dashboard showing the stack of data, visual imaging of what’s in this stack, and keywords or tags pulled from that data and metadata. Where descriptive metadata is lacking or absent, our system presents new candidates for labels. The system supports users in running fast and powerful data discovery cycles, which link search, sorting, natural language programming, and labeling. The output is knowledge about your organization’s dark data which can be used and reused by other users and software systems.

This approach allows data experts to look at files and keywords and very quickly add tags. More importantly, it creates room for human expertise, to recognize when data is outside of the norm – e.g., files are related to a special circumstance, which machines simply cannot reliably do. The result is a powerful, fast and flexible system, usable with a variety of data.

Once you select the machine proposed labels, you only have to individually inspect a small number of the actual files to confirm the labeling for an entire group of sorted files.

Unlocking Dark Data as You Move to the Cloud

Moving to Azure forces most organizations to do something with, or certainly think about, their dark data. You can’t move untold amounts of data to the cloud without knowing what’s in it. You would not be able to extract enough additional value from such a blind move. Directing data to the right storage solutions for easy governance, compliance, and management demands knowledge of its content. E.g., so you can prioritize data for further processing and computation, or save on storage for less value-added content. Data intelligence can mostly be paid for by decreasing ‘dark storage’. Meanwhile, your organization can improve its governance footprint and ensure compliance.

Synerscope can deliver the potential value in dark data by increasing knowledge, helping with retention, access management, discovery, data cleansing efforts, data privacy protection measures, and compliance. Most importantly, dark data mining gives organizations the information needed to make business as well as IT and compliance decisions with that data – because Data intersects between the three.

To learn more about Synerscope’s software and our approach, contact us to schedule a demo and see the software in action.

Delving into Dark Data on Azure – Data Governance in the Cloud

For most organizations, dark data is a vague concept, the knowledge that, somewhere, you have vast amounts of stored data – and you have no real idea what it is. Gartner coined the term to refer to data which organizations collect but fail to use or monetize, and eventually lose track of.

That data, which is stored in network file shares, collaboration tools (e.g., SharePoint), online storage services like Drive and Dropbox, old PCs, and backups, is dark because most people in the organization have no idea what’s in it. In fact, often that data is stored in legacy systems or placed on drives by people who have since left the organization. But, as organizations move to the cloud and must choose whether to leave data where it is or move it to an Azure Blob, it becomes more of an issue – not just for the potential of business value but for regulatory compliance.

Dark Data can include Private Data

Dark data offers no promises in terms of delivering business value. Yet, organizations cannot ignore it. Often, dark data contains everything from personally identifiable information to HR data, legal contracts, security, and access information, and other confidential or proprietary information. This presents real liabilities in information governance, especially in industries such as finance and public sector. And, for global companies, it becomes increasingly crucial that data analytics and governance be addressed simultaneously to meet data privacy laws across the EU and USA.

Knowing your enterprise data and being able to search for it would be the ideal. However, the absence of labels, categories and meta data in general makes it hard to choose what to send to AI for analysis and discovery, who receives access to what data, and what data to keep (and where to keep it). Most businesses have dark data specifically because it takes too much manual effort to sort and label. But dark data presents unknown potential and risks – without understanding its contents, no organization can optimize decisions around what to do best.

A Significant Governance Footprint

Both structured and unstructured data can be part of dark data. More unstructured than structured data resides in the dark.

Why? Unstructured data makes computerized processing more difficult, much of this data requires significant manual processing.  Azure cloud compute and storage use elasticity and scale to offer options to optimize resources efficiently and cost-efficiently process all data. This option is obviously not readily available in on-premise data centers. With SynerScope positioned on top of the customer’s Azure object store (Blob or ADLS), enterprises can quickly and economically see what content they have. More importantly they can use this information to take action.

For example, the underlying contracts and correspondences for 10-year-old invoices cannot be handled without proper governance. In the Azure cloud, you can generate that data. Yet, if there are multiple back-ends from different SaaS suppliers, moving dark data to the cloud is impaired from a governance and risk perspective. That’s why SynerScope’s SaaS-like application uses the storage on the customer’s Azure tenant. Therefore, all data protection and security is regulated by the single contract between the customer and Microsoft Azure. This simplicity allows the enterprise to confidently move data to the cloud, knowing that responsibilities and liabilities are clearly defined.

Categorizing Dark Data in the Azure Cloud

At Synerscope we deliver the tools to unlock dark data using machine learning for sorting by content, whilst your domain experts add context. Our AI sorts data visually, “stacking” content based on visual similarity – and highlighting keywords and descriptors pulled from the stack. Your domain expert can use that to add context to the stack – quickly identifying whether something is an invoice, a mortgage receipt, a single customer’s banking data, etc.

The software installs into your Azure tenant, leaving data in a system structure, only governed by your Azure contract. SynerScope runs similarly to an Azure module; we bring data to cache memory, it is computed, and newly generated metadata augments the original data. These data artefacts are moved into the storage, which you, as a client, set up and manage. We provide the support for you to:

  • Find relevant structured and unstructured data, open it for control, data governance, and maintainability for GDPR compliance
  • Find and structure data for governance to meet compliance requirements in finance, public sector, etc.
  • Improve triage for files to be inspected in KYC, CDD, PDD, and AML investigations

Most importantly, this applies both for stored dark data – and for the massive quantities of data churned out by CMS, self-service, surveys, and specifics like KYC programs and security. Synerscope delivers tooling to make the move to the cloud possible with dark data analysis – so that the organization implements proper governance on all data as it moves to the cloud – while creating structure and insight into new data.

Granular Insight into Big Data

Synerscope gives massive insight into not just dark data, but any data. By mapping data visually and relying on data experts to create connections, we speed up data analysis across nearly any type of data.

In a specific example, KYC is incredibly important for banks and other financial organizations. Automatic alert systems can have as much as a 5%+ false positive rate – each alert requires manual review. If each manual file review takes 4+ hours, a 5% false positive rate is a massive burden on the company. But Synerscope’s machine learning using AI to categorize and sort data, speeds up this manual review by as much as 20x.

As data continues to accumulate in the cloud, Synerscope’s role in making day-to-day compliance and governance decisions will grow. That applies for retrieving data, deciding where to store it, and whether to keep that data in the first place.

If you would like to see how it works, contact us for a demo or pilot