Data Discovery for Securing Sensitive Data on the Cloud

discover and protect sensitive data on the cloud

As organizations seek to expand their data footprint on the cloud, many executives responsible for data security and risk are encountering a challenge they have not had to navigate earlier: discovering sensitive data on the cloud.

In fact, it is the security of sensitive information that most executives are worried about when it comes to hosting data on the cloud. One survey found out that “71% of organizations report that the majority of their cloud-resident data is sensitive.”

Therefore, many CIOs and CTOs are always looking for answers on how to discover and protect sensitive data on the cloud and what are the traits of a good data discovery tool. All such questions answered – right here.

Why discover sensitive data on the cloud?

In reality, sensitive and personal information that is regulated for compliance often gets misplaced on the cloud. For instance, with the evolving trends in the payment card industry, payments data, credit and debit card data, in particular, is found outside of the formally defined boundaries of the Cardholder Data Environment (CDE). Unfortunately, such events often come to light during breach investigations or, in better times, during a PCI DSS assessment by a QSA. Card data discovery tools have been found to solve the problem.

In addition, remotely working employees are collaborating freely in the multi-cloud environment without the visibility or protection of traditional on-premises security. The sensitive data today may reside across multiple apps, databases and personal devices, and without proper visibility, enterprises run at the risk of compliance failure, security vulnerabilities and data breaches. The Ghimob malware can spy on 153 Android mobile applications – it’s important that sensitive data is located and secured before any mishap takes place.

Almost overnight, organizations deploy new business paths resulting in major data upgrades on the cloud infrastructure. With this pace of digital agility and scalability, sensitive data is moving in almost any direction across the enterprise architecture. Following are the key factors data discovery on the cloud has become a necessity for businesses.

  • Availability of space: Whether there is a current requirement of 10 GB or 100 GB of cloud bucket, there always will be additional space available on-demand on the cloud. Such an easy expansion of data space enables the user to add more space upon reaching the storage limit, which does not require deleting the old, non-functional data, hence dispersing the data beyond a business’s control.
  • Huge volumes of data: Whether it’s the number of transactions made by the users, sending out emails to a new lead list, or testing new modules or features with “dummy” data – there can be a large amount of data flowing into the enterprises’ and services’ cloud bucket. Some types of data may be sensitive, while other types are less so, and some not at all. Tracking, securing, and purging data in such cases becomes more difficult.
  • Multiple sources and requirements: Huge amount of data is generated by software, humans, and organizations such as web logs, event collection, social media, ERP systems, databases, etc. Such data is in use by more than one entity, so deleting or changing the existing data for other entities becomes more challenging. Instead, each entity, for ease, creates its data for processing and stores it in the bucket where the old data is forgotten.

If you find yourself thinking about the best ways to ensure cloud security, you might consider watching our webinar 5 Power Tips to Ensure Cloud Security in 2021.

How to discover sensitive data on the cloud?

  • Gather and classify the data: Not every data hosted in the cloud is equally sensitive. All the data needs to be gathered and classified. To illustrate, high-risk sensitive data is any information that, when lost or leaked, can lead to legal liabilities or damage to an organization’s reputation. The importance and sensitivity of data vary from its access level within users to the integration between cloud apps. Define policies to classify and label the data (“confidential”, “important”, “sensitive”, “private”, etc.)
  • Analyze the data: Once all the data is in a manageable environment, it’s time to analyze it. When looking, it’s important to segregate the data based on the sensitivity (e.g., cardholder data, health insurance information and account number, etc.) and data that is necessary, yet not sensitive (e.g., order history). It is also critical to determine what data a user needs to retain (for SOX compliance, other regulations, or for business purposes) and what data can be discarded.
  • Remediate or purge: Once analyzed, all unnecessary data should be purged from the cloud platform. A policy should be set for the data to be purged once it is identified as unnecessary. However, when executing data discovery on the cloud, it is vital to keep a record of the data that is remediated.
  • Set up DLP policies: While out-of-the-box DLP templates cover a wide variety of standards to identify PII, PHI, or financial data, information security specialists should also create custom templates to take advantage of regular expressions and keywords. Look for a solution that supports Optical Character Recognition (OCR) to scan images for sensitive data violations, such as credit card, social security number, personally identifiable information, etc.
  • Assess security posture: Assess and report the details on the historical file scans, along with the number of existing non-compliance. Post this, the admin can take action to remediate per data security policies of the organization.
  • Schedule the data discovery scans: Cloud data discovery scans should be performed periodically as a full or incremental scan to determine data compliance with GDPR, CCPA, HIPAA, and other regulatory laws. This process of scheduling data discovery on the cloud helps recognize out-of-compliance data that can be triaged immediately or remediated.


Which are the traits of a good data discovery tool for cloud data security?

Keeping aside cloud security vulnerability, a data discovery tool for cloud, especially ones featuring regulatory advantages such as GDPR data discovery and PII data discovery, will be of much help for the organization to scan the buckets and to find any sensitive data stored in a scattered manner and provide an answer to the critical question ‘how do you find the data you need to secure’.

A good data discovery tool for the cloud may have the below features for an organization’s optimum value:

  • A centralized tool where all the data is being secured in your environment
  • Able to scan AWS Cloud / Google Drive / O365 / Share point, etc.
  • Option to scan all folders or by selecting specific folder or drive
  • Support email servers for scanning (Gmail, Yahoo, Outlook, Custom mail or Business mail)
  • Option for resuming the scan from where it was last stopped
  • Audio scanning, PDF and images, including handwritten scanning for sensitive data
  • Automate the future scans by scheduling on a monthly/ quarterly basis
  • Option to scan only last modified files
  • Remediation options like masking, truncating, deleting, encryption
  • Equipped with AI and Machine Learning capabilities to identify complex data files, to increase accuracy, and reduce false positives


Sensitive data discovery isn’t a single headed monster, and a risk-based approach to robust, holistic data discovery needs to feature comprehensive regulatory standards. When was the last time you surfed through your infrastructure for sensitive data? How does sensitive data discovery play into your cloud and hybrid models?

SISA’s Latest
close slider