Unsuspecting Data Leaks: A healthtech nightmare
Why is it difficult to detect data leaks
Google and Facebook provide developers with analytics platforms to understand their users' behavior and earn revenue through targeted ads. For this, developers need to integrate third-party SDKs of these analytics platforms into their apps. Data is shared via these SDKs and sent over to Google and Facebook’s servers, from where organizations and developers have very little control over how this data is used to target ads to its users.
How it could have been protected
Organizations need an automated mechanism to analyze the data flows across various applications and outside the company to third parties to solve this problem. Such a mechanism or tool will help organizations monitor and flag privacy issues as soon as new code is committed by the developers. The best way is to integrate such tools into the CI/CD pipelines of the development workflows.
Creating data flows with Privado
Now we will take an example to show how we can shift the monitoring and detection workflows closer to the development workflows. For this example, we will take the HealthPlus repository, a healthcare practice management software that handles the sensitive health data of its users. First, we will map out the data flows of the existing repository to analyze the flow of various data elements in the repository. To do that, we will use the open-source Privado code scanner. To install Privado, we follow these steps.
After we have installed Privado, to scan the repository, follow these steps:
The scan will take about a minute, after which we can see the various data elements and data flows in the repository. An example data flow of Medical Certificate data is shown below:
Now, let’s assume that a developer needs to add Facebook Ads SDK to the repository. To simulate such conditions, I’ve created an Example.java file with the following contents.
Then, if a developer sends any data to this SDK for marketing purposes, it will be detected by the Privado scanner and show up in the dashboard. Let’s scan the repository again and see the results:
As we can see, the scanner was able to detect the addition of the Facebook SDK in the code. We can also look up the Code Analysis section to view a detailed line-by-line flow of the data to the SDK, as displayed below:
Through this, we can move privacy and data security assessments closer to the developer workflows and detect flaws and violations early on to save time and the risk of a regulatory violation.
As a side note, while scanning the repository I came across an interesting data element, categorized as “Religion / Religious Beliefs”. I was interested to know why such a data element was being used in a healthcare repository. By navigating the Code Analysis tool for the Religion data element, I was able to pinpoint exactly where the data element was being initialized, the entire journey of the data element including 5 log leakages, and other details. This can be interesting from a privacy engineer’s perspective, where often they are not able to scan the entire codebase manually and have to resort to sending assessments to developers to map out the data flows.
You can check out the tool yourself on Github. Feel free to drop comments and do share your experiences about creating data flows and mapping data elements used in your repositories.