The revolution in the data driven healthcare and life sciences market
Find out more about synthetic data surges in response to growing demand.
Accelerate innovation, reduce time-to-data, enable data collaborations and save research costs: these and others are the benefits to be expected by the use of the synthetic data.
Synthetic data is artificial data that is generated from original data and a model that is trained to reproduce the characteristics and structure of the original data. This means that synthetic data and original data should deliver very similar results when undergoing the same statistical analysis. The generation process, also called synthesis, can be performed using different techniques, such as decision trees, deep learning algorithms and statistical models that replicate the patterns, characteristics and relationships found in real-world data. These are data that are not collected from an interaction with the real world but look in every way similar to original personal data ("source data").
More precisely, through data synthesis artificial databases are produced "in a test tube" that have statistical properties that are extremely similar, if not identical, to those of the source data even though these data do not fall under the definition of personal data.
To this end, it is worth noting that in the EU Artificial Intelligence Act draft regulation 1 the use of "anonymised, synthetic or other non-personal data" is suggested in place of the category of personal data, thus assigning the former an equal value and distinguishing them from personal data. Therefore, the use of synthetic data allows the use of equivalent datasets that are not subject to the personal data processing regulation falling outside the material scope of the General Data Protection Regulation n. 2016/679 2 ("GDPR").
The potential application of synthetic data is wide and varied since it is gaining traction within the machine learning domain. Our work is primarily focused on the AI systems training and scientific researches in the clinical trials.
In 2022, the worldwide market for synthetic data generation reached USD 288.5 million, with an anticipated Compound Annual Growth Rate (CAGR) of 31.1% from 2023 to 2030 3. The industry is experiencing growth driven by increased AI penetration, leading to a rise in the generation of synthetic data.
In data-driven markets where the urgency to adhere to privacy laws can be a barrier to innovation, in particular due to the multitude of deviations provided by each local legislation, this cutting-edge privacy-enhancing technology redefines how we approach data privacy. Unlike traditional methods that involve handling raw personal data directly, synthetic data mining makes it possible to construct artificial datasets from which all relevant insights can be extracted, even though the information they contain does not correspond to real individuals, i.e. they are not personal data as defined by GDPR.
For instance, with reference to clinical trials, this innovative approach not only safeguards privacy but also allows for meaningful analysis without compromising sensitive details, using artificial yet statistically equivalent data, preserving individuals' privacy while advancing medical solutions. In essence, this technology offers the accuracy of raw data and the protection offered by anonymised data. To this end, in vitro trial data can be used to fulfil all or part of information requirements that would otherwise require data obtained from experiments on living organisms (in vivo tests). In other words, conducting an in vitro trial leveraging synthetic data means performing it outside of a living organism, thus do not requiring the use of a "whole" organism since the research can be carried out without the need to administer the drug to a person who voluntarily undergoes the study but using synthetic yet statistically similar data.
It is worth noting that on 10 March 2016, the European Parliament made a recommendation to the European Medicine Agency to take into account alternative methods in the evaluation of medical products and in particular In Silico methodologies. Similarly, the Food and Drugs Administration ("FDA") considers this methodologies as powerful tools that complement traditional methods for gathering evidence - including bench-top (in vitro) testing, and animal or clinical (in vivo) studies - about products regulated by the FDA or for developing FDA policy.
Moreover, it must be highlighted the recent rapid advancements of Generative AI ("GenAI") systems have unveiled unprecedented capabilities, pushing the boundaries of what we thought possible. However, this has sparkled a global discussion on how to strike a balance between the rights of intellectual property holders and the interests of the AI developers. Some IP holders and representative organizations have initiated legal actions against developers of GenAI tools 4, alleging that the training process and, in certain instances, the output of these tools violate their intellectual property rights. In response, the concept of "synthetic data" has emerged as a potential remedy. Synthetic data involves artificially generated data by AI models, eliminating the need for real-world data and, theoretically, posing certain elements aimed at avoiding the risks of IP infringement. Properly developed synthetic data should closely mimic real-world data, making it technically and statistically indistinguishable for the purpose of training AI models. This will clearly be food for thoughts for IP lawyers in the next future.
Let's try to shed lights on the privacy related and contractual matters which may arise out of the massive use of in silico data and on the current and future development.
Impact on privacy principles
When delving into the potential impacts of synthetic data on privacy and data protection, a nuanced examination reveals both positive and slippery dimensions.
On the positive side, as privacy-enhancing technology (PET), this innovation holds the key to reshaping how we manage and share information in the digital age providing also an innovative solution for conundrum arising from the need to abide by to the fragmented patchwork of data privacy regulation across the globe. In particular, adopting a privacy by design approach, synthetic data could be a significant stride in enhancing privacy since this adds an extra layer of protection for individuals. Crucially, this approach can have a role in:
Protect personal data and minimize the risk of data breach where for specific processing there is no need to elaborate raw data (e.g., in R&D) and it is therefore decided to create a synthetic dataset reproducing the same statistics. For instance, in the healthcare industries, protecting patient privacy is pivotal and the adoption of synthetic data can be seen as a step toward solving this problem. By generating artificial data that closely resembles source data, but without containing identifying information, the risk of data breaches and unauthorized access is greatly reduced.
Sharing non personal synthetic data with third parties without the need to seek for a legal basis or to provide appropriate transparency under Article 13 and 14 GDPR. For instance, this is the case where the growing data demands in the context of clinical trials would be accommodated by profoundly simplifying the difficulties of gathering the necessary data for scientific research from different sources.
Moreover, synthetic data holds promise in addressing biases within artificial intelligence models. By integrating fair synthetic datasets during the training phase of the AI, the technology aims to mitigate biases and present a more accurate reflection of the world. This is particularly relevant in the context of reducing gender-based or racial discrimination, signalling a positive step toward aligning datasets with societal ideals. Proper validation and verification of synthetic data generation algorithms is critical to ensure that the data reflect the true underlying distribution.
On the flip side, complexities emerge in terms of output control, especially within intricate datasets. Ensuring accuracy and consistency necessitates a comparative analysis between synthetic and source data. To this end, the European Data Supervisory Board suggests to conduct a privacy assurance assessment to ensure that the resulting synthetic data is not actual personal data. This privacy assurance evaluates the extent to which data subjects can be identified in the synthetic data and how much new data about those data subjects would be revealed upon successful identification.
Another challenge lies in the difficulty of identifying outliers. Synthetic data, though mimicking real-world data, may not comprehensively capture outliers present in the original dataset. This limitation is noteworthy as outliers often hold significant importance in specific applications, posing a potential drawback for certain use cases.
In essence, the impact of synthetic data on data protection is a multifaceted consideration, with material advantages and potential challenges that necessitate careful examination and thoughtful implementation. To this end, international industry standardization of the generation process, also called synthesis, would boost and simplify the adoption of this privacy-enhancing technology.
Common standards support communities with a basis for mutual understanding and information exchange and common standards are indispensable for collaborative work. Data from different sources and recorded at different times must be integrated in order to setup models. Consistent documentation of data, models and simulation results based on standards ensure that the data and corresponding metadata (data describing the data and its context), as well as models, methods and visualizations are structured and interoperable manner.
For instance, this is the goal of the EU-STANDS4PM, the Coordinating and Support Action funded under the Horizon2020 framework programme of the European Commission which conducted a EU wide mapping process to assess and evaluate strategies for data-driven in silico modelling approaches. In particular, with reference to pharmacodynamic modelling, this study provides that in silico models enable a unique possibility to integrate personalised omics data in a whole‐body context. It is hence a plausible expectation, that in silico models will in the future be used to represent personalised patient data in digital twins and to optimize therapeutic outcomes of drug treatments. Such approaches may thereby help to overcome the currently prevailing "one‐size‐fits‐all" paradigm in drug treatment through model‐informed precision dosing. This includes in particular tailored patient‐specific therapies with maximum efficacy yet minimum adverse side effects.
Biobanks: an example of practical impact of synthetic data
Biobanks are non-profit service unit which aim to collect, process, store and distribute human biological samples and related data for research and diagnosis representing a key emerging biomedical research infrastructure bringing together a multitude of data on people, including health and lifestyle data, providing useful data for observational epidemiological studies on populations.
In Iceland, the deCODE project has been running since 2001, which involves collecting DNA samples, biographical data and clinical data from the entire population within a national biobank consisting of a collection of biological samples stored indefinitely. Biobanks manifest therefore the turn towards greater global sharing of genomic and health-related data, which is considered by many to be an ethical and scientific imperative. The collective interests lie in improving the health and welfare of individuals, communities, and populations; improving health and welfare requires access to, and use of, widely dispersed quality data. However, sharing these individual and familial personal data requires in turn that due thought be given to the ethical and legal interests at stake. Most critically, data sharing must occur in an environment whereby privacy interests are safeguarded throughout the lifecycle of biobank initiatives, and regardless of the locations where the data are stored, to which they are sent, and where they are ultimately processed.
In particular, from a privacy perspective, it must be highlighted that every tissue stored is a potential source of genetic information (DNA) and genetic biobanks require special precautions, because a true genetic profile of the individual person could emerge from the linked dataset, as the "uniqueness" of the individual person's genome is universally recognized. As of today, the processing of genetic data falls within the scope of the rules on the processing of personal health related data, without any kind of distinction between "genetic data" and "biological material".
From a regulatory perspective, Directive EC/44/98 does neither help, which, in regulating the biotechnological patent, defines "biological materials" as "material containing genetic information" and subjects the material element and the informational element indiscriminately to the same discipline.
Moreover, the necessary storing as well as sharing and distribution of data among biobanks and research institutions pose a risk for the privacy rights. In particular, the main privacy issues that hinder and limit scientific research and the work of these operators, jeopardising scientific development, mainly concern:
i. seeking an appropriate legal basis under Article 9 GDPR with reference to both the collection of health related data and the sharing of these with research centres, physicians and clinics;
ii. providing adequate and constantly updated transparency as provided by Article 13 and 14 GDPR;
iii. comply with specific legislation on cross-boarder data transfer as well as on data localisation/residency;
iv. implementing appropriate technical and organisational measures which are designed to implement data-protection principles, such as data minimisation, in an effective manner and to integrate the necessary safeguards into the processing in order to meet the requirements of GDPR and protect the rights of data subjects (Article 25 GDPR).
This is particularly complex also in the event of a transfer of the dataset with third parties in the context of merger and acquisition (M&A) deals involving a change of ownership of a database containing sensitive data (e.g., through an asset deal, a merger, etc.). To this extent, with reference to a biobank which contains approximately 230,000 biological samples, clinical, biochemical, demographic and genealogical information of approximately 11,770 individuals, in 2023 the Italian Data Protection Authority reaffirmed that in the case of the transfer of a dataset, the new data controller must verify that it has an appropriate legal basis, in this case consent under Article 9 GDPR, and provide appropriate transparency under Article 14 GDPR. 5
Given the above and as seen afore, implementation of data synthesis technology just after the collection of data would exclude the resulting synthetic dataset from the application of privacy framework thus zeroing the risks for data subjects in terms of security, privacy obligations (transparency and legal basis), sharing and distribution as well as the risk of seeing business or an corporate deal thwarted.
A glimpse into the future: the European Health Data Space. Challenges and opportunities of synthetic data in the healthcare revolution
On 3 May 2022, the European Commission presented a legislative proposal on the European Health Data Space, also known as European Health Data Space ("EHDS"). The implementation of the EHDS is part of the overarching strategy formulated by European institutions to establish a single data market. The objective is to position the EU as a benchmark in the effective management and sharing of electronic health data. This, in turn, involves leveraging the full potential of such data for primary purposes like delivering care and assistance to citizens and secondary purposes such as conducting research, development activities, and shaping health policy, ultimately encouraging innovation and allowing companies to compete on global markets.
Within the European Health Data Space, where the exchange of health data is paramount, synthetic data shines as a beacon of privacy and scientific development. Healthcare organizations can harness the power of synthetic datasets to advance research, develop models, and improve patient outcomes without compromising both the individual privacy and the accuracy of the dataset, where the anonymized data cannot effectively fulfil the purpose. This aligns seamlessly with the EU's commitment to fostering innovation in healthcare while upholding the highest standards of data protection. Indeed, under the EHDS Commission's proposal, data access in anonymized form shall be the standard way in which information is provided with regard to secondary use. Only if the purpose of the processing - among those listed by the proposal - cannot be achieved through access to anonymized data will the bodies responsible for access identified by individual Member States be allowed to provide data access in pseudonymized form. In this sense, synthetic data provide the same advantages as raw data, also in terms of statistical analysis of the entire dataset by specific characteristic, while reducing the identifiability of the data subject reconciling the need to protect the rights and freedoms of natural persons with the peculiarities of this sector.
Next sensitive topics
Beyond privacy profiles, the main issues that make it necessary to be carefully managed in the contracts with the vendor which provides synthesis services mainly relate to:
The protection of IP rights on the systems and algorithms of synthesis.
The protection of IP rights on the source data feeding the synthesis systems whenever those data are not provided directly by the client. It is worth mentioning that recently, several big tech companies developing AI systems have announced that they have introduced an IP waiver for their users using AI applications. Moreover, authorities/Courts are showing 6 increasing resistance towards techniques for data scraping from public sources for purposes other than those for which they were published.
The evaluation of secondary potential use of data for training purposes of synthesis algorithms.
Conclusion: syntethic data - an ethic cornerstone for digital transformation
As aforementioned, synthetic data stands at the forefront of privacy-enhancing technologies, presenting a transformative solution for safeguarding sensitive information. In the context of the EU AI Act draft and the European Health Data Space proposal, synthetic data emerges as a cornerstone for digital transformation.
As we navigate the evolving landscape of data-driven technologies, synthetic data paves the way for a future where privacy and progress coexist harmoniously also ensuring new market opportunities and collaboration and sharing between various market players, granting data representativeness and addressing biases by generating a more diverse and representative dataset and training models that are more fair and accurate also in with minorities.
1 In early December 2023, provisional agreement on AI Act was finally reached between the European Council and European Parliament on the text of the EU's new AI Act. Although some work remains before the text is finalised, its content is, for all practical purposes, now agreed upon. The final text is expected between May and June 2024. The majority of the Act's provisions will apply after a two-year grace period for compliance. However, the regulation's prohibitions will already apply after six months and the obligations for GPAI models will become effective after 12 months.
2 Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation)
3 https://www.fortunebusinessinsights.com/synthetic-data-generation-market-108433
4 Some of the most prominent cases currently pending in U.S. courts include: Doe I v. Github, Inc., No. 4:22-cv-06823 (N.D. Cal. Nov. 3, 2022); Andersen et al. v. Stability AI et al., No. 3:23-cv-00201 (N. D. Cal. Jan. 13, 2023); Getty Images (US), І**nc. v. Stability AI, No. 1:23-cv-00135 (D. Del. Feb. 3, 2023). J.L. et al. v. Alphabet, No. 3:23-cv-03440 (N.D. Cal. Jul. 11, 2023); Tremblay et al. v. OpenAI, No. 3:23-cv-03223 (N.D. Cal. June 28, 2023); Silverman et al. v. OpenAI, No. 4:23-cv-03416 (N.D. Cal. July 7, 2023); Kadrey et al. v. Meta Platforms, No. 3:23-cv-03417 (N.D. Cal July 7, 2023); Chabon et al. v. OpenAI, No 3:23-cv-04625 (N.D. Cal Sept. 8, 2023); Chabon et al. v. Meta Platforms, No. 3:23-cv-04663 (N.D. Cal. Sept. 12, 2023); Authors Guild v. OpenAI, No. 1:23-cv-08292 (S.D.N.Y. Sept. 19, 2023); Huckabee et al. v. Meta Platforms et al., No. 1:23-cv-09152 (S.D.N.Y. Oct. 17, 2023); Concord Music Group et al. v. Anthropic PBC, No. 3:23-cv-01092 (M.D. Tenn. Oct. 18, 2023); Sancton v. Open AI, Inc. et al., No. 1:23-10211 (S.D.N.Y. Nov. 21, 2023)
5 Italian Data Protection Authority - Provision of 27 April 2023 - https://www.garanteprivacy.it/web/guest/home/docweb/-/docweb-display/docweb/9898815
6 For instance, the Italian Data Protection Authority with the provision No. 201 of 17th May, 2023, has prohibited the owner of a website from creating and disseminating online a telephone directory formed by "scraping" data through web scraping and has imposed a fine of €60,000. Moreover, on 24 August 2023, 12 international data protection and privacy regulators issued a joint statement (Statement) on their "global expectations of social media platforms and other sites to safeguard against unlawful data scraping". The Statement is a call to action for online platforms and websites, particularly social media companies (SMCs), to address the rise of unlawful data scraping. It sets out expected standards to ensure the protection of personal data and confirms that data protection rules apply to data scraping. While no European data protection regulator is a signatory to the Statement, it is a significant regulatory development for such an array of non-European regulators to come together and issue this joint message. It is a rare occurrence and demonstrates that data scraping, including its impact on data protection rules, is being considered at an international level. This is undoubtedly an area of increased regulatory focus as global policymakers seek to regulate artificial intelligence (AI) technologies.

_11zon.jpg?crop=300,495&format=webply&auto=webp)


_11zon.jpg?crop=300,495&format=webply&auto=webp)
_11zon.jpg?crop=300,495&format=webply&auto=webp)

_11zon.jpg?crop=300,495&format=webply&auto=webp)







.jpg?crop=300,495&format=webply&auto=webp)
_(1).jpg?crop=300,495&format=webply&auto=webp)

