The internet is a major distribution platform for web applications, but there are no effective transparency and audit mechanisms in place for the web. Due to the ephemeral nature of web applications, a client visiting a website has no guarantee that the code it receives today is the same as yesterday, or the same as other visitors receive. Despite advances in web security, it is thus challenging to audit web applications before they are rendered in the browser. We propose Accountable JS, a browser extension and opt in protocol for accountable delivery of active content on a web page. We prototype our protocol, formally model its security properties with the Tamarin Prover, and evaluate its compatibility and performance impact with case studies including WhatsApp Web, AdSense and Nimiq. Accountability is beginning to be deployed at scale, with Meta’s recent announcement of Code Verify available to all 2 billion WhatsApp users, but there has been little formal analysis of such protocols. We formally model Code Verify using the Tamarin Prover and compare its properties to our Accountable JS protocol. We also compare Code Verify’s and Accountable JS extension’s performance impacts on WhatsApp Web.
On How Zero-Knowledge Proof Blockchain Mixers Improve, and Worsen User Privacy
Wang, Zhipeng, Chaliasos, Stefanos, Qin, Kaihua, Zhou, Liyi, Gao, Lifeng,
Berrang, Pascal, Livshits, Ben, and Gervais, Arthur
In Proceedings of the ACM Web Conference 2023 2023
One of the most prominent and widely-used blockchain privacy solutions are zero-knowledge proof (ZKP) mixers operating on top of smart contract-enabled blockchains. ZKP mixers typically advertise their level of privacy through a so-called anonymity set size, similar to k-anonymity, where a user hides among a set of k other users. In reality, however, these anonymity set claims are mostly inaccurate, as we find through empirical measurements of the currently most active ZKP mixers. We propose five heuristics that, in combination, can increase the probability that an adversary links a withdrawer to the correct depositor on average by 51.94% (108.63%) on the most popular Ethereum (ETH) and Binance Smart Chain (BSC) mixer, respectively. Our empirical evidence is hence also the first to suggest a differing privacy-predilection of users on ETH and BSC. We further identify 105 Decentralized Finance (DeFi) attackers leveraging ZKP mixers as the initial funds and to deposit attack revenue (e.g., from phishing scams, hacking centralized exchanges, and blockchain project attacks).
State-of-the-art mixers are moreover tightly intertwined with the growing DeFi ecosystem by offering "anonymity mining" (AM) incentives, i.e., mixer users receive monetary rewards for mixing coins. However, contrary to the claims of related work, we find that AM does not always contribute to improving the quality of an anonymity set size of a mixer, because AM tends to attract privacy-ignorant users naively reusing addresses.
A framework for constructing Single Secret Leader Election from MPC
The emergence of distributed digital currencies has raised the need for a reliable consensus mechanism. In proof-of-stake cryptocurrencies, the participants periodically choose a closed set of validators, who can vote and append transactions to the blockchain. Each validator can become a leader with the probability proportional to its stake. Keeping the leader private yet unique until it publishes a new block can significantly reduce the attack vector of an adversary and improve the throughput of the network. The problem of Single Secret Leader Election (SSLE) was first formally defined by Boneh et al. in 2020.
In this work, we propose a novel framework for constructing SSLE protocols, which relies on secure multi-party computation (MPC) and satisfies the desired security properties. Our framework does not use any shuffle or sort operations and has a computational cost for $N parties as low as O(N) of basic MPC operations per party. We improve the state-of-the-art for SSLE protocols that do not assume a trusted setup. Moreover, our SSLE scheme efficiently handles weighted elections. That is, for a total weight S of N parties, the associated costs are only increased by a factor of \logS. When the MPC layer is instantiated with techniques based on Shamir’s secret-sharing, our SSLE has a communication cost of O(N^2) which is spread over O(\logN) rounds, can tolerate up to t<N/2 of faulty nodes without restarting the protocol, and its security relies on DDH in the random oracle model. When the MPC layer is instantiated with more efficient techniques based on garbled circuits, our SSLE requires all parties to participate, up to N-1$ of which can be malicious, and its security is based on the random oracle model.
Fine-Tuning Is All You Need to Mitigate Backdoor Attacks
Backdoor attacks represent one of the major threats to machine learning models. Various efforts have been made to mitigate backdoors. However, existing defenses have become increasingly complex and often require high computational resources or may also jeopardize models’ utility. In this work, we show that fine-tuning, one of the most common and easy-to-adopt machine learning training operations, can effectively remove backdoors from machine learning models while maintaining high model utility. Extensive experiments over three machine learning paradigms show that fine-tuning and our newly proposed super-fine-tuning achieve strong defense performance. Furthermore, we coin a new term, namely backdoor sequela, to measure the changes in model vulnerabilities to other attacks before and after the backdoor has been removed. Empirical evaluation shows that, compared to other defense methods, super-fine-tuning leaves limited backdoor sequela. We hope our results can help machine learning model owners better protect their models from backdoor threats. Also, it calls for the design of more advanced attacks in order to comprehensively assess machine learning models’ backdoor vulnerabilities.
Data Poisoning Attacks Against Multimodal Encoders
Traditional machine learning (ML) models usually rely on large-scale labeled datasets to achieve strong performance. However, such labeled datasets are often challenging and expensive to obtain. Also, the predefined categories limit the model’s ability to generalize to other visual concepts as additional labeled data is required. On the contrary, the newly emerged multimodal model, which contains both visual and linguistic modalities, learns the concept of images from the raw text. It is a promising way to solve the above problems as it can use easy-to-collect image-text pairs to construct the training dataset and the raw texts contain almost unlimited categories according to their semantics. However, learning from a large-scale unlabeled dataset also exposes the model to the risk of potential poisoning attacks, whereby the adversary aims to perturb the model’s training dataset to trigger malicious behaviors in it. Previous work mainly focuses on the visual modality. In this paper, we instead focus on answering two questions: (1) Is the linguistic modality also vulnerable to poisoning attacks? and (2) Which modality is most vulnerable? To answer the two questions, we conduct three types of poisoning attacks against CLIP, the most representative multimodal contrastive learning framework. Extensive evaluations on different datasets and model architectures show that all three attacks can perform well on the linguistic modality with only a relatively low poisoning rate and limited epochs. Also, we observe that the poisoning effect differs between different modalities, i.e., with lower MinRank in the visual modality and with higher Hit@K when K is small in the linguistic modality. To mitigate the attacks, we propose both pre-training and post-training defenses. We empirically show that both defenses can significantly reduce the attack performance while preserving the model’s utility.
Membership Inference Against DNA Methylation Databases
Biomedical data sharing is one of the key elements fostering the advancement of biomedical research but poses severe risks towards the privacy of individuals contributing their data, as already demonstrated for genomic data. In this paper, we study whether and to which extent DNA methylation data, one of the most important epigenetic elements regulating human health, is prone to membership inference attacks, a critical type of attack that reveals an individual’s participation in a given database. We design and evaluate three different attacks exploiting published summary statistics, among which one is based on machine learning and another is exploiting the dependencies between genome and methylation data. Our extensive evaluation on six datasets containing a diverse set of tissues and diseases collected from more than 1,300 individuals in total shows that such membership inference attacks are effective, even when the target’s methylation profile is not accessible. It further shows that the machine-learning approach outperforms the statistical attacks, and that learned models are transferable across different datasets.
ML-Leaks: Model and Data Independent Membership Inference Attacks and Defenses on Machine Learning Models
Machine learning (ML) has become a core component of many real-world applications and training data is a key factor that drives current progress. This huge success has led Internet companies to deploy machine learning as a service (MLaaS). Recently, the first membership inference attack has shown that extraction of information on the training set is possible in such MLaaS settings, which has severe security and privacy implications.
However, the early demonstrations of the feasibility of such attacks have many assumptions on the adversary, such as using multiple so-called shadow models, knowledge of the target model structure, and having a dataset from the same distribution as the target model’s training data. We relax all these key assumptions, thereby showing that such attacks are very broadly applicable at low cost and thereby pose a more severe risk than previously thought. We present the most comprehensive study so far on this emerging and developing threat using eight diverse datasets which show the viability of the proposed attacks across domains. In addition, we propose the first effective defense mechanisms against such broader class of membership inference attacks that maintain a high level of utility of the ML model.
NDSSBest Paper Award
MBeacon: Privacy-Preserving Beacons for DNA Methylation Data
The advancement of molecular profiling techniques fuels biomedical research with a deluge of data. To facilitate data sharing, the Global Alliance for Genomics and Health established the Beacon system, a search engine designed to help researchers find datasets of interest. While the current Beacon system only supports genomic data, other types of biomedical data, such as DNA methylation, are also essential for advancing our understanding in the field. In this paper, we propose the first Beacon system for DNA methylation data sharing: MBeacon. As the current genomic Beacon is vulnerable to privacy attacks, such as membership inference, and DNA methylation data is highly sensitive, we take a privacy-by-design approach to construct MBeacon.
First, we demonstrate the privacy threat, by proposing a membership inference attack tailored specifically to unprotected methylation Beacons. Our experimental results show that 100 queries are sufficient to achieve a successful attack with AUC (area under the ROC curve) above 0.9. To remedy this situation, we propose a novel differential privacy mechanism, namely SVT^2, which is the core component of MBeacon. Extensive experiments over multiple datasets show that SVT^2 can successfully mitigate membership privacy risks without significantly harming utility. We further implement a fully functional prototype of MBeacon which we make available to the research community.
Albatross – An optimistic consensus algorithm
Berrang, Pascal, Styp-Rekowsky, Philipp, Wissfeld, Marvin, França, Bruno, and Trinkler, Reto
In Proceedings of the Crypto Valley Conference on Blockchain Technology (CVCBT) 2019
The decreasing costs of molecular profiling have fueled the biomedical research community with a plethora of new types of biomedical data, enabling a breakthrough towards more precise and personalized medicine. Naturally, the increasing availability of data also enables physicians to compare patients’ data and treatments easily and to find similar patients in order to propose the optimal therapy. Such similar patient queries (SPQs) are of utmost importance to medical practice and will be relied upon in future health information exchange systems. While privacy-preserving solutions have been previously studied, those are limited to genomic data, ignoring the different newly available types of biomedical data.
In this paper, we propose new cryptographic techniques for finding similar patients in a privacy-preserving manner with various types of biomedical data, including genomic, epigenomic and transcriptomic data as well as their combination. We design protocols for two of the most common similarity metrics in biomedicine: the Euclidean distance and Pearson correlation coefficient. Moreover, unlike previous approaches, we account for the fact that certain locations contribute differently to a given disease or phenotype by allowing to limit the query to the relevant locations and to assign them different weights. Our protocols are specifically designed to be highly efficient in terms of communication and bandwidth, requiring only one or two rounds of communication and thus enabling scalable parallel queries. We rigorously prove our protocols to be secure based on cryptographic games and instantiate our technique with three of the most important types of biomedical data – namely DNA, microRNA expression, and DNA methylation. Our experimental results show that our protocols can compute a similarity query over a typical number of positions against a database of 1,000 patients in a few seconds. Finally, we propose and formalize strategies to mitigate the threat of malicious users or hospitals.
Revisiting Membership Inference Attacks Against Machine Learning Models
The decreasing costs of molecular profiling has fueled the biomedical research community with a plethora of new types of biomedical data, enabling a breakthrough towards a more precise and personalized medicine. However, the release of these intrinsically highly sensitive data poses a new severe privacy threat. While biomedical data is largely associated with our health, there also exist various correlations between different types of biomedical data, along the temporal dimension, and also in-between family members. However, so far, the security community has focused on privacy risks stemming from genomic data, largely overlooking the manifold interdependencies between other biomedical data.
In this paper, we present a generic framework for quantifying the privacy risks in biomedical data taking into account the various interdependencies between data (i) of different types, (ii) from different individuals, and (iii) at different time. To this end, we rely on a Bayesian network model that allows us to take all aforementioned dependencies into account and run exact probabilistic inference attacks very efficiently. Furthermore, we introduce a generic algorithm for building the Bayesian network, which encompasses expert knowledge for known dependencies, such as genetic inheritance laws, and learns previously unknown dependencies from the data. Then, we conduct a thorough inference risk evaluation with a very rich dataset containing genomic and epigenomic data of mothers and children over multiple years. Besides effective probabilistic inference, we further demonstrate that our Bayesian network model can also serve as a building block for other attacks. We show that, with our framework, an adversary can efficiently identify the parent-child relationships based on methylation data with a success rate of 95%.
Identifying Personal DNA Methylation Profiles by Genotype Inference
Since the first whole-genome sequencing, the biomedical research community has made significant steps towards a more precise, predictive and personalized medicine. Genomic data is nowadays widely considered privacy-sensitive and consequently protected by strict regulations and released only after careful consideration. Various additional types of biomedical data, however, are not shielded by any dedicated legal means and consequently disseminated much less thoughtfully. This in particular holds true for DNA methylation data as one of the most important and well-understood epigenetic element influencing human health.
In this paper, we show that, in contrast to the aforementioned belief, releasing one’s DNA methylation data causes privacy issues akin to releasing one’s actual genome. We show that already a small subset of methylation regions influenced by genomic variants are sufficient to infer parts of someone’s genome, and to further map this DNA methylation profile to the corresponding genome. Notably, we show that such re-identification is possible with 97.5% accuracy, relying on a dataset of more than 2500 genomes, and that we can reject all wrongly matched genomes using an appropriate statistical test. We provide means for countering this threat by proposing a novel cryptographic scheme for privately classifying tumors that enables a privacy-respecting medical diagnosis in a common clinical setting. The scheme relies on a combination of random forests and homomorphic encryption, and it is proven secure in the honest-but-curious model. We evaluate this scheme on real DNA methylation data, and show that we can keep the computational overhead to acceptable values for our application scenario.
Privacy in Epigenetics: Temporal Linkability of MicroRNA Expression Profiles
The decreasing cost of molecular profiling tests, such as DNA sequencing, and the consequent increasing availability of biological data are revolutionizing medicine, but at the same time create novel privacy risks. The research community has already proposed a plethora of methods for protecting genomic data against these risks. However, the privacy risks stemming from epigenetics, which bridges the gap between the genome and our health characteristics, have been largely overlooked so far, even though epigenetic data such as microRNAs (miRNAs) are no less privacy sensitive. This lack of investigation is attributed to the common belief that the inherent temporal variability of miRNAs shields them from being tracked and linked over time.
In this paper, we show that, contrary to this belief, miRNA expression profiles can be successfully tracked over time, despite their variability. Specifically, we show that two blood-based miRNA expression profiles taken with a time difference of one week from the same person can be matched with a success rate of 90%. We furthermore observe that this success rate stays almost constant when the time difference is increased from one week to one year. In order to mitigate the linkability threat, we propose and thoroughly evaluate two countermeasures: (i) hiding a subset of disease-irrelevant miRNA expressions, and (ii) probabilistically sanitizing the miRNA expression profiles. Our experiments show that the second mechanism provides a better trade-off between privacy and disease-prediction accuracy.
The continuous decrease in cost of molecular profiling tests is revolutionizing medical research and practice, but it also raises new privacy concerns. One of the first attacks against privacy of biological data, proposed by Homer et al. in 2008, showed that, by knowing parts of the genome of a given individual and summary statistics of a genome-based study, it is possible to detect if this individual participated in the study. Since then, a lot of work has been carried out to further study the theoretical limits and to counter the genome-based membership inference attack. However, genomic data are by no means the only or the most influential biological data threatening personal privacy. For instance, whereas the genome informs us about the risk of developing some diseases in the future, epigenetic biomarkers, such as microRNAs, are directly and deterministically affected by our health condition including most common severe diseases.
In this paper, we show that the membership inference attack also threatens the privacy of individuals contributing their microRNA expressions to scientific studies. Our results on real and public microRNA expression data demonstrate that disease-specific datasets are especially prone to membership detection, offering a true-positive rate of up to 77% at a false-negative rate of less than 1%. We present two attacks: one relying on the L1 distance and the other based on the likelihood-ratio test. We show that the likelihood-ratio test provides the highest adversarial success and we derive a theoretical limit on this success. In order to mitigate the membership inference, we propose and evaluate both a differentially private mechanism and a hiding mechanism. We also consider two types of adversarial prior knowledge for the differentially private mechanism and show that, for relatively large datasets, this mechanism can protect the privacy of participants in miRNA-based studies against strong adversaries without degrading the data utility too much. Based on our findings and given the current number of miRNAs, we recommend to only release summary statistics of datasets containing at least a couple of hundred individuals.
On Epigenomic Privacy: Tracking Personal MicroRNA Expression Profiles over Time
The dramatically decreasing costs of DNA sequencing have triggered more than a million humans to date to have their genotypes sequenced. Moreover, these individuals increasingly make their genomic data publicly available, and thereby create unique privacy threats not only for themselves, but also for their relatives because of their DNA similarities. More generally, an entity that gains access to a significant fraction of sequenced genotypes from a given population might be able to infer even the genomes of unsequenced individuals by relying on available data.
In this paper, we propose a simulation-based model for quantifying the impact of continuously sequencing and publicizing personal genomic data on a population’s genomic privacy. Our simulation probabilistically models data sharing by individuals and additionally takes into account the influence on genomic privacy of geopolitical events such as migration, and sociological trends such as interracial marriage. We exemplarily instantiate our simulation with a sample population of 1,000 individuals, and evaluate the evolution of privacy under different settings over either thousands of genomic variants or a subset of variants influencing the phenotype. Our findings notably demonstrate that an increasing sharing rate of genomic data in the future entails a substantial negative effect on the privacy of all older generations. Moreover, we find that mixed populations, due to their large genomic diversity, face a less severe erosion of genomic privacy over time than more homogeneous populations. However, even when no data is shared, the genomic privacy averaged over a large number of variants is already very low since mere population allele frequencies already reveal a lot of information about the values of the genomic variants. By focusing on a subset of sensitive variants, we observe a higher genetic diversity in the population. Thus, genomic-data sharing can be much more detrimental for the privacy of the most sensitive variants.
From Zoos to Safaris – From Closed-World Enforcement to Open-World Assessment of Privacy