Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection


Abstract

Social platforms such as Reddit have a network of communities of shared interests, with a prevalence of posts and comments from which one can infer users’ Personal Information Identifiers (PIIs). While such self-disclosures can lead to rewarding social interactions, they pose privacy risks and the threat of online harms. Research into the identification and retrieval of such risky self-disclosures of PIIs is hampered by the lack of open-source labeled datasets. Important hindrances to sharing high quality labelled data include high annotation costs and privacy risks associated with the release of datasets containing self-disclosive text, especially when users include vulnerable populations.
To foster reproducible research into PII-revealing text detection, we de- velop a novel methodology to create synthetic equivalents of PII re- vealing data that can be safely shared. Our contributions include cre- ating a taxonomy of 19 PII-revealing categories for vulnerable popula- tions and the creation and release of a synthetic PII-labeled multi-text span dataset generated from 3 text generation Large Language Mod- els (LLMs), Llama2-7B, Llama3-8B and zephyr-7b-beta, with sequential instruction prompting to resemble the original Reddit posts. The util- ity of our methodology to generate this synthetic dataset is evaluated with three metrics: First, we require reproducibility equivalence, i.e., re- sults from training a model on the synthetic data should be comparable to those obtained by training the same models on the original posts. Second, we require that the synthetic data be unlinkable to the origi- nal users, through common mechanisms such as Google Search. Third, we wish to ensure that the synthetic data be indistinguishable from the original, i.e., trained humans should not be able to tell them apart. We release our dataset and code at https://netsys.surrey.ac.uk/datasets/synthetic-self-disclosure/ to foster reproducible research into PII privacy risks in online social media.

Dataset and Codes

An anonymized version of the dataset and codes used in our paper are available for the research community.

  1. Self-disclosure Dataset: This is PII-labeled multi-text span synthetic dataset which is generated using LLMs. Dataset is labelled with 19 PII-revealing categories.

  2. Codes: Implementation details and codes for synthetic dataset generation are available on GITHUB.

You can find the format of the Self-disclosure dataset here.


Contact Us


If you are interested in using this data, please fill the form to


Dataset Terms and Conditions

  1. You will use the data solely for the purpose of non-profit research or non-profit education.

  2. You will respect the privacy of end users and organizations that may be identified in the data. You will not attempt to reverse engineer, decrypt, de-anonymize, derive or otherwise r>

  3. You will not distribute the data beyond your immediate research group.

  4. If you create a publication using our datasets, please cite our papers as follows.

            @article{jangra2025synthetic,
              title={Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection},
              author={Shalini Jangra, Suparna De, Nishanth Sastry and Saeed Fadaei},
              journal={arXiv preprint arXiv:xxxx.xxxxx},
              year={2025}
            }
          






©2025 Netsys, Department of Computer Science, University of Surrey. Guildford GU2 7XH, Surrey, United Kingdom.