An anonymized version of the dataset and codes used in our paper are available for the research community.
Self-disclosure Dataset: This is PII-labeled multi-text span synthetic dataset which is generated using LLMs. Dataset is labelled with 19 PII-revealing categories.
Codes: Implementation details and codes for synthetic dataset generation are available on GITHUB.
You can find the format of the Self-disclosure dataset here.
If you are interested in using this data, please fill the form to
You will use the data solely for the purpose of non-profit research or non-profit education.
You will respect the privacy of end users and organizations that may be identified in the data. You will not attempt to reverse engineer, decrypt, de-anonymize, derive or otherwise r>
You will not distribute the data beyond your immediate research group.
If you create a publication using our datasets, please cite our papers as follows.
@article{jangra2025synthetic, title={Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection}, author={Shalini Jangra, Suparna De, Nishanth Sastry and Saeed Fadaei}, journal={arXiv preprint arXiv:xxxx.xxxxx}, year={2025} }