This document provides all details needed to have access to the research collection eRisk 2020.

Any scientific publication derived from the use of this collection should explicitly refer to the following CLEF 2016 paper:

The eRisk 2021 collection is available for research purposes under proper user agreements.

Data

The collection contains textual interactions (posts or comments) from multiple users (task2: pathological gamblers and non-pathological gamblers, task2: self-harm and non-self-harm users; ). For each subject, a (usually long) history of writings (posts or comments from a social networking site) is available. This is stored as a XML file (one per subject) with the following structure:

<INDIVIDUAL>
<ID> ... </ID>
<WRITING>
<TITLE> ...   </TITLE>
<DATE> ... </DATE>
<INFO> ... </INFO>
<TEXT> ...  </TEXT>
</WRITING>
<WRITING>
<TITLE> ... </TITLE>
<DATE> ... </DATE>
<INFO> ... </INFO>
<TEXT> ... </TEXT>
</WRITING>
....
</INDIVIDUAL>

ID: contains the anonymised id of the subject

TITLE: title of the post if available (if it is a comment then TITLE is empty)

INFO: additional info about the writing (source of the post/comment)

TEXT: body of the post or comment

eRisk 2021 Text Research Collection

Data

User agreement