A Large-Scale English Multi-Label Twitter Dataset for Cyberbullying and Online Abuse Detection (2024)

Semiu Salawu,Jo Lumsden,Yulan He

Abstract

In this paper, we introduce a new English Twitter-based dataset for cyberbullying detection and online abuse. Comprising 62,587 tweets, this dataset was sourced from Twitter using specific query terms designed to retrieve tweets with high probabilities of various forms of bullying and offensive content, including insult, trolling, profanity, sarcasm, threat, p*rn and exclusion. We recruited a pool of 17 annotators to perform fine-grained annotation on the dataset with each tweet annotated by three annotators. All our annotators are high school educated and frequent users of social media. Inter-rater agreement for the dataset as measured by Krippendorff’s Alpha is 0.67. Analysis performed on the dataset confirmed common cyberbullying themes reported by other studies and revealed interesting relationships between the classes. The dataset was used to train a number of transformer-based deep learning models returning impressive results.

Anthology ID:
2021.woah-1.16
Volume:
Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)
Month:
August
Year:
2021
Address:
Online
Editors:
Aida Mostafazadeh Davani,Douwe Kiela,Mathias Lambert,Bertie Vidgen,Vinodkumar Prabhakaran,Zeerak Waseem
Venue:
WOAH
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
146–156
Language:
URL:
https://aclanthology.org/2021.woah-1.16
DOI:
10.18653/v1/2021.woah-1.16
Bibkey:
Cite (ACL):
Semiu Salawu, Jo Lumsden, and Yulan He. 2021. A Large-Scale English Multi-Label Twitter Dataset for Cyberbullying and Online Abuse Detection. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), pages 146–156, Online. Association for Computational Linguistics.
Cite (Informal):
A Large-Scale English Multi-Label Twitter Dataset for Cyberbullying and Online Abuse Detection (Salawu et al., WOAH 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.woah-1.16.pdf

PDFCiteSearch

Export citation
  • BibTeX
  • MODS XML
  • Endnote
  • Preformatted
@inproceedings{salawu-etal-2021-large, title = "A Large-Scale {E}nglish Multi-Label {T}witter Dataset for Cyberbullying and Online Abuse Detection", author = "Salawu, Semiu and Lumsden, Jo and He, Yulan", editor = "Mostafazadeh Davani, Aida and Kiela, Douwe and Lambert, Mathias and Vidgen, Bertie and Prabhakaran, Vinodkumar and Waseem, Zeerak", booktitle = "Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.woah-1.16", doi = "10.18653/v1/2021.woah-1.16", pages = "146--156", abstract = "In this paper, we introduce a new English Twitter-based dataset for cyberbullying detection and online abuse. Comprising 62,587 tweets, this dataset was sourced from Twitter using specific query terms designed to retrieve tweets with high probabilities of various forms of bullying and offensive content, including insult, trolling, profanity, sarcasm, threat, p*rn and exclusion. We recruited a pool of 17 annotators to perform fine-grained annotation on the dataset with each tweet annotated by three annotators. All our annotators are high school educated and frequent users of social media. Inter-rater agreement for the dataset as measured by Krippendorff{'}s Alpha is 0.67. Analysis performed on the dataset confirmed common cyberbullying themes reported by other studies and revealed interesting relationships between the classes. The dataset was used to train a number of transformer-based deep learning models returning impressive results.",}

Download as File

<?xml version="1.0" encoding="UTF-8"?><modsCollection xmlns="http://www.loc.gov/mods/v3"><mods ID="salawu-etal-2021-large"> <titleInfo> <title>A Large-Scale English Multi-Label Twitter Dataset for Cyberbullying and Online Abuse Detection</title> </titleInfo> <name type="personal"> <namePart type="given">Semiu</namePart> <namePart type="family">Salawu</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jo</namePart> <namePart type="family">Lumsden</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Yulan</namePart> <namePart type="family">He</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2021-08</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <relatedItem type="host"> <titleInfo> <title>Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)</title> </titleInfo> <name type="personal"> <namePart type="given">Aida</namePart> <namePart type="family">Mostafazadeh Davani</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Douwe</namePart> <namePart type="family">Kiela</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Mathias</namePart> <namePart type="family">Lambert</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Bertie</namePart> <namePart type="family">Vidgen</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Vinodkumar</namePart> <namePart type="family">Prabhakaran</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Zeerak</namePart> <namePart type="family">Waseem</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>Association for Computational Linguistics</publisher> <place> <placeTerm type="text">Online</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> </relatedItem> <abstract>In this paper, we introduce a new English Twitter-based dataset for cyberbullying detection and online abuse. Comprising 62,587 tweets, this dataset was sourced from Twitter using specific query terms designed to retrieve tweets with high probabilities of various forms of bullying and offensive content, including insult, trolling, profanity, sarcasm, threat, p*rn and exclusion. We recruited a pool of 17 annotators to perform fine-grained annotation on the dataset with each tweet annotated by three annotators. All our annotators are high school educated and frequent users of social media. Inter-rater agreement for the dataset as measured by Krippendorff’s Alpha is 0.67. Analysis performed on the dataset confirmed common cyberbullying themes reported by other studies and revealed interesting relationships between the classes. The dataset was used to train a number of transformer-based deep learning models returning impressive results.</abstract> <identifier type="citekey">salawu-etal-2021-large</identifier> <identifier type="doi">10.18653/v1/2021.woah-1.16</identifier> <location> <url>https://aclanthology.org/2021.woah-1.16</url> </location> <part> <date>2021-08</date> <extent unit="page"> <start>146</start> <end>156</end> </extent> </part></mods></modsCollection>

Download as File

%0 Conference Proceedings%T A Large-Scale English Multi-Label Twitter Dataset for Cyberbullying and Online Abuse Detection%A Salawu, Semiu%A Lumsden, Jo%A He, Yulan%Y Mostafazadeh Davani, Aida%Y Kiela, Douwe%Y Lambert, Mathias%Y Vidgen, Bertie%Y Prabhakaran, Vinodkumar%Y Waseem, Zeerak%S Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)%D 2021%8 August%I Association for Computational Linguistics%C Online%F salawu-etal-2021-large%X In this paper, we introduce a new English Twitter-based dataset for cyberbullying detection and online abuse. Comprising 62,587 tweets, this dataset was sourced from Twitter using specific query terms designed to retrieve tweets with high probabilities of various forms of bullying and offensive content, including insult, trolling, profanity, sarcasm, threat, p*rn and exclusion. We recruited a pool of 17 annotators to perform fine-grained annotation on the dataset with each tweet annotated by three annotators. All our annotators are high school educated and frequent users of social media. Inter-rater agreement for the dataset as measured by Krippendorff’s Alpha is 0.67. Analysis performed on the dataset confirmed common cyberbullying themes reported by other studies and revealed interesting relationships between the classes. The dataset was used to train a number of transformer-based deep learning models returning impressive results.%R 10.18653/v1/2021.woah-1.16%U https://aclanthology.org/2021.woah-1.16%U https://doi.org/10.18653/v1/2021.woah-1.16%P 146-156

Download as File

Markdown (Informal)

[A Large-Scale English Multi-Label Twitter Dataset for Cyberbullying and Online Abuse Detection](https://aclanthology.org/2021.woah-1.16) (Salawu et al., WOAH 2021)

  • A Large-Scale English Multi-Label Twitter Dataset for Cyberbullying and Online Abuse Detection (Salawu et al., WOAH 2021)
ACL
  • Semiu Salawu, Jo Lumsden, and Yulan He. 2021. A Large-Scale English Multi-Label Twitter Dataset for Cyberbullying and Online Abuse Detection. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), pages 146–156, Online. Association for Computational Linguistics.
A Large-Scale English Multi-Label Twitter Dataset for Cyberbullying and Online Abuse Detection (2024)

References

Top Articles
Latest Posts
Article information

Author: Jamar Nader

Last Updated:

Views: 6233

Rating: 4.4 / 5 (55 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Jamar Nader

Birthday: 1995-02-28

Address: Apt. 536 6162 Reichel Greens, Port Zackaryside, CT 22682-9804

Phone: +9958384818317

Job: IT Representative

Hobby: Scrapbooking, Hiking, Hunting, Kite flying, Blacksmithing, Video gaming, Foraging

Introduction: My name is Jamar Nader, I am a fine, shiny, colorful, bright, nice, perfect, curious person who loves writing and wants to share my knowledge and understanding with you.