Spanish Twitter Dataset for Pride Day (2015–2024)

Ramiro Ortega, María del Mar; Hassan, Samer

doi:10.5281/zenodo.15639492

Published June 11, 2025 | Version v2

Dataset Restricted

Spanish Twitter Dataset for Pride Day (2015–2024)

1. Harvard University
2. Universidad Complutense de Madrid

English Version:

Two datasets are published as part of my Bachelor's final thesis on hate speech, titled Hate Speech on Twitter: Analysis of LGBTIQ-phobia Before and After Elon Musk:

Colectivo.csv: This dataset contains 653,000 tweets in Spanish related to the LGBTIQ+ community, collected using specific keywords. The tweets correspond to each June 28th of every year from 2015 to 2024.
Aleatorio.csv: This dataset includes 395,000 random tweets in Spanish, obtained through a selection of keywords. The tweets represent a 6-minute sample from every hour, corresponding to each June 28th from 2015 to 2024.

Both datasets aim to provide a detailed view of interactions on Twitter on the specified days.

The columns include: id, createdAt, source, lang, retweetCount, replyCount, likeCount, quoteCount, viewCount, bookmarkCount, isReply, conversationId, author_verified, author_blue_verified, author_followers, author_following, author_tweets, author_createdAt, hashtags, author_isAutomated, author_fastFollowersCount, author_favouritesCount, texto_analisis, toxicity, severe_toxicity, identity_attack, insult, profanity, threat. The 'texto_analisis' column contains the content of the tweet, with all user mentions removed to comply with privacy regulations such as GDPR. The 'toxicity', 'severe_toxicity', 'identity_attack', 'insult', 'profanity', and 'threat' columns have values ranging from 0 to 1, where 0 indicates the attribute is not present and 1 indicates it is strongly present. The 'createdAt' column represents the tweet's publication date.

For further details, you can find the code for processing and analysis in the project's GitHub repository.

Acknowledgements

We would like to acknowledge the use of tools and support provided by twitterapi.io for data extraction, as well as the Perspective API, which played a crucial role in analyzing tweet toxicity. These resources were indispensable for the successful completion of this project.

Versión en Español:

Se publican dos conjuntos de datos como parte de mi trabajo de fin de grado (TFG) sobre el discurso de odio, titulado Discurso de odio en Twitter: Análisis de la LGTBIQ-fobia antes y después de Elon Musk:

Colectivo.csv: Este conjunto de datos contiene 653,000 tuits en español relacionados con la comunidad LGTBIQ+, recopilados mediante el uso de palabras clave. Los tuits corresponden a cada 28 de junio de cada año, desde 2015 hasta 2024.
Aleatorio.csv: Este conjunto de datos incluye 395,000 tuits aleatorios en español, obtenidos a partir de una selección de palabras clave. Los tuits representan una muestra de 6 minutos de cada hora, correspondiente a cada 28 de junio, desde 2015 hasta 2024.

Ambos conjuntos de datos tienen como objetivo proporcionar una visión detallada de las interacciones en Twitter en los días señalados.

Las columnas incluyen: id, createdAt, source, lang, retweetCount, replyCount, likeCount, quoteCount, viewCount, bookmarkCount, isReply, conversationId, author_verified, author_blue_verified, author_followers, author_following, author_tweets, author_createdAt, hashtags, author_isAutomated, author_fastFollowersCount, author_favouritesCount, texto_analisis, toxicity, severe_toxicity, identity_attack, insult, profanity, threat. La columna 'texto_analisis' contiene el contenido del tuit, una vez eliminadas todas las menciones a usuarios para cumplir con las normativas de privacidad, como la GDPR. Las columnas 'toxicity', 'severe_toxicity', 'identity_attack', 'insult', 'profanity' y 'threat' tienen valores que van del 0 al 1, donde 0 indica que el atributo no está presente y 1 indica que está muy presente. La columna 'createdAt' representa la fecha de publicación del tuit.

Para más detalles, puede consultar el código de procesamiento y análisis de los datos en el repositorio de GitHub del proyecto.

Agradecimientos

Queremos agradecer el apoyo y las herramientas proporcionadas por twitterapi.io para la extracción de datos, así como la Perspective API, que jugó un papel crucial en el análisis de la toxicidad de los tuits. Estos recursos fueron indispensables para la realización exitosa de este proyecto.

Files

Restricted

The record is publicly accessible, but files are restricted. <a href="https://zenodo.org/account/settings/login?next=https://zenodo.org/records/15639492">Log in</a> to check if you have access.

	All versions	This version
Views	320	111
Downloads	267	113
Data volume	34.4 GB	24.5 GB

Spanish Twitter Dataset for Pride Day (2015–2024)

Authors/Creators

Description

English Version:

Versión en Español:

Files

Restricted