Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over ~3.3 million tweets a day.
The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (468,169,539 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv (115,262,201 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.
The latest version of the dataset and usage instructions can be found in our github page: https://github.com/thepanacealab/covid19_twitter
This dataset is released only for Non-Commercial research purposes
Additional data provided by Guanyu Wang (Missouri school of journalism, University of Missouri), Jingyuan Yu (Department of social psychology, Universitat Autònoma de Barcelona), Tuo Liu (Department of psychology, Carl von Ossietzky Universität Oldenburg), Yuning Ding (Language technology lab, Universität Duisburg-Essen), Katya Artemova (NRU HSE) and Elena Tutubalina (KFU)
Feel Free to share in any medium you want:
As part of our normal daily data collection from the publicly available Twitter stream, we get around 4.4 million tweets a day. From that collection we started analyzing the uptick on Coronavirus related keywords (coronavirus , 2019nCoV) when we first searched for them on February 11th. The tweets we found (the identifiers that is) are available from January 1st to March 11th, this later date is when we started collecting tweets specifically using the following data driven selection of keywords: COVD19, CoronavirusPandemic, COVID-19, 2019nCoV, CoronaOutbreak,coronavirus , WuhanVirus, covid19, coronaviruspandemic, covid-19, 2019ncov, coronaoutbreak, wuhanvirus. These keywords have been used to exclusively grab tweets from the stream API since then, yielding around 4.4 million tweets a day and the bulk of the data found in this dataset.
Figure 1. Number of TOTAL tweets per day on dataset Version 17
Figure 2. Number of cleaned tweets per day on dataset Version 17
As mentioned before, we release two versions of the available data: non-filtered and with retweets removed. While some applications and questions are better served with the full dataset, others like NLP tasks might prefer a clean dataset to have less inflated counts of the n-grams identified. It is vital to understand that we can only share tweet identifiers per Twitter’s terms of usage, so to get the raw tweets from this dataset the user needs to hydrate them. For ease of NLP tasks we are also releasing global counts for the top 1000 frequent terms, top 1000 bigrams, and top 1000 trigrams found in the data up to 4/18.
We will be updating the github repo every two days with additional days of data we gather and will be updating release versions every full week.
The following interactive map shows the number of tweets that have a Place Location available in the dataset
To see the data on a tabular format, look below. Notice that this data field is very sparsely populated on tweets
Looking at how many tweets have geo-location enabled, we find a lot less of them as we can see in the following interactive map. NOTE: This map might take a bit to load as it can been zoomed in.
The radius of the circles here represent the number of tweets from any given location. In other words, the bigger the radius, the larger the circle.
With regards to the tweet languages available on this dataset, we have created the following plot. Please click on it to see it clearly.
You can also look at the tabular language data below
Have you generated any insights from this data? feel free to share the link and we will post it here.
Do you have additional Twitter data? Do you have paid-level Twitter API access?
Do you want to analyze the current data in the dataset?
Any other ideas of what to do with this data?