PanaceaLab - COVID19 Twitter Dataset Homepage

An Open Resource for the Global Research Community

Due to the relevance of the COVID-19 global pandemic, we are releasing our dataset of tweets acquired from the Twitter Stream related to COVID-19 chatter. Since our first release we have received additional data from our new collaborators, allowing this resource to grow to its current size. Dedicated data gathering started from March 11th yielding over ~3.3 million tweets a day.

The data collected from the stream captures all languages, but the higher prevalence are: English, Spanish, and French. We release all tweets and retweets on the full_dataset.tsv file (990,198,297 unique tweets), and a cleaned version with no retweets on the full_dataset-clean.tsv (252,342,227 unique tweets). There are several practical reasons for us to leave the retweets, tracing important tweets and their dissemination is one of them. For NLP tasks we provide the top 1000 frequent terms in frequent_terms.csv, the top 1000 bigrams in frequent_bigrams.csv, and the top 1000 trigrams in frequent_trigrams.csv. Some general statistics per day are included for both datasets in the statistics-full_dataset.tsv and statistics-full_dataset-clean.tsv files.

The latest version of the dataset and usage instructions can be found in our github page: https://github.com/thepanacealab/covid19_twitter

If you are going to cite or reuse this dataset, please use:

This dataset is released only for Non-Commercial research purposes

This dataset is being mantained by Georgia State University's Panacea Lab. Curators: Juan M. Banda, Ramya Tekumalla and Gerardo Chowell-Puente

Additional data provided by Guanyu Wang (Missouri school of journalism, University of Missouri), Jingyuan Yu (Department of social psychology, Universitat Autònoma de Barcelona), Tuo Liu (Department of psychology, Carl von Ossietzky Universität Oldenburg), Yuning Ding (Language technology lab, Universität Duisburg-Essen), Katya Artemova (NRU HSE) and Elena Tutubalina (KFU)

Feel Free to share in any medium you want:

Data description

As part of our normal daily data collection from the publicly available Twitter stream, we get around 4.4 million tweets a day. From that collection we started analyzing the uptick on Coronavirus related keywords (coronavirus , 2019nCoV) when we first searched for them on February 11th. The tweets we found (the identifiers that is) are available from January 1st to March 11th, this later date is when we started collecting tweets specifically using the following data driven selection of keywords: COVD19, CoronavirusPandemic, COVID-19, 2019nCoV, CoronaOutbreak,coronavirus , WuhanVirus, covid19, coronaviruspandemic, covid-19, 2019ncov, coronaoutbreak, wuhanvirus. These keywords have been used to exclusively grab tweets from the stream API since then, yielding around 4.4 million tweets a day and the bulk of the data found in this dataset.

Figure 1. Number of TOTAL tweets per day on dataset Version 53

Figure 2. Number of cleaned tweets per day on dataset Version 53

As mentioned before, we release two versions of the available data: non-filtered and with retweets removed. While some applications and questions are better served with the full dataset, others like NLP tasks might prefer a clean dataset to have less inflated counts of the n-grams identified. It is vital to understand that we can only share tweet identifiers per Twitter’s terms of usage, so to get the raw tweets from this dataset the user needs to hydrate them. For ease of NLP tasks we are also releasing global counts for the top 1000 frequent terms, top 1000 bigrams, and top 1000 trigrams found in the data up to 8/8.

We will be updating the github repo every two days with additional days of data we gather and will be updating release versions every full week.

Data Insights

The following interactive map shows the number of tweets that have a Place Location available in the dataset

To see the data on a tabular format, look below. Notice that this data field is very sparsely populated on tweets

Looking at how many tweets have geo-location enabled, we find a lot less of them as we can see in the following interactive map. NOTE: This map might take a bit to load as it can been zoomed in.

The radius of the circles here represent the number of tweets from any given location. In other words, the bigger the radius, the larger the circle.

With regards to the tweet languages available on this dataset, we have created the following plot. Please click on it to see it clearly.

You can also look at the tabular language data below

Bach-corp has released a very cool visualization network of the top 1000 terms linked with the top 1000 bigrams, see it here.

Have you generated any insights from this data? feel free to share the link and we will post it here.

How to Contribute?

Do you have additional Twitter data? Do you have paid-level Twitter API access?

Feel free to reach out so we can enhance the dataset or collect more data.

Do you want to analyze the current data in the dataset?

Everybody is free to download, hydrate and analyze the dataset. If you find something cool, send it our way and we will share it on the insights section of this website.

Any other ideas of what to do with this data?

Feel free to contact us to discuss

Disclaimers

Note that the data is provided as-is, more importantly, due to Twitter’s terms of use, we can not release the full JSON objects from the tweets, just the tweet identifiers. This is not a major hurdle as you can hydrate them using different sets of tools. One caveat is that any tweet that was deleted by the Twitter user will not be available anymore, so your number of hydrated tweets will vary. We do have all JSON objects from these tweets stored on our servers ready to share under allowed circumstances.

Covid-19 Twitter chatter dataset for scientific use

Latest update:

An Open Resource for the Global Research Community

Data description

Data Insights

How to Contribute?

Disclaimers

Contact us