Retrieve all you might want from a tweet using the tweepy python package and formatting it into a pandas data frame.
Twitter is a beautiful atmosphere to examine different types of users, relationships, and information. It is a fertile ground for many types of research- mine included.
Past research also means existing datasets- so why bother mining the Twitter data stream? The answer is, of course, that not every shoe fits, and existing datasets might not be adequate to your research goals or scope.
In this article, I’m offering to teach you how to create a dataset with at least twice as many features (comparing to datasets and tutorials I found).
No matter what you choose to do with them. Allow me to introduce your possibilities. I strongly suggest that you indeed retrieve all the available data for your research in advance since it might be unavailable in the future (tweets might be deleted).
The gist with the full code presented in this article is attached at the end. The full material for the article can be found in this GitHub repository, in addition to a video link.
First thing first- make sure you have a Twitter account.
For connecting to Twitter APIyou’ll need a Twitter developer account and your account credentials. A way to access your keys and tokens is through Apps and App Details- where you can also regenerate them if needed.
Here are the things that will appear on your screen in the process:
You will need to save your credentials in order to place them in the appropriate place in the code, as you can see in the code screenshot below:
Keep your credentials safe! Please don’t share them! It’s the access to the Twitter data stream via your account, and while we only retrieve tweets in order to research them and analyze them, if your keys and credentials fall into the wrong hands, someone might abuse them.
The second basic thing is the installs and imports. The most important one is tweepy, the python package that helps us connect to the Twitter data stream using our credentials via the Twitter API. Importing JSON is also important since the tweet structure is a JSON object; Pandas help us parse the JSON string into a tabular data set- which is friendlier. Since the Twitter API is time-limited, the time python package is also very important for inserting a fifteen-minute timeout between batches of retrieved tweets.
Here is the syntax for using the
tweepy.Cursor to connect to the Twitter API:
Tweepy receives several arguments within each search lets view them:
- Queries — the terms you are searching for?
- Filters — what you don’t want to retrieve
Tweet_mode— 140 or 280 characters?
Include_rts— do you want retweets as well?
- Lang Items — what language do you search for?
Also, it has time limitations- approximately 4000 tweets per 15 minutes, which means that we should be considering a time out of 15 minutes in our function used to retrieve tweets by using
A closer look at an extended tweet
text: “Can’t fit your Tweet into 140 characters? 🤔nnWe’re trying something new with a small group, and increasing the char… https://t.co/y1rJlHsVB5″,
“full_text”: “Can’t fit your Tweet into 140 characters? 🤔nnWe’re trying something new with a small group, and increasing the character limit to 280! Excited about the possibilities? Read our blog to find out how it all adds up. 👇nhttps://t.co/C6hjsB9nbL”,
Here is an example tweet taken from my thread:
What would you like to retrieve from this example tweet tread?
- Reply to
What fields would you like to retrieve from a profile?
- Screen name
- Joined at
Below is a list of fields from a tweet object that creates our desired data frame (full code at the end of this article):
Pretty gruesome. Even though it’s small, you can see little emojis appear a few times in different places- this means that the full text of a tweet can appear in different places in the JSON string. It also means that using
text.append(tweet.full_text) or using
text as a fall-back — is not enough. Here is the solution to overcome it:
The full text can be used in all kinds of analysis using NLP methodologies such as sentiment analysis or LDA (topic analysis); You can divide it according to the users discussing the topic.
You can retrieve information regarding the tweet itself the following way:
Retrieving information regarding the user can be done by the following:
Retrieving mentions and hashtags data:
In reply to:
Retrieving data regarding sides in a conversation allows analyzing the data using graph and SNA methodologies using python packages such as NetworkX.
For geographical analysis, such as plotting tweets based on their location, you can retrieve the location data of a tweet the following way:
Some of the tweets might be sensitive; you might be interested in knowing that Twitter tagged them as such, especially when dealing with sensitive or controversial topics. Here is how to retrieve the “Is it sensitive” tags:
Encoding may create garbage
One of the problems I discovered after retrieving 8GB worth of tweets is that somewhere between all the encoding, everything is unreadable and contains odd symbols and gibberish-like characters — I suspected that the one possible cause was the emoji.
Look at the following example from StackOverflow:
I saved emojis as another feature on the data frame to overcome encoding problems. Below is the full code that includes the demoji python package that interprets emojis. More information about the demoji python package and other python packages that interpret emojis can be found here.
I hope you find this article useful. Here is the full code covered in this article: