Harvesting and Analyzing Tweets

Twitter is a fabulous source for information. Whenever something is happening, people around the world start tweeting away. Often they include hashtags, allowing us to selectively search for tweets about a certain event or thing. Many twitter users also engage in conversations, and looking at these conversations allows us to identify leaders and frequent actors.

In this lesson, we will look how to harvest tweets from Twitter using ScraperWiki and how to analyse them using social network analysis and software called Gephi.

What you will need

Harvesting Tweets using ScraperWiki

The first thing we need to do is to get tweets out of Twitter. Getting full access to everything that is posted to Twitter is hard and mainly used by academics and companies building on top of it. Nevertheless, everyone can get a selection of tweets using the Twitter API and searching for specific keywords or users.

There are various tools to get tweets out of Twitter. By far the easiest to use seems to be ScraperWiki.

Walkthrough: Harvesting tweets with ScraperWiki

  1. Sign in to ScraperWiki.

  2. In your “data hub” (the page you get to after signing in), click the big “create a new dataset” field.

    Create a new dataset

    Create a new dataset

  3. You will be presented with multiple options. Select “Search for Tweets”.

    Search for tweets

    Search for tweets

  4. Now Scraperwiki will start up a simple search interface. Enter a search term and go ahead. I’ll search for #ddj:

    Search for #ddj

    Search for #ddj

    You will need to authorize the app that comes up with Twitter, and ScraperWiki will start downloading all tweets it can find.

  5. Check the Also monitor for future tweets checkbox to create a continuous dataset.

    Create a continuous dataset

    Create a continuous dataset

  6. Now you can see the tweets ScraperWiki grabbed via “View in a Table”.

    View in a table

    View in a table

  7. To effectively work with it, download it as a spreadsheet.

Well done—now you’ve downloaded a dataset full of tweets!

Analyzing Languages

Once we have our dataset, let’s do some analysis with it.

First let’s look at what we can find out. Twitter gives us a wide range of information: the date, who’s tweeting, the language the tweet is in, and so on. Doing a quick pivot table on the downloaded dataset allows us to see which languages are the most frequent. We can also find out whether or not a link was included.

You’ll also notice the mention and hashtag columns. These are quite handy, but you’ll soon realize they only include the first mention and hashtag. If we want to do a full analysis, we’ll have to extract more.

Analyzing social networks from tweets

Let’s do a network analysis based on people tweeting and on people and hashtags mentioned in tweets.

This is going to be a two-step approach. Since the Twitter API only gives us the first mention or hashtag, we have to get the full information out and into the right format. For this we’ll use Refine, since it’s great at bringing data into the shape we need.

Walkthrough: Extracting mentions and hashtags using Refine

  1. Load the dataset you just downloaded into Refine using “create project”.

    We’re interested in getting the data into a format that has two columns. The first column is the screen name, and the second is any hashtag or account that is mentioned in a tweet by that user. Gephi understands that format easily.

    First, let’s extract mentions and hashtags from the tweets. Mentions start with @, and hashtags start with #.

  2. Let’s start by lowercasing all the tweets. This helps to unify hashtags and mentions.

    We’ll do so by selecting “edit cells – common transforms – to lowercase” from the column options menu of the text column (the arrow next to the column name).

    To lowercase

    To lowercase

  3. Great! To get all mentions out, we have to do several steps at once.

    First we need to split all the words. Twitter delineates words by anything that is not a letter, number, dash (-), or underscore (_).

    Select “edit columns – add column based on this column” and choose the column mentions. The expression that will split the column’s text into words is value.split(/[^a-z0-9-_@#]/).

    Add column based on column text

    Add column based on column text

  4. Next we need to filter the list that comes out. We want everything that starts with a @ or #.

    To filter the list, we use the function filter(). The filter() function wants your list first, then the name of a variable to assign to each column, and then something that checks whether or not it should be included.

    If we want to filter for each person mention, for example, we can use the expression:

    filter expression

    filter expression

  5. To get any hashtags, we could use i.startsWith("#") instead. But since we want either mentions or hashtags, we’ll use another formula to connect them: or. We’ll write or(i.startsWith(“#”),i.startsWith(“@”)) as a condition.

    filter expression: or

    filter expression: or

  6. One thing remains to be done, and then we’ll have our giant expression written. We’ll need to join the list (so that Refine can handle it) by appending the .join(",") function. This joins the list into a single string of text by inserting commas.



  7. Great—now let’s add this column.

    Let’s remove everything we don’t need anymore and bring the two columns “screen name” and “mentions” into position.

    Re-order columns

    Re-order columns

    Re-order screen_name and mentions

    Re-order *screen_name* and *mentions*

  8. Great, now you’ll have two columns left. For Gephi, we’ll need to have each user-mention pair in a row. Let’s split the column into several rows.

    Do so with “edit cells – split multi-valued cells”, and split by comma (,).

    Split by comma

    Split by comma

  9. Fantastic. Notice how only the first row has the users tweeting. This is not a problem. Use “edit cells – fill down” to add them to all the empty rows.

    Fill down

    Fill down

  10. There is a difference between the screen_names and the mentions: mentions start with @ (for users) and are in lowercase.

    Let’s lowercase the letters using “edit cells – common transforms – to lowercase”, as above.

  11. Then use “edit cells – Transform” to add the @ in front of the name using "@"+value.

  12. Fantastic! You have now formatted the file for Gephi. Download it as CSV using the “Export” button.

Walkthrough: Social network analysis using Gephi

  1. Start Gephi and choose “new project”.

  2. Open the CSV with “file – open”.

    Open the CSV

    Open the CSV

  3. Select “directed” and leave the defaults.

    Choose "Directed"

    Choose “Directed”

  4. Click “OK” to create a graph.

  5. Since Gephi takes all the rows and columns into account, we’ll have to remove the header.

    Change the view to the “data laboratory” view.

    Data Laboratory

    Data Laboratory

  6. Remove the first two nodes (screen_name and mentions): mark them, then right click and select “delete all”.

  7. Change back to the “overview”.

  8. At first, this graph doesn’t look very special, simply because we haven’t applied any layout yet.

    Let’s do so. In the layout window on the left, select “ForceAtlas 2”. This will apply a force-directed layout.

    ForceAtlas 2

    ForceAtlas 2

  9. ForceAtlas is a very simple algorithm. It groups connected nodes closer together.

    Let it run for a while. Note how many nodes stay in the middle. This is because we searched for #ddj and all the tweets contain it. Let’s remove it from our nodes.

    Doing so results in a much nicer layout. (Press “play” and “stop” as you like.)

  10. But how do we know which dot is which? Let’s enable the labels.

    Enable labels

    Enable labels

  11. The labels right now are not clear to read, and we don’t know which labels are important. We can scale the labels by how many connections (mentions, tweets) a label has.

    In the top right, select label size. Then choose “degree” as a parameter.

    Choose a rank parameter

    Choose a rank parameter

  12. Click on “apply” and play with the parameters (minimum and maximum size) as you see fit.

  13. Okay, we still can’t read a thing! Luckily there is a layout called “label adjust”. This layout will move nodes so the labels don’t overlap. Try this for a while.

  14. Now if you use the zoom (hidden in a menu below the graph), you can get a pretty clear picture of who’s important.



  15. But this is not the only thing we can do. We can check for clusters: people and hashtags that belong together.

    Do so by switching to “statistics” on the tab on the right.



  16. Choose Modularity. Now we can color the labels by modularity class. Select the label color on the left.

    Modularity Class

    Modularity Class

  17. Now you can see which hashtags and people are closer together and which are farther apart.

    Other interesting parameters are “centrality” (who is more central in the network, who is less connected, etc.).

  18. When you’re done, you can either export the data or format the graph for exporting in the “preview” tab. This works slightly differently from the “overview”, so you’ll need to play around for a while to create a nice-looking graph.

Further Analysis

Of course, social network analysis is only one of the analyses possible. I’ve outlined some others below.


Using a service like Wordle, remove all the mentions and hashtags and create a wordcloud. I’d use a spreadsheet and a formula like =Concatenate(text column) to create one giant string from all the tweets.

Keyword analysis

Sometimes you know that certain keywords will be present and you just want to check for them. You could, for example, check for occurrences of certain hashtags together (e.g. how often do #ddj tweets mention “dataviz”?).


The spreadsheet contains latitude and longitude for tweets which have a location. You could easily map them using an online service such as CartoDB.

Time-based analysis

Can you find out how keywords and or hashtags change over time? How does the discussion shift?


A more sophisticated analysis is sentiment analysis. For this, you would either specify keywords for which you say “this means happy” or “this means sad”—or employ machine learning and a training set to determine the mood of the tweets. While this is out of range for a simple tutorial, it is a quite powerful form of analysis.

  • mariano_j

    I did not get past step 2 of your walkthrough. Splitting the text column
    to isolate all values does not work for me. While the preview looks fine
    the espression “value.split(/[^a-z0-9-_@#]/)” ultimately produces an
    error (“€œString value not storable”). Please see files attached.

    I tried this with Google Refine 2.5 stable and Open Refine 2.6 beta with
    Firefox 26.0 and Chromium 31.0.1650.63 in Ubuntu 13.10. What was your
    setting? Do you have any idea what might be causing this? Any advice
    would be appreciated.

    • Michael Bauer

      mariano_j: Did you press OK right there? If yes it is because Refine can’t handle lists. You’ll need to go on and do the rest (filtering and joining) until you get a proper text back again.

      • mariano_j

        Thanks, Michael! After having my face rub in it I realized that the sections 3 to 6 of your walkthrough is not a series of steps, one done after another, but rather a documentation of how to build the one necessary expression, that goes like this:


        I found it quite puzzling that OpenRefine shows a preview of something it can actually not provide an output for, see image attached.

        • Hi, thank you for your hint, marioano_j. That was very useful. I am working with German content. Your Regex-term also cut out all Umlaute (ä,ü,ö) and other special latin characters.. So I wrote this instead to get all the hashtags:


  • Kristina_Ra

    how do you apply this concept with the twitter follower data from scraperwiki? i’m very new at sna but i believe the table is missing something (not sure what) that would allow me to graph/map who follows who among twitter followers

  • Ido Schacham

    ScraperWiki doesn’t provide Twitter tools anymore.

Theme by Anders Norén