Our goal in the earliest stage of the project is to understand as much as we can about the data: what data sources are available; how much of the data is being produced; how is it captured and transmitted, with what latencies and on what channels; how long it stays available; how secure is it; how accurate it is, and so on. In our case, we need the following types of data:
Once the data scientists build the basic understanding of the data, they may begin formulating the hypotheses on the insights that might be minable from the data and on approach they may use to gain these insights.
The first task is to get the stream of tweets related to some specific movie. We will employ the filtering capability of the Twitter streaming API. Every tweet containing words similar to a movie name is considered as the movie-related. E.g. for the movie “Lights Out” both texts will match. Examples of the data we’ll be dealing with quickly reveal quality issues we’ll have to deal with:
A data received with every tweet looks like this, after being captured in a json format:
{
"created_at" : "Fri Jul 22 21:34:48 +0000 2016",
"id" : 756603425161261057,
"id_str" : "756603425161261057",
"text" : "i hope light's out is worth my time. \ud83d\ude34",
"source" : "\u003ca href=\"http:\/\/twitter.com\/download\/iphone\"
rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e",
"truncated" : false,
"in_reply_to_status_id" : null,
"in_reply_to_status_id_str" : null,
"in_reply_to_user_id" : null,
"in_reply_to_user_id_str" : null,
"in_reply_to_screen_name" : null,
"user" : {
"Id" : 2989278494,
"Id_str" : "2989278494",
"name" : "- q\u03c5een\u0455\u043day \u2728",
"screen_name" : "yungbarbiex0",
"location" : "Queens, NY",
"url" : "http:\/\/Instagram.com\/xobabyshay",
"description" : "\u03b9'\u043c \u0442\u043da\u0442 \u0432\u03b9\u0442c\u043d yo\u03c5 \u043doe\u0455 \u043da\u0442e. $ | \u2651\ufe0f |",
"Protected" : false,
"verified" : false,
"followers_count" : 1226,
"friends_count" : 551,
"listed_count" : 7,
"favourites_count" : 26466,
"statuses_count" : 40548,
"created_at" : "Mon Jan 19 04:01:54 +0000 2015",
"utc_offset" : null,
"time_zone" : null,
"geo_enabled" : true,
"lang" : "en",
"contributors_enabled" : false,
"is_translator" : false,
"profile_background_color" : "C0DEED",
"profile_background_image_url" : "http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png",
"profile_background_image_url_https" : "https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png",
"profile_background_tile" : true,"profile_link_color" : "DD2E44",
"profile_sidebar_border_color" : "000000",
"profile_sidebar_fill_color" : "000000",
"profile_text_color" : "000000",
"profile_use_background_image" : true,
"profile_image_url" : "http:\/\/pbs.twimg.com\/profile_images\/755988257163317248\/SRdOYbJA_normal.jpg",
"profile_image_url_https" : "https:\/\/pbs.twimg.com\/profile_images\/755988257163317248\/SRdOYbJA_normal.jpg",
"profile_banner_url" : "https:\/\/pbs.twimg.com\/profile_banners\/2989278494\/1468891291",
"default_profile" : false,
"default_profile_image" : false,
"following" : null,
"follow_request_sent" : null,
"notifications" : null
},
"geo" : null,
"coordinates" : null,
"place" : null,
"contributors" : null,
"is_quote_status" : false,
"retweet_count" : 0,
"favorite_count" : 0,
"entities" : {
"hashtags" : [],
"urls" : [],
"user_mentions" : [],
"Symbols" : []
},
"favorited" : false,
"retweeted" : false,
"filter_level" : "low",
"lang" : "en",
"timestamp_ms" : "1469223288227"
}
The tweet screenshot is
What potentially valuable data we can see here?
What are potential issues and challenges with the data?
The most important data, naturally, is the field “text” containing the tweet itself. Also there is a location information in fields “location”, “coordinates”, “place”. The information reflecting the tweeter’s social power like “followers_count” also could be interesting. Let’s have a quick overview of followers distribution based on a data sample of about 220K tweets collected during July 22-27, 2016 for several movies.
Quantiles of the “followers number”
An ANOVA (ANalysis Of VAriance) test says the “followers” number is statistically different for different movies.
Analysis of Variance Table
Response: followers_count
Df Sum Sq Mean Sq F value Pr(>F)
movie 4 1.2078e+12 3.0195e+11 10.464 1.789e-08 ***
Residuals 219302 6.3280e+15 2.8855e+10
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The next important aspect of data understanding is amount of data. Working with Twitter Streaming API we get several dozens tweets per second for a stream filtered for one movie name.
Once our data science team looked at enough data samples, they could summarize the initial findings:
That all gives us insight into what kind of data we have and stimulates our thinking on hypotheses and directions for further data exploration. It also gives us the necessary ground to proceed with selection of the right dictionary, which is the subject of the next blog post.