Python note3: Stream Tweets Collection

jdzoomer04
Sep 11, 2017
2 min read

As last chapter explained, whatever search method or stream data method, the first thing we need to do to is to get Twitter API key. We need to visit dev.twitter.com/apps, login to Twitter account, and create a Twitter application there.

So, what is API?

We can make the concept clearer here.

As Wikipedia explained, an Application Programming Interface (API) is a set of subroutine definitions, protocols, and tools for building application software. An API may be for a web-based system, operating system, database system, computer hardware or software library.

When we data mining tweets, we use python to connect with Twitter API for a database system.

For collecting the whole messy data of one complete tweet, we can use JSON format~ JavaScript Object Notation. It is a very common data format used for asynchronous browser–server communication, as a replacement for XML. JSON was derived from JavaScript, but as of 2017 many programming languages include code to generate and parse JSON-format data. The official Internet media type for JSON is application/json. Moreover, JSON filenames use the extension .json.

When collect stream data, firstly we need to connect python on terminal with Twitter API. And then, use code to import json and parse: parse = argparse.ArgumentParser. We need to get stream data of start and end to learn where to stop the tweets for counting.

Python codes sent request from user to HTTP server, and the server pulls processed result from data store and grants view to user. And then the server opens and makes streaming connected with Twitter.

After Twitter accepts connection, the server would receive streamed tweets, process and store result when related tweets including words of movie names occur. Those tweets are automatically shown and stored in database in real time.

Filtering tweets by Heuristics

The RP stops tweets when it detected words of movie names in the tweet, but the movie names are not always unique words, and they are mixed with large amounts of rubbish tweets that are not related to the movies. Thus, filtering methods have been designed according to different movie names.