Python Notes4: Extracting small files
Extracting into small files
Prior to extracting tweets into small file, the workflow has included reading original database of tweets in Apache Server, and then filtering tweets by heuristics, in which movie-irrelevant tweets got deleted by customized method for different movies.
Now, the extracting process needs to select effective metadata and copy needed fields to new file on Python module. It extracts Json-format data into small files of movie-by-week from Movie-title-json.0 to Movie-title-json.7, in which, title refers to movie name, and json Number 0 indicates pre-release week, and from Json Number 1 to Json Number 7 indicate release weeks from first week until seventh week. The whole extracting work processes data size from 1.4TB into 177GB.