Wrangle Report

In this project, we are going to wrangle WeRateDogs twitter account data. Wrangling process includes gathering data from multiple resources, assessing data qualities and tidiness issues, cleaning data, then using this cleaned data to create interesting and trustworthy analyses and visualizations.

In gathering data procedure, we brought data from two resources as the following:

Udacity Website
Twitter API

From Udacity server, we download twitter-archive-enhanced.csv manually and import the data by using read_csv function. Then, by using python request library, we download programmatically image prediction data from Udacity server and save data in image_predictions.tsv. Finally, we use Tweepy library to retrieve more additional data, tweet’s retweet and favorite counts. Data from twitter API will be stored in tweet_jason.txt file.

Quality Issues:

twitter archived data:

Missing values in (reply to status id, reply to user id, retweeted status id, retweeted status user id, retweeted status timestamp)
Missing dogs stages (Some tweets don't have any dog stage, All four columns have None value doggo,floofer, pupper, puppo)
Missing dogs names (Some dogs have no name, the dog name value of this tweets is None)
Inaccurate dog names (a, the, an,such,quite, this...)
You only want original ratings, so we don't want the tweets that replied to another tweet.
Errorneous datatypes (tweet_id,in_reply_to_status_id, in_reply_to_user_id,retweeted_status_id, retweeted_status_user_id, timestamp, rating_numerator,rating_denominator)

image prediction data:

Missing data for the dog preed pridiction

additional twitter data:

Some tweets have been deleted. ex. missing API data

Tidiness Issues:

In twitter archived data table, The doggo, floofer, pupper, and puppo should all be in one column
Merge three data sets twitter archived, image prediction and additional twitter data into one dataframe

Once we encounter the data qualities and tidiness issues, cleaning data phase starts. In this stage, we create a copy of each data set, then we take each issue and work to solve it by three steps, Define, Code and Test. We start with tidiness issues and then data quality issues. In wrangle_act.ipynd file, you will notice that missing records issues in both image prediction and additional twitter data sets are solved during our work in Tidiness part. After cleaning data phase complete, we reassess the data again and encounter new three data quality issues which are mentioned below:

The following colunms p1,p2,p3, dog_type and tweet_id have wrong data type
Some tweets don't have dog types (type value is None)
Pictures of some tweets aren't dog pictures (tweets with rate < 8 and false NN predictions)
Timestamp should be separated into year, month, day and time

Finally, we store the new cleaned data in twitter_archive_master.csv in order to use it in creating analysis and visualizations.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Act Report.pdf		Act Report.pdf
README.md		README.md
image_predictions.tsv		image_predictions.tsv
tweet_jason.txt		tweet_jason.txt
twitter-archive-enhanced.csv		twitter-archive-enhanced.csv
twitter_archive_master.csv		twitter_archive_master.csv
wrangle_act.ipynb		wrangle_act.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wrangle Report

Quality Issues:

twitter archived data:

image prediction data:

additional twitter data:

Tidiness Issues:

About

Releases

Packages

Languages

shadatahayneh/data-wrangling-project

Folders and files

Latest commit

History

Repository files navigation

Wrangle Report

Quality Issues:

twitter archived data:

image prediction data:

additional twitter data:

Tidiness Issues:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages