-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathSocial Media Data Analysis .py
173 lines (87 loc) · 4.03 KB
/
Social Media Data Analysis .py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
#!/usr/bin/env python
# coding: utf-8
# # Clean & Analyze Social Media
# The project began by generating random social media data to simulate various post categories and like counts. I utilized Python’s Pandas, Seaborn, and Matplotlib libraries to clean, explore, and visualize the data effectively. Throughout the process, I encountered and resolved challenges related to data cleaning, ensuring the final dataset was robust for analysis. Additionally, I experimented with Seaborn palettes to enhance visual appeal and legibility of the plots.
# In[21]:
# Packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
# In[22]:
import random
# In[23]:
# Define categories
categories = ['Food', 'Travel', 'Fashion', 'Fitness', 'Music', 'Culture', 'Family', 'Health']
# Set the number of entries you want (e.g., n = 500)
n = 500
# Generate random data
data = {
'Date': pd.date_range(start='2021-01-01', periods=n),
'Category': [random.choice(categories) for _ in range(n)],
'Likes': np.random.randint(0, 10000, size=n)
}
# Convert the dictionary to a DataFrame
df = pd.DataFrame(data)
# Display the first few rows of the DataFrame
print(df.head())
# In[24]:
# DataFrame’s structure, (number of entries, column names, data types, and memory usage).
print(df.info())
# In[25]:
#Summary of numerical columns (count, mean, standard deviation, minimum, and maximum values).
print(df.describe())
# In[26]:
# Count the Frequency of Each Category(How many times each category appears in the data)
category_counts = df['Category'].value_counts()
print(category_counts)
# In[27]:
# Remove rows with any null values
df_cleaned = df.dropna()
# In[28]:
# Remove duplicate rows
df_cleaned = df_cleaned.drop_duplicates()
# In[29]:
# Convert 'Date' column to datetime format
df_cleaned['Date'] = pd.to_datetime(df_cleaned['Date'])
# In[30]:
# Convert 'Likes' column to integer format
df_cleaned['Likes'] = df_cleaned['Likes'].astype(int)
# In[31]:
# Display the cleaned DataFrame's information to confirm changes
print(df_cleaned.info())
# In[32]:
print(df.head())
# In[33]:
import seaborn as sns
import matplotlib.pyplot as plt
# In[34]:
# Create a histogram of the 'Likes' column
sns.histplot(df_cleaned['Likes']),
plt.xlabel('Likes')
plt.ylabel('Frequency')
plt.title('Distribution of Likes')
plt.show()
# In[35]:
# Create a Boxplot with a Custom Color Palette
sns.boxplot(x='Category', y='Likes', data=df_cleaned, palette="Pastel1")
plt.xlabel('Category')
plt.ylabel('Likes')
plt.title('Likes Distribution Across Categories')
plt.xticks(rotation=45)
plt.show()
# In[36]:
# Calculate and print the overall mean of 'Likes'
mean_likes = df_cleaned['Likes'].mean()
print("Mean of Likes:", mean_likes)
# In[37]:
# Calculate and print the mean 'Likes' for each 'Category'
mean_likes_by_category = df_cleaned.groupby('Category')['Likes'].mean()
print(mean_likes_by_category)
# ### Key Findings ###
# The analysis revealed a nearly uniform distribution of likes across all posts, likely due to the random nature of the data generation. Among categories, 'Culture' and 'Food' posts tended to receive higher average likes, while 'Fashion' and 'Health' were slightly lower. Such insights might indicate potential areas for content focus, especially if these categories were real-world data.
# ### This Project ###
# What sets this project apart is the thorough approach to data quality and the commitment to making data-driven insights accessible through clear and visually appealing plots. My focus on data cleaning and aesthetic customization showcases a readiness to adapt and optimize for real-world analytics tasks.
# ### Improvements ###
# To elevate this analysis further, I would consider generating data that reflects real-world social media patterns, potentially including additional metrics like comments or shares. An interactive dashboard could also offer users an engaging way to explore the data dynamically, providing more flexibility in how insights are derived and applied.
# In[ ]: