-
Function of calculating area of bounding box.
-
Columns:
['id', 'created_at', 'coordinates', 'co_lon', 'co_lat', 'geo', 'geo_lat', 'geo_lon', 'user_location', 'place_type', 'place_name', 'place_full_name', 'place_country', 'place_bounding_box', 'pb_avg_lon', 'pb_avg_lat', 'min_lon', 'min_lat', 'max_lon', 'max_lat', 'bb_area', 'lang', 'source', 'text']
. -
user->location
: replace"\n"
into" "
; replace"\""
and"\'"
into""
. Since sometimes there exist commas, when saving text into csv, text should be wraped by double quotation marks. -
text
: the same touser->location
.
-
df = df.drop_duplicates(["id"])
: delete duplicate tweets based on"id"
. -
df.dtypes
: get data types of all columns. -
df = df.astype({"created_at":"datetime64[ns]"})
: change data type to datetime. -
df['co_lon'] = df['co_lon'].astype(float)
: if all values of'co_lon'
are numeric. -
df.source.value_counts()
: get total number of every source. -
df.coordinates.value_counts()
: get total number of tweets with goe-tags. -
df = df[df['place_country'] == 'Australia']
: get tweets from Aus. -
df = df[df['lang'] == 'en']
: get English tweets. -
re.sub("[^a-zA-Z]", " ", name)
: change all the non-alphabet characters to single space. -
re.sub(r"\s+", " ", name)
: eliminate duplicate whitespaces. -
Convert all the characters into lower case.
-
Delete links like
https://github.com/alexbnlee/Tweets-Processing/edit/master/README.md
. -
Extract flu related tweets.
-
Get all results with name of Kingsford.
-
Get the suburb with name of Kingford.
- Get keywords from articles.
-
Standford NER
-
spaCy
-
NTLK
-
get continuous words starting with a capital letter (2 methods)
-
delete stopwords
-
weekly tweets
-
flu related tweets
-
user number
-
processing with progress bar
- Calculate numbers of tweets based on different area.
-
Function of distance between two points.
-
Data processing.
-
Mean error distance.
-
Median error distance.