Skip to content

Commit

Permalink
Merge branch 'main' of github.com:epfl-ada/ada-2024-project-dondata2025
Browse files Browse the repository at this point in the history
  • Loading branch information
juldib committed Dec 20, 2024
2 parents 3debaf0 + 6c2238f commit c8cab47
Showing 1 changed file with 14 additions and 7 deletions.
21 changes: 14 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,17 +58,24 @@ We tried another approach to detect unusual trends in name counts following a ke

### Name detection

To identify the main characters in our movies, we processed the plot_summaries.txt file, which contains plot summaries for 42,306 movies extracted from English-language Wikipedia. Each entry in the file follows a consistent structure:

Wikipedia ID \t Plot Summary \n
To identify the main characters in our movies, we processed the plot_summaries.txt file, which contains plot summaries for 42,306 movies extracted from English-language Wikipedia.
Every line in the file represents a movie, with its wikipedia id and plot summary separated by a tabulation.

Using this format, we extracted both the Wikipedia ID and the plot summary, linking each movie’s name to its corresponding Wikipedia ID and release year.

After filtering the data, we proceeded to identify the main characters. For this task, we utilized spaCy, an open-source Natural Language Processing library for Python. We analyzed each plot summary, labeled words in the text, and calculated the frequency of each character’s name. To ensure relevance, we applied a threshold: only characters mentioned at least twice in the plot summary were retained.

This approach allowed us to efficiently detect main characters. Ultimately, we created a DataFrame containing the character names and their respective counts for each movie.

### prophet
### Machine learning prediction Prophet

The metric given in the section above identifies which names are potential candidates, but we still need to fin a way to know if the name was actually influenced.

To do so, we use a technique called Interrupted Time Series. Basically, what we do is taking the data about a name before a movie was released, and trying to deduce what would a normal evolution for the name be with machine a learning model.

This will leave us with two curves that represent the names evolution after the release of the movie. One containing the actual data from the datasets and one that was predicted based on the previous counts (predicted). If the actual curve is much higher than the predicted one, we can assume that the movie has influenced this name!

Thus we run this algorithm on all main character names of every movie of the dataset and then define a threshold on the distance between the curves to decide wether a movie influenced a name or not.

## Contribution of group members
- Jeremy :
Expand All @@ -80,8 +87,8 @@ This approach allowed us to efficiently detect main characters. Ultimately, we c
- Emile :
- datasets and naïve approach model presentation
- "try it yourself" results display
- Movie influence over time analysis
- Birth of a new name analysis
- "Movie influence over time" analysis
- "Birth of a new name" analysis
- Corentin :
- Predicted name counts using Prophet and SARIMA models, incorporating confidence intervals and metric computation to determine name influence.
- Developed the character name recognition system.
Expand All @@ -91,7 +98,7 @@ This approach allowed us to efficiently detect main characters. Ultimately, we c
- Worked on defining what a blockbuster is
- Studied of genre movie on names
- Studied on the case of Norwegian names (only on the results notebook)
- Updated the website to sho the findings of the analysis for the datastory
- Updated the website to show the findings of the analysis for the datastory



0 comments on commit c8cab47

Please # to comment.