Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Remove duplicates and sort elements #32

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

janthmueller
Copy link

Hi there,

First off, thank you for your amazing work! I’ve been using your API in my project, Daily Stoic Quotes Using GitHub Scheduled Actions, and it has been a fantastic resource.

For a follow-up project, I explored your dataset (quotes.json) and discovered some exact duplicate entries. This PR aims to remove those duplicates and suggests sorting the quotes by author name and text, which should make it easier to identify similar entries in the future. I also found a few quotes that were very similar but not identical—likely due to different translations—so I left them unchanged.

Below is the code I used to clean and sort the dataset, along with detailed comments explaining each step:

import json

# Load the quotes data from the JSON file.
with open('quotes.json') as f:
    data = json.load(f)

quotes = data['quotes']

# Initialize lists to store unique quotes, seen quotes, and indices of duplicates.
unique_quotes = []
_seen_quotes = []
duplicate_indices = []

# Iterate through the list of quotes.
for i, quote in enumerate(quotes):
    # Convert the quote dictionary to a frozenset (immutable set) for easy comparison.
    _quote = frozenset(quote.items())
    
    # Check if the current quote has already been seen.
    if _quote not in _seen_quotes:
        # If it's unique, add it to the seen list and the list of unique quotes.
        _seen_quotes.append(_quote)
        unique_quotes.append(quote)
    else:
        # If it's a duplicate, record its index for reference.
        duplicate_indices.append(i)

# Sort the list of unique quotes by author's last name (reversed for sorting) and then by quote text.
sorted_quotes = sorted(
    unique_quotes,
    key=lambda x: ("".join(x['author'].lower().split()[::-1]), x['text'].lower())
)

# Create a new JSON structure for the cleaned and sorted quotes.
new_quotes = {"quotes": sorted_quotes}

# Save the cleaned and sorted quotes to a new file.
with open('new_quotes.json', 'w', encoding='utf-8') as f:
    json.dump(new_quotes, f, indent=2, ensure_ascii=False)

# Output the number of duplicate quotes that were removed.
num_duplicates = len(duplicate_indices)
print(f"Removed {num_duplicates} duplicate quotes.")
# Output example: Removed 43 duplicate quotes.

Please let me know if you’d like me to adjust anything or if there are any concerns. Thank you again for your fantastic API, and I hope this contribution is helpful.

Copy link

vercel bot commented Nov 17, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
stoic-quotes ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 17, 2024 1:58pm

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant