Remove duplicates and sort elements #32

janthmueller · 2024-11-17T13:58:19Z

Hi there,

First off, thank you for your amazing work! I’ve been using your API in my project, Daily Stoic Quotes Using GitHub Scheduled Actions, and it has been a fantastic resource.

For a follow-up project, I explored your dataset (quotes.json) and discovered some exact duplicate entries. This PR aims to remove those duplicates and suggests sorting the quotes by author name and text, which should make it easier to identify similar entries in the future. I also found a few quotes that were very similar but not identical—likely due to different translations—so I left them unchanged.

Below is the code I used to clean and sort the dataset, along with detailed comments explaining each step:

import json

# Load the quotes data from the JSON file.
with open('quotes.json') as f:
    data = json.load(f)

quotes = data['quotes']

# Initialize lists to store unique quotes, seen quotes, and indices of duplicates.
unique_quotes = []
_seen_quotes = []
duplicate_indices = []

# Iterate through the list of quotes.
for i, quote in enumerate(quotes):
    # Convert the quote dictionary to a frozenset (immutable set) for easy comparison.
    _quote = frozenset(quote.items())
    
    # Check if the current quote has already been seen.
    if _quote not in _seen_quotes:
        # If it's unique, add it to the seen list and the list of unique quotes.
        _seen_quotes.append(_quote)
        unique_quotes.append(quote)
    else:
        # If it's a duplicate, record its index for reference.
        duplicate_indices.append(i)

# Sort the list of unique quotes by author's last name (reversed for sorting) and then by quote text.
sorted_quotes = sorted(
    unique_quotes,
    key=lambda x: ("".join(x['author'].lower().split()[::-1]), x['text'].lower())
)

# Create a new JSON structure for the cleaned and sorted quotes.
new_quotes = {"quotes": sorted_quotes}

# Save the cleaned and sorted quotes to a new file.
with open('new_quotes.json', 'w', encoding='utf-8') as f:
    json.dump(new_quotes, f, indent=2, ensure_ascii=False)

# Output the number of duplicate quotes that were removed.
num_duplicates = len(duplicate_indices)
print(f"Removed {num_duplicates} duplicate quotes.")
# Output example: Removed 43 duplicate quotes.

Please let me know if you’d like me to adjust anything or if there are any concerns. Thank you again for your fantastic API, and I hope this contribution is helpful.

vercel · 2024-11-17T13:58:23Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
stoic-quotes	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Nov 17, 2024 1:58pm

Remove duplicates and sort

4186c54

vercel bot deployed to Preview November 17, 2024 13:58 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove duplicates and sort elements #32

Remove duplicates and sort elements #32

janthmueller commented Nov 17, 2024

vercel bot commented Nov 17, 2024 •

edited

Loading

Remove duplicates and sort elements #32

Are you sure you want to change the base?

Remove duplicates and sort elements #32

Conversation

janthmueller commented Nov 17, 2024

vercel bot commented Nov 17, 2024 • edited Loading

vercel bot commented Nov 17, 2024 •

edited

Loading