[Feature/Proposal]: Add automatic content capture on infinite loading websites #44

tuxx · 2025-03-19T02:04:28Z

Feature proposal.

This commit introduces a new feature that automatically captures content as users scroll through websites, starting with Reddit as a proof of concept.

The implementation:

Adds a background script that monitors page navigation to Reddit domains
Injects content capture functionality that detects Reddit posts in the viewport
Creates a throttled queue system to prevent excessive API calls
Maintains a status indicator to show users when content is being captured
Adds configuration options in the settings UI to enable/disable the feature
Allows users to specify custom tags for auto-captured content
Implements automatic domain-based tagging
Prevents duplicate entries with URL normalization
Adds visual feedback with a status indicator during capture

This feature aims to effortlessly build your personal archive while browsing, capturing valuable content without requiring manual clicks. Reddit serves as the initial implementation, but the architecture can be extended to support other infinite-scrolling sites like Twitter, Facebook, YouTube, etc.

Technical details:

Uses mutation observers and scroll event listeners to detect new content
Maintains a processed elements set to avoid duplicate captures
Implements throttling to manage API request frequency
Adds user configuration options in the options page

How this could be improved

This feature can be expanded in several ways:

Additional Website Support:
- Twitter/X for capturing tweets and threads
- YouTube for capturing video details
- Facebook for capturing posts and updates
- Instagram for image posts
- LinkedIn for professional content
- Pinterest for visual content
- News sites with infinite scrolling
Enhanced Configuration Options:
- Per-site toggle controls
- Capture frequency settings (aggressive, normal, minimal)
- Content type filters (e.g., only posts above certain engagement threshold)
- Automatic tagging based on content type/category
- Schedule-based enabling (e.g., only during work hours)
- Domain-specific tag templates
Performance Optimizations:
- Intelligent throttling based on system resources
- Background syncing when browser is idle
- Optional local caching before server sync
- Batch processing of multiple captures

Feature proposal. This commit introduces a new feature that automatically captures content as users scroll through websites, starting with Reddit as a proof of concept. The implementation: - Adds a background script that monitors page navigation to Reddit domains - Injects content capture functionality that detects Reddit posts in the viewport - Creates a throttled queue system to prevent excessive API calls - Maintains a status indicator to show users when content is being captured - Adds configuration options in the settings UI to enable/disable the feature - Allows users to specify custom tags for auto-captured content - Implements automatic domain-based tagging - Prevents duplicate entries with URL normalization - Adds visual feedback with a status indicator during capture This feature aims to effortlessly build your personal archive while browsing, capturing valuable content without requiring manual clicks. Reddit serves as the initial implementation, but the architecture can be extended to support other infinite-scrolling sites like Twitter, Facebook, YouTube, etc. Technical details: - Uses mutation observers and scroll event listeners to detect new content - Maintains a processed elements set to avoid duplicate captures - Implements throttling to manage API request frequency - Adds user configuration options in the options page

pirate · 2025-03-19T07:29:58Z

cool! can you post a screenrecording or gif of it working on a reddit page? I can add that to the screenshots in the chrome store

tuxx · 2025-03-19T13:29:57Z

Will try to make one tonight.

I noticed this morning that there are some bugs where it did not record. But what else is new, bugs in software 😅

tuxx · 2025-03-19T18:41:59Z

reddit-archivebox.mp4

Screenshots showing the config page in the extension, and the pages we logged through the new feature. Video shows me scrolling reddit (and on the bottom right the indicator when posts are saved).

tuxx · 2025-03-19T18:50:38Z

I noticed that i had to reload the extension before it correctly started capturing the posts. Also i had a popup.js warning.

Priority 1 is fixing the bugs. After that integrating more infinite scrolling websites would be great.

The great thing about this (imo) is that it does not scrape any html, but uses the webRequest permission to get response bodies from the urls that are loaded in the browser. So in theory once we fix more websites this should be pretty solid. I was working on another idea to scrape the HTML. But that resulted in a lot of headaches.

I'm also curious how it would handle hour long scrolling sessions, since it saves already saved URL's. Testing is the key here :)

This commit fixes several issues with the Reddit content capture functionality: 1. Storage Access Error: Removed direct chrome.storage.session calls from the injected script context, using window variables instead to avoid "Access to storage is not allowed from this context" errors. 2. Browser Restart Detection: Added setupExistingTabs() function to detect and initialize Reddit tabs that were already open when the extension starts or reloads. 3. Small Window Detection: Improved post visibility detection to use a more forgiving algorithm that captures posts partially visible in the viewport, fixing issues with small browser windows. 4. Added Message Passing: Implemented chrome.runtime.sendMessage for configuration values instead of direct storage access. 5. Improved Status Indicator: Enhanced the status display to show multiple captured posts at once with a counter. 6. Toggleable Debug Logging: Added a DEBUG_MODE constant to easily enable/disable diagnostic logs. These changes improve reliability of the Reddit capture system across browser restarts and different window sizes while maintaining the original functionality.

tuxx · 2025-03-20T00:53:31Z

Fixed a lot of bugs on browser restart, already open reddit tabs, and made the indicator show the last 5 captures posts (for all you fast scrollers out there)

…intainability This commit introduces a major architectural overhaul to the content capture system: - Created a modular site handler system to support multiple sites - Extracted Reddit-specific logic into dedicated reddit-handler.js - Implemented memory management with configurable limits - Added enhanced user controls for site-specific settings - Improved performance with better throttling and queuing - Added detailed capture statistics for monitoring - Enhanced UI with site detection and filtering - Fixed resource usage issues by limiting stored data - Improved error handling and recovery - Fixed syntax errors with proper async/await usage in background.js and options.js - Fixed pattern escaping issue in manifest.json web_accessible_resources This refactoring addresses key architectural issues: 1. Tight coupling between components 2. Excessive resource usage 3. Limited user configuration 4. Poor maintainability Known issues that still need to be addressed: Security vulnerabilities: - XSS risks in entries-tab.js and popup.js due to unsanitized string interpolation - Path traversal risk in reddit-handler.js URL normalization - Missing secure context verification in site-handlers.js Performance issues: - Memory leaks in state.observedPosts and processedUrls sets with insufficient pruning - Inefficient DOM operations in entries-tab.js causing full re-renders on each filter change - Redundant storage operations in reddit-handler.js (saving every 50 items) - Overly broad mutation observer in reddit-content.js Browser compatibility issues: - Missing API availability checks for features like navigator.clipboard.writeText() - Chrome-specific APIs used without fallbacks - CSS vendor prefixes missing in popup.js Architecture issues: - Multiple components with direct dependencies on same storage keys - Inconsistent error handling across files - Callback patterns that could be improved with async/await - HTML structure dependencies without validation Specific bugs: - Race conditions in site-handlers.js concurrent operations - Missing permissions verification in several features - Unclosed observers in reddit-content.js - Unsafe URL parsing without proper error handling The new architecture is more extensible, allowing for easier addition of new site handlers in the future, but these issues will need to be addressed in subsequent commits.

pirate · 2025-03-20T04:11:58Z

Awesome, nice work. I'd love to build this for X.com next after we merge this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature/Proposal]: Add automatic content capture on infinite loading websites #44

[Feature/Proposal]: Add automatic content capture on infinite loading websites #44

tuxx commented Mar 19, 2025 •

edited

Loading

pirate commented Mar 19, 2025

tuxx commented Mar 19, 2025

tuxx commented Mar 19, 2025

tuxx commented Mar 19, 2025 •

edited

Loading

tuxx commented Mar 20, 2025

pirate commented Mar 20, 2025 •

edited

Loading

[Feature/Proposal]: Add automatic content capture on infinite loading websites #44

Are you sure you want to change the base?

[Feature/Proposal]: Add automatic content capture on infinite loading websites #44

Conversation

tuxx commented Mar 19, 2025 • edited Loading

Feature proposal.

The implementation:

Technical details:

How this could be improved

pirate commented Mar 19, 2025

tuxx commented Mar 19, 2025

tuxx commented Mar 19, 2025

tuxx commented Mar 19, 2025 • edited Loading

tuxx commented Mar 20, 2025

pirate commented Mar 20, 2025 • edited Loading

tuxx commented Mar 19, 2025 •

edited

Loading

tuxx commented Mar 19, 2025 •

edited

Loading

pirate commented Mar 20, 2025 •

edited

Loading