-
Notifications
You must be signed in to change notification settings - Fork 29
[Feature/Proposal]: Add automatic content capture on infinite loading websites #44
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
base: master
Are you sure you want to change the base?
Conversation
Feature proposal. This commit introduces a new feature that automatically captures content as users scroll through websites, starting with Reddit as a proof of concept. The implementation: - Adds a background script that monitors page navigation to Reddit domains - Injects content capture functionality that detects Reddit posts in the viewport - Creates a throttled queue system to prevent excessive API calls - Maintains a status indicator to show users when content is being captured - Adds configuration options in the settings UI to enable/disable the feature - Allows users to specify custom tags for auto-captured content - Implements automatic domain-based tagging - Prevents duplicate entries with URL normalization - Adds visual feedback with a status indicator during capture This feature aims to effortlessly build your personal archive while browsing, capturing valuable content without requiring manual clicks. Reddit serves as the initial implementation, but the architecture can be extended to support other infinite-scrolling sites like Twitter, Facebook, YouTube, etc. Technical details: - Uses mutation observers and scroll event listeners to detect new content - Maintains a processed elements set to avoid duplicate captures - Implements throttling to manage API request frequency - Adds user configuration options in the options page
cool! can you post a screenrecording or gif of it working on a reddit page? I can add that to the screenshots in the chrome store |
Will try to make one tonight. I noticed this morning that there are some bugs where it did not record. But what else is new, bugs in software 😅 |
I noticed that i had to reload the extension before it correctly started capturing the posts. Also i had a popup.js warning. Priority 1 is fixing the bugs. After that integrating more infinite scrolling websites would be great. The great thing about this (imo) is that it does not scrape any html, but uses the webRequest permission to get response bodies from the urls that are loaded in the browser. So in theory once we fix more websites this should be pretty solid. I was working on another idea to scrape the HTML. But that resulted in a lot of headaches. I'm also curious how it would handle hour long scrolling sessions, since it saves already saved URL's. Testing is the key here :) |
This commit fixes several issues with the Reddit content capture functionality: 1. Storage Access Error: Removed direct chrome.storage.session calls from the injected script context, using window variables instead to avoid "Access to storage is not allowed from this context" errors. 2. Browser Restart Detection: Added setupExistingTabs() function to detect and initialize Reddit tabs that were already open when the extension starts or reloads. 3. Small Window Detection: Improved post visibility detection to use a more forgiving algorithm that captures posts partially visible in the viewport, fixing issues with small browser windows. 4. Added Message Passing: Implemented chrome.runtime.sendMessage for configuration values instead of direct storage access. 5. Improved Status Indicator: Enhanced the status display to show multiple captured posts at once with a counter. 6. Toggleable Debug Logging: Added a DEBUG_MODE constant to easily enable/disable diagnostic logs. These changes improve reliability of the Reddit capture system across browser restarts and different window sizes while maintaining the original functionality.
…intainability This commit introduces a major architectural overhaul to the content capture system: - Created a modular site handler system to support multiple sites - Extracted Reddit-specific logic into dedicated reddit-handler.js - Implemented memory management with configurable limits - Added enhanced user controls for site-specific settings - Improved performance with better throttling and queuing - Added detailed capture statistics for monitoring - Enhanced UI with site detection and filtering - Fixed resource usage issues by limiting stored data - Improved error handling and recovery - Fixed syntax errors with proper async/await usage in background.js and options.js - Fixed pattern escaping issue in manifest.json web_accessible_resources This refactoring addresses key architectural issues: 1. Tight coupling between components 2. Excessive resource usage 3. Limited user configuration 4. Poor maintainability Known issues that still need to be addressed: Security vulnerabilities: - XSS risks in entries-tab.js and popup.js due to unsanitized string interpolation - Path traversal risk in reddit-handler.js URL normalization - Missing secure context verification in site-handlers.js Performance issues: - Memory leaks in state.observedPosts and processedUrls sets with insufficient pruning - Inefficient DOM operations in entries-tab.js causing full re-renders on each filter change - Redundant storage operations in reddit-handler.js (saving every 50 items) - Overly broad mutation observer in reddit-content.js Browser compatibility issues: - Missing API availability checks for features like navigator.clipboard.writeText() - Chrome-specific APIs used without fallbacks - CSS vendor prefixes missing in popup.js Architecture issues: - Multiple components with direct dependencies on same storage keys - Inconsistent error handling across files - Callback patterns that could be improved with async/await - HTML structure dependencies without validation Specific bugs: - Race conditions in site-handlers.js concurrent operations - Missing permissions verification in several features - Unclosed observers in reddit-content.js - Unsafe URL parsing without proper error handling The new architecture is more extensible, allowing for easier addition of new site handlers in the future, but these issues will need to be addressed in subsequent commits.
Awesome, nice work. I'd love to build this for X.com next after we merge this. |
Feature proposal.
This commit introduces a new feature that automatically captures content as users scroll through websites, starting with Reddit as a proof of concept.
The implementation:
This feature aims to effortlessly build your personal archive while browsing, capturing valuable content without requiring manual clicks. Reddit serves as the initial implementation, but the architecture can be extended to support other infinite-scrolling sites like Twitter, Facebook, YouTube, etc.
Technical details:
How this could be improved
This feature can be expanded in several ways:
Additional Website Support:
Enhanced Configuration Options:
Performance Optimizations: