Skip to content

[Feature/Proposal]: Add automatic content capture on infinite loading websites #44

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

tuxx
Copy link

@tuxx tuxx commented Mar 19, 2025

Feature proposal.

This commit introduces a new feature that automatically captures content as users scroll through websites, starting with Reddit as a proof of concept.

The implementation:

  • Adds a background script that monitors page navigation to Reddit domains
  • Injects content capture functionality that detects Reddit posts in the viewport
  • Creates a throttled queue system to prevent excessive API calls
  • Maintains a status indicator to show users when content is being captured
  • Adds configuration options in the settings UI to enable/disable the feature
  • Allows users to specify custom tags for auto-captured content
  • Implements automatic domain-based tagging
  • Prevents duplicate entries with URL normalization
  • Adds visual feedback with a status indicator during capture

This feature aims to effortlessly build your personal archive while browsing, capturing valuable content without requiring manual clicks. Reddit serves as the initial implementation, but the architecture can be extended to support other infinite-scrolling sites like Twitter, Facebook, YouTube, etc.

Technical details:

  • Uses mutation observers and scroll event listeners to detect new content
  • Maintains a processed elements set to avoid duplicate captures
  • Implements throttling to manage API request frequency
  • Adds user configuration options in the options page

How this could be improved

This feature can be expanded in several ways:

  1. Additional Website Support:

    • Twitter/X for capturing tweets and threads
    • YouTube for capturing video details
    • Facebook for capturing posts and updates
    • Instagram for image posts
    • LinkedIn for professional content
    • Pinterest for visual content
    • News sites with infinite scrolling
  2. Enhanced Configuration Options:

    • Per-site toggle controls
    • Capture frequency settings (aggressive, normal, minimal)
    • Content type filters (e.g., only posts above certain engagement threshold)
    • Automatic tagging based on content type/category
    • Schedule-based enabling (e.g., only during work hours)
    • Domain-specific tag templates
  3. Performance Optimizations:

    • Intelligent throttling based on system resources
    • Background syncing when browser is idle
    • Optional local caching before server sync
    • Batch processing of multiple captures

Feature proposal.

This commit introduces a new feature that automatically captures content as users scroll through websites, starting with Reddit as a proof of concept. The implementation:

- Adds a background script that monitors page navigation to Reddit domains
- Injects content capture functionality that detects Reddit posts in the viewport
- Creates a throttled queue system to prevent excessive API calls
- Maintains a status indicator to show users when content is being captured
- Adds configuration options in the settings UI to enable/disable the feature
- Allows users to specify custom tags for auto-captured content
- Implements automatic domain-based tagging
- Prevents duplicate entries with URL normalization
- Adds visual feedback with a status indicator during capture

This feature aims to effortlessly build your personal archive while browsing, capturing valuable content without requiring manual clicks. Reddit serves as the initial implementation, but the architecture can be extended to support other infinite-scrolling sites like Twitter, Facebook, YouTube, etc.

Technical details:
- Uses mutation observers and scroll event listeners to detect new content
- Maintains a processed elements set to avoid duplicate captures
- Implements throttling to manage API request frequency
- Adds user configuration options in the options page
@pirate
Copy link
Member

pirate commented Mar 19, 2025

cool! can you post a screenrecording or gif of it working on a reddit page? I can add that to the screenshots in the chrome store

@tuxx
Copy link
Author

tuxx commented Mar 19, 2025

Will try to make one tonight.

I noticed this morning that there are some bugs where it did not record. But what else is new, bugs in software 😅

@tuxx
Copy link
Author

tuxx commented Mar 19, 2025

reddit-archivebox.mp4

2025-03-19_19-40-38
2025-03-19_19-37-39

Screenshots showing the config page in the extension, and the pages we logged through the new feature. Video shows me scrolling reddit (and on the bottom right the indicator when posts are saved).

@tuxx
Copy link
Author

tuxx commented Mar 19, 2025

I noticed that i had to reload the extension before it correctly started capturing the posts. Also i had a popup.js warning.

Priority 1 is fixing the bugs. After that integrating more infinite scrolling websites would be great.

The great thing about this (imo) is that it does not scrape any html, but uses the webRequest permission to get response bodies from the urls that are loaded in the browser. So in theory once we fix more websites this should be pretty solid. I was working on another idea to scrape the HTML. But that resulted in a lot of headaches.

I'm also curious how it would handle hour long scrolling sessions, since it saves already saved URL's. Testing is the key here :)

This commit fixes several issues with the Reddit content capture functionality:

1. Storage Access Error: Removed direct chrome.storage.session calls from
   the injected script context, using window variables instead to avoid
   "Access to storage is not allowed from this context" errors.

2. Browser Restart Detection: Added setupExistingTabs() function
   to detect and initialize Reddit tabs that were already open
   when the extension starts or reloads.

3. Small Window Detection: Improved post visibility detection to use
   a more forgiving algorithm that captures posts partially visible
   in the viewport, fixing issues with small browser windows.

4. Added Message Passing: Implemented chrome.runtime.sendMessage for
   configuration values instead of direct storage access.

5. Improved Status Indicator: Enhanced the status display to show
   multiple captured posts at once with a counter.

6. Toggleable Debug Logging: Added a DEBUG_MODE constant to easily
   enable/disable diagnostic logs.

These changes improve reliability of the Reddit capture system across
browser restarts and different window sizes while maintaining the
original functionality.
@tuxx
Copy link
Author

tuxx commented Mar 20, 2025

Fixed a lot of bugs on browser restart, already open reddit tabs, and made the indicator show the last 5 captures posts (for all you fast scrollers out there)

2025-03-20_01-51-20

…intainability

This commit introduces a major architectural overhaul to the content capture system:

- Created a modular site handler system to support multiple sites
- Extracted Reddit-specific logic into dedicated reddit-handler.js
- Implemented memory management with configurable limits
- Added enhanced user controls for site-specific settings
- Improved performance with better throttling and queuing
- Added detailed capture statistics for monitoring
- Enhanced UI with site detection and filtering
- Fixed resource usage issues by limiting stored data
- Improved error handling and recovery
- Fixed syntax errors with proper async/await usage in background.js and options.js
- Fixed pattern escaping issue in manifest.json web_accessible_resources

This refactoring addresses key architectural issues:
1. Tight coupling between components
2. Excessive resource usage
3. Limited user configuration
4. Poor maintainability

Known issues that still need to be addressed:

Security vulnerabilities:
- XSS risks in entries-tab.js and popup.js due to unsanitized string interpolation
- Path traversal risk in reddit-handler.js URL normalization
- Missing secure context verification in site-handlers.js

Performance issues:
- Memory leaks in state.observedPosts and processedUrls sets with insufficient pruning
- Inefficient DOM operations in entries-tab.js causing full re-renders on each filter change
- Redundant storage operations in reddit-handler.js (saving every 50 items)
- Overly broad mutation observer in reddit-content.js

Browser compatibility issues:
- Missing API availability checks for features like navigator.clipboard.writeText()
- Chrome-specific APIs used without fallbacks
- CSS vendor prefixes missing in popup.js

Architecture issues:
- Multiple components with direct dependencies on same storage keys
- Inconsistent error handling across files
- Callback patterns that could be improved with async/await
- HTML structure dependencies without validation

Specific bugs:
- Race conditions in site-handlers.js concurrent operations
- Missing permissions verification in several features
- Unclosed observers in reddit-content.js
- Unsafe URL parsing without proper error handling

The new architecture is more extensible, allowing for easier addition of
new site handlers in the future, but these issues will need to be addressed
in subsequent commits.
@pirate
Copy link
Member

pirate commented Mar 20, 2025

Awesome, nice work. I'd love to build this for X.com next after we merge this.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants