Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Cache warm-up #15

Closed
chesio opened this issue Jan 30, 2019 · 5 comments
Closed

Cache warm-up #15

chesio opened this issue Jan 30, 2019 · 5 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@chesio
Copy link
Owner

chesio commented Jan 30, 2019

Let internal crawling mechanism trigger cache files creation instead of site visitors.

@chesio chesio added the enhancement New feature or request label Jan 30, 2019
@chesio chesio self-assigned this Jan 30, 2019
@chesio
Copy link
Owner Author

chesio commented Feb 3, 2020

Things to keep in mind:

  1. Crawling should not overload the web server.
  2. Some URLs are more important than others.
  3. Individual cache entries are never removed automatically, they can only be removed "by hand" via Cache Viewer interface, WP-CLI command or deletion from disk.
  4. Request variants should be supported as much as possible.

Points 1. and 2. inherently require some priority management (a queue of some sort). I would argue that individual URLs can be grouped as follows (sorted by descending priority):

  1. Front-page - is_front_page()
  2. Blog-page - is_home()
  3. Static pages - is_page()
  4. Posts (including CPT, excluding attachments) - is_single()
  5. Taxonomy pages - is_tax()

Groups 4 and 5 futhermore contain subgroups:

  • in case of group 4: built-in posts and (optional) custom post types
  • in case of group 5: built-in category and tags taxonomy and (optional) custom taxonomies

Subgroups should have priority too, although here no obvious default sorting exists. In case of a blog website, blog posts would have higher priority than custom post types; In case of portfolio website with a news section, portfolio custom post type would have higher priority than blog posts.

Finally, items within every (sub)group should have priority too. I assume it could be (most of the time) sort order based on particular database columns: date in case of blog posts, menu order in case of pages etc.

Summary

Given all the above, the crawling queue can be treated as static list. To get next set of URLs to crawl, one would only need to maintain a pointer of some kind like "Nth static page should be cached next". This pointer would then be reset every time the cache is cleared.

Since individual cache entries are only removed manually (see point 3. above), I would argue that they can be set for immediate recrawl. Alternatively the interface could provide an option for that (like "Remove and recrawl").

Implementation notes:

  1. The ordering of groups, subgroups and items must be strict.
  2. The ordering of groups, subgroups and items should be filterable.
  3. It should be possible to exclude certain group, subgroup or item.

Open questions:

  1. Should there be some indications of crawling progress? Is this feasible?
  2. How to proceed with request variants? Does wp_remote_get() support cookies?

@chesio chesio added this to the 1.8 milestone Feb 3, 2020
@chesio chesio modified the milestones: 1.8, 1.9 May 8, 2020
@chesio
Copy link
Owner Author

chesio commented May 22, 2020

Another thing to consider: some pages (URLs in general) are excluded from caching (like WooCommerce cart page). Crawler should handle such URLs gracefully.

@chesio chesio modified the milestones: 1.9.x, 2.0.x May 29, 2020
@chesio
Copy link
Owner Author

chesio commented May 29, 2020

Seems to be a large enough feature to warrant major version bump.

@chesio
Copy link
Owner Author

chesio commented May 29, 2020

There are 3-rd party solutions to this problem, probably worth further investigation:

@chesio
Copy link
Owner Author

chesio commented May 29, 2020

Things to keep in mind:
1. Crawling should not overload the web server.
2. Some URLs are more important than others.

One elegant solution that fits these criteria is to integrate cache warming with a statistics plugin like Statify and only warm up the cache with M most-visited URLs from last N days (where M and N are respective Statify options).

@chesio chesio closed this as completed in 2354b4e Aug 29, 2021
chesio added a commit that referenced this issue Aug 29, 2021
Check the response of HTTP request and if it is anything else than 200,
push the item back to the feeder for a later recrawl.

Fix the handling of request variants (use key instead of name).

See #15.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant