-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
JavaScript files harvested as partial content (HTTP 206) break playback #201
Comments
Interesting. I don't think we have a configuration mechanism to avoid saving 206's at the moment. It might be worth adding such a feature. Alternatively it might also make sense for pywb to have some heuristic to prefer 200's over 206's, or ignore 206's entirely. Or they could be skipped at indexing time. Lot of options here, curious what you think @ikreymer? |
The issue with skipping all 206s is that many are for a pywb in recording mode actually converts a More generally, I wanted to mention, part of the reason for pywb recording mode existing is that direct proxying as warcprox does, does not always lead to replayable content. If the goal is to replay in pywb, using the pywb recording mode is probably the best option, perhaps that could be an option for brozzler? The recording mode does things like this with range requests, fixes resolution for DASH/HLS video, and chooses a full video stream when available. This results in videos from youtube and other places that are directly replayable w/o having to use youtube-dl (for example, the issue in #198). I'm not sure how to address this other than to adopt the solutions that pywb has for this... @karenhanson slightly off-topic, but I am also curious about your use case of archiving Scalar sites as I've been working on a workflow to automate Scalar capture specifically.. |
Thank you for the responses so far. I'm fairly new to web archiving, so still trying to wrap my mind around some things. I think this is related to the comment above about @ikreymer - I believe I have seen a recorded demo for the project you mentioned and it's actually the next thing I plan to look at. The Scalar instance I'm working with is enhanced with some non-standard features, so I'm not sure what will happen. I wanted to see what I could do with Brozzler first as I've had some promising results with it for another dynamic publishing platform. Rather than add too much unrelated information to this thread, I could email you some information about what I'm working on. I think you may already be in touch with some of our project partners - it's related to this grant. |
Apologies, I think my comment may have been overly broad assuming of what the issue is, and trying to address other possible issues as well! Perhaps it'd be useful to look at some examples of JS files that are 206 responses? Do you have a few examples of the files and/or pages where this happens? 206 for JS seems quite unusual, so perhaps something else is going on...
Ah, that makes sense, yes - would be happy to continue this over email as its off-topic for this thread! Thanks! |
Indeed - I haven't noticed it happening when browsing normally and watching network activity in the developer panel. It's only when looking at the WARC after I run Brozzler. The missing
I have a few behaviors that I've added using
|
I've been testing Brozzler locally using the
brozzler-easy
option. I have generated a comprehensive list of URLs to visit for a Scalar publication I'm working on (i.e. 0 hops for each seed). The resulting WARC files have a large number of HTTP 206 partial responses for a portion of the JavaScript files, though each JS file has at least one 200 response. The result is, on playback in PyWb some pages load the206 Partial Content
response, others will load200 OK
. If the 206 response is loaded by PyWb, then a blank page is shown and the console has JS errors. I can fix it by removing the 206 rows from the.cdxj
index file so it falls back to the 200 copy, then every page loads fine.I noticed that some JS files don’t seem to have this problem – it looks like it’s only the ones where the <script> tag declaring the file does not include the
type=”text/javascript”
attribute, which should be optional – that may be a coincidence, but I tried 2 completely different Scalar sites and they did the same thing. I'm running Brozzler on a Mac with Google Chrome. I’m suspecting a possible Chrome behavior that has a negative impact on the WARC – but not sure whether Brozzler, Warcprox, Pywb, or somewhere else is the best place to handle it. Does this seem like a Brozzler issue?If needed, I can supply a test configuration file for replicating the problem, but wanted to check I'm in the right place and that it's not a known issue or result of incorrect configuration. Thanks!
The text was updated successfully, but these errors were encountered: