Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

JavaScript files harvested as partial content (HTTP 206) break playback #201

Open
karenhanson opened this issue Jun 10, 2020 · 5 comments
Open

Comments

@karenhanson
Copy link

karenhanson commented Jun 10, 2020

I've been testing Brozzler locally using the brozzler-easy option. I have generated a comprehensive list of URLs to visit for a Scalar publication I'm working on (i.e. 0 hops for each seed). The resulting WARC files have a large number of HTTP 206 partial responses for a portion of the JavaScript files, though each JS file has at least one 200 response. The result is, on playback in PyWb some pages load the 206 Partial Content response, others will load 200 OK. If the 206 response is loaded by PyWb, then a blank page is shown and the console has JS errors. I can fix it by removing the 206 rows from the .cdxj index file so it falls back to the 200 copy, then every page loads fine.

I noticed that some JS files don’t seem to have this problem – it looks like it’s only the ones where the <script> tag declaring the file does not include the type=”text/javascript” attribute, which should be optional – that may be a coincidence, but I tried 2 completely different Scalar sites and they did the same thing. I'm running Brozzler on a Mac with Google Chrome. I’m suspecting a possible Chrome behavior that has a negative impact on the WARC – but not sure whether Brozzler, Warcprox, Pywb, or somewhere else is the best place to handle it. Does this seem like a Brozzler issue?

If needed, I can supply a test configuration file for replicating the problem, but wanted to check I'm in the right place and that it's not a known issue or result of incorrect configuration. Thanks!

@nlevitt
Copy link
Contributor

nlevitt commented Jun 10, 2020

Interesting. I don't think we have a configuration mechanism to avoid saving 206's at the moment. It might be worth adding such a feature. Alternatively it might also make sense for pywb to have some heuristic to prefer 200's over 206's, or ignore 206's entirely. Or they could be skipped at indexing time. Lot of options here, curious what you think @ikreymer?

@ikreymer
Copy link

The issue with skipping all 206s is that many are for a bytes=0- request, which is often (though i suppose not necessarily) equivalent to a regular 200.

pywb in recording mode actually converts a bytes=0- to a regular non-range request, records that, and then adds the range back in when serving back. If it gets a non bytes=0- range request, then it is proxied as is, but not recorded.
https://github.com/webrecorder/pywb/blob/master/pywb/apps/rewriterapp.py#L248

More generally, I wanted to mention, part of the reason for pywb recording mode existing is that direct proxying as warcprox does, does not always lead to replayable content. If the goal is to replay in pywb, using the pywb recording mode is probably the best option, perhaps that could be an option for brozzler? The recording mode does things like this with range requests, fixes resolution for DASH/HLS video, and chooses a full video stream when available. This results in videos from youtube and other places that are directly replayable w/o having to use youtube-dl (for example, the issue in #198). I'm not sure how to address this other than to adopt the solutions that pywb has for this...

@karenhanson slightly off-topic, but I am also curious about your use case of archiving Scalar sites as I've been working on a workflow to automate Scalar capture specifically..

@karenhanson
Copy link
Author

Thank you for the responses so far. I'm fairly new to web archiving, so still trying to wrap my mind around some things. I think this is related to the comment above about bytes=0- and not skipping all 206s, but I did note that many audiovisual files are stored only in 206 responses and appear to work fine. The 206 JS files, on the other hand, are clearly cut off mid file. Favoring 200 over 206 for files referenced by <script> either during playback or capture seems ideal from my limited perspective.

@ikreymer - I believe I have seen a recorded demo for the project you mentioned and it's actually the next thing I plan to look at. The Scalar instance I'm working with is enhanced with some non-standard features, so I'm not sure what will happen. I wanted to see what I could do with Brozzler first as I've had some promising results with it for another dynamic publishing platform. Rather than add too much unrelated information to this thread, I could email you some information about what I'm working on. I think you may already be in touch with some of our project partners - it's related to this grant.

@ikreymer
Copy link

Thank you for the responses so far. I'm fairly new to web archiving, so still trying to wrap my mind around some things. I think this is related to the comment above about bytes=0- and not skipping all 206s, but I did note that many audiovisual files are stored only in 206 responses and appear to work fine. The 206 JS files, on the other hand, are clearly cut off mid file. Favoring 200 over 206 for files referenced by <script> either during playback or capture seems ideal from my limited perspective.

Apologies, I think my comment may have been overly broad assuming of what the issue is, and trying to address other possible issues as well! Perhaps it'd be useful to look at some examples of JS files that are 206 responses? Do you have a few examples of the files and/or pages where this happens? 206 for JS seems quite unusual, so perhaps something else is going on...

@ikreymer - I believe I have seen a recorded demo for the project you mentioned and it's actually the next thing I plan to look at. The Scalar instance I'm working with is enhanced with some non-standard features, so I'm not sure what will happen. I wanted to see what I could do with Brozzler first as I've had some promising results with it for another dynamic publishing platform. Rather than add too much unrelated information to this thread, I could email you some information about what I'm working on. I think you may already be in touch with some of our project partners - it's related to this grant.

Ah, that makes sense, yes - would be happy to continue this over email as its off-topic for this thread! Thanks!

@karenhanson
Copy link
Author

Indeed - I haven't noticed it happening when browsing normally and watching network activity in the developer panel. It's only when looking at the WARC after I run Brozzler. The missing <script type= attribute is the only thing I can see that makes sense - maybe if the browser doesn't instantly know it's JS, it gets part of the file as it would a video then immediately retrieves the whole thing once it identifies it as JS. This is total speculation. This very simple test config file replicates it for me, though I think any Scalar page will do:

id: ravenspace-as-i-remember-it
max_claimed_sites: 2
ignore_robots: true
metadata: {}
scope:
  max_hops: 0
seeds:
- url: http://publications.ravenspacepublishing.org/as-i-remember-it/territory

I have a few behaviors that I've added using /usr/local/lib/python3.7/site-packages/brozzler/behaviors.yaml. I'm almost certain they aren't relevant to the issue, but the first one gets rid of the popup and allows the scroll to happen:

  url_regex: '^https?://publications\.ravenspacepublishing\.org/.*$'
  behavior_js_template: umbraBehavior.js.j2
  request_idle_timeout_sec: 30
  default_parameters:
    interval: 1000
    actions:
      - selector: a.button.popup__btn--agree
      - selector: a[rev="scalar:has_note"]
      - selector: div[class="media_tab"]
        closeSelector: a[title="Close"]

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants