Thinking like a hacker:

Security Considerations for
High-Fidelity Web Archives

Jack Cushman, Perma.cc

Ilya Kreymer, Webrecorder

Why is security a concern in a web archive?

  • Web Archives just collection of old pages
  • High-Fidelity web archives run untrusted web software
  • Live site is "safe", so nothing to worry about
  • Web archive replay can pose new security risks
  • Pages in web archives are always reliable
  • Replay issues side, page loaded from an archive could intentionally deceive

So our challenge:

Possible security threats

  1. Archiving local server files
  2. Hacking the headless browser
  3. Stealing user secrets during capture
  4. Cross site scripting to steal archive logins
  5. Live web leakage on playback
  6. Show different page contents when archived
  7. Banner spoofing

Threat: Archiving local content

  • Capture system could have privileged access:
    • Local ports: http://localhost:8080/
    • Network server: http://private-server/
    • Local files: file:///etc/passwd
  • Could capture private resources, into a public archive

Mitigation:Network Filtering + Sandboxing

  • Don't allow capture of local ip ranges
  • Restrict to http(s) protocol
  • Run capture in isolated container/VM

Threat:Hacking the headless browser

  • Modern captures may use PhantomJS or other browsers on the server
  • Most browsers have known exploits

Mitigation: Sandboxing

  • Run capture system in isolated virtual machine
  • Keep VM up to date

Threat:Stealing user secrets during capture

  • Normal web flow:
    • https://twitter.com/login
    • https://doubleclick.com/evil-ad
  • During Webrecorder interactive capture:
    • https://webrecorder.io/record/https://twitter.com/login
    • https://webrecorder.io/record/https://doubleclick.com/evil-ad
  • Standard cross-domain protections do not apply!

Threat:Stealing user secrets during capture

Partial Mitigation: Rewriting

  • Rewrite cookies to exact path only
  • Rewrite JS to intercept cookie access

Mitigation: Separate Recording Sessions

  • For Webrecorder, use separate recording sessions when recording credentialed content

Mitigation: Remote browser

  • Record in containerized/proxy mode browser

Threat:Cross site scripting to steal archive logins

  • http://myarchive.com/login is the main institution login page
  • http://myarchive.com/web/http://evil.com is a web archive
  • safe?

No!

  • Cross-site scripting (XSS): an admin who visits http://myarchive.com/web/http://evil.com has their account taken over

Threat:Cross site scripting to steal archive logins
...across subdomains

  • http://myarchive.com/login is the main institution login page
  • http://web.myarchive.com/http://evil.com is a web archive
  • safe?

Still no ...

  • In IE10, evil.com might steal login cookie
  • In all browsers, evil.com can wipe and replace cookies

Mitigation: Run web archive on separate domain

  • Use iframes to isolate web archive content
  • Load web archive app from app domain
  • Load iframe content from content domain
  • Webrecorder example:
    • https://webrecorder.io/ -- app domain
    • https://wbrc.io/ -- content domain
  • Perma.cc example:
    • https://perma.cc/ -- app domain
    • https://perma-archives.org/ -- content domain

Threat:Live web leakage on playback

  • Javascript can send messages to evil.com and fetch new content
  • ... to mislead, track users, or rewrite history
  • (Bonus for private archives -- any of your captures could export any of your other captures)

Mitigation:Content-Security-Policy header can limit access to web archive domain

Threat:Show different page contents when archived

  • Pages can tell they're in an archive and act differently

Mitigation: Run archive in containerized/proxy mode browser

Threat:Banner spoofing

  • Pages can dynamically edit the archive's banner

Mitigation:Use iframes for replay

  • Don't inject banner into replay frame
  • Use X-Frame-Options header to limit embedding
  • Serve from separate content domain
  • Use iframe sandbox (more restrictive)

Mitigation: Run archive in containerized/proxy mode browser

Mitigation: Display rendered DOM and strip javascript (archive.is)

What's next?

  • Build tools for web archive security research
  • Challenge researchers to find security issues
  • Introducing: http://warc.games/

Thank you!


Questions?

g