Thinking like a hacker:
Security Considerations for
High-Fidelity Web Archives
Jack Cushman, Perma.cc
Ilya Kreymer, Webrecorder
Why is security a concern in a web archive?
- Web Archives just collection of old pages
- Live site is "safe", so nothing to worry about
- Pages in web archives are always reliable
- Replay issues side, page loaded from an archive could intentionally deceive
So our challenge:
Possible security threats
- Archiving local server files
- Hacking the headless browser
- Stealing user secrets during capture
- Cross site scripting to steal archive logins
- Live web leakage on playback
- Show different page contents when archived
- Banner spoofing
Threat: Archiving local content
- Capture system could have privileged access:
- Local ports: http://localhost:8080/
- Network server: http://private-server/
- Local files: file:///etc/passwd
- Could capture private resources, into a public archive
Mitigation:Network Filtering + Sandboxing
- Don't allow capture of local ip ranges
- Restrict to http(s) protocol
- Run capture in isolated container/VM
Threat:Hacking the headless browser
- Modern captures may use PhantomJS or other browsers on the server
- Most browsers have known exploits
Mitigation: Sandboxing
- Run capture system in isolated virtual machine
- Keep VM up to date
Threat:Stealing user secrets during capture
- Normal web flow:
- https://twitter.com/login
- https://doubleclick.com/evil-ad
- Standard cross-domain protections do not apply!
Threat:Stealing user secrets during capture
Partial Mitigation: Rewriting
- Rewrite cookies to exact path only
- Rewrite JS to intercept cookie access
Mitigation: Separate Recording Sessions
- For Webrecorder, use separate recording sessions when recording credentialed content
Mitigation: Remote browser
- Record in containerized/proxy mode browser
Threat:Cross site scripting to steal archive logins
- http://myarchive.com/login is the main institution login page
- http://myarchive.com/web/http://evil.com is a web archive
- safe?
No!
- Cross-site scripting (XSS): an admin who visits http://myarchive.com/web/http://evil.com has their account taken over
Threat:Cross site scripting to steal archive logins
...across subdomains
- http://myarchive.com/login is the main institution login page
- http://web.myarchive.com/http://evil.com is a web archive
- safe?
Still no ...
- In IE10, evil.com might steal login cookie
- In all browsers, evil.com can wipe and replace cookies
Mitigation: Run web archive on separate domain
- Use iframes to isolate web archive content
- Load web archive app from app domain
- Load iframe content from content domain
Threat:Live web leakage on playback
- Javascript can send messages to evil.com and fetch new content
- ... to mislead, track users, or rewrite history
- (Bonus for private archives -- any of your captures could export any of your other captures)
Mitigation:Content-Security-Policy header can limit access to web archive domain
Threat:Show different page contents when archived
- Pages can tell they're in an archive and act differently
Mitigation: Run archive in containerized/proxy mode browser
Threat:Banner spoofing
- Pages can dynamically edit the archive's banner
Mitigation:Use iframes for replay
- Don't inject banner into replay frame
- Use X-Frame-Options header to limit embedding
- Serve from separate content domain
- Use iframe sandbox (more restrictive)
Mitigation: Run archive in containerized/proxy mode browser
Mitigation: Display rendered DOM and strip javascript (archive.is)
What's next?
- Build tools for web archive security research
- Challenge researchers to find security issues