Webrecorder: Open Source Web Archiving Toolset
Code4Lib, 2019
Ilya Kreymer, Webrecorder Lead Developer
@IlyaKreymer @webrecorder_io
What is Webrecorder?
- A set of FOSS tools for creating and viewing web archives
- A free, hosted service running on https://webrecorder.io/
- Supports anonymous capture and user account system
- Browser-based capture and access focusing on high-fidelity
- Goal is web archiving for all!
- Stewarded by Rhizome, an arts non-profit in NYC
- Team of six working on Webrecorder
- Supported by two grants from the Mellon Foundation
How is Webrecorder different?
- Traditional web archiving is crawler based
- Crawler loads URLs, starting with a list of 'seeds'
- Parse HTML to find more urls to crawl
- Store HTTP traffic in lossless format (WARC)
- Easy to parse HTML == Fast!
- Easy to crawl lots of content in bulk!
- Mostly inadequate for modern websites, because
What's a WARC?
- Standardized (ISO) file format for web archives
- Concatenated byte-level capture of each HTTP (1.x) request and response
- Optional metadata records (no set standard)
- Webrecorder produces standard WARCs
Remote Browser Example Links:
Web archiving != Archiving the entire web
- Web archives can be small
- Web archives can contain bounded objects
- Quality over quantity
- You can run Webrecorder at your institution today
What if I just want to read/write WARC files?
warcio
- package for creating and reading WARC files
- Make a WARC in 4 lines of Python:
from warcio.capture_http import capture_http
import requests
with capture_http('example.warc.gz', warc_version='1.1'):
requests.get('https://example.com/ ')
Code: warcio
pywb
Python Wayback / Web Archive Toolkit
- Core "engine" powering Webrecorder
- Create and view WARCs through browser, via rewritten urls and HTTP/S proxy
- Docs: pywb.readthedocs.io
- Code: pywb
What if I want to archive through the browser?
- Create a web archive of a page in 4 line script:
pip install pywb
wb-manager init my-web-archive
wayback --proxy my-web-archive --proxy-record --live
google-chrome http://localhost:8080/my-web-archive/record/http://example.com/
OR
google-chrome --proxy-server=http://localhost:8080 https://example.com/
What if I want to host a wayback machine/provide access?
- View an archive of with a 4-line script:
pip install pywb
wb-manager init my-web-archive
wayback --proxy my-web-archive
google-chrome http://localhost:8080/my-web-archive/http://example.com/
OR
google-chrome --proxy-server=http://localhost:8080 https://example.com/
What if I want a simple desktop app for users to browse a web archive?
Webrecorder Player
What if I want a specific browser, eg. with Flash?
Remote Browser System
- Docker containers each containing a web browser
- Originally developed for oldweb.today
- Preserving browsers with Flash, even Java
- Several versions of Chrome, Firefox
- Access via VNC + WebRTC
- Lots of Code: github.com/oldweb-today
- Docs still needed
- Another Browser: demo
What if I want to make a custom behavior?
Webrecorder Behaviors
- Working on an extensible per-site behavior system
- Will provide a JS library for building behaviors
- API and documentation coming soon
- Talk to us about beta-testing!
- Code: wr-behaviors
What if I want to try it all!
webrecorder/webrecorder
- Full system running on webrecorder.io
- Containerized deployment with Docker Compose
- Adds user, collection management, friendly UI
- Integrates Remote Browser System
- API Backend, React Frontend
- Code: webrecorder
Have we solved web archiving?
Unfortunately, many challenges remain:
- Websockets
- HTTP/2
- Dynamic History / Single Page Apps
- Non-deterministic behaviors (eg. using time)
Want to help? Contributions Welcome!