Webrecorder Open Source Tools

Webrecorder: Open Source Web Archiving Toolset

Code4Lib, 2019

Ilya Kreymer, Webrecorder Lead Developer

@IlyaKreymer @webrecorder_io

What is Webrecorder?

A set of FOSS tools for creating and viewing web archives
A free, hosted service running on https://webrecorder.io/
Supports anonymous capture and user account system
Browser-based capture and access focusing on high-fidelity
Goal is web archiving for all!
Stewarded by Rhizome, an arts non-profit in NYC
Team of six working on Webrecorder
Supported by two grants from the Mellon Foundation

How is Webrecorder different?

Traditional web archiving is crawler based

Crawler loads URLs, starting with a list of 'seeds'
Parse HTML to find more urls to crawl
Store HTTP traffic in lossless format (WARC)
Easy to parse HTML == Fast!
Easy to crawl lots of content in bulk!
Mostly inadequate for modern websites, because

What's a WARC?

Standardized (ISO) file format for web archives
Concatenated byte-level capture of each HTTP (1.x) request and response
Optional metadata records (no set standard)
Webrecorder produces standard WARCs

Webrecorder Demo!

Remote Browser Example Links:

Web archiving != Archiving the entire web

Web archives can be small
Web archives can contain bounded objects
Quality over quantity
You can run Webrecorder at your institution today

The Webrecorder Stack

Componentized Architecture
Python and JS
Lots of tools of developers
https://github.com/webrecorder

What if I just want to read/write WARC files?

warcio

package for creating and reading WARC files
Make a WARC in 4 lines of Python:

from warcio.capture_http import capture_http
import requests

with capture_http('example.warc.gz', warc_version='1.1'):
     requests.get('https://example.com/ ')

Code: warcio

pywb

Python Wayback / Web Archive Toolkit

Core "engine" powering Webrecorder
Create and view WARCs through browser, via rewritten urls and HTTP/S proxy
Docs: pywb.readthedocs.io
Code: pywb

What if I want to archive through the browser?

Create a web archive of a page in 4 line script:

pip install pywb
wb-manager init my-web-archive
wayback --proxy my-web-archive --proxy-record --live
google-chrome http://localhost:8080/my-web-archive/record/http://example.com/

google-chrome --proxy-server=http://localhost:8080 https://example.com/

What if I want to host a wayback machine/provide access?

View an archive of with a 4-line script:

pip install pywb
wb-manager init my-web-archive
wayback --proxy my-web-archive
google-chrome http://localhost:8080/my-web-archive/http://example.com/

google-chrome --proxy-server=http://localhost:8080 https://example.com/

What if I want a simple desktop app for users to browse a web archive?

Webrecorder Player

Electron Desktop App for OSX, Windows, Linux
Open and browse any WARC file locally, offline
UI consistent with webrecorder.io
Released via Github
Code: webrecorder-player

What if I want a specific browser, eg. with Flash?

Remote Browser System

Docker containers each containing a web browser
Originally developed for oldweb.today
Preserving browsers with Flash, even Java
Several versions of Chrome, Firefox
Access via VNC + WebRTC
Lots of Code: github.com/oldweb-today
Docs still needed
Another Browser: demo

What if I want to make a custom behavior?

Webrecorder Behaviors

Working on an extensible per-site behavior system
Will provide a JS library for building behaviors
API and documentation coming soon
Talk to us about beta-testing!
Code: wr-behaviors

What if I want to try it all!

webrecorder/webrecorder

Full system running on webrecorder.io
Containerized deployment with Docker Compose
Adds user, collection management, friendly UI
Integrates Remote Browser System
API Backend, React Frontend
Code: webrecorder

Any other tools?

webrecorder-deploy -- Ansible cookbook for Webrecorder deployment
warcit -- turn files on disk into WARCs
har2warc -- convert HAR files into WARCs

Have we solved web archiving?

Unfortunately, many challenges remain:

Websockets
HTTP/2
Dynamic History / Single Page Apps
Non-deterministic behaviors (eg. using time)

Want to help? Contributions Welcome!

Webrecorder: Open Source Web Archiving Toolset

Code4Lib, 2019

What is Webrecorder?

How is Webrecorder different?

What's a WARC?

Webrecorder Demo!

Web archiving != Archiving the entire web

The Webrecorder Stack

What if I just want to read/write WARC files?

warcio

pywb

Python Wayback / Web Archive Toolkit

What if I want to archive through the browser?

What if I want to host a wayback machine/provide access?

What if I want a simple desktop app for users to browse a web archive?

Webrecorder Player

What if I want a specific browser, eg. with Flash?

Remote Browser System

What if I want to make a custom behavior?

Webrecorder Behaviors

What if I want to try it all!

webrecorder/webrecorder

Any other tools?

Have we solved web archiving?

Thank you

Q & A

Contact:
support@webrecorder.io

Webrecorder: Open Source Web Archiving Toolset

Code4Lib, 2019

What is Webrecorder?

How is Webrecorder different?

What's a WARC?

Webrecorder Demo!

Web archiving != Archiving the entire web

The Webrecorder Stack

What if I just want to read/write WARC files?

warcio

pywb

Python Wayback / Web Archive Toolkit

What if I want to archive through the browser?

What if I want to host a wayback machine/provide access?

What if I want a simple desktop app for users to browse a web archive?

Webrecorder Player

What if I want a specific browser, eg. with Flash?

Remote Browser System

What if I want to make a custom behavior?

Webrecorder Behaviors

What if I want to try it all!

webrecorder/webrecorder

Any other tools?

Have we solved web archiving?

Thank you

Q & A

Contact:support@webrecorder.io

Contact:
support@webrecorder.io