Webrecorder: Open Source Web Archiving Toolset

Code4Lib, 2019

Ilya Kreymer, Webrecorder Lead Developer

@IlyaKreymer @webrecorder_io

What is Webrecorder?

  • A set of FOSS tools for creating and viewing web archives
  • A free, hosted service running on https://webrecorder.io/
  • Supports anonymous capture and user account system
  • Browser-based capture and access focusing on high-fidelity
  • Goal is web archiving for all!
  • Stewarded by Rhizome, an arts non-profit in NYC
  • Team of six working on Webrecorder
  • Supported by two grants from the Mellon Foundation

How is Webrecorder different?

  • Traditional web archiving is crawler based
  • Crawler loads URLs, starting with a list of 'seeds'
  • Parse HTML to find more urls to crawl
  • Store HTTP traffic in lossless format (WARC)
  • Easy to parse HTML == Fast!
  • Easy to crawl lots of content in bulk!
  • Mostly inadequate for modern websites, because

What's a WARC?

  • Standardized (ISO) file format for web archives
  • Concatenated byte-level capture of each HTTP (1.x) request and response
  • Optional metadata records (no set standard)
  • Webrecorder produces standard WARCs

Webrecorder Demo!

Remote Browser Example Links:

Web archiving != Archiving the entire web

  • Web archives can be small
  • Web archives can contain bounded objects
  • Quality over quantity
  • You can run Webrecorder at your institution today

The Webrecorder Stack

What if I just want to read/write WARC files?

warcio

  • package for creating and reading WARC files
  • Make a WARC in 4 lines of Python:
  • from warcio.capture_http import capture_http
    import requests
    
    with capture_http('example.warc.gz', warc_version='1.1'):
         requests.get('https://example.com/ ')
                        
  • Code: warcio

pywb

Python Wayback / Web Archive Toolkit

  • Core "engine" powering Webrecorder
  • Create and view WARCs through browser, via rewritten urls and HTTP/S proxy
  • Docs: pywb.readthedocs.io
  • Code: pywb

What if I want to archive through the browser?

  • Create a web archive of a page in 4 line script:
  • pip install pywb
    wb-manager init my-web-archive
    wayback --proxy my-web-archive --proxy-record --live
    google-chrome http://localhost:8080/my-web-archive/record/http://example.com/
                        
  • OR
  • google-chrome --proxy-server=http://localhost:8080 https://example.com/
                        

What if I want to host a wayback machine/provide access?

  • View an archive of with a 4-line script:
  • pip install pywb
    wb-manager init my-web-archive
    wayback --proxy my-web-archive
    google-chrome http://localhost:8080/my-web-archive/http://example.com/
                        
  • OR
  • google-chrome --proxy-server=http://localhost:8080 https://example.com/
                        

What if I want a simple desktop app for users to browse a web archive?

Webrecorder Player

What if I want a specific browser, eg. with Flash?

Remote Browser System

  • Docker containers each containing a web browser
  • Originally developed for oldweb.today
  • Preserving browsers with Flash, even Java
  • Several versions of Chrome, Firefox
  • Access via VNC + WebRTC
  • Lots of Code: github.com/oldweb-today
  • Docs still needed
  • Another Browser: demo

What if I want to make a custom behavior?

Webrecorder Behaviors

  • Working on an extensible per-site behavior system
  • Will provide a JS library for building behaviors
  • API and documentation coming soon
  • Talk to us about beta-testing!
  • Code: wr-behaviors

What if I want to try it all!

webrecorder/webrecorder

  • Full system running on webrecorder.io
  • Containerized deployment with Docker Compose
  • Adds user, collection management, friendly UI
  • Integrates Remote Browser System
  • API Backend, React Frontend
  • Code: webrecorder

Any other tools?

Have we solved web archiving?

Unfortunately, many challenges remain:

  • Websockets
  • HTTP/2
  • Dynamic History / Single Page Apps
  • Non-deterministic behaviors (eg. using time)

Want to help? Contributions Welcome!

Thank you

Q & A


Contact:
support@webrecorder.io