Webrecorder: Open Source Web Archiving Toolset
                Code4Lib, 2019
                Ilya Kreymer, Webrecorder Lead Developer
                @IlyaKreymer @webrecorder_io
            
            
                What is Webrecorder?
                
                    - A set of FOSS tools for creating and viewing web archives
- A free, hosted service running on https://webrecorder.io/
- Supports anonymous capture and user account system
- Browser-based capture and access focusing on high-fidelity
- Goal is web archiving for all!
- Stewarded by Rhizome, an arts non-profit in NYC
- Team of six working on Webrecorder
- Supported by two grants from the Mellon Foundation
How is Webrecorder different?
                
                    - Traditional web archiving is crawler based

                    - Crawler loads URLs, starting with a list of 'seeds'
- Parse HTML to find more urls to crawl
- Store HTTP traffic in lossless format (WARC)
- Easy to parse HTML == Fast!
- Easy to crawl lots of content in bulk!
- Mostly inadequate for modern websites, because

                
What's a WARC?
                
                    - Standardized (ISO) file format for web archives
- Concatenated byte-level capture of each HTTP (1.x) request and response
- Optional metadata records (no set standard)
- Webrecorder produces standard WARCs
Remote Browser Example Links:
                    
            
            
                Web archiving != Archiving the entire web
                
                    - Web archives can be small
- Web archives can contain bounded objects
- Quality over quantity
- You can run Webrecorder at your institution today
What if I just want to read/write WARC files?
                warcio
                
                    - package for creating and reading WARC files
- Make a WARC in 4 lines of Python:
from warcio.capture_http import capture_http
import requests
with capture_http('example.warc.gz', warc_version='1.1'):
     requests.get('https://example.com/ ')
                    
                    Code: warcio
                
            
            
                pywb
                Python Wayback / Web Archive Toolkit
                
                    - Core "engine" powering Webrecorder
- Create and view WARCs through browser, via rewritten urls and HTTP/S proxy
- Docs: pywb.readthedocs.io
- Code: pywb
What if I want to archive through the browser?
                
                    - Create a web archive of a page in 4 line script:
pip install pywb
wb-manager init my-web-archive
wayback --proxy my-web-archive --proxy-record --live
google-chrome http://localhost:8080/my-web-archive/record/http://example.com/
                    
OR
                    
google-chrome --proxy-server=http://localhost:8080 https://example.com/
                    
                
            
            
                What if I want to host a wayback machine/provide access?
                
                    - View an archive of with a 4-line script:
pip install pywb
wb-manager init my-web-archive
wayback --proxy my-web-archive
google-chrome http://localhost:8080/my-web-archive/http://example.com/
                    
OR
                    
google-chrome --proxy-server=http://localhost:8080 https://example.com/
                    
                
            
            
                What if I want a simple desktop app for users to browse a web archive?
                Webrecorder Player
                 
                
            
                What if I want a specific browser, eg. with Flash?
                Remote Browser System
                
                    - Docker containers each containing a web browser
- Originally developed for oldweb.today
- Preserving browsers with Flash, even Java
- Several versions of Chrome, Firefox
- Access via VNC + WebRTC
- Lots of Code: github.com/oldweb-today
- Docs still needed
- Another Browser: demo
                
What if I want to make a custom behavior?
                Webrecorder Behaviors
                
                    - Working on an extensible per-site behavior system
- Will provide a JS library for building behaviors
- API and documentation coming soon
- Talk to us about beta-testing!
- Code: wr-behaviors
What if I want to try it all!
                webrecorder/webrecorder
                
                    - Full system running on webrecorder.io
- Containerized deployment with Docker Compose
- Adds user, collection management, friendly UI
- Integrates Remote Browser System
- API Backend, React Frontend
- Code: webrecorder
Have we solved web archiving?
                Unfortunately, many challenges remain:
                
                    - Websockets
- HTTP/2
- Dynamic History / Single Page Apps
- Non-deterministic behaviors (eg. using time)
Want to help? Contributions Welcome!