WARC Standard Suggestions

Ilya Kreymer, Webrecorder Lead Developer

WARC Provenance Headers Guidance

  • Designed to support creation of WARCs from existing archives, other WARCs, or files on disk
  • Add two new headers WARC-Source-URI and WARC-Creation-Date
  • WARC-Source-URI indicates the source that a url was retrieved from.
    If omitted, assume to be WARC-Target-URI
  • WARC-Creation-Date indicates the datetime that the WARC was created, if different then the datetime of the resource
    If omitted, assume to be WARC-Date
  • Interaction with Memento headers if a resource with Memento is written to a WARC:
    • The Memento-Datetime should be the WARC-Date
    • The URI-R should be the WARC-Target-URI
    • The URI-M should be the WARC-Source-URI

Example 1: Remote url extraction: Webrecorder Extraction/Patching from Remote Archive

WARC/1.0
WARC-Type: response
...
WARC-Date: 1996-12-26T18:25:58Z
WARC-Creation-Date: 2018-11-13T11:19:07Z
WARC-Target-URI: http://geocities.com/
WARC-Source-URI: https://web.archive.org/web/19961226182558id_/http://geocities.com/
...

Example 2: Local file to WARC: warcit created WARC file with resource records from files on disk

WARC/1.0
WARC-Type: resource
...
WARC-Date: 2010-01-01T00:00:00Z
WARC-Creation-Date: 2018-11-13T11:24:35Z
WARC-Source-URI: /path/to/files/example.com/index.html
WARC-Target-URI: http://example.com/
...

Example 3: Remote WARC record: A custom metadata record for WARC record extracted from another WARC

Adds WARC-Source-Range to indicate byte range used to fetch record

New WARC metadata record added pointing to the extracted record

WARC/1.0
WARC-Type: metadata
...
WARC-Date: 2010-01-01T00:00:00Z
WARC-Creation-Date: 2018-11-13T11:24:35Z
WARC-Source-URI: http://myarchive.example.com/path/to/warcs/mywarc.warc.gz
WARC-Source-Range: bytes=1000-1999
WARC-Target-URI: http://example.com/
WARC-Concurrent-To: warc-record-id
...


WARC/1.0
WARC-Record-ID: warc-record-id

Experimental Idea: Saving Dynamic History Records

Intercept and capture pushState()calls

    window.history.pushState({"custom": "state"}, "Title", "/relative/url");
    window.history.pushState({"custom": "another"}, "Anoher Title", "/another/url/dynamic");

Store as JSON list:

    "states": [
        [{"custom": "state"}, "Title", "/relative/url"],
        [{"custom": "another"}, "Anoher Title", "/another/url/dynamic"],
    ]
WARC-Type: metadata
WARC-Record-ID: 
WARC-Payload-Digest: sha1:MC3EDZY4B4KAOFMPVBUULAJSSWI454UC
WARC-Block-Digest: sha1:MC3EDZY4B4KAOFMPVBUULAJSSWI454UC
Content-Type: application/vnd.pywb-waypoint+json; charset=utf-8
Content-Length: 1054

{
    "base_url": "https://twitter.com/webrecorder_io",
    "base_timestamp": "20180619152005",
    "states": [
        [{
            "inOverlay": true,
            "rollbackCount": 1
        }, "Webrecorder on Twitter: \"Indeed! We are really thrilled to announce that @johnaberlin will be joining the Webrecorder project as a Senior Backend Developer and bringing his #webarchiving expertise from @WebSciDL to the team here at @rhizome!… https://t.co/F2pq2Da7A2\"", "https://twitter.com/webrecorder_io/status/992176688887820288"]
    ],
    "init_state": [null, "Webrecorder (@webrecorder_io) | Twitter", "https://twitter.com/webrecorder_io"],
    "curr_state": [{
        "inOverlay": true,
        "rollbackCount": 1
    }, "Webrecorder on Twitter: \"Indeed! We are really thrilled to announce that @johnaberlin will be joining the Webrecorder project as a Senior Backend Developer and bringing his #webarchiving expertise from @WebSciDL to the team here at @rhizome!… https://t.co/F2pq2Da7A2\"", "https://twitter.com/webrecorder_io/status/992176688887820288"],
    "final_url": "https://twitter.com/webrecorder_io/status/992176688887820288"
}
WARC-Type: metadata
WARC-Record-ID: 
WARC-Payload-Digest: sha1:XENSDL25R7HGVGD5U4ORVUC3Y3QAJK2S
WARC-Block-Digest: sha1:XENSDL25R7HGVGD5U4ORVUC3Y3QAJK2S
Content-Type: application/vnd.pywb-waypoint+json; charset=utf-8
Content-Length: 1033

{
    "base_url": "https://twitter.com/webrecorder_io",
    "base_timestamp": "20180619152005",
    "states": [
        [{
            "inOverlay": true,
            "rollbackCount": 1
        }, "Webrecorder on Twitter: \"Indeed! We are really thrilled to announce that @johnaberlin will be joining the Webrecorder project as a Senior Backend Developer and bringing his #webarchiving expertise from @WebSciDL to the team here at @rhizome!… https://t.co/F2pq2Da7A2\"", "https://twitter.com/webrecorder_io/status/992176688887820288"],
        [{
            "inOverlay": true,
            "rollbackCount": 2
        }, "Uncle Traveller on Twitter: \"Amazing! Congratulations @johnaberlin!! #digipres #webarchives… \"", "https://twitter.com/beet_keeper/status/992293419127984130"]
    ],
    "init_state": [null, "Webrecorder (@webrecorder_io) | Twitter", "https://twitter.com/webrecorder_io"],
    "curr_state": [{
        "inOverlay": true,
        "rollbackCount": 2
    }, "Uncle Traveller on Twitter: \"Amazing! Congratulations @johnaberlin!! #digipres #webarchives… \"", "https://twitter.com/beet_keeper/status/992293419127984130"],
    "final_url": "https://twitter.com/beet_keeper/status/992293419127984130"
}  final_url: "https: //twitter.com/webrecorder_io/status/992176688887820288"
}
WARC-Type: metadata
WARC-Record-ID: 
WARC-Payload-Digest: sha1:PSL7KDE5FBQ2U5RF6TQEDQATNYXILG2O
WARC-Block-Digest: sha1:PSL7KDE5FBQ2U5RF6TQEDQATNYXILG2O
Content-Type: application/vnd.pywb-waypoint+json; charset=utf-8
Content-Length: 1698

{
    "base_url": "https://twitter.com/webrecorder_io",
    "base_timestamp": "20180619152005",
    "states": [
        [{
            "inOverlay": true,
            "rollbackCount": 1
        }, "Webrecorder on Twitter: \"Indeed! We are really thrilled to announce that @johnaberlin will be joining the Webrecorder project as a Senior Backend Developer and bringing his #webarchiving expertise from @WebSciDL to the team here at @rhizome!… https://t.co/F2pq2Da7A2\"", "https://twitter.com/webrecorder_io/status/992176688887820288"],
        [{
            "inOverlay": true,
            "rollbackCount": 2
        }, "Uncle Traveller on Twitter: \"Amazing! Congratulations @johnaberlin!! #digipres #webarchives… \"", "https://twitter.com/beet_keeper/status/992293419127984130"],
        [{
            "inOverlay": true,
            "rollbackCount": 3
        }, "Michael L. Nelson on Twitter: \"buried at the bottom of @johnaberlin's excellent MS thesis summary is this exciting news: after this weekend's graduation he will be a back end developer for @webrecorder_io joining @IlyaKreymer, @despens @AnnaPerricci \n@michael_connor et al. at @rhizome!\n\nhttps://t.co/5KxSLmC0kA\"", "https://twitter.com/phonedude_mln/status/991675218715430912"]
    ],
    "init_state": [null, "Webrecorder (@webrecorder_io) | Twitter", "https://twitter.com/webrecorder_io"],
    "curr_state": [{
        "inOverlay": true,
        "rollbackCount": 3
    }, "Michael L. Nelson on Twitter: \"buried at the bottom of @johnaberlin's excellent MS thesis summary is this exciting news: after this weekend's graduation he will be a back end developer for @webrecorder_io joining @IlyaKreymer, @despens @AnnaPerricci \n@michael_connor et al. at @rhizome!\n\nhttps://t.co/5KxSLmC0kA\"", "https://twitter.com/phonedude_mln/status/991675218715430912"],
    "final_url": "https://twitter.com/phonedude_mln/status/991675218715430912"
}

pywb can be extended to replay such records

(tweet example source)

WARC File Extensions

Webrecorder switched WARC downloads from .warc.gz to .warc

  • Browsers (Safari) uncompress the WARC automatically
  • OSes open .warc.gz files with default .gz compress utility
  • Often decompress only the first record
  • Users end up with broken WARC files :(

A Goal? To make WARCs as ubiquitous as PDFs for browsing web content!

WARCs have a .warc extension (regardless of compression)

Users would understand that a WARC file contains a web page, represented with .warc extension

Viewers to open WARC on the desktop (eg. Webrecorder Player)

A standard way to specify entrypoints?

g