capabilities

Web Archive Scraper

Search the Wayback Machine for archived versions of websites. Extract cached pages, customer lists, testimonials, and partner directories from sites that have changed or gone offline. Uses the free CDX API — no API key needed.

by Athina AI

Run in Gooseworks View skill in Github

Install

Terminal

npx gooseworks install --all

# then, in Claude Code, Cursor, or Codex:
/gooseworks use the web-archive-scraper skill

About This Skill

Web Archive Scraper

Search the Wayback Machine (Internet Archive) for archived snapshots of websites. Fetch cached page content to find customer lists, testimonials, partner directories, and other information from sites that have changed or shut down.

Quick Start

Only dependency is requests. No API key needed.

# Find all snapshots of a URL
python3 skills/web-archive-scraper/scripts/search_archive.py \
  --url "https://botkeeper.com/customers"
 
# Search with date range
python3 skills/web-archive-scraper/scripts/search_archive.py \
  --url "https://botkeeper.com" --from 2025-01-01 --to 2026-02-01
 
# Search all pages under a domain (prefix match)
python3 skills/web-archive-scraper/scripts/search_archive.py \
  --url "https://botkeeper.com" --match prefix --limit 50
 
# Fetch the actual archived page content
python3 skills/web-archive-scraper/scripts/search_archive.py \
  --url "https://botkeeper.com/customers" --fetch
 
# Output formats
python3 skills/web-archive-scraper/scripts/search_archive.py --url URL --output json
python3 skills/web-archive-scraper/scripts/search_archive.py --url URL --output csv
python3 skills/web-archive-scraper/scripts/search_archive.py --url URL --output summary

How It Works

CDX API search — Queries web.archive.org/cdx/search/cdx for snapshots matching the URL
Filtering — Filters by date range, HTTP status code, and MIME type
Dedup — Collapses to one snapshot per day by default to avoid redundant results
Content fetch — Optionally fetches the raw archived HTML (using id_ modifier to skip Wayback toolbar)
Text extraction — Strips HTML tags for readable text output when fetching content

CLI Reference

Flag	Default	Description
`--url`	required	Target URL to search in the archive
`--match`	exact	Match type: `exact`, `prefix`, `host`, `domain`
`--from`	none	Start date (YYYY-MM-DD)
`--to`	none	End date (YYYY-MM-DD)
`--limit`	25	Max number of snapshots to return
`--fetch`	false	Fetch and display the content of the most recent snapshot
`--fetch-all`	false	Fetch content of ALL matched snapshots (use with small --limit)
`--status`	200	HTTP status filter (set to "any" to include all)
`--output`	json	Output format: `json`, `csv`, `summary`
`--collapse`	day	Dedup level: `none`, `day`, `month`, `year`

Output Schema

{
  "url": "https://botkeeper.com/customers",
  "timestamp": "20250915143022",
  "datetime": "2025-09-15T14:30:22",
  "status_code": "200",
  "mime_type": "text/html",
  "archive_url": "https://web.archive.org/web/20250915143022/https://botkeeper.com/customers",
  "raw_url": "https://web.archive.org/web/20250915143022id_/https://botkeeper.com/customers",
  "content": "..."
}

The content field is only populated when --fetch or --fetch-all is used.

Cost

Free. The Wayback Machine CDX API requires no authentication or API key. Rate limit is ~15 requests/minute.

Common Use Cases

Find customer lists from shut-down companies (e.g., botkeeper.com)
Recover testimonials/case studies before a site redesign
Track how a competitor's messaging changed over time
Find partner directories that have been removed

What's included

Find customer lists from shut-down companies* (e.g., botkeeper.com)

Recover testimonials/case studies* before a site redesign

Track how a competitor's messaging changed over time*

Find partner directories* that have been removed

Render Model Comparison Grid

Render a 'model comparison grid' video from a config — a fal-style "same prompt, N contenders" showcase — a dark real-DOM stage where per beat a monospace prompt fades in centered, docks to a small top strip, then a labeled 2-4 panel grid (static images OR muted video clips, mixable per cell) staggers in and holds for comparison, plus a minimal end card — frame-stepped via Playwright (video cells are frame-seeked deterministically) and encoded with FFmpeg. Deterministic assembly, FREE (cell media comes from create-image-fal / create-video-fal, music from create-music-elevenlabs), text stays pixel-crisp. Use for the model-comparison-grid format.

Render Offer Ad

Render a punchy ~12s vertical (9:16) music-only direct-response OFFER ad as a 4-beat kinetic-typography film — HEADLINE slam → real PRODUCT drop → CLAIM/proof → CTA pill — from one config of copy slots, a real product photo, a brand palette, fonts, bpm, and beat split. DETERMINISTIC + FREE (a bundled Remotion project; springs + interpolate, no AI-gen for visuals). Backgrounds are engine gradient divs off the palette, props are inline SVG, the ONLY composited bitmap is the REAL product photo (objectFit:contain, never stretched), and ALL headline/claim/CTA/URL/wordmark text is typeset in the engine — never AI-rendered (the format's credibility guard). A driver binds the config to Remotion input props, renders the 9:16 master, and derives a 1:1 center-crop with ffmpeg. Two gating checks run before render (claim verbs must match the product's physical format; the claim beat needs an edge-entry mechanism prop). Use for the motion-graphics-offer-ad format.

Render Myth Vs Fact

Assemble a myth-vs-fact kinetic-typography explainer video ad (≈29.5s, 9:16) from N myth/fact pairs + hook / turn / punch copy + palette + a brand end-card PNG + a VO track — a hook, 3 red-strike MYTH cards that flip to teal-check FACT cards (per-line strikethrough that crosses EVERY wrapped line), a "what actually works" turn, an optional proof reveal, a punch line, and a static end card. DETERMINISTIC assembly with ZERO AI-gen visuals — HTML hyperframes rendered frame-exact via Playwright (`window.renderAt(t)`, animation a pure function of beat-local time), Whisper beat-snap to VO word onsets, concat at a uniform fps, karaoke `.ass` captions burned last (suppressed on the proof + end-card beats), and a VO + optional music mix (music −20 dB, `amix normalize=0`, tail fade). FREE (Python + Playwright + ffmpeg); the recipe supplies the copy / palette / end-card / VO and gates the paid VO / music / Whisper calls to their own capabilities. Use for the myth-vs-fact format.

Web Archive Scraper

Web Archive Scraper

Quick Start

How It Works

CLI Reference

Output Schema

Cost

Common Use Cases

What's included

Render Model Comparison Grid

Render Offer Ad

Render Myth Vs Fact

Learn to build Growth systems with AI