contentseo

Site Content Catalog

Crawl a website's sitemap and blog index to build a complete content inventory. Lists every page with URL, title, publish date, content type, and topic cluster. Groups content by category and topic. Optionally deep-reads top N pages for quality analysis and funnel stage tagging. Use before SEO audits, content gap analysis, or brand voice extraction.

by Athina AI

Run in Gooseworks View skill in Github

Install

Terminal

npx gooseworks install --all

# then, in Claude Code, Cursor, or Codex:
/gooseworks use the site-content-catalog skill

About This Skill

Site Content Catalog

Crawl a website's sitemap and blog to build a complete content inventory — every page cataloged with URL, title, date, content type, and topic cluster. Groups content by category, identifies publishing patterns, and optionally deep-analyzes top pages.

Quick Start

# Basic content inventory
python3 scripts/catalog_content.py --domain "example.com"
 
# With deep analysis of top 20 pages
python3 scripts/catalog_content.py --domain "example.com" --deep-analyze 20
 
# Output to specific file
python3 scripts/catalog_content.py --domain "example.com" --output content-inventory.json

Inputs

Parameter	Required	Default	Description
domain	Yes	—	Domain to catalog (e.g., "example.com")
deep-analyze	No	0	Number of top pages to deep-read for content analysis
output	No	stdout	Path to save JSON output
include-non-blog	No	true	Also catalog landing pages, docs, etc. (not just blog)

Cost

Sitemap/RSS crawling: Free (direct HTTP requests)
Apify sitemap extractor (fallback): ~$0.50 per site
Deep analysis: Free (WebFetch on individual pages)

Process

Phase 1: Discover All Pages

The script attempts multiple methods to find all pages on a site, in order:

A) Sitemap.xml

Fetch https://[domain]/sitemap.xml
If it's a sitemap index, recursively fetch all child sitemaps
Common alternate locations: /sitemap_index.xml, /sitemap-index.xml, /wp-sitemap.xml
Check robots.txt for Sitemap: directives

B) RSS/Atom Feeds

Check /feed, /rss, /atom.xml, /blog/feed, etc.
Extract posts with titles, dates, and URLs
RSS typically only surfaces recent content (last 10-50 posts)

C) Blog Index Crawl

Fetch /blog, /resources, /insights, /news, /articles
Extract links from the page
Follow pagination if present (/blog/page/2, ?page=2, etc.)

D) Site: Search (fallback)

WebSearch: site:[domain] to estimate total indexed pages
WebSearch: site:[domain]/blog to find blog content
WebSearch: site:[domain] intitle: to discover page title patterns

E) Apify Sitemap Extractor (fallback for JS-heavy sites)

Actor: onescales/sitemap-url-extractor
Use when sitemap.xml is missing and the site is JS-rendered

Phase 2: Classify Each Page

For each discovered URL, classify by:

Content Type

Classify based on URL patterns and page titles:

Type	URL Patterns	Examples
`blog-post`	`/blog/`, `/posts/`, `/articles/`	How-to guides, opinion pieces
`case-study`	`/case-study/`, `/customers/`, `/success-stories/`	Customer stories
`comparison`	`/vs/`, `/compare/`, `/alternative/`	X vs Y pages
`landing-page`	`/solutions/`, `/use-cases/`, `/for-/`	Product marketing pages
`docs`	`/docs/`, `/help/`, `/documentation/`, `/api/`	Technical documentation
`changelog`	`/changelog/`, `/releases/`, `/whats-new/`	Product updates
`pricing`	`/pricing/`	Pricing page
`about`	`/about/`, `/team/`, `/careers/`	Company pages
`legal`	`/privacy/`, `/terms/`, `/security/`	Legal/compliance
`resource`	`/resources/`, `/guides/`, `/ebooks/`, `/webinars/`	Gated/downloadable content
`glossary`	`/glossary/`, `/dictionary/`, `/terms/`	SEO glossary pages
`integration`	`/integrations/`, `/apps/`, `/marketplace/`	Integration pages
`other`	—	Anything else

Topic Cluster

Group by extracting topic signals from URL slugs and titles:

Extract keywords from URL path segments
Group similar keywords into clusters (e.g., "aws-cost", "cloud-spending", "finops" → "Cloud Cost Management")
Use simple keyword co-occurrence for clustering

Phase 3: Analyze Publishing Patterns

From the dated content (primarily blog posts):

Total content pieces by type
Publishing frequency: Posts per month over last 12 months
Trend: Increasing, decreasing, or stable output
Recency: Date of most recent publish
Author diversity: Unique authors (if extractable from RSS)

Phase 4: Deep Analysis (Optional)

If --deep-analyze N is specified, fetch the top N pages (prioritizing blog posts) and extract:

Word count (approximate)
Target keyword (inferred from title + H1 + URL)
Funnel stage: TOFU (awareness), MOFU (consideration), BOFU (decision)
Content depth: Shallow (<500 words), Medium (500-1500), Deep (1500+)
Has images/video: Boolean
Has CTA: Boolean (detected by common CTA patterns)
Internal links count

Phase 5: Output

JSON Output (default)

{
  "domain": "example.com",
  "crawl_date": "2026-02-25",
  "total_pages": 347,
  "discovery_methods": ["sitemap.xml", "rss"],
  "pages": [
    {
      "url": "https://example.com/blog/reduce-aws-costs",
      "title": "How to Reduce Your AWS Bill by 40%",
      "date": "2025-11-15",
      "type": "blog-post",
      "topic_cluster": "Cloud Cost Optimization",
      "deep_analysis": {
        "word_count": 2100,
        "target_keyword": "reduce aws costs",
        "funnel_stage": "TOFU",
        "content_depth": "deep",
        "has_images": true,
        "has_cta": true
      }
    }
  ],
  "summary": {
    "by_type": {"blog-post": 89, "landing-page": 23, "case-study": 12, ...},
    "by_topic": {"Cloud Cost Optimization": 34, "FinOps": 18, ...},
    "publishing_cadence": {
      "posts_per_month_avg": 4.2,
      "trend": "increasing",
      "most_recent": "2026-02-20"
    }
  }
}

Markdown Summary (also generated)

# Content Inventory: example.com
**Crawled:** 2026-02-25 | **Total pages:** 347
 
## Content by Type
| Type | Count | % |
|------|-------|---|
| Blog Posts | 89 | 25.6% |
| Landing Pages | 23 | 6.6% |
| ...
 
## Content by Topic Cluster
| Topic | Posts | Most Recent |
|-------|-------|-------------|
| Cloud Cost Optimization | 34 | 2026-02-20 |
| ...
 
## Publishing Cadence
- Average: 4.2 posts/month
- Trend: Increasing (3.1 → 5.4 over last 6 months)
- Most recent: 2026-02-20
 
## Full Catalog
| # | Date | Type | Topic | Title | URL |
|---|------|------|-------|-------|-----|
| 1 | 2026-02-20 | blog-post | Cloud Cost | How to Reduce... | https://... |

Tips

Sitemap.xml is the best source. Most well-maintained sites have one. If missing, it's itself an SEO signal (negative).
RSS only shows recent content. If you need the full catalog, sitemap is essential. RSS is supplementary.
Deep analysis is optional but valuable. Use it when feeding into brand-voice-extractor or when you need funnel stage mapping.
JS-rendered sites may need the Apify fallback. Signs: sitemap.xml returns HTML, or blog page returns mostly JavaScript.
Combine with seo-domain-analyzer to overlay traffic data on the content inventory — see which content actually performs.

Dependencies

Python 3.8+
requests library (pip install requests)
APIFY_API_TOKEN env var (only for Apify fallback mode)

What's included

Sitemap/RSS crawling:* Free (direct HTTP requests)

Apify sitemap extractor (fallback):* ~$0.50 per site

Deep analysis:* Free (WebFetch on individual pages)

Actor: onescales/sitemap-url-extractor

Use when sitemap.xml is missing and the site is JS-rendered

Render Model Comparison Grid

Render a 'model comparison grid' video from a config — a fal-style "same prompt, N contenders" showcase — a dark real-DOM stage where per beat a monospace prompt fades in centered, docks to a small top strip, then a labeled 2-4 panel grid (static images OR muted video clips, mixable per cell) staggers in and holds for comparison, plus a minimal end card — frame-stepped via Playwright (video cells are frame-seeked deterministically) and encoded with FFmpeg. Deterministic assembly, FREE (cell media comes from create-image-fal / create-video-fal, music from create-music-elevenlabs), text stays pixel-crisp. Use for the model-comparison-grid format.

Render Offer Ad

Render a punchy ~12s vertical (9:16) music-only direct-response OFFER ad as a 4-beat kinetic-typography film — HEADLINE slam → real PRODUCT drop → CLAIM/proof → CTA pill — from one config of copy slots, a real product photo, a brand palette, fonts, bpm, and beat split. DETERMINISTIC + FREE (a bundled Remotion project; springs + interpolate, no AI-gen for visuals). Backgrounds are engine gradient divs off the palette, props are inline SVG, the ONLY composited bitmap is the REAL product photo (objectFit:contain, never stretched), and ALL headline/claim/CTA/URL/wordmark text is typeset in the engine — never AI-rendered (the format's credibility guard). A driver binds the config to Remotion input props, renders the 9:16 master, and derives a 1:1 center-crop with ffmpeg. Two gating checks run before render (claim verbs must match the product's physical format; the claim beat needs an edge-entry mechanism prop). Use for the motion-graphics-offer-ad format.

Render Myth Vs Fact

Assemble a myth-vs-fact kinetic-typography explainer video ad (≈29.5s, 9:16) from N myth/fact pairs + hook / turn / punch copy + palette + a brand end-card PNG + a VO track — a hook, 3 red-strike MYTH cards that flip to teal-check FACT cards (per-line strikethrough that crosses EVERY wrapped line), a "what actually works" turn, an optional proof reveal, a punch line, and a static end card. DETERMINISTIC assembly with ZERO AI-gen visuals — HTML hyperframes rendered frame-exact via Playwright (`window.renderAt(t)`, animation a pure function of beat-local time), Whisper beat-snap to VO word onsets, concat at a uniform fps, karaoke `.ass` captions burned last (suppressed on the proof + end-card beats), and a VO + optional music mix (music −20 dB, `amix normalize=0`, tail fade). FREE (Python + Playwright + ffmpeg); the recipe supplies the copy / palette / end-card / VO and gates the paid VO / music / Whisper calls to their own capabilities. Use for the myth-vs-fact format.

Site Content Catalog

Site Content Catalog

Quick Start

Inputs

Cost

Process

Phase 1: Discover All Pages

A) Sitemap.xml

B) RSS/Atom Feeds

C) Blog Index Crawl

D) Site: Search (fallback)

E) Apify Sitemap Extractor (fallback for JS-heavy sites)

Phase 2: Classify Each Page

Content Type

Topic Cluster

Phase 3: Analyze Publishing Patterns

Phase 4: Deep Analysis (Optional)

Phase 5: Output

JSON Output (default)

Markdown Summary (also generated)

Tips

Dependencies

What's included

Render Model Comparison Grid

Render Offer Ad

Render Myth Vs Fact

Learn to build Growth systems with AI