Crawl a website's sitemap and blog index to build a complete content inventory. Lists every page with URL, title, publish date, content type, and topic cluster. Groups content by category and topic. Optionally deep-reads top N pages for quality analysis and funnel stage tagging. Use before SEO audits, content gap analysis, or brand voice extraction.
npx gooseworks install --claude # Then in your agent: /gooseworks <prompt> --skill site-content-catalog
Crawl a website's sitemap and blog to build a complete content inventory — every page cataloged with URL, title, date, content type, and topic cluster. Groups content by category, identifies publishing patterns, and optionally deep-analyzes top pages.
# Basic content inventory
python3 scripts/catalog_content.py --domain "example.com"
# With deep analysis of top 20 pages
python3 scripts/catalog_content.py --domain "example.com" --deep-analyze 20
# Output to specific file
python3 scripts/catalog_content.py --domain "example.com" --output content-inventory.json| Parameter | Required | Default | Description |
|---|---|---|---|
| domain | Yes | — | Domain to catalog (e.g., "example.com") |
| deep-analyze | No | 0 | Number of top pages to deep-read for content analysis |
| output | No | stdout | Path to save JSON output |
| include-non-blog | No | true | Also catalog landing pages, docs, etc. (not just blog) |
The script attempts multiple methods to find all pages on a site, in order:
https://[domain]/sitemap.xml/sitemap_index.xml, /sitemap-index.xml, /wp-sitemap.xmlrobots.txt for Sitemap: directives/feed, /rss, /atom.xml, /blog/feed, etc./blog, /resources, /insights, /news, /articles/blog/page/2, ?page=2, etc.)site:[domain] to estimate total indexed pagessite:[domain]/blog to find blog contentsite:[domain] intitle: to discover page title patternsonescales/sitemap-url-extractorFor each discovered URL, classify by:
Classify based on URL patterns and page titles:
| Type | URL Patterns | Examples |
|---|---|---|
blog-post | /blog/, /posts/, /articles/ | How-to guides, opinion pieces |
case-study | /case-study/, /customers/, /success-stories/ | Customer stories |
comparison | /vs/, /compare/, /alternative/ | X vs Y pages |
landing-page | /solutions/, /use-cases/, /for-/ | Product marketing pages |
docs | /docs/, /help/, /documentation/, /api/ | Technical documentation |
changelog | /changelog/, /releases/, /whats-new/ | Product updates |
pricing | /pricing/ | Pricing page |
about | /about/, /team/, /careers/ | Company pages |
legal | /privacy/, /terms/, /security/ | Legal/compliance |
resource | /resources/, /guides/, /ebooks/, /webinars/ | Gated/downloadable content |
glossary | /glossary/, /dictionary/, /terms/ | SEO glossary pages |
integration | /integrations/, /apps/, /marketplace/ | Integration pages |
other | — | Anything else |
Group by extracting topic signals from URL slugs and titles:
From the dated content (primarily blog posts):
If --deep-analyze N is specified, fetch the top N pages (prioritizing blog posts) and extract:
{
"domain": "example.com",
"crawl_date": "2026-02-25",
"total_pages": 347,
"discovery_methods": ["sitemap.xml", "rss"],
"pages": [
{
"url": "https://example.com/blog/reduce-aws-costs",
"title": "How to Reduce Your AWS Bill by 40%",
"date": "2025-11-15",
"type": "blog-post",
"topic_cluster": "Cloud Cost Optimization",
"deep_analysis": {
"word_count": 2100,
"target_keyword": "reduce aws costs",
"funnel_stage": "TOFU",
"content_depth": "deep",
"has_images": true,
"has_cta": true
}
}
],
"summary": {
"by_type": {"blog-post": 89, "landing-page": 23, "case-study": 12, ...},
"by_topic": {"Cloud Cost Optimization": 34, "FinOps": 18, ...},
"publishing_cadence": {
"posts_per_month_avg": 4.2,
"trend": "increasing",
"most_recent": "2026-02-20"
}
}
}# Content Inventory: example.com
**Crawled:** 2026-02-25 | **Total pages:** 347
## Content by Type
| Type | Count | % |
|------|-------|---|
| Blog Posts | 89 | 25.6% |
| Landing Pages | 23 | 6.6% |
| ...
## Content by Topic Cluster
| Topic | Posts | Most Recent |
|-------|-------|-------------|
| Cloud Cost Optimization | 34 | 2026-02-20 |
| ...
## Publishing Cadence
- Average: 4.2 posts/month
- Trend: Increasing (3.1 → 5.4 over last 6 months)
- Most recent: 2026-02-20
## Full Catalog
| # | Date | Type | Topic | Title | URL |
|---|------|------|-------|-------|-----|
| 1 | 2026-02-20 | blog-post | Cloud Cost | How to Reduce... | https://... |requests library (pip install requests)APIFY_API_TOKEN env var (only for Apify fallback mode)onescales/sitemap-url-extractorDiagnose Meta Ads campaign performance using Meta's actual system mechanics — Breakdown Effect, Learning Phase, Auction Overlap, Pacing, and Creative Fatigue — and produce structured, testable recommendations that avoid judging segments by average CPA instead of marginal efficiency.
Pre-flight policy check for Meta ads. Takes ad copy plus advertiser context, resolves and fetches the relevant Meta transparency-center policy pages at runtime, and returns a Pass / Fix Required / Block verdict with cited findings and rewrites.
For paid lead-gen and participant-recruitment ads, replaces vanity CPA with true CAC per qualified lead by joining ad-platform data with downstream funnel events, surfaces tracking gaps, and classifies every creative into Scale / Keep / Investigate / Cut.