Crawl a website's sitemap and blog index to build a complete content inventory. Lists every page with URL, title, publish date, content type, and topic cluster. Groups content by category and topic. Optionally deep-reads top N pages for quality analysis and funnel stage tagging. Use before SEO audits, content gap analysis, or brand voice extraction.
npx gooseworks install --claude # Then in your agent: /gooseworks <prompt> --skill site-content-catalog
Crawl a website's sitemap and blog to build a complete content inventory — every page cataloged with URL, title, date, content type, and topic cluster. Groups content by category, identifies publishing patterns, and optionally deep-analyzes top pages.
# Basic content inventory
python3 skills/site-content-catalog/scripts/catalog_content.py \
--domain "example.com"
# With deep analysis of top 20 pages
python3 skills/site-content-catalog/scripts/catalog_content.py \
--domain "example.com" --deep-analyze 20
# Output to specific file
python3 skills/site-content-catalog/scripts/catalog_content.py \
--domain "example.com" --output clients/acme/research/content-inventory.json| Parameter | Required | Default | Description |
|---|---|---|---|
| domain | Yes | — | Domain to catalog (e.g., "example.com") |
| deep-analyze | No | 0 | Number of top pages to deep-read for content analysis |
| output | No | stdout | Path to save JSON output |
| include-non-blog | No | true | Also catalog landing pages, docs, etc. (not just blog) |
The script attempts multiple methods to find all pages on a site, in order:
https://[domain]/sitemap.xml/sitemap_index.xml, /sitemap-index.xml, /wp-sitemap.xmlrobots.txt for Sitemap: directives/feed, /rss, /atom.xml, /blog/feed, etc./blog, /resources, /insights, /news, /articles/blog/page/2, ?page=2, etc.)site:[domain] to estimate total indexed pagessite:[domain]/blog to find blog contentsite:[domain] intitle: to discover page title patternsonescales/sitemap-url-extractorFor each discovered URL, classify by:
Classify based on URL patterns and page titles:
| Type | URL Patterns | Examples |
|---|---|---|
blog-post | /blog/, /posts/, /articles/ | How-to guides, opinion pieces |
case-study | /case-study/, /customers/, /success-stories/ | Customer stories |
comparison | /vs/, /compare/, /alternative/ | X vs Y pages |
landing-page | /solutions/, /use-cases/, /for-/ | Product marketing pages |
docs | /docs/, /help/, /documentation/, /api/ | Technical documentation |
changelog | /changelog/, /releases/, /whats-new/ | Product updates |
pricing | /pricing/ | Pricing page |
about | /about/, /team/, /careers/ | Company pages |
legal | /privacy/, /terms/, /security/ | Legal/compliance |
resource | /resources/, /guides/, /ebooks/, /webinars/ | Gated/downloadable content |
glossary | /glossary/, /dictionary/, /terms/ | SEO glossary pages |
integration | /integrations/, /apps/, /marketplace/ | Integration pages |
other | — | Anything else |
Group by extracting topic signals from URL slugs and titles:
From the dated content (primarily blog posts):
If --deep-analyze N is specified, fetch the top N pages (prioritizing blog posts) and extract:
{
"domain": "example.com",
"crawl_date": "2026-02-25",
"total_pages": 347,
"discovery_methods": ["sitemap.xml", "rss"],
"pages": [
{
"url": "https://example.com/blog/reduce-aws-costs",
"title": "How to Reduce Your AWS Bill by 40%",
"date": "2025-11-15",
"type": "blog-post",
"topic_cluster": "Cloud Cost Optimization",
"deep_analysis": {
"word_count": 2100,
"target_keyword": "reduce aws costs",
"funnel_stage": "TOFU",
"content_depth": "deep",
"has_images": true,
"has_cta": true
}
}
],
"summary": {
"by_type": {"blog-post": 89, "landing-page": 23, "case-study": 12, ...},
"by_topic": {"Cloud Cost Optimization": 34, "FinOps": 18, ...},
"publishing_cadence": {
"posts_per_month_avg": 4.2,
"trend": "increasing",
"most_recent": "2026-02-20"
}
}
}# Content Inventory: example.com
**Crawled:** 2026-02-25 | **Total pages:** 347
## Content by Type
| Type | Count | % |
|------|-------|---|
| Blog Posts | 89 | 25.6% |
| Landing Pages | 23 | 6.6% |
| ...
## Content by Topic Cluster
| Topic | Posts | Most Recent |
|-------|-------|-------------|
| Cloud Cost Optimization | 34 | 2026-02-20 |
| ...
## Publishing Cadence
- Average: 4.2 posts/month
- Trend: Increasing (3.1 → 5.4 over last 6 months)
- Most recent: 2026-02-20
## Full Catalog
| # | Date | Type | Topic | Title | URL |
|---|------|------|-------|-------|-----|
| 1 | 2026-02-20 | blog-post | Cloud Cost | How to Reduce... | https://... |requests library (pip install requests)APIFY_API_TOKEN env var (only for Apify fallback mode)onescales/sitemap-url-extractorCheck and improve your brand's visibility across AI search engines (ChatGPT, Perplexity, Gemini, Grok, Claude, DeepSeek). Set up tracking, run visibility analyses, audit your website for AI readability, and get actionable recommendations. Uses the npx goose-aeo@latest CLI.
Extract competitor and customer intelligence from any company's landing page HTML. Discovers tech stack, analytics tools, ad pixels, customer logos, SEO metadata, CTAs, hidden elements, and more. No API keys required.
Discover all customers of a given company by scanning websites, case studies, review sites, press, social media, job postings, and more. Use when you need competitive intelligence on who a company sells to.