Skip to main content

web-pages

Run ingestion-backed workflows for web pages.

Usage

  • worai web-pages classify-types <source> [options]

classify-types

Classify ingested URLs into schema.org types and export CSV columns: url,main_type,additional_types,explanation.

Arguments

ArgumentTypeDescription
sourcestringSource input for ingestion (sitemap URL/path, URL file, sheets URL/ID, or local JSON file).

Options

OptionTypeDefaultDescription
--outputstringweb_pages_type_classification_<profile>_<date>_<seq>.csvOutput CSV path (supports profile template fallback).
--sheet-namestringnoneSheet tab name when using sheets source.
--service-accountstringnoneGoogle service account path or JSON body (required for sheets source).
--ingest-sourcestringauto`auto
--ingest-loaderstringweb_scrape_api`auto
--ingest-passthrough-when-html / --no-ingest-passthrough-when-htmlboolconfig/defaultPrefer passthrough when HTML is already embedded.
--url-regexstringnoneURL regex filter mapped to SDK URL_REGEX.
--agent-clistringautoLocal agent CLI (`claude
--agent-timeout-secfloat120.0Per-page agent timeout.
--max-markdown-charsint24000Markdown truncation limit before classification.
-y, --yesboolfalseSkip confirmation prompt and proceed.

Credit confirmation

  • The command prompts before execution because each run consumes agent credits.
  • Prompt default is yes (Y/n), so pressing Enter continues.
  • Use -y / --yes in automation to skip the prompt.
  • Shows one CLI-owned progress bar during execution, powered by SDK type_classification.progress.* callbacks.

Config Fallbacks

  • profiles.<name>.web_pages.output
  • ingest.source
  • ingest.loader
  • ingest.url_regex
  • ingest.passthrough_when_html
  • profiles.<name>.oauth.service_account

Examples

  • worai web-pages classify-types https://example.com/sitemap.xml --ingest-source sitemap --ingest-loader playwright --url-regex "/blog/" --output ./types.csv
  • worai web-pages classify-types ./urls.txt --ingest-source urls --output ./types.csv
  • worai web-pages classify-types https://docs.google.com/spreadsheets/d/<id>/edit --ingest-source sheets --sheet-name URLs --service-account ./service-account.json --output ./types.csv
  • worai web-pages classify-types https://example.com/sitemap.xml --ingest-source sitemap --output ./types.csv --yes