Skip to main content

structured-data

Generate JSON-LD/YARRRML mappings or materialize RDF from YARRRML using Playwright-rendered pages.

Usage

  • worai structured-data create <url> [schema_type] [options]
  • worai structured-data generate <input> --yarrrml <mapping.yarrrml> [options]
  • worai structured-data inventory <source> (--output <file.csv> | --destination-sheet-id <spreadsheet_id> --destination-sheet-name <tab>) [options]

structured-data create

Generate JSON-LD and YARRRML for a rendered web page using Playwright and Agent WordLift.

Arguments

ArgumentTypeDescription
urlstringTarget page URL.
schema_typestringSchema.org type to generate (e.g., Review). Required unless provided with --type.

Options

OptionTypeDefaultDescription
--typestringnoneSchema.org type to generate (e.g., Review). Required if schema_type is omitted.
--output-dirstring.Output directory for generated files.
--base-namestringstructured-dataBase output filename.
--jsonldstringnoneWrite JSON-LD to this file path.
--yarrmlstringnoneWrite YARRRML to this file path.
--debugboolfalseWrite agent prompt/response to .structured-data/agent_debug.json and echo to stderr.
--headedboolfalseRun the browser with a visible UI instead of headless.
--timeout-msint30000Timeout (ms) for page loads.
--max-xhtml-charsint40000Max characters to keep in cleaned XHTML sent to the agent.
--max-text-node-charsint400Max characters per text node in cleaned XHTML.
--max-nesting-depthint2Max depth for related schema types in the property guide.
--verbosebooltrueEmit progress logs to stderr.
--wait-untilchoicenetworkidlePlaywright wait strategy: domcontentloaded, load, networkidle.

Notes

  • Requires WORDLIFT_API_KEY (or profiles.<name>.api_key in config) to resolve the dataset URI.
  • Requires yarrrml-parser (npm install -g @rmlio/yarrrml-parser).
  • morph-kgc is included in project dependencies.
  • Each JSON-LD node includes an @id built as <dataset_uri>/<pluralized-type>/<name>-<hash>.
  • YARRRML uses XPath selectors.
  • schema:url is replaced with __URL__ and source paths with __XHTML__ when --reusable is enabled.
  • Intermediate artifacts are stored under <output-dir>/.structured-data/ (HTML, XHTML, cleaned XHTML, mapping, validation reports).
  • The generator rejects hard-coded literals (except schema:url) and checks XPath evidence before accepting a mapping.
  • Missing Google-required properties are reported as warnings (not hard failures).

Examples

  • worai structured-data create https://example.com/article Review --output-dir ./structured-data
  • worai structured-data create https://example.com/article --type Review --output-dir ./structured-data
  • worai structured-data create https://example.com/article Review --jsonld ./out/page.jsonld --yarrml ./out/page.yarrml

structured-data generate

Render pages from a sitemap (or a single URL), apply a YARRRML mapping, and emit one RDF file per page.

Arguments

ArgumentTypeDescription
inputstringSitemap URL/path or a page URL.

Options

OptionTypeDefaultDescription
--yarrrmlstringnonePath to the YARRRML mapping file.
--regexstring.*Regex to filter URLs (matches full URL).
--output-dirstring.Output directory for generated RDF.
--formatstringttlOutput format: ttl, jsonld, rdf, nt, nq.
--concurrencystringautoWorker count or auto to adapt to responses.
--headedboolfalseRun the browser with a visible UI instead of headless.
--timeout-msint30000Timeout (ms) for page loads.
--wait-untilchoicenetworkidlePlaywright wait strategy: domcontentloaded, load, networkidle.
--max-xhtml-charsint40000Max characters to keep in cleaned XHTML.
--max-text-node-charsint400Max characters per text node in cleaned XHTML.
--max-pagesintnoneMax number of pages to process.
--verbosebooltrueEmit progress logs to stderr.

Notes

  • Requires yarrrml-parser (npm install -g @rmlio/yarrrml-parser).
  • Uses Playwright to render HTML and converts it to XHTML before mapping.
  • Output filenames use {slug}--{hash}.{ext} to avoid collisions.
  • Blank nodes are rejected by default.

Examples

  • worai structured-data generate https://example.com/sitemap.xml --yarrrml ./mapping.yarrrml --output-dir ./out
  • worai structured-data generate https://example.com/page --yarrrml ./mapping.yarrrml --format jsonld
  • worai structured-data generate ./sitemap.xml --yarrrml ./mapping.yarrrml --regex "/product/" --concurrency auto

structured-data inventory

Parse all URLs from a sitemap, extract JSON-LD from each page, and export a structured-data inventory.

Arguments

ArgumentTypeDescription
sourcestringSitemap URL/path, local URL list file, or Google Spreadsheet URL/ID containing input URLs.

Options

OptionTypeDefaultDescription
--sheet-namestringnoneSource sheet tab name when source is a Google Spreadsheet (reads the url column).
--outputstringnoneWrite inventory to CSV.
--destination-sheet-idstringnoneGoogle Spreadsheet ID where inventory should be written.
--destination-sheet-namestringnoneDestination sheet tab name for inventory output.
--client-secretsstringnoneOAuth client secrets JSON path (used when Sheets auth needs re-consent).
--tokenstringoauth_token.jsonOAuth token path (shared token file).
--portint8080Local redirect port for OAuth flow.
--timeoutfloat30.0HTTP timeout in seconds for sitemap and page fetches.
--concurrencystringautoWorker count or auto to adapt to fetch/parse responses.
--source-typestringnoneOptional source parser override (e.g., debug-cloud).
--ingest-sourcestringnoneSDK 5 source axis: auto, urls, sitemap, sheets, local.
--ingest-loaderstringnoneSDK 5 loader axis: auto, simple, proxy, playwright, premium_scraper, web_scrape_api, passthrough.
--ingest-passthrough-when-html / --no-ingest-passthrough-when-htmlboolconfig/defaultPrefer passthrough when source records include embedded HTML.

Output columns

  • url
  • faq_markup (yes/no, based on FAQPage existence)
  • faq_markup_from_graph (yes/no, based on FAQPage.@id being under current account dataset URI)
  • types (comma-separated schema types without schema.org prefixes)
  • structured_data (full combined JSON-LD object with one @graph)

Notes

  • Uses JSON-LD only (<script type=\"application/ld+json\">).
  • Requires WORDLIFT_API_KEY (or profiles.<name>.api_key in config) to resolve account dataset URI.
  • Requires exactly one destination: --output or --destination-sheet-id + --destination-sheet-name.
  • Fetches page content with Playwright using the shared worai default User-Agent.
  • Shows a progress bar while processing source URLs.
  • Supports adaptive concurrency via --concurrency auto.
  • Ingestion precedence:
    • new ingest settings (--ingest-* or ingest_* config) win over legacy when both are set
    • legacy remains supported when new is unset
    • disagreements emit a structured warning event
  • Loader defaults:
    • default and auto loader resolve to web_scrape_api
    • passthrough takes precedence when embedded HTML exists and passthrough-when-html is enabled
  • worai.toml examples:
[ingest]
source = "auto"
loader = "web_scrape_api"
passthrough_when_html = true
[profiles.inventory_local]
api_key = "${WORDLIFT_API_KEY}"
mapping = "default.yarrrml"
urls = ["https://example.com/page"]
ingest_source = "local"
ingest_loader = "passthrough"
  • Local URL list file support:
    • .txt: one URL per line
    • .csv: requires url column
  • When using a Google Spreadsheet as source, --sheet-name is required.
  • --source-type debug-cloud (legacy alias of --ingest-source local) supports .ttl debug artifacts by reading:
    • URL from http://schema.org/url
    • HTML from https://w3id.org/seovoc/html

Examples

  • worai structured-data inventory https://example.com/sitemap.xml --output ./structured-data-inventory.csv
  • worai structured-data inventory ./urls.txt --output ./structured-data-inventory.csv
  • worai structured-data inventory https://docs.google.com/spreadsheets/d/<id>/edit --sheet-name URLs_US --output ./structured-data-inventory.csv
  • worai structured-data inventory https://example.com/sitemap.xml --destination-sheet-id 1AbCdEfGhIjKlMnOp --destination-sheet-name Inventory
  • worai structured-data inventory https://example.com/sitemap.xml --output ./structured-data-inventory.csv --concurrency auto
  • worai structured-data inventory /path/to/debug_cloud/us --source-type debug-cloud --output ./structured-data-inventory.csv
  • worai structured-data inventory /path/to/debug_cloud/us --ingest-source local --ingest-loader passthrough --output ./structured-data-inventory.csv
  • worai structured-data inventory https://example.com/sitemap.xml --ingest-loader web_scrape_api --output ./structured-data-inventory.csv