Skip to main content

structured-data

Generate JSON-LD/YARRRML mappings or materialize RDF from YARRRML using Playwright-rendered pages.

Usage

  • worai structured-data create <url> [schema_type] [options]
  • worai structured-data generate <input> --yarrrml <mapping.yarrrml> [options]

structured-data create

Generate JSON-LD and YARRRML for a rendered web page using Playwright and Agent WordLift.

Arguments

ArgumentTypeDescription
urlstringTarget page URL.
schema_typestringSchema.org type to generate (e.g., Review). Required unless provided with --type.

Options

OptionTypeDefaultDescription
--typestringnoneSchema.org type to generate (e.g., Review). Required if schema_type is omitted.
--output-dirstring.Output directory for generated files.
--base-namestringstructured-dataBase output filename.
--jsonldstringnoneWrite JSON-LD to this file path.
--yarrmlstringnoneWrite YARRRML to this file path.
--debugboolfalseWrite agent prompt/response to .structured-data/agent_debug.json and echo to stderr.
--headedboolfalseRun the browser with a visible UI instead of headless.
--timeout-msint30000Timeout (ms) for page loads.
--max-xhtml-charsint40000Max characters to keep in cleaned XHTML sent to the agent.
--max-text-node-charsint400Max characters per text node in cleaned XHTML.
--max-nesting-depthint2Max depth for related schema types in the property guide.
--verbosebooltrueEmit progress logs to stderr.
--wait-untilchoicenetworkidlePlaywright wait strategy: domcontentloaded, load, networkidle.

Notes

  • Requires WORDLIFT_KEY (or wordlift.api_key in config) to resolve the dataset URI.
  • Requires yarrrml-parser (npm install -g @rmlio/yarrrml-parser).
  • morph-kgc is included in project dependencies.
  • Each JSON-LD node includes an @id built as <dataset_uri>/<pluralized-type>/<name>-<hash>.
  • YARRRML uses XPath selectors.
  • schema:url is replaced with __URL__ and source paths with __XHTML__ when --reusable is enabled.
  • Intermediate artifacts are stored under <output-dir>/.structured-data/ (HTML, XHTML, cleaned XHTML, mapping, validation reports).
  • The generator rejects hard-coded literals (except schema:url) and checks XPath evidence before accepting a mapping.
  • Missing Google-required properties are reported as warnings (not hard failures).

Examples

  • worai structured-data create https://example.com/article Review --output-dir ./structured-data
  • worai structured-data create https://example.com/article --type Review --output-dir ./structured-data
  • worai structured-data create https://example.com/article Review --jsonld ./out/page.jsonld --yarrml ./out/page.yarrml

structured-data generate

Render pages from a sitemap (or a single URL), apply a YARRRML mapping, and emit one RDF file per page.

Arguments

ArgumentTypeDescription
inputstringSitemap URL/path or a page URL.

Options

OptionTypeDefaultDescription
--yarrrmlstringnonePath to the YARRRML mapping file.
--regexstring.*Regex to filter URLs (matches full URL).
--output-dirstring.Output directory for generated RDF.
--formatstringttlOutput format: ttl, jsonld, rdf, nt, nq.
--concurrencystringautoWorker count or auto to adapt to responses.
--headedboolfalseRun the browser with a visible UI instead of headless.
--timeout-msint30000Timeout (ms) for page loads.
--wait-untilchoicenetworkidlePlaywright wait strategy: domcontentloaded, load, networkidle.
--max-xhtml-charsint40000Max characters to keep in cleaned XHTML.
--max-text-node-charsint400Max characters per text node in cleaned XHTML.
--max-pagesintnoneMax number of pages to process.
--verbosebooltrueEmit progress logs to stderr.

Notes

  • Requires yarrrml-parser (npm install -g @rmlio/yarrrml-parser).
  • Uses Playwright to render HTML and converts it to XHTML before mapping.
  • Output filenames use {slug}--{hash}.{ext} to avoid collisions.
  • Blank nodes are rejected by default.

Examples

  • worai structured-data generate https://example.com/sitemap.xml --yarrrml ./mapping.yarrrml --output-dir ./out
  • worai structured-data generate https://example.com/page --yarrrml ./mapping.yarrrml --format jsonld
  • worai structured-data generate ./sitemap.xml --yarrrml ./mapping.yarrrml --regex "/product/" --concurrency auto