{"openapi":"3.1.0","info":{"title":"AI-First Scraper","summary":"Ad-free Markdown extraction API designed for LLM and AI agent consumption.","description":"AI-First Scraper fetches any web page or PDF, strips advertising, navigation chrome, trackers, and scripts, and returns the main content as clean Markdown.\n\nIt exists because AI agents (LLM-powered crawlers, autonomous research agents, RAG pipelines) waste tokens parsing ad-laden HTML. This API returns deterministic Markdown so an agent can reason about the actual article in the fewest tokens possible.\n\n### How an AI agent should use this API\n1. **Single URL** — `GET /scrape?url=<target>` (JSON) or `/raw?url=<target>` (markdown).\n2. **Many URLs** — `POST /batch` with `{\"urls\": [...], \"max_tokens\": N}` for parallel fetch.\n3. Feed `markdown` directly into your LLM context.\n\nAll endpoints support `max_tokens` to cap the response size and protect your prompt budget.","contact":{"name":"ai-first-scraper","url":"https://github.com/yubinkim444/ai-first-scraper"},"license":{"name":"MIT","url":"https://opensource.org/licenses/MIT"},"version":"1.1.0"},"paths":{"/":{"get":{"tags":["meta"],"summary":"Liveness probe.","operationId":"root__get","responses":{"200":{"description":"Successful Response","content":{"application/json":{"schema":{"$ref":"#/components/schemas/HealthResponse"}}}}}}},"/scrape":{"get":{"tags":["scrape"],"summary":"Fetch one URL and return clean Markdown (JSON).","description":"Fetches a single URL (HTML or PDF), removes ads / trackers / nav / scripts, and returns Markdown plus metadata (title, word_count, links). Use `max_tokens` to cap the body size.","operationId":"scrape_scrape_get","parameters":[{"name":"url","in":"query","required":true,"schema":{"type":"string","description":"Fully-qualified http(s) URL.","examples":["https://en.wikipedia.org/wiki/Web_scraping"],"title":"Url"},"description":"Fully-qualified http(s) URL."},{"name":"max_tokens","in":"query","required":false,"schema":{"anyOf":[{"type":"integer","minimum":100},{"type":"null"}],"description":"Soft cap on the returned markdown (whitespace tokens).","title":"Max Tokens"},"description":"Soft cap on the returned markdown (whitespace tokens)."}],"responses":{"200":{"description":"Successful Response","content":{"application/json":{"schema":{"$ref":"#/components/schemas/ScrapeResponse"}}}},"422":{"description":"Validation Error","content":{"application/json":{"schema":{"$ref":"#/components/schemas/HTTPValidationError"}}}}}}},"/raw":{"get":{"tags":["scrape"],"summary":"Same as /scrape but returns plain text/markdown (no JSON envelope).","operationId":"raw_raw_get","parameters":[{"name":"url","in":"query","required":true,"schema":{"type":"string","title":"Url"}},{"name":"max_tokens","in":"query","required":false,"schema":{"anyOf":[{"type":"integer","minimum":100},{"type":"null"}],"title":"Max Tokens"}}],"responses":{"200":{"description":"Successful Response","content":{"text/plain":{"schema":{"type":"string"}}}},"422":{"description":"Validation Error","content":{"application/json":{"schema":{"$ref":"#/components/schemas/HTTPValidationError"}}}}}}},"/batch":{"post":{"tags":["scrape"],"summary":"Fetch many URLs in parallel and return per-URL results.","description":"Accepts up to 25 URLs and scrapes them concurrently. Returns an array in the same order as the input — each item has `ok` plus either `data` (success) or `error` (failure). One failing URL never blocks the others.","operationId":"batch_batch_post","requestBody":{"content":{"application/json":{"schema":{"$ref":"#/components/schemas/BatchRequest"}}},"required":true},"responses":{"200":{"description":"Successful Response","content":{"application/json":{"schema":{"items":{"$ref":"#/components/schemas/BatchItem"},"type":"array","title":"Response Batch Batch Post"}}}},"422":{"description":"Validation Error","content":{"application/json":{"schema":{"$ref":"#/components/schemas/HTTPValidationError"}}}}}}},"/llms.txt":{"get":{"tags":["meta"],"summary":"Machine-readable usage spec for LLMs (llms.txt convention).","operationId":"llms_txt_llms_txt_get","responses":{"200":{"description":"Successful Response","content":{"text/plain":{"schema":{"type":"string"}}}}}}}},"components":{"schemas":{"BatchItem":{"properties":{"url":{"type":"string","title":"Url"},"ok":{"type":"boolean","title":"Ok"},"data":{"anyOf":[{"$ref":"#/components/schemas/ScrapeResponse"},{"type":"null"}]},"error":{"anyOf":[{"type":"string"},{"type":"null"}],"title":"Error"}},"type":"object","required":["url","ok"],"title":"BatchItem"},"BatchRequest":{"properties":{"urls":{"items":{"type":"string"},"type":"array","maxItems":25,"minItems":1,"title":"Urls","description":"Up to 25 URLs to scrape in parallel.","examples":[["https://example.com","https://en.wikipedia.org/wiki/AI"]]},"max_tokens":{"anyOf":[{"type":"integer","minimum":100.0},{"type":"null"}],"title":"Max Tokens","description":"Per-URL soft cap on the markdown size, measured in whitespace-split tokens. Pages larger than this are truncated and `truncated=true` is set."}},"type":"object","required":["urls"],"title":"BatchRequest"},"HTTPValidationError":{"properties":{"detail":{"items":{"$ref":"#/components/schemas/ValidationError"},"type":"array","title":"Detail"}},"type":"object","title":"HTTPValidationError"},"HealthResponse":{"properties":{"status":{"type":"string","title":"Status","examples":["ok"]},"service":{"type":"string","title":"Service","examples":["ai-first-scraper"]},"version":{"type":"string","title":"Version","examples":["1.1.0"]}},"type":"object","required":["status","service","version"],"title":"HealthResponse"},"ScrapeResponse":{"properties":{"url":{"type":"string","maxLength":2083,"minLength":1,"format":"uri","title":"Url","description":"The URL that was scraped."},"title":{"anyOf":[{"type":"string"},{"type":"null"}],"title":"Title","description":"The page <title>, if present."},"word_count":{"type":"integer","title":"Word Count","description":"Number of words in the extracted markdown."},"markdown":{"type":"string","title":"Markdown","description":"The main page content rendered as Markdown. Ads, scripts, styles, nav, footer, aside, iframes, and tracking elements have been removed."},"links":{"items":{"type":"string"},"type":"array","title":"Links","description":"All outbound HTTP(S) links found in the cleaned content, deduplicated and in document order. Useful for agents that need to plan the next hop."},"truncated":{"type":"boolean","title":"Truncated","description":"True when the `markdown` was cut off because it exceeded `max_tokens`.","default":false},"content_type":{"type":"string","title":"Content Type","description":"Either 'html' or 'pdf' depending on what the upstream returned.","default":"html"}},"type":"object","required":["url","word_count","markdown"],"title":"ScrapeResponse","description":"Structured response returned by `/scrape` and per-item in `/batch`."},"ValidationError":{"properties":{"loc":{"items":{"anyOf":[{"type":"string"},{"type":"integer"}]},"type":"array","title":"Location"},"msg":{"type":"string","title":"Message"},"type":{"type":"string","title":"Error Type"}},"type":"object","required":["loc","msg","type"],"title":"ValidationError"}}}}