Skip to content

Client

client

Synchronous client for the Scrape.do Async API

Defines the ScrapeDoAsyncAPIClient, a synchronous wrapper over httpx.Client configured against q.scrape.do

Endpoint Mapping

_raise_for_status(resp)

Raises the relevant Async-API Exception based on the status code of a non-2xx httpx.Response object

Shared Helper

Both the ScrapeDoAsyncAPIClient and the AsyncScrapeDoAsyncAPIClient clients use this function

Exception Mapping

Parameters:

Name Type Description Default
resp Response

The raw response received from q.scrape.do

required

Raises:

Type Description
AsyncAPIResponseError

One of the typed subclasses corresponding to the response's status code

_parse_response(resp, model_cls)

Parses the raw JSON body returned by a request to q.scrape.do into one of the endpoint-specific Async-API Pydantic Response Models by parsing it with httpx.Response.json and feeding the result to model_cls.model_validate

Shared Helper

Both the ScrapeDoAsyncAPIClient and the AsyncScrapeDoAsyncAPIClient clients use this function

AsyncAPIUnparsableResponseError

This method raises the AsyncAPIUnparsableResponseError exception whenever a JSONDecodeError or a ValidationError occurs when parsing the raw httpx.Response into a custom pydantic response model

Why Wrap These Errors
  • q.scrape.do ordinarily returns JSON bodies that match the documented schema for each endpoint

  • Server-side incidents can leak raw HTML or a malformed payload through to the client. In that case, resp.json() would raise JSONDecodeError

  • An unexpected change to the JSON schema returned by one of the endpoints would cause model_cls.model_validate to raise ValidationError

  • Either failure is surfaced as an AsyncAPIUnparsableResponseError so callers see a single, consistent exception type for the cases in which the gateway returned something that the SDK couldn't parse

Parameters:

Name Type Description Default
resp Response

The 2xx response to parse

required
model_cls Type[_ResponseModelT]

The pydantic model class to validate against

required

Returns:

Type Description
_ResponseModelT

The validated model_cls instance

Raises:

Type Description
AsyncAPIUnparsableResponseError

If resp.json() fails or the body doesn't match model_cls

_build_job_creation_request(**kwargs)

Builds a JobCreationRequest from the JobCreationRequestDict TypedDict kwargs dictionary

Shared Helper

Both the ScrapeDoAsyncAPIClient and the AsyncScrapeDoAsyncAPIClient clients use this function

Render Field Auto-Collection

Every key in RENDER_PARAMETER_FIELDS found in kwargs is collected into a RenderParameters instance and assigned to JobCreationRequest.render

Mixed Render Configuration
  • Providing a pre-built RenderParameters instance via kwargs["render"] AND any key in RENDER_PARAMETER_FIELDS at the same time raises a ValueError

  • The two configuration styles are mutually exclusive

Parameters:

Name Type Description Default
**kwargs Unpack[JobCreationRequestDict]

Any subset of the JobCreationRequestDict keys

{}

Returns:

Type Description
JobCreationRequest

The validated JobCreationRequest model

Raises:

Type Description
ValueError

If both a pre-built render and any flat render field are provided in the same call

ScrapeDoAsyncAPIClient

Synchronous client for the Scrape.do Async API on q.scrape.do

Aims to facilitate interactions with the Scrape.do Async API by managing an httpx.Client instance to provide strict type-checking for request parameters, endpoint-specific methods, automatic polling, and custom error parsing while keeping the network configurations as flexible as possible

Features
Concurrency Limit and Server Errors

This client intercepts and manages Scrape.do's Async API specific gateway errors (429 / 502 / 503 / 504), automatically applying a customisable retry strategy before the error can reach the application

SDK Event Hooks (event_hooks)

This client implements SDK-specific event hooks mimicking the structure of httpx native event hooks. See AsyncAPIEventHooks for available lifecycle hooks and their required signatures

Additional httpx.Client Configuration

The following httpx.Client parameters can be provided as keyword arguments and will be passed directly to the underlying object

  • verify
  • cert
  • http1
  • http2
  • timeout
  • limits
  • transport
  • default_encoding

Additionally, the following httpx.Client.request parameters can be provided as keyword arguments during request execution

  • timeout (r_timeout)
  • extensions

For more information on their behaviour and default values, please consult the official httpx documentation

Unsupported HTTPX Client Arguments

The underlying httpx.Client object is strictly managed by the instance to prevent invalid configurations from being sent to the Scrape.do Async API. For this reason, arguments not listed in the previous section are intentionally blocked and shouldn't be changed

Parameters:

Name Type Description Default
api_token Optional[str]

The Scrape.do API key. If omitted, falls back to the SCRAPE_DO_API_KEY environment variable

None
max_retries int

Maximum retry attempts on transient gateway errors (429 / 502 / 503 / 504)

3
retry_backoff Optional[Union[float, Callable[[int], float]]]

The strategy used to calculate the delay between retries. Can be a static float (seconds) or a callable that accepts the current attempt number (0-indexed) and returns a float. Defaults to a jittered exponential backoff when set to None

None
event_hooks Optional[AsyncAPIEventHooks]

A dictionary of SDK-native hooks to execute during different points of the Async-API request lifecycle

None
verify Union[SSLContext, str, bool]

Configures SSL certificate verification. Defaults to True (secure)

True
cert Optional[CertTypes]

Client-side certificates for mutual TLS authentication

None
http1 bool

Enable HTTP/1.1

True
http2 bool

Enable HTTP/2 multiplexing

False
timeout TimeoutTypes

Default timeout in seconds applied to every network phase

60.0
limits Limits

Connection pool limits

DEFAULT_LIMITS
transport Optional[BaseTransport]

Custom transport engine

None
default_encoding Union[str, Callable[[bytes], str]]

The fallback text encoding used if a target website omits a charset header

'utf-8'

close()

Closes the underlying HTTPX connection pool.

It is recommended to use the client as a context manager to ensure resources are released automatically.

__enter__()

Initializes the HTTPX connection pool and returns the context manager object

Returns:

Type Description
Self

The ScrapeDoAsyncAPIClient instance with an opened HTTPX connection pool

__exit__(exc_type, exc_val, exc_tb)

Calls the close method to close the underlying HTTPX connection pool without swallowing any exceptions

Parameters:

Name Type Description Default
exc_type Optional[Type[BaseException]]

The type of the exception

required
exc_val Optional[BaseException]

The instance of the exception

required
exc_tb Optional[TracebackType]

The traceback information

required

Returns:

Type Description
Literal[False]

False, since no exceptions are swallowed

_sleep(attempt)

Sleeps for the duration dictated by self.retry_backoff

Parameters:

Name Type Description Default
attempt int

The current zero-index attempt number

required

_request(method, path, *, json_body=None, params=None, r_timeout=USE_CLIENT_DEFAULT, extensions=None)

Sends an HTTP request to a Scrape.do Async API endpoint with retry on transient gateway errors

Usage

This method is used internally by all of the client's endpoint-specific methods

Execution
  • Applies the customisable retry_backoff strategy on retryable statuses

  • Fires the configured AsyncAPIEventHooks (request / response / retry)

  • Uses the _raise_for_status function to raise exceptions on network and API response errors, ensuring that the returned httpx.Response is successful

Parameters:

Name Type Description Default
method HttpMethod

HTTP method

required
path str

Endpoint path relative to API_PATH

required
json_body Optional[Any]

Optional JSON body for POST

None
params Optional[QueryParamsType]

Optional query parameters for httpx client's request method

None
r_timeout Union[TimeoutTypes, UseClientDefault]

A request-specific timeout override

USE_CLIENT_DEFAULT
extensions Optional[RequestExtensions]

Advanced HTTPX extensions for this specific request

None

Returns:

Type Description
Response

The successful httpx.Response

Raises:

Type Description
AsyncAPIResponseError

Any typed Async-API error routed by status code

APIConnectionError

If the underlying network transport fails for max_retries + 1 consecutive attempts

create_job(request=None, *, r_timeout=USE_CLIENT_DEFAULT, extensions=None, **job_kwargs)

Creates a new Async API job

Parameter Configuration

This method provides smart routing based on the arguments provided. You can configure the request in two ways

job_kwargs Additional Configuration

Since JobCreationRequest accepts a Nested Pydantic Model for its render attribute, job_kwargs offers two ways to configure it

Parameter Restrictions

To prevent silent overwrites and routing ambiguity, the client enforces that only one of the parameter configurations can be used at a time.

  • When using the Pre-Built Parameters configuration, passing any job_kwargs keyword-argument will raise a ValueError

  • When using the job_kwargs configuration, passing a JobCreationRequest to the request argument will raise a ValueError

  • When using the job_kwargs configuration, providing a pre-built RenderParameters instance via job_kwargs["render"] AND any other job_kwargs keyword-argument in RENDER_PARAMETER_FIELDS at the same time raises a ValueError

Parameters:

Name Type Description Default
request Optional[JobCreationRequest]

Pre-built job creation body. Mutually exclusive with **job_kwargs

None
r_timeout Union[TimeoutTypes, UseClientDefault]

A request-specific timeout override

USE_CLIENT_DEFAULT
extensions Optional[RequestExtensions]

Advanced HTTPX extensions

None
**job_kwargs Unpack[JobCreationRequestDict]

Flat kwargs-based configuration

{}

Returns:

Type Description
JobCreationResponse

The parsed JobCreationResponse containing the assigned job_id and per-task task_ids

Raises:

Type Description
ValueError

If both request and **job_kwargs are provided, or if the flat render fields conflict with a pre-built render in job_kwargs

AsyncAPIBadRequestError

On HTTP 400

AsyncAPIAuthError

On HTTP 401

AsyncAPIRateLimitError

On HTTP 429 once retries are exhausted

AsyncAPIServerError

On HTTP 5xx once retries are exhausted

AsyncAPIUnparsableResponseError

If the SDK can't parse a successful response to q.scrape.do

get_job(job_id, *, r_timeout=USE_CLIENT_DEFAULT, extensions=None)

Fetches the current state of an Async API job

Parameters:

Name Type Description Default
job_id str

UUID of the job to fetch

required
r_timeout Union[TimeoutTypes, UseClientDefault]

A request-specific timeout override

USE_CLIENT_DEFAULT
extensions Optional[RequestExtensions]

Advanced HTTPX extensions

None

Returns:

Type Description
JobDetails

The parsed JobDetails model

Raises:

Type Description
AsyncAPINotFoundError

If the job doesn't exist or has expired

AsyncAPIAuthError

On HTTP 401

AsyncAPIServerError

On HTTP 5xx once retries are exhausteds

AsyncAPIUnparsableResponseError

If the SDK can't parse a successful response to q.scrape.do

list_jobs(query=None, *, r_timeout=USE_CLIENT_DEFAULT, extensions=None, **query_kwargs)

Lists Async API jobs filtered / sorted by query

Parameter Configuration

This method provides smart routing based on the arguments provided. You can configure the request in two ways

Parameter Restrictions

To prevent silent overwrites and routing ambiguity, the client enforces that only one of the parameter configurations can be used at a time.

  • When using the Pre-Built Parameters configuration, passing any query_kwargs keyword-argument will raise a ValueError

  • When using the query_kwargs configuration, passing a JobListQueryParameters to the query argument will raise a ValueError

Parameters:

Name Type Description Default
query Optional[JobListQueryParameters]

Pre-built filter / sort / pagination shape. Mutually exclusive with **query_kwargs

None
r_timeout Union[TimeoutTypes, UseClientDefault]

A request-specific timeout override

USE_CLIENT_DEFAULT
extensions Optional[RequestExtensions]

Advanced HTTPX extensions

None
**query_kwargs Unpack[JobListQueryParametersDict]

Flat kwargs-based configuration

{}

Returns:

Type Description
JobsListResponse

The parsed JobsListResponse model

Raises:

Type Description
ValueError

If both query and **query_kwargs are provided

AsyncAPIServerError

On HTTP 5xx once retries are exhausted

AsyncAPIUnparsableResponseError

If the SDK can't parse a successful response to q.scrape.do

get_task(job_id, task_id, *, r_timeout=USE_CLIENT_DEFAULT, extensions=None)

Fetches the full details of a single task within a job

Parameters:

Name Type Description Default
job_id str

UUID of the parent job

required
task_id str

UUID of the task to fetch

required
r_timeout Union[TimeoutTypes, UseClientDefault]

A request-specific timeout override

USE_CLIENT_DEFAULT
extensions Optional[RequestExtensions]

Advanced HTTPX extensions

None

Returns:

Type Description
TaskDetails

The parsed TaskDetails model

Raises:

Type Description
AsyncAPINotFoundError

If the job / task doesn't exist or has expired

AsyncAPIServerError

On HTTP 5xx once retries are exhausted

AsyncAPIUnparsableResponseError

If the SDK can't parse a successful response to q.scrape.do

cancel_job(job_id, *, r_timeout=USE_CLIENT_DEFAULT, extensions=None)

Cancels an in-flight Async API job

Parameters:

Name Type Description Default
job_id str

UUID of the job to cancel

required
r_timeout Union[TimeoutTypes, UseClientDefault]

A request-specific timeout override

USE_CLIENT_DEFAULT
extensions Optional[RequestExtensions]

Advanced HTTPX extensions

None

Returns:

Type Description
CancelJobResponse

The parsed CancelJobResponse (same shape as JobDetails, with Canceled=True)

Raises:

Type Description
AsyncAPINotFoundError

If the job doesn't exist or has expired

AsyncAPINotAcceptableError

If the job is already in a terminal state and can no longer be canceled

AsyncAPIServerError

On HTTP 5xx once retries are exhausted

AsyncAPIUnparsableResponseError

If the SDK can't parse a successful response to q.scrape.do

get_user_info(*, r_timeout=USE_CLIENT_DEFAULT, extensions=None)

Fetches the current user / account information

Parameters:

Name Type Description Default
r_timeout Union[TimeoutTypes, UseClientDefault]

A request-specific timeout override

USE_CLIENT_DEFAULT
extensions Optional[RequestExtensions]

Advanced HTTPX extensions

None

Returns:

Type Description
UserInformation

The parsed UserInformation model

Raises:

Type Description
AsyncAPIAuthError

On HTTP 401

AsyncAPIServerError

On HTTP 5xx once retries are exhausted

AsyncAPIUnparsableResponseError

If the SDK can't parse a successful response to q.scrape.do

wait_for_job(job_id, *, strategy=None, r_timeout=USE_CLIENT_DEFAULT, extensions=None)

Polls a job until it reaches a terminal status

Strategy Argument
  • None (default) → Uses PollingStrategy() with its documented defaults

  • Custom PollingStrategy Instance → Uses PollingStrategy() with the instance's custom configurations

  • PollingFunction → Uses the provided callable to calculate sleep times between attempts and decide whether or not to stop polling before the job reaches a terminal status

Additional strategy Information
  • For more information on how the default polling strategy works and how to customise it, see the PollingStrategy docstring

  • For more information on how to define a custom polling function, see the PollingFunction docstring

Parameters:

Name Type Description Default
job_id str

UUID of the job to poll

required
strategy Optional[Union[PollingStrategy, PollingFunction]]

How to poll for the job

None
r_timeout Union[TimeoutTypes, UseClientDefault]

A request-specific timeout override applied to every get_job call

USE_CLIENT_DEFAULT
extensions Optional[RequestExtensions]

Advanced HTTPX extensions applied to every get_job call

None

Returns:

Type Description
JobDetails

The terminal JobDetails snapshot

Raises:

Type Description
JobTimeoutError

If strategy raises

submit_and_wait(request=None, *, strategy=None, r_timeout=USE_CLIENT_DEFAULT, extensions=None, **job_kwargs)

Submits a job, polls until terminal, and fetches every task

Parameter Configuration
  • This method reuses the same smart-routing as the client's job-creation method

  • For more information, see the create_job method's docstring

Polling Configuration
  • This method passes the strategy argument unchanged to the client's polling helper

  • For more information, see the wait_for_job method's docstring

Parameters:

Name Type Description Default
request Optional[JobCreationRequest]

Pre-built job creation body. Mutually exclusive with **job_kwargs

None
strategy Optional[Union[PollingStrategy, PollingFunction]]

How to poll for the job

None
r_timeout Union[TimeoutTypes, UseClientDefault]

A request-specific timeout override applied to every underlying HTTP call (create / poll / fetch)

USE_CLIENT_DEFAULT
extensions Optional[RequestExtensions]

Advanced HTTPX extensions applied to every underlying HTTP call

None
**job_kwargs Unpack[JobCreationRequestDict]

Flat kwargs-based configuration

{}

Returns:

Type Description
JobResult

A JobResult bundling the terminal JobDetails with the fetched List[TaskDetails] (one per task, in input order)

Raises:

Type Description
ValueError

If both request and **job_kwargs are provided, or if mixed render configurations are detected

JobFailedError

If the job reaches terminal status error

JobCanceledError

If the job reaches terminal status canceled

JobTimeoutError

If strategy raises

BASE_URL class-attribute instance-attribute

BASE_URL = 'https://q.scrape.do'

Base URL for the Scrape.do Async API

API_PATH class-attribute instance-attribute

API_PATH = '/api/v1'

API path prefix appended after BASE_URL

AsyncAPIEventHooks

Bases: TypedDict

Configuration dictionary for SDK-native Async-API lifecycle hooks

Differences From The Sync Client's Hooks

Modeled after SyncClientEventHooks, but adapted to the Async-API request lifecycle

poll
  • The poll event hooks are the only ones that receive a custom response model (JobDetails) instead of a raw httpx object

  • This is because all other hooks are executed for all endpoint methods and have distinct request / response structures while poll hooks are only executed for /api/v1/jobs/{jobID} requests

poll instance-attribute

poll: List[Callable[[int, 'JobDetails'], None]]

Fires on each non-terminal polling iteration of wait_for_job or submit_and_wait. Receives the zero-indexed attempt counter and the latest JobDetails snapshot returned by get_job. Useful for surfacing polling progress

request instance-attribute

request: List[Callable[[Request], None]]

Fires immediately before each HTTP call leaves the client. Receives the prepared httpx.Request. Useful for logging the raw Async-API call about to be sent

response instance-attribute

response: List[Callable[[Response], None]]

Fires immediately after each HTTP call returns, before the status-code error routing in _raise_for_status runs. Receives the raw httpx.Response. Useful for logging every Async-API response (including ones that are about to be raised on)

retry instance-attribute

retry: List[
    Callable[
        [
            int,
            Request,
            Optional[Response],
            Optional[Exception],
        ],
        None,
    ]
]

Fires inside the execution loop ONLY when an Async-API gateway error (429 / 502 / 503 / 504) or an httpx.RequestError occurs and the SDK decides to retry. Receives the current attempt number, the prepared httpx.Request that was retried, and either the failed httpx.Response (when the gateway returned a retryable status) or the underlying Exception that caused the retry. Useful for tracking gateway instability

_ResponseModelT module-attribute

_ResponseModelT = TypeVar(
    "_ResponseModelT", bound=BaseModel
)

TypeVar bound to BaseModel used by _parse_response to propagate the concrete pydantic response model that it returns to the callers