Client
client
¶
Synchronous client for the Scrape.do Async API
Defines the ScrapeDoAsyncAPIClient, a synchronous
wrapper over httpx.Client configured against q.scrape.do
Endpoint Mapping
-
POST /api/v1/jobs→create_job -
GET /api/v1/jobs/{jobID}→get_job -
GET /api/v1/jobs/{jobID}/{taskID}→get_task -
GET /api/v1/jobs→list_jobs -
DELETE /api/v1/jobs/{jobID}→cancel_job -
GET /api/v1/me→get_user_info
_raise_for_status(resp)
¶
Raises the relevant Async-API Exception based on the status code of a non-2xx
httpx.Response object
Shared Helper
Both the ScrapeDoAsyncAPIClient and
the AsyncScrapeDoAsyncAPIClient clients
use this function
Exception Mapping
400→AsyncAPIBadRequestError401→AsyncAPIAuthError404→AsyncAPINotFoundError406→AsyncAPINotAcceptableError429→AsyncAPIRateLimitError5xx→AsyncAPIServerErrorUnmapped→AsyncAPIResponseError
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
resp
|
Response
|
The raw response received from
|
required |
Raises:
| Type | Description |
|---|---|
AsyncAPIResponseError
|
One of the typed subclasses corresponding to the response's status code |
_parse_response(resp, model_cls)
¶
Parses the raw JSON body returned by a request to q.scrape.do into
one of the endpoint-specific Async-API Pydantic Response Models by parsing it with
httpx.Response.json and feeding the result to
model_cls.model_validate
Shared Helper
Both the ScrapeDoAsyncAPIClient and
the AsyncScrapeDoAsyncAPIClient clients
use this function
AsyncAPIUnparsableResponseError
This method raises the AsyncAPIUnparsableResponseError
exception whenever a JSONDecodeError or a ValidationError occurs when parsing
the raw httpx.Response into a custom pydantic response
model
Why Wrap These Errors
-
q.scrape.doordinarily returns JSON bodies that match the documented schema for each endpoint -
Server-side incidents can leak raw HTML or a malformed payload through to the client. In that case,
resp.json()would raiseJSONDecodeError -
An unexpected change to the JSON schema returned by one of the endpoints would cause
model_cls.model_validateto raiseValidationError -
Either failure is surfaced as an
AsyncAPIUnparsableResponseErrorso callers see a single, consistent exception type for the cases in which the gateway returned something that the SDK couldn't parse
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
resp
|
Response
|
The 2xx response to parse |
required |
model_cls
|
Type[_ResponseModelT]
|
The pydantic model class to validate against |
required |
Returns:
| Type | Description |
|---|---|
_ResponseModelT
|
The validated |
Raises:
| Type | Description |
|---|---|
AsyncAPIUnparsableResponseError
|
If |
_build_job_creation_request(**kwargs)
¶
Builds a JobCreationRequest from
the JobCreationRequestDict
TypedDict kwargs dictionary
Shared Helper
Both the ScrapeDoAsyncAPIClient and
the AsyncScrapeDoAsyncAPIClient clients
use this function
Render Field Auto-Collection
Every key in RENDER_PARAMETER_FIELDS
found in kwargs is collected into a RenderParameters
instance and assigned to JobCreationRequest.render
Mixed Render Configuration
-
Providing a pre-built
RenderParametersinstance viakwargs["render"]AND any key inRENDER_PARAMETER_FIELDSat the same time raises aValueError -
The two configuration styles are mutually exclusive
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Unpack[JobCreationRequestDict]
|
Any subset of the
|
{}
|
Returns:
| Type | Description |
|---|---|
JobCreationRequest
|
The validated |
Raises:
| Type | Description |
|---|---|
ValueError
|
If both a pre-built |
ScrapeDoAsyncAPIClient
¶
Synchronous client for the Scrape.do Async API on q.scrape.do
Aims to facilitate interactions with the Scrape.do Async API by managing
an httpx.Client instance to provide strict type-checking for request
parameters, endpoint-specific methods, automatic polling, and custom error
parsing while keeping the network configurations as flexible as possible
Features
-
Pre-flight payload validation via the
JobCreationRequestandJobListQueryParametersmodels -
Status code error routing to specific exceptions (
400/401/404/406/429/5xx) -
Customisable retry intervals on transient gateway errors
-
Polling helpers (
wait_for_jobandsubmit_and_wait) with either a built-inPollingStrategyor a user-suppliedPollingFunction
Concurrency Limit and Server Errors
This client intercepts and manages Scrape.do's Async API specific
gateway errors (429 / 502 / 503 / 504),
automatically applying a customisable retry strategy before the error
can reach the application
SDK Event Hooks (event_hooks)
This client implements SDK-specific event hooks mimicking the
structure of httpx native event hooks. See
AsyncAPIEventHooks
for available lifecycle hooks and their required signatures
Additional httpx.Client Configuration
The following httpx.Client parameters can be provided as keyword
arguments and will be passed directly to the underlying object
verifycerthttp1http2timeoutlimitstransportdefault_encoding
Additionally, the following httpx.Client.request parameters can be
provided as keyword arguments during request execution
timeout(r_timeout)extensions
For more information on their behaviour and default values, please
consult the official
httpx documentation
Unsupported HTTPX Client Arguments
The underlying httpx.Client object is strictly managed by the
instance to prevent invalid configurations from being sent to the
Scrape.do Async API. For this reason, arguments not listed in the
previous section are intentionally blocked and shouldn't be changed
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
api_token
|
Optional[str]
|
The Scrape.do API key. If omitted,
falls back to the |
None
|
max_retries
|
int
|
Maximum retry attempts on transient gateway
errors ( |
3
|
retry_backoff
|
Optional[Union[float, Callable[[int], float]]]
|
The
strategy used to calculate the delay between retries. Can be a
static |
None
|
event_hooks
|
Optional[AsyncAPIEventHooks]
|
A dictionary of SDK-native hooks to execute during different points of the Async-API request lifecycle |
None
|
verify
|
Union[SSLContext, str, bool]
|
Configures SSL certificate verification. Defaults to True (secure) |
True
|
cert
|
Optional[CertTypes]
|
Client-side certificates for mutual TLS authentication |
None
|
http1
|
bool
|
Enable HTTP/1.1 |
True
|
http2
|
bool
|
Enable HTTP/2 multiplexing |
False
|
timeout
|
TimeoutTypes
|
Default timeout in seconds applied to every network phase |
60.0
|
limits
|
Limits
|
Connection pool limits |
DEFAULT_LIMITS
|
transport
|
Optional[BaseTransport]
|
Custom transport engine |
None
|
default_encoding
|
Union[str, Callable[[bytes], str]]
|
The fallback text encoding used if a target website omits a charset header |
'utf-8'
|
close()
¶
Closes the underlying HTTPX connection pool.
It is recommended to use the client as a context manager to ensure resources are released automatically.
__enter__()
¶
Initializes the HTTPX connection pool and returns the context manager object
Returns:
| Type | Description |
|---|---|
Self
|
The |
__exit__(exc_type, exc_val, exc_tb)
¶
Calls the close method to close the underlying HTTPX
connection pool without swallowing any exceptions
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
exc_type
|
Optional[Type[BaseException]]
|
The type of the exception |
required |
exc_val
|
Optional[BaseException]
|
The instance of the exception |
required |
exc_tb
|
Optional[TracebackType]
|
The traceback information |
required |
Returns:
| Type | Description |
|---|---|
Literal[False]
|
|
_sleep(attempt)
¶
Sleeps for the duration dictated by self.retry_backoff
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
attempt
|
int
|
The current zero-index attempt number |
required |
_request(method, path, *, json_body=None, params=None, r_timeout=USE_CLIENT_DEFAULT, extensions=None)
¶
Sends an HTTP request to a Scrape.do Async API endpoint with
retry on transient gateway errors
Usage
This method is used internally by all of the client's endpoint-specific methods
Execution
-
Applies the customisable
retry_backoffstrategy on retryable statuses -
Fires the configured
AsyncAPIEventHooks(request/response/retry) -
Uses the
_raise_for_statusfunction to raise exceptions on network and API response errors, ensuring that the returnedhttpx.Responseis successful
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
method
|
HttpMethod
|
HTTP method |
required |
path
|
str
|
Endpoint path relative to |
required |
json_body
|
Optional[Any]
|
Optional JSON body for |
None
|
params
|
Optional[QueryParamsType]
|
Optional query
parameters for |
None
|
r_timeout
|
Union[TimeoutTypes, UseClientDefault]
|
A request-specific timeout override |
USE_CLIENT_DEFAULT
|
extensions
|
Optional[RequestExtensions]
|
Advanced HTTPX extensions for this specific request |
None
|
Returns:
| Type | Description |
|---|---|
Response
|
The successful |
Raises:
| Type | Description |
|---|---|
AsyncAPIResponseError
|
Any typed Async-API error routed by status code |
APIConnectionError
|
If the underlying network transport
fails for |
create_job(request=None, *, r_timeout=USE_CLIENT_DEFAULT, extensions=None, **job_kwargs)
¶
Creates a new Async API job
Parameter Configuration
This method provides smart routing based on the arguments provided. You can configure the request in two ways
-
job_kwargs→ Build aJobCreationRequestimplicitly by passing the keyword-arguments accepted by theJobCreationRequestDictTypedDict -
Pre-Built Parameters→ Pass a validatedJobCreationRequestinstance directly to therequestargument
job_kwargs Additional Configuration
Since JobCreationRequest accepts
a Nested Pydantic Model for its
render attribute, job_kwargs offers two ways to configure it
-
Implicit Construction→ Passevery field accepted by the nested modelas a flat keyword-argument -
Explicit Construction→ Pass a validatedRenderParametersinstance to therenderkeyword-argument
Parameter Restrictions
To prevent silent overwrites and routing ambiguity, the client enforces that only one of the parameter configurations can be used at a time.
-
When using the
Pre-Built Parametersconfiguration, passing anyjob_kwargskeyword-argument will raise aValueError -
When using the
job_kwargsconfiguration, passing aJobCreationRequestto therequestargument will raise aValueError -
When using the
job_kwargsconfiguration, providing a pre-builtRenderParametersinstance viajob_kwargs["render"]AND any otherjob_kwargskeyword-argument inRENDER_PARAMETER_FIELDSat the same time raises aValueError
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
request
|
Optional[JobCreationRequest]
|
Pre-built job
creation body. Mutually exclusive with |
None
|
r_timeout
|
Union[TimeoutTypes, UseClientDefault]
|
A request-specific timeout override |
USE_CLIENT_DEFAULT
|
extensions
|
Optional[RequestExtensions]
|
Advanced HTTPX extensions |
None
|
**job_kwargs
|
Unpack[JobCreationRequestDict]
|
Flat kwargs-based configuration |
{}
|
Returns:
| Type | Description |
|---|---|
JobCreationResponse
|
The parsed |
Raises:
| Type | Description |
|---|---|
ValueError
|
If both |
AsyncAPIBadRequestError
|
On |
AsyncAPIAuthError
|
On |
AsyncAPIRateLimitError
|
On |
AsyncAPIServerError
|
On |
AsyncAPIUnparsableResponseError
|
If the SDK can't parse a
successful response to |
get_job(job_id, *, r_timeout=USE_CLIENT_DEFAULT, extensions=None)
¶
Fetches the current state of an Async API job
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
job_id
|
str
|
UUID of the job to fetch |
required |
r_timeout
|
Union[TimeoutTypes, UseClientDefault]
|
A request-specific timeout override |
USE_CLIENT_DEFAULT
|
extensions
|
Optional[RequestExtensions]
|
Advanced HTTPX extensions |
None
|
Returns:
| Type | Description |
|---|---|
JobDetails
|
The parsed |
Raises:
| Type | Description |
|---|---|
AsyncAPINotFoundError
|
If the job doesn't exist or has expired |
AsyncAPIAuthError
|
On |
AsyncAPIServerError
|
On |
AsyncAPIUnparsableResponseError
|
If the SDK can't parse a
successful response to |
list_jobs(query=None, *, r_timeout=USE_CLIENT_DEFAULT, extensions=None, **query_kwargs)
¶
Lists Async API jobs filtered / sorted by query
Parameter Configuration
This method provides smart routing based on the arguments provided. You can configure the request in two ways
-
query_kwargs→ Build aJobListQueryParametersimplicitly by passing the keyword-arguments accepted by theJobListQueryParametersDictTypedDict -
Pre-Built Parameters→ Pass a validatedJobListQueryParametersinstance directly to thequeryargument
Parameter Restrictions
To prevent silent overwrites and routing ambiguity, the client enforces that only one of the parameter configurations can be used at a time.
-
When using the
Pre-Built Parametersconfiguration, passing anyquery_kwargskeyword-argument will raise aValueError -
When using the
query_kwargsconfiguration, passing aJobListQueryParametersto thequeryargument will raise aValueError
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
Optional[JobListQueryParameters]
|
Pre-built
filter / sort / pagination shape. Mutually exclusive
with |
None
|
r_timeout
|
Union[TimeoutTypes, UseClientDefault]
|
A request-specific timeout override |
USE_CLIENT_DEFAULT
|
extensions
|
Optional[RequestExtensions]
|
Advanced HTTPX extensions |
None
|
**query_kwargs
|
Unpack[JobListQueryParametersDict]
|
Flat kwargs-based configuration |
{}
|
Returns:
| Type | Description |
|---|---|
JobsListResponse
|
The parsed |
Raises:
| Type | Description |
|---|---|
ValueError
|
If both |
AsyncAPIServerError
|
On |
AsyncAPIUnparsableResponseError
|
If the SDK can't parse a
successful response to |
get_task(job_id, task_id, *, r_timeout=USE_CLIENT_DEFAULT, extensions=None)
¶
Fetches the full details of a single task within a job
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
job_id
|
str
|
UUID of the parent job |
required |
task_id
|
str
|
UUID of the task to fetch |
required |
r_timeout
|
Union[TimeoutTypes, UseClientDefault]
|
A request-specific timeout override |
USE_CLIENT_DEFAULT
|
extensions
|
Optional[RequestExtensions]
|
Advanced HTTPX extensions |
None
|
Returns:
| Type | Description |
|---|---|
TaskDetails
|
The parsed |
Raises:
| Type | Description |
|---|---|
AsyncAPINotFoundError
|
If the job / task doesn't exist or has expired |
AsyncAPIServerError
|
On |
AsyncAPIUnparsableResponseError
|
If the SDK can't parse a
successful response to |
cancel_job(job_id, *, r_timeout=USE_CLIENT_DEFAULT, extensions=None)
¶
Cancels an in-flight Async API job
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
job_id
|
str
|
UUID of the job to cancel |
required |
r_timeout
|
Union[TimeoutTypes, UseClientDefault]
|
A request-specific timeout override |
USE_CLIENT_DEFAULT
|
extensions
|
Optional[RequestExtensions]
|
Advanced HTTPX extensions |
None
|
Returns:
| Type | Description |
|---|---|
CancelJobResponse
|
The parsed |
Raises:
| Type | Description |
|---|---|
AsyncAPINotFoundError
|
If the job doesn't exist or has expired |
AsyncAPINotAcceptableError
|
If the job is already in a terminal state and can no longer be canceled |
AsyncAPIServerError
|
On |
AsyncAPIUnparsableResponseError
|
If the SDK can't parse a
successful response to |
get_user_info(*, r_timeout=USE_CLIENT_DEFAULT, extensions=None)
¶
Fetches the current user / account information
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
r_timeout
|
Union[TimeoutTypes, UseClientDefault]
|
A request-specific timeout override |
USE_CLIENT_DEFAULT
|
extensions
|
Optional[RequestExtensions]
|
Advanced HTTPX extensions |
None
|
Returns:
| Type | Description |
|---|---|
UserInformation
|
The parsed |
Raises:
| Type | Description |
|---|---|
AsyncAPIAuthError
|
On |
AsyncAPIServerError
|
On |
AsyncAPIUnparsableResponseError
|
If the SDK can't parse a
successful response to |
wait_for_job(job_id, *, strategy=None, r_timeout=USE_CLIENT_DEFAULT, extensions=None)
¶
Polls a job until it reaches a terminal status
Strategy Argument
-
None (default)→ UsesPollingStrategy()with its documented defaults -
Custom PollingStrategy Instance→ UsesPollingStrategy()with the instance's custom configurations -
PollingFunction→ Uses the provided callable to calculate sleep times between attempts and decide whether or not to stop polling before the job reaches a terminal status
Additional strategy Information
-
For more information on how the default polling strategy works and how to customise it, see the
PollingStrategydocstring -
For more information on how to define a custom polling function, see the
PollingFunctiondocstring
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
job_id
|
str
|
UUID of the job to poll |
required |
strategy
|
Optional[Union[PollingStrategy, PollingFunction]]
|
How to poll for the job |
None
|
r_timeout
|
Union[TimeoutTypes, UseClientDefault]
|
A
request-specific timeout override applied to every |
USE_CLIENT_DEFAULT
|
extensions
|
Optional[RequestExtensions]
|
Advanced HTTPX
extensions applied to every |
None
|
Returns:
| Type | Description |
|---|---|
JobDetails
|
The terminal |
Raises:
| Type | Description |
|---|---|
JobTimeoutError
|
If |
submit_and_wait(request=None, *, strategy=None, r_timeout=USE_CLIENT_DEFAULT, extensions=None, **job_kwargs)
¶
Submits a job, polls until terminal, and fetches every task
Parameter Configuration
-
This method reuses the same smart-routing as the client's job-creation method
-
For more information, see the
create_jobmethod's docstring
Polling Configuration
-
This method passes the
strategyargument unchanged to the client's polling helper -
For more information, see the
wait_for_jobmethod's docstring
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
request
|
Optional[JobCreationRequest]
|
Pre-built job
creation body. Mutually exclusive with |
None
|
strategy
|
Optional[Union[PollingStrategy, PollingFunction]]
|
How to poll for the job |
None
|
r_timeout
|
Union[TimeoutTypes, UseClientDefault]
|
A request-specific timeout override applied to every underlying HTTP call (create / poll / fetch) |
USE_CLIENT_DEFAULT
|
extensions
|
Optional[RequestExtensions]
|
Advanced HTTPX extensions applied to every underlying HTTP call |
None
|
**job_kwargs
|
Unpack[JobCreationRequestDict]
|
Flat kwargs-based configuration |
{}
|
Returns:
| Type | Description |
|---|---|
JobResult
|
A |
Raises:
| Type | Description |
|---|---|
ValueError
|
If both |
JobFailedError
|
If the job reaches terminal status |
JobCanceledError
|
If the job reaches terminal status
|
JobTimeoutError
|
If |
BASE_URL
class-attribute
instance-attribute
¶
Base URL for the Scrape.do Async API
API_PATH
class-attribute
instance-attribute
¶
API path prefix appended after BASE_URL
AsyncAPIEventHooks
¶
Bases: TypedDict
Configuration dictionary for SDK-native Async-API lifecycle hooks
Differences From The Sync Client's Hooks
Modeled after SyncClientEventHooks, but adapted to the
Async-API request lifecycle
poll
-
The
pollevent hooks are the only ones that receive a custom response model (JobDetails) instead of a rawhttpxobject -
This is because all other hooks are executed for all endpoint methods and have distinct
request/responsestructures whilepollhooks are only executed for/api/v1/jobs/{jobID}requests
poll
instance-attribute
¶
Fires on each non-terminal polling iteration of wait_for_job or
submit_and_wait. Receives the zero-indexed attempt counter and
the latest JobDetails snapshot returned
by get_job. Useful for surfacing polling progress
request
instance-attribute
¶
Fires immediately before each HTTP call leaves the client.
Receives the prepared httpx.Request. Useful for logging the raw
Async-API call about to be sent
response
instance-attribute
¶
Fires immediately after each HTTP call returns, before the
status-code error routing in _raise_for_status runs. Receives
the raw httpx.Response. Useful for logging every Async-API
response (including ones that are about to be raised on)
retry
instance-attribute
¶
Fires inside the execution loop ONLY when an Async-API gateway
error (429 / 502 / 503 / 504) or an httpx.RequestError
occurs and the SDK decides to retry. Receives the current attempt
number, the prepared httpx.Request that was retried, and either
the failed httpx.Response (when the gateway returned a retryable
status) or the underlying Exception that caused the retry. Useful
for tracking gateway instability
_ResponseModelT
module-attribute
¶
TypeVar bound to BaseModel used by
_parse_response to
propagate the concrete pydantic response model that it returns to the
callers