Docs/Configuring Your Agent/Enable Web Search and Scrapping

Enable Web Search and Scrapping

Your Pusaka agent answers questions based on the knowledge base you give it. But sometimes the information a user needs simply does not exist in your documents — it lives on the open internet, or on a specific website that changes every day. The Web Search and Web Scraping features let your agent reach beyond its static knowledge base and pull live content from the web during a conversation.

What Is Web Search?

Web Search gives your agent the ability to run a real-time internet search during a conversation. When a user asks a question that the agent cannot answer from its knowledge base, the agent can silently query a search engine, retrieve a ranked list of results, and use those results to formulate a grounded, up-to-date response — all within the same chat turn.

Think of it as giving your agent access to a search engine it consults automatically, without the user ever having to leave the conversation.

When to use Web Search

Your agent needs to answer questions about current events, prices, or anything that changes over time
You want to reduce the burden of keeping your knowledge base up to date
Your agent handles a broad set of topics and you cannot anticipate every question

What Is Web Scraping?

Web Scraping goes one step further. Instead of returning a list of search-result snippets, it fetches the full content of a specific webpage and converts it into clean, readable markdown. The agent then uses that extracted content to answer the user's question in detail.

This is especially useful when a user shares a URL or when your agent needs to read an actual page — a product listing, a news article, a documentation page — rather than just knowing that the page exists.

When to use Web Scraping

Users share URLs and expect the agent to summarise or answer questions about that page
You want the agent to read the full text of a page, not just a search-result excerpt
Your agent needs to process structured pages like documentation, product specs, or blog posts

How the Two Features Compare

	Web Search	Web Scraping
What it retrieves	A ranked list of result snippets from a search engine	The full text content of a specific webpage
Triggered by	A question the agent cannot answer from its knowledge base	A URL shared in conversation, or a need for full-page content
Output format	Search result summaries	Markdown-formatted page content
Domain restriction	Whitelist up to 20 domains	Exclude specific paths/file types
Image handling	Not applicable	Configurable size filters
Best for	Real-time factual queries	Deep reading of a specific page

Both features can be active at the same time and complement each other in practice.

Benefits

Web Search benefits

Always up to date — The agent surfaces the latest information without requiring manual knowledge base updates
Broader coverage — Handles questions outside the scope of your curated documents
Source transparency — Responses can include references to where information was found
Domain filtering — You can restrict searches to trusted domains only, keeping answers on-brand and reliable

Web Scraping benefits

Full-content understanding — The agent reads the actual page, not just a summary
URL-aware conversations — Users can paste a link and immediately ask the agent to explain or summarise it
Markdown precision — Content is extracted as clean structured text, reducing noise from ads and navigation elements
Image control — You can include or exclude images and filter by file size to avoid downloading irrelevant assets

Risks and Considerations

Note: Both features make outbound requests to external websites at conversation time. Understand the implications before enabling them in production.

Web Search risks

Risk	Description
Inaccurate sources	Search results may include pages with incorrect or outdated information. Mitigate this by whitelisting only trusted domains.
Off-topic results	Without a domain whitelist, the agent may draw on content that is irrelevant to your use case.
Latency	Each search adds a network round-trip to the response time. Set `maxWebSearchResults` conservatively to minimise this.
Data leakage	The user's query is sent to a search engine. Avoid enabling Web Search for agents that process confidential or personally identifiable information.

Web Scraping risks

Risk	Description
Terms of service	Some websites prohibit automated scraping. Ensure the sites your agent accesses permit programmatic content retrieval.
Dynamic content	Pages built entirely with JavaScript may not render correctly within the configured timeout window.
Excessive payload	Large pages with many images can slow down responses. Use the image size filters and the ignored-paths list to control what gets fetched.
Sensitive content exposure	If a user provides a URL to an internal or sensitive page, the agent will attempt to read it. Consider your audience before enabling this feature.

Before You Start

Before configuring either feature, confirm the following:

You have an existing agent. If not, follow Build an Agent from Zero to Live first.
You are signed in as an Owner or Contributor of that agent. Viewers cannot access these settings.
You know which agent you want to configure. Open it from the agent list so it becomes the active agent.

Part 1 — Configure Web Search

Step 1 — Open Web Search Settings

In the left sidebar of your agent's detail page, scroll to the Automate section and click Websearch.

The page has two sections: Web Search Configuration and Domain Whitelist.

Step 2 — Enable Web Search

At the top of the Web Search Configuration section, you will see the Enable Web Search toggle. It is off by default.

Flip the toggle to the on position. The remaining controls on the page will become active.

Tip: If you want to test the feature without affecting real users, enable it on a test agent first.

Step 3 — Set Maximum Search Results

The Max Web Search Results field controls how many search results the agent retrieves per query. The default is 3.

Value	Effect
`1`	Fastest responses, but limited context for the agent to work with
`3` (default)	A good balance of speed and answer quality
Higher values	Richer context but slower responses due to additional data fetching

Adjust this number based on your tolerance for response latency and the complexity of questions your agent handles. For most use cases, keep the value between 3 and 5.

Note: There is also an Allow User Control toggle on this page. It is currently marked Coming Soon and cannot be enabled. This feature will let end users toggle web search on or off during a conversation — it is not available yet.

Step 4 — Restrict to Specific Domains (Optional)

By default, Web Search queries the open internet. If you want to limit results to specific trusted websites, use the Domain Whitelist section.

Click Add New Domain to add an entry. Enter the domain name without a protocol prefix.

✅ Correct format	❌ Incorrect format
`wikipedia.org`	`https://wikipedia.org`
`docs.microsoft.com`	`http://docs.microsoft.com/`
`support.qlar.com`	`www.support.qlar.com`

Domain names must contain at least one dot and use only letters, numbers, hyphens, underscores, and dots. You can add up to 20 domains.

When the whitelist is not empty, the agent restricts all searches to the listed domains only. Leaving the whitelist empty allows searches across the full internet.

To remove a domain, click the delete icon to the right of its entry.

Tip: Start with a small whitelist of two or three authoritative sources. You can always expand it later. A curated whitelist dramatically reduces the chance of the agent citing inaccurate or off-topic content.

Step 5 — Save Your Web Search Settings

Click the Save button at the bottom of the page. Your changes take effect for all new conversations immediately.

Part 2 — Configure Web Scraping

Step 1 — Open Web Scraping Settings

In the left sidebar, still under the Automate section, click Webscraping.

The page is divided into three sections: Web Scraping Configuration, Image Size Filters, and Ignored Paths.

Step 2 — Enable Web Scraping

At the top of the Web Scraping Configuration section, flip the Enable Web Scraping toggle to on. All other controls will activate.

Step 3 — Set the Page Load Timeout

The Timeout (seconds) field sets the maximum time the agent will wait for a page to finish loading before giving up. The default is 30 seconds.

Scenario	Recommended value
Fast, text-heavy pages (documentation, blogs)	`15` – `20`
Standard pages	`30` (default)
Heavy pages with many resources	`45` – `60`

The minimum allowed value is 5 seconds and the maximum is 60 seconds. Setting the timeout too low will cause partial or failed fetches on slower pages. Setting it too high will increase response latency when a page is unreachable.

Step 4 — Configure Image Settings

Web Scraping can include or exclude images from the extracted markdown. This section has three controls.

Include Images toggle

When on, images found on the page are included in the markdown output. When off, all images are stripped. Disable this if your agent only needs the text content of pages — it reduces response size significantly.

Min Image Size (KB)

Images smaller than this threshold are excluded. The default is 5 KB. This filters out icons, tracking pixels, and other decorative micro-images that add noise without providing value.

Max Image Size (MB)

Images larger than this threshold are excluded. The default is 10 MB. This prevents the agent from downloading extremely large images that would slow down extraction without improving the quality of the response.

Setting	Min allowed	Max allowed	Default
Min Image Size	1 KB	10,240 KB (10 MB)	5 KB
Max Image Size	1 MB	50 MB	10 MB

Tip: If image content is not important for your use case, toggle Include Images off. It is the single most effective way to speed up web scraping responses.

Step 5 — Exclude Specific Paths (Optional)

The Ignored Paths section lets you define path patterns that the scraper should skip. This is useful for excluding static assets, media files, or sections of a site that are not useful for your agent.

Click Add New Path Pattern to add an entry. Each pattern can take one of these forms:

Pattern type	Example	What it excludes
Absolute path	`/icons`	All content under the `/icons/` path
Wildcard by extension	`*.svg`	All SVG files regardless of path
Specific filename	`logo.png`	Any file named `logo.png`

You can add up to 50 patterns. Each pattern must be no longer than 200 characters.

To remove a pattern, click the delete icon to the right of its entry.

Tip: Common patterns to exclude: *.svg, *.ico, *.woff, *.woff2, /cdn-cgi, /wp-content/uploads. These are typically assets that the agent does not need to read.

Step 6 — Confirm and Save

Click the Save button at the bottom of the page.

Unlike Web Search, Web Scraping will show a confirmation dialog before applying your settings. Review the summary and click Confirm to proceed. This extra step exists because Web Scraping settings affect how the agent processes external content, which can have a meaningful impact on response quality and latency.

Your settings take effect immediately for all new conversations.

Using Both Features Together

Web Search and Web Scraping work independently but pair well in conversation flows:

A user asks a question the agent cannot answer from its knowledge base
Web Search finds the most relevant result URLs
Web Scraping fetches the full content of the top result
The agent synthesises a detailed, sourced answer from the full page text

Enabling both features together gives your agent the broadest possible access to live information. However, each feature adds latency and external dependency. Start with Web Search only, measure the impact on response times, and then layer in Web Scraping when you need deeper content extraction.

Common Mistakes to Avoid

Mistake	Why it matters	Fix
Leaving the domain whitelist empty in a business context	The agent may cite unreliable sources	Add a curated list of 2–5 trusted domains
Setting Max Results above 5	Higher values increase latency noticeably	Keep it at 3–5 unless response depth is critical
Enabling Include Images without size filters	Large images slow extraction significantly	Set a Max Image Size of 2–5 MB for typical use
Using a timeout below 10 seconds	Fast timeout causes frequent scraping failures	Use 20–30 seconds as a baseline
Enabling Web Scraping for confidential conversations	Users can share any URL, including internal pages	Only enable scraping when your audience and content are appropriate
Not testing after configuration	Settings may not work as expected on all sites	Test with 3–5 representative URLs before going live

Next Steps

Automate with Plugins — Extend your agent with purpose-built plugins for common operations
Automate with Custom API — Connect your agent to any external system via REST
Observe Feedback and Logs — Monitor how your agent uses web content in real conversations

PreviousCall MCP Server

NextEmbed Chat to Website