Docs/Configuring Your Agent/Enable Web Search and Scrapping

Enable Web Search and Scrapping

Your Pusaka agent answers questions based on the knowledge base you give it. But sometimes the information a user needs simply does not exist in your documents — it lives on the open internet, or on a specific website that changes every day. The Web Search and Web Scraping features let your agent reach beyond its static knowledge base and pull live content from the web during a conversation.


Web Search gives your agent the ability to run a real-time internet search during a conversation. When a user asks a question that the agent cannot answer from its knowledge base, the agent can silently query a search engine, retrieve a ranked list of results, and use those results to formulate a grounded, up-to-date response — all within the same chat turn.

Think of it as giving your agent access to a search engine it consults automatically, without the user ever having to leave the conversation.

  • Your agent needs to answer questions about current events, prices, or anything that changes over time
  • You want to reduce the burden of keeping your knowledge base up to date
  • Your agent handles a broad set of topics and you cannot anticipate every question

What Is Web Scraping?

Web Scraping goes one step further. Instead of returning a list of search-result snippets, it fetches the full content of a specific webpage and converts it into clean, readable markdown. The agent then uses that extracted content to answer the user's question in detail.

This is especially useful when a user shares a URL or when your agent needs to read an actual page — a product listing, a news article, a documentation page — rather than just knowing that the page exists.

When to use Web Scraping

  • Users share URLs and expect the agent to summarise or answer questions about that page
  • You want the agent to read the full text of a page, not just a search-result excerpt
  • Your agent needs to process structured pages like documentation, product specs, or blog posts

How the Two Features Compare

Web SearchWeb Scraping
What it retrievesA ranked list of result snippets from a search engineThe full text content of a specific webpage
Triggered byA question the agent cannot answer from its knowledge baseA URL shared in conversation, or a need for full-page content
Output formatSearch result summariesMarkdown-formatted page content
Domain restrictionWhitelist up to 20 domainsExclude specific paths/file types
Image handlingNot applicableConfigurable size filters
Best forReal-time factual queriesDeep reading of a specific page

Both features can be active at the same time and complement each other in practice.


Benefits

Web Search benefits

  • Always up to date — The agent surfaces the latest information without requiring manual knowledge base updates
  • Broader coverage — Handles questions outside the scope of your curated documents
  • Source transparency — Responses can include references to where information was found
  • Domain filtering — You can restrict searches to trusted domains only, keeping answers on-brand and reliable

Web Scraping benefits

  • Full-content understanding — The agent reads the actual page, not just a summary
  • URL-aware conversations — Users can paste a link and immediately ask the agent to explain or summarise it
  • Markdown precision — Content is extracted as clean structured text, reducing noise from ads and navigation elements
  • Image control — You can include or exclude images and filter by file size to avoid downloading irrelevant assets

Risks and Considerations

Note: Both features make outbound requests to external websites at conversation time. Understand the implications before enabling them in production.

Web Search risks

RiskDescription
Inaccurate sourcesSearch results may include pages with incorrect or outdated information. Mitigate this by whitelisting only trusted domains.
Off-topic resultsWithout a domain whitelist, the agent may draw on content that is irrelevant to your use case.
LatencyEach search adds a network round-trip to the response time. Set maxWebSearchResults conservatively to minimise this.
Data leakageThe user's query is sent to a search engine. Avoid enabling Web Search for agents that process confidential or personally identifiable information.

Web Scraping risks

RiskDescription
Terms of serviceSome websites prohibit automated scraping. Ensure the sites your agent accesses permit programmatic content retrieval.
Dynamic contentPages built entirely with JavaScript may not render correctly within the configured timeout window.
Excessive payloadLarge pages with many images can slow down responses. Use the image size filters and the ignored-paths list to control what gets fetched.
Sensitive content exposureIf a user provides a URL to an internal or sensitive page, the agent will attempt to read it. Consider your audience before enabling this feature.

Before You Start

Before configuring either feature, confirm the following:

  • You have an existing agent. If not, follow Build an Agent from Zero to Live first.
  • You are signed in as an Owner or Contributor of that agent. Viewers cannot access these settings.
  • You know which agent you want to configure. Open it from the agent list so it becomes the active agent.

Step 1 — Open Web Search Settings

In the left sidebar of your agent's detail page, scroll to the Automate section and click Websearch.

The page has two sections: Web Search Configuration and Domain Whitelist.


At the top of the Web Search Configuration section, you will see the Enable Web Search toggle. It is off by default.

Flip the toggle to the on position. The remaining controls on the page will become active.

Tip: If you want to test the feature without affecting real users, enable it on a test agent first.


Step 3 — Set Maximum Search Results

The Max Web Search Results field controls how many search results the agent retrieves per query. The default is 3.

ValueEffect
1Fastest responses, but limited context for the agent to work with
3 (default)A good balance of speed and answer quality
Higher valuesRicher context but slower responses due to additional data fetching

Adjust this number based on your tolerance for response latency and the complexity of questions your agent handles. For most use cases, keep the value between 3 and 5.

Note: There is also an Allow User Control toggle on this page. It is currently marked Coming Soon and cannot be enabled. This feature will let end users toggle web search on or off during a conversation — it is not available yet.


Step 4 — Restrict to Specific Domains (Optional)

By default, Web Search queries the open internet. If you want to limit results to specific trusted websites, use the Domain Whitelist section.

Click Add New Domain to add an entry. Enter the domain name without a protocol prefix.

✅ Correct format❌ Incorrect format
wikipedia.orghttps://wikipedia.org
docs.microsoft.comhttp://docs.microsoft.com/
support.qlar.comwww.support.qlar.com

Domain names must contain at least one dot and use only letters, numbers, hyphens, underscores, and dots. You can add up to 20 domains.

When the whitelist is not empty, the agent restricts all searches to the listed domains only. Leaving the whitelist empty allows searches across the full internet.

To remove a domain, click the delete icon to the right of its entry.

Tip: Start with a small whitelist of two or three authoritative sources. You can always expand it later. A curated whitelist dramatically reduces the chance of the agent citing inaccurate or off-topic content.


Step 5 — Save Your Web Search Settings

Click the Save button at the bottom of the page. Your changes take effect for all new conversations immediately.


Part 2 — Configure Web Scraping

Step 1 — Open Web Scraping Settings

In the left sidebar, still under the Automate section, click Webscraping.

The page is divided into three sections: Web Scraping Configuration, Image Size Filters, and Ignored Paths.


Step 2 — Enable Web Scraping

At the top of the Web Scraping Configuration section, flip the Enable Web Scraping toggle to on. All other controls will activate.


Step 3 — Set the Page Load Timeout

The Timeout (seconds) field sets the maximum time the agent will wait for a page to finish loading before giving up. The default is 30 seconds.

ScenarioRecommended value
Fast, text-heavy pages (documentation, blogs)1520
Standard pages30 (default)
Heavy pages with many resources4560

The minimum allowed value is 5 seconds and the maximum is 60 seconds. Setting the timeout too low will cause partial or failed fetches on slower pages. Setting it too high will increase response latency when a page is unreachable.


Step 4 — Configure Image Settings

Web Scraping can include or exclude images from the extracted markdown. This section has three controls.

Include Images toggle

When on, images found on the page are included in the markdown output. When off, all images are stripped. Disable this if your agent only needs the text content of pages — it reduces response size significantly.

Min Image Size (KB)

Images smaller than this threshold are excluded. The default is 5 KB. This filters out icons, tracking pixels, and other decorative micro-images that add noise without providing value.

Max Image Size (MB)

Images larger than this threshold are excluded. The default is 10 MB. This prevents the agent from downloading extremely large images that would slow down extraction without improving the quality of the response.

SettingMin allowedMax allowedDefault
Min Image Size1 KB10,240 KB (10 MB)5 KB
Max Image Size1 MB50 MB10 MB

Tip: If image content is not important for your use case, toggle Include Images off. It is the single most effective way to speed up web scraping responses.


Step 5 — Exclude Specific Paths (Optional)

The Ignored Paths section lets you define path patterns that the scraper should skip. This is useful for excluding static assets, media files, or sections of a site that are not useful for your agent.

Click Add New Path Pattern to add an entry. Each pattern can take one of these forms:

Pattern typeExampleWhat it excludes
Absolute path/iconsAll content under the /icons/ path
Wildcard by extension*.svgAll SVG files regardless of path
Specific filenamelogo.pngAny file named logo.png

You can add up to 50 patterns. Each pattern must be no longer than 200 characters.

To remove a pattern, click the delete icon to the right of its entry.

Tip: Common patterns to exclude: *.svg, *.ico, *.woff, *.woff2, /cdn-cgi, /wp-content/uploads. These are typically assets that the agent does not need to read.


Step 6 — Confirm and Save

Click the Save button at the bottom of the page.

Unlike Web Search, Web Scraping will show a confirmation dialog before applying your settings. Review the summary and click Confirm to proceed. This extra step exists because Web Scraping settings affect how the agent processes external content, which can have a meaningful impact on response quality and latency.

Your settings take effect immediately for all new conversations.


Using Both Features Together

Web Search and Web Scraping work independently but pair well in conversation flows:

  1. A user asks a question the agent cannot answer from its knowledge base
  2. Web Search finds the most relevant result URLs
  3. Web Scraping fetches the full content of the top result
  4. The agent synthesises a detailed, sourced answer from the full page text

Enabling both features together gives your agent the broadest possible access to live information. However, each feature adds latency and external dependency. Start with Web Search only, measure the impact on response times, and then layer in Web Scraping when you need deeper content extraction.


Common Mistakes to Avoid

MistakeWhy it mattersFix
Leaving the domain whitelist empty in a business contextThe agent may cite unreliable sourcesAdd a curated list of 2–5 trusted domains
Setting Max Results above 5Higher values increase latency noticeablyKeep it at 3–5 unless response depth is critical
Enabling Include Images without size filtersLarge images slow extraction significantlySet a Max Image Size of 2–5 MB for typical use
Using a timeout below 10 secondsFast timeout causes frequent scraping failuresUse 20–30 seconds as a baseline
Enabling Web Scraping for confidential conversationsUsers can share any URL, including internal pagesOnly enable scraping when your audience and content are appropriate
Not testing after configurationSettings may not work as expected on all sitesTest with 3–5 representative URLs before going live

Next Steps