Enable Web Search and Scrapping
Your Pusaka agent answers questions based on the knowledge base you give it. But sometimes the information a user needs simply does not exist in your documents — it lives on the open internet, or on a specific website that changes every day. The Web Search and Web Scraping features let your agent reach beyond its static knowledge base and pull live content from the web during a conversation.
What Is Web Search?
Web Search gives your agent the ability to run a real-time internet search during a conversation. When a user asks a question that the agent cannot answer from its knowledge base, the agent can silently query a search engine, retrieve a ranked list of results, and use those results to formulate a grounded, up-to-date response — all within the same chat turn.
Think of it as giving your agent access to a search engine it consults automatically, without the user ever having to leave the conversation.
When to use Web Search
- Your agent needs to answer questions about current events, prices, or anything that changes over time
- You want to reduce the burden of keeping your knowledge base up to date
- Your agent handles a broad set of topics and you cannot anticipate every question
What Is Web Scraping?
Web Scraping goes one step further. Instead of returning a list of search-result snippets, it fetches the full content of a specific webpage and converts it into clean, readable markdown. The agent then uses that extracted content to answer the user's question in detail.
This is especially useful when a user shares a URL or when your agent needs to read an actual page — a product listing, a news article, a documentation page — rather than just knowing that the page exists.
When to use Web Scraping
- Users share URLs and expect the agent to summarise or answer questions about that page
- You want the agent to read the full text of a page, not just a search-result excerpt
- Your agent needs to process structured pages like documentation, product specs, or blog posts
How the Two Features Compare
| Web Search | Web Scraping | |
|---|---|---|
| What it retrieves | A ranked list of result snippets from a search engine | The full text content of a specific webpage |
| Triggered by | A question the agent cannot answer from its knowledge base | A URL shared in conversation, or a need for full-page content |
| Output format | Search result summaries | Markdown-formatted page content |
| Domain restriction | Whitelist up to 20 domains | Exclude specific paths/file types |
| Image handling | Not applicable | Configurable size filters |
| Best for | Real-time factual queries | Deep reading of a specific page |
Both features can be active at the same time and complement each other in practice.
Benefits
Web Search benefits
- Always up to date — The agent surfaces the latest information without requiring manual knowledge base updates
- Broader coverage — Handles questions outside the scope of your curated documents
- Source transparency — Responses can include references to where information was found
- Domain filtering — You can restrict searches to trusted domains only, keeping answers on-brand and reliable
Web Scraping benefits
- Full-content understanding — The agent reads the actual page, not just a summary
- URL-aware conversations — Users can paste a link and immediately ask the agent to explain or summarise it
- Markdown precision — Content is extracted as clean structured text, reducing noise from ads and navigation elements
- Image control — You can include or exclude images and filter by file size to avoid downloading irrelevant assets
Risks and Considerations
Note: Both features make outbound requests to external websites at conversation time. Understand the implications before enabling them in production.
Web Search risks
| Risk | Description |
|---|---|
| Inaccurate sources | Search results may include pages with incorrect or outdated information. Mitigate this by whitelisting only trusted domains. |
| Off-topic results | Without a domain whitelist, the agent may draw on content that is irrelevant to your use case. |
| Latency | Each search adds a network round-trip to the response time. Set maxWebSearchResults conservatively to minimise this. |
| Data leakage | The user's query is sent to a search engine. Avoid enabling Web Search for agents that process confidential or personally identifiable information. |
Web Scraping risks
| Risk | Description |
|---|---|
| Terms of service | Some websites prohibit automated scraping. Ensure the sites your agent accesses permit programmatic content retrieval. |
| Dynamic content | Pages built entirely with JavaScript may not render correctly within the configured timeout window. |
| Excessive payload | Large pages with many images can slow down responses. Use the image size filters and the ignored-paths list to control what gets fetched. |
| Sensitive content exposure | If a user provides a URL to an internal or sensitive page, the agent will attempt to read it. Consider your audience before enabling this feature. |
Before You Start
Before configuring either feature, confirm the following:
- You have an existing agent. If not, follow Build an Agent from Zero to Live first.
- You are signed in as an Owner or Contributor of that agent. Viewers cannot access these settings.
- You know which agent you want to configure. Open it from the agent list so it becomes the active agent.
Part 1 — Configure Web Search
Step 1 — Open Web Search Settings
In the left sidebar of your agent's detail page, scroll to the Automate section and click Websearch.
The page has two sections: Web Search Configuration and Domain Whitelist.
Step 2 — Enable Web Search
At the top of the Web Search Configuration section, you will see the Enable Web Search toggle. It is off by default.
Flip the toggle to the on position. The remaining controls on the page will become active.
Tip: If you want to test the feature without affecting real users, enable it on a test agent first.
Step 3 — Set Maximum Search Results
The Max Web Search Results field controls how many search results the agent retrieves per query. The default is 3.
| Value | Effect |
|---|---|
1 | Fastest responses, but limited context for the agent to work with |
3 (default) | A good balance of speed and answer quality |
| Higher values | Richer context but slower responses due to additional data fetching |
Adjust this number based on your tolerance for response latency and the complexity of questions your agent handles. For most use cases, keep the value between 3 and 5.
Note: There is also an Allow User Control toggle on this page. It is currently marked Coming Soon and cannot be enabled. This feature will let end users toggle web search on or off during a conversation — it is not available yet.
Step 4 — Restrict to Specific Domains (Optional)
By default, Web Search queries the open internet. If you want to limit results to specific trusted websites, use the Domain Whitelist section.
Click Add New Domain to add an entry. Enter the domain name without a protocol prefix.
| ✅ Correct format | ❌ Incorrect format |
|---|---|
wikipedia.org | https://wikipedia.org |
docs.microsoft.com | http://docs.microsoft.com/ |
support.qlar.com | www.support.qlar.com |
Domain names must contain at least one dot and use only letters, numbers, hyphens, underscores, and dots. You can add up to 20 domains.
When the whitelist is not empty, the agent restricts all searches to the listed domains only. Leaving the whitelist empty allows searches across the full internet.
To remove a domain, click the delete icon to the right of its entry.
Tip: Start with a small whitelist of two or three authoritative sources. You can always expand it later. A curated whitelist dramatically reduces the chance of the agent citing inaccurate or off-topic content.
Step 5 — Save Your Web Search Settings
Click the Save button at the bottom of the page. Your changes take effect for all new conversations immediately.
Part 2 — Configure Web Scraping
Step 1 — Open Web Scraping Settings
In the left sidebar, still under the Automate section, click Webscraping.
The page is divided into three sections: Web Scraping Configuration, Image Size Filters, and Ignored Paths.
Step 2 — Enable Web Scraping
At the top of the Web Scraping Configuration section, flip the Enable Web Scraping toggle to on. All other controls will activate.
Step 3 — Set the Page Load Timeout
The Timeout (seconds) field sets the maximum time the agent will wait for a page to finish loading before giving up. The default is 30 seconds.
| Scenario | Recommended value |
|---|---|
| Fast, text-heavy pages (documentation, blogs) | 15 – 20 |
| Standard pages | 30 (default) |
| Heavy pages with many resources | 45 – 60 |
The minimum allowed value is 5 seconds and the maximum is 60 seconds. Setting the timeout too low will cause partial or failed fetches on slower pages. Setting it too high will increase response latency when a page is unreachable.
Step 4 — Configure Image Settings
Web Scraping can include or exclude images from the extracted markdown. This section has three controls.
Include Images toggle
When on, images found on the page are included in the markdown output. When off, all images are stripped. Disable this if your agent only needs the text content of pages — it reduces response size significantly.
Min Image Size (KB)
Images smaller than this threshold are excluded. The default is 5 KB. This filters out icons, tracking pixels, and other decorative micro-images that add noise without providing value.
Max Image Size (MB)
Images larger than this threshold are excluded. The default is 10 MB. This prevents the agent from downloading extremely large images that would slow down extraction without improving the quality of the response.
| Setting | Min allowed | Max allowed | Default |
|---|---|---|---|
| Min Image Size | 1 KB | 10,240 KB (10 MB) | 5 KB |
| Max Image Size | 1 MB | 50 MB | 10 MB |
Tip: If image content is not important for your use case, toggle Include Images off. It is the single most effective way to speed up web scraping responses.
Step 5 — Exclude Specific Paths (Optional)
The Ignored Paths section lets you define path patterns that the scraper should skip. This is useful for excluding static assets, media files, or sections of a site that are not useful for your agent.
Click Add New Path Pattern to add an entry. Each pattern can take one of these forms:
| Pattern type | Example | What it excludes |
|---|---|---|
| Absolute path | /icons | All content under the /icons/ path |
| Wildcard by extension | *.svg | All SVG files regardless of path |
| Specific filename | logo.png | Any file named logo.png |
You can add up to 50 patterns. Each pattern must be no longer than 200 characters.
To remove a pattern, click the delete icon to the right of its entry.
Tip: Common patterns to exclude:
*.svg,*.ico,*.woff,*.woff2,/cdn-cgi,/wp-content/uploads. These are typically assets that the agent does not need to read.
Step 6 — Confirm and Save
Click the Save button at the bottom of the page.
Unlike Web Search, Web Scraping will show a confirmation dialog before applying your settings. Review the summary and click Confirm to proceed. This extra step exists because Web Scraping settings affect how the agent processes external content, which can have a meaningful impact on response quality and latency.
Your settings take effect immediately for all new conversations.
Using Both Features Together
Web Search and Web Scraping work independently but pair well in conversation flows:
- A user asks a question the agent cannot answer from its knowledge base
- Web Search finds the most relevant result URLs
- Web Scraping fetches the full content of the top result
- The agent synthesises a detailed, sourced answer from the full page text
Enabling both features together gives your agent the broadest possible access to live information. However, each feature adds latency and external dependency. Start with Web Search only, measure the impact on response times, and then layer in Web Scraping when you need deeper content extraction.
Common Mistakes to Avoid
| Mistake | Why it matters | Fix |
|---|---|---|
| Leaving the domain whitelist empty in a business context | The agent may cite unreliable sources | Add a curated list of 2–5 trusted domains |
| Setting Max Results above 5 | Higher values increase latency noticeably | Keep it at 3–5 unless response depth is critical |
| Enabling Include Images without size filters | Large images slow extraction significantly | Set a Max Image Size of 2–5 MB for typical use |
| Using a timeout below 10 seconds | Fast timeout causes frequent scraping failures | Use 20–30 seconds as a baseline |
| Enabling Web Scraping for confidential conversations | Users can share any URL, including internal pages | Only enable scraping when your audience and content are appropriate |
| Not testing after configuration | Settings may not work as expected on all sites | Test with 3–5 representative URLs before going live |
Next Steps
- Automate with Plugins — Extend your agent with purpose-built plugins for common operations
- Automate with Custom API — Connect your agent to any external system via REST
- Observe Feedback and Logs — Monitor how your agent uses web content in real conversations