Enhans :: Research Trends in Automated Web Agents

Research Trends in Automated Web Agents

April 4, 2025

Tech

Agents on the Web: Structure and Trends in Autonomous Web Agents

‍

1. From Mobile to the Web

In the previous post, we discussed mobile agents—automated systems operating within mobile environments. These systems are capable of perceiving screen content, planning actions, and executing tasks within mobile apps without user intervention.

In this post, we turn our attention to autonomous web agents. While both types of agents are based on large language models (LLMs) and share a similar perception–planning–action architecture, they differ significantly in the environments in which they operate and the structure of the input data they process. Web agents, in particular, interact through the browser interface and must interpret and interact with web page elements such as the DOM structure, text content, and clickable components in real time—posing a distinct set of technical challenges compared to mobile agents.

2. What Are Autonomous Web Agents?

Autonomous web agents are intelligent systems that perform complex tasks in browser environments without human input. For instance, such agents can search and order products from e-commerce platforms, open emails and download attachments, or navigate web services to extract useful information.

Technically, these agents require the ability to understand natural language using LLMs, analyze DOM trees, and manipulate the browser through APIs. Some models also employ reinforcement learning or are trained in browser-based simulation environments. Standardized benchmarks such as WebArena and MiniWoB++, as well as browser simulators (similar to AndroidEnv for mobile), are actively being developed to support research in this area.

3. Shared Architecture: Similarities with Mobile Agents

Mobile and web agents share a common architectural pattern. Most systems are composed of the following components:

Perception: Mobile agents perceive UI elements via screen-captured images, while web agents interpret the DOM tree, HTML structure, and style attributes to construct a visual understanding.
Planning: Based on LLMs, both types of agents generate plans via prompt-based reasoning or predict action sequences.
Action: Agents interact with the environment—tapping or swiping on mobile, and clicking, typing, or moving the mouse on the web, typically via a web driver.
Memory: Both types of agents maintain context over time, such as tracking state across app or browser sessions.

Ultimately, both agents operate on an agent loop of environment perception → interpretation → planning → execution → result observation → repetition.

4. Key Differences: Input Structure and Complexity

One of the most critical differences between mobile and web agents lies in the structure of the input data.

Mobile agents primarily process visual input. Since app UIs are rendered graphically, agents must extract meaningful information from screen-captured images using vision models or OCR-based UI parsers. This makes multimodal modeling a core requirement.

In contrast, web agents operate in explicitly structured environments. Web pages are composed of a Document Object Model (DOM), which provides access to element text, attributes, positions, and more. Instead of inferring structure visually, agents can reason directly over structured data—facilitating cleaner state-action mappings in policy learning or reinforcement learning.

Web environments also provide standardized APIs (e.g., Selenium, Puppeteer) to control the browser, making it relatively straightforward to execute actions accurately. However, agents must also handle browser-specific challenges such as non-determinism, loading delays, and network latency, which introduce instability into action execution.

5. Learning Strategy Differences Driven by Input Formats

Differences in input structure lead to significant divergence in learning strategies. Mobile agents, which must map images and text to actions, typically rely on multimodal transformers, vision-language models (VLMs), and OCR-enhanced input pipelines. On the other hand, web agents benefit from structured state representations, making them suitable for reinforcement learning (RL), behavior cloning (BC), and policy optimization methods.

For prompt-based agents, the inference burden in mobile environments is often heavier due to the need to interpret visual context. In web agents, the primary challenge is mapping natural language instructions to relevant DOM elements. This has led to the development of HTML-aware prompts, multi-turn prompt structures, and even XML-like task specification formats.

6. Summary: Same Architecture, Different Constraints

Mobile agents and autonomous web agents are both LLM-based intelligent systems built around the perception–planning–action–memory loop. However, the environments in which they operate demand very different technical implementations.

Mobile agents must process visual inputs and rely heavily on multimodal reasoning to extract actionable UI elements from rendered screens. Web agents, by contrast, work with structured DOM input, shifting the focus toward semantic reasoning and accurate mapping between user intent and HTML components.

These input differences shape how agents are designed and trained. While mobile agents favor multimodal models and visual understanding, web agents excel with policy learning, structured prompts, and HTML-level reasoning.

In essence, while both agents share the same architectural backbone, they operate under very different constraints—providing a valuable comparative lens for understanding how LLM-based agents adapt to diverse real-world interfaces. This comparison offers meaningful insights into the design of more general, cross-platform autonomous agents in the future.

‍

references

Ammanamanchi, P., Murty, S., Kumar, V., Huang, J., Shah, D., & Varma, G. (2023). WebArena: A real-world benchmark for large language model based web agents. arXiv preprint arXiv:2312.06693. https://arxiv.org/abs/2312.06693
Shi, W., Liang, P. P., Subramanian, S., & Morency, L.-P. (2022). MiniWoB++: A benchmark for web-based task learning with reinforcement learning and imitation. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). https://github.com/google/miniwob-plusplus
Yao, S., Zhang, Y., Shen, T., Xiong, C., & Ma, J. (2023). ReAct: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629. https://arxiv.org/abs/2210.03629
Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual instruction tuning. arXiv preprint arXiv:2304.08485. https://arxiv.org/abs/2304.08485
Nakano, R., Hilton, J., Wu, J., Ouyang, L., Kim, C., Hesse, C., ... & Schulman, J. (2021). WebGPT: Browser-assisted question answering with human feedback. OpenAI. https://openai.com/research/webgpt
Srivastava, A., Greydanus, S., Susskind, J., Zeng, A., & Finn, C. (2022). Behavior cloning from observation (BCO). Proceedings of NeurIPS 2022. https://arxiv.org/abs/1706.01703
OpenAI. (2023). GPT-4 Technical Report. arXiv preprint arXiv:2303.08774. https://arxiv.org/abs/2303.08774
Selenium. (n.d.). Selenium WebDriver. https://www.selenium.dev/
Puppeteer. (n.d.). Puppeteer. https://pptr.dev/

‍