Crawling – PortSwigger




The exploration phase of an analysis involves navigating the application, following links, submitting forms, and logging in as needed, to catalog the content of the application and the navigation paths it contains. . This seemingly simple task presents a variety of challenges that Burp’s crawler is able to overcome, to create an accurate map of the app.

Basic approach

By default, Burp’s crawler navigates to a target application using a built-in browser, clicking links and submitting entries when possible. It builds a map of the content and functionality of the application in the form of a directed graph, representing the different locations in the application and the links between these locations:

The crawler does not make any assumptions about the structure of the URL used by the application. Locations are identified (and re-identified later) based on their content, not the URL that was used to reach them. This allows the crawler to reliably handle modern applications that place ephemeral data, such as CSRF tokens or cache busters, in URLs. Even though the entire URL for each link changes each time, the crawler still builds an accurate map:

An app with ephemeral URLs that change at every opportunity

The approach also allows the crawler to manage apps that use the same URL to reach different locations based on the state of the app or the user’s interaction with it:

An app that uses the same URL to reach different locations, depending on the state of the app or the user's interaction with it

As the robot navigates and builds coverage of the target application, it follows the edges of the graph that have not been completed. These represent links (or other navigation transitions) that have been observed in the app but not yet visited. But the crawler never “jumps” on a pending link and visits it out of context. Instead, it either navigates directly from its current location or returns to the starting location and navigates from there. This reproduces as closely as possible the actions of a normal user browsing the site:

Return to the start location of the exploration

Exploring in a way that makes no assumptions about the URL structure is very effective in dealing with modern web applications, but can potentially cause problems seeing “too much” content. Modern websites often have a mass of superfluous navigation paths (via footers, burger menus, etc.), which means everything ties right into everything else. Burp’s crawler uses a variety of techniques to solve this problem: it creates link fingerprints to previously visited locations to avoid redundantly visiting them; it explores in a breadth-first order that prioritizes the discovery of new content; and it has configurable thresholds that limit the extent of exploration. These metrics also allow “endless” applications, such as calendars, to be properly handled.

Managing sessions

As Burp’s crawler navigates through a target application using a built-in browser, it is able to automatically handle virtually any session management mechanism that modern browsers can use. There is no need to record macros or configure session management rules that tell Burp how to get a session or verify that the current session is valid.

The crawler employs several crawler “agents” to parallelize its work. Each agent represents a separate user of the application browsing with his own browser. Each agent has their own cookie jar, which is updated when the application sends them a cookie. When an agent returns to the starting location to begin crawling from there, their cookie file is cleared, to simulate an entirely new browser session.

The requests that the crawler makes while browsing are built dynamically based on the previous response, so CSRF tokens in URLs or form fields are handled automatically. This allows the navigation robot to navigate correctly in the functions that use complex session management, without any configuration on the part of the user:

Automatic management of session tokens during the crawl

Detection of application state changes

Modern web applications are very dynamic and it is common for the same application function to return different content on different occasions, as a result of actions taken by the user in the meantime. Burp’s crawler is able to detect changes in the state of the application resulting from the actions it has taken while crawling.

In the example below, walking the path BC changes the application from state 1 to state 2. Link D goes to a logically different location in state 1 compared to state 2. Thus, the path AD goes to the empty cart, while ABCD go to the full basket. Rather than simply concluding that link D is non-deterministic, the crawler is able to identify the path of state change that link D depends on. This allows the crawler to reliably reach l ‘location of the filled basket in the future, to access the other functions available from there:

Detection of application state changes during exploration

Connection to the application

Burp’s crawler begins with an unauthenticated phase in which no credentials are submitted. Once this is done, Burp will have discovered all the connection and self-registration functions in the application.

If the app supports self-registration, Burp will attempt to register a user. You can also configure the crawler to use one or more pre-existing connections.

The crawler then passes to an authentication phase. It will visit the login function several times and submit:

  • The credentials of the self-registered account (if applicable).

  • The credentials for each preexisting account configured.

  • Fake credentials (these can achieve cool functions like account recovery).

For each set of credentials submitted to the connection, Burp will then analyze the content discovered behind the connection. This allows the crawler to capture the different functions available to different types of users:

Explore with different login credentials to access the different functions available to different users

Explore volatile content

Modern web applications frequently contain volatile content, where the “same” location or function will return responses that differ significantly on different occasions, not necessarily as a result of user action. This behavior may be the result of factors such as feeds from social media channels or user comments, online advertising, or genuinely random content (post of the day, A / B testing, etc.).

Burp’s crawler is able to identify many instances of volatile content and correctly re-identify the same location on different visits, despite different responses. This allows the crawler to focus its attention on the “core” elements within a set of application responses, which is possibly most important in terms of discovering key navigation paths to content and the interesting features of the application:

Identify the basic elements of an HTML page and the variable content that changes on different occasions

In some cases, visiting a given link on different occasions will return responses that are too different to be treated as “the same”. In this situation, Burp’s crawler will capture both versions of the response at two different locations and draw a non-deterministic edge in the graph. As long as the extent of the non-determinism in the application is not too great, Burp can still crawl the related content and reliably find its way to the content behind the non-deterministic link:

Crawling when application responses are sometimes non-deterministic

Exploration with the integrated browser (browser-based analysis)

By default, if your machine seems to support it, Burp will use its built-in Chromium browser for all navigation of your target websites and apps. This approach offers several major advantages, allowing Burp Scanner to handle most of the client-side technologies that modern browsers can handle.

One of the main benefits is the ability to effectively crawl JavaScript heavy content. Some websites have a dynamically generated navigation user interface using JavaScript. Although this content is not present in the raw HTML code, Burp Scanner is able to use the built-in browser to load the page, run all the scripts required to create the user interface, and then continue to explore normally.

The built-in browser also allows Burp Scanner to handle instances where websites modify requests on the fly using JavaScript event handlers. The crawler can trigger these events and run the appropriate script, modifying the requests as needed. For example, a website can use JavaScript to generate a new CSRF token after a onclick event and add it to the next request. Burp Suite can interact with items made clickable by JavaScript event handlers.

If you prefer, you can also manually enable or disable browser scanning in your scan setup. You can find this option under “Explore options”> “Miscellaneous”> “Integrated browser options”.


Leave A Reply

Your email address will not be published.