Crawling – PortSwigger
The exploration phase of an analysis involves navigating the application, following links, submitting forms, and logging in as needed, to catalog the content of the application and the navigation paths it contains. . This seemingly simple task presents a variety of challenges that Burp’s crawler is able to overcome, to create an accurate map of the app.
By default, Burp’s crawler navigates to a target application using a built-in browser, clicking links and submitting entries when possible. It builds a map of the content and functionality of the application in the form of a directed graph, representing the different locations in the application and the links between these locations:
The crawler does not make any assumptions about the structure of the URL used by the application. Locations are identified (and re-identified later) based on their content, not the URL that was used to reach them. This allows the crawler to reliably handle modern applications that place ephemeral data, such as CSRF tokens or cache busters, in URLs. Even though the entire URL for each link changes each time, the crawler still builds an accurate map:
The approach also allows the crawler to manage apps that use the same URL to reach different locations based on the state of the app or the user’s interaction with it:
As the robot navigates and builds coverage of the target application, it follows the edges of the graph that have not been completed. These represent links (or other navigation transitions) that have been observed in the app but not yet visited. But the crawler never “jumps” on a pending link and visits it out of context. Instead, it either navigates directly from its current location or returns to the starting location and navigates from there. This reproduces as closely as possible the actions of a normal user browsing the site:
Exploring in a way that makes no assumptions about the URL structure is very effective in dealing with modern web applications, but can potentially cause problems seeing âtoo muchâ content. Modern websites often have a mass of superfluous navigation paths (via footers, burger menus, etc.), which means everything ties right into everything else. Burp’s crawler uses a variety of techniques to solve this problem: it creates link fingerprints to previously visited locations to avoid redundantly visiting them; it explores in a breadth-first order that prioritizes the discovery of new content; and it has configurable thresholds that limit the extent of exploration. These metrics also allow âendlessâ applications, such as calendars, to be properly handled.
As Burp’s crawler navigates through a target application using a built-in browser, it is able to automatically handle virtually any session management mechanism that modern browsers can use. There is no need to record macros or configure session management rules that tell Burp how to get a session or verify that the current session is valid.
The crawler employs several crawler “agents” to parallelize its work. Each agent represents a separate user of the application browsing with his own browser. Each agent has their own cookie jar, which is updated when the application sends them a cookie. When an agent returns to the starting location to begin crawling from there, their cookie file is cleared, to simulate an entirely new browser session.
The requests that the crawler makes while browsing are built dynamically based on the previous response, so CSRF tokens in URLs or form fields are handled automatically. This allows the navigation robot to navigate correctly in the functions that use complex session management, without any configuration on the part of the user:
Detection of application state changes
Modern web applications are very dynamic and it is common for the same application function to return different content on different occasions, as a result of actions taken by the user in the meantime. Burp’s crawler is able to detect changes in the state of the application resulting from the actions it has taken while crawling.
In the example below, walking the path
BC changes the application from state 1 to state 2. Link D goes to a logically different location in state 1 compared to state 2. Thus, the path
AD goes to the empty cart, while
ABCD go to the full basket. Rather than simply concluding that link D is non-deterministic, the crawler is able to identify the path of state change that link D depends on. This allows the crawler to reliably reach l ‘location of the filled basket in the future, to access the other functions available from there:
Connection to the application
Burp’s crawler begins with an unauthenticated phase in which no credentials are submitted. Once this is done, Burp will have discovered all the connection and self-registration functions in the application.
If the app supports self-registration, Burp will attempt to register a user. You can also configure the crawler to use one or more pre-existing connections.
The crawler then passes to an authentication phase. It will visit the login function several times and submit:
The credentials of the self-registered account (if applicable).
The credentials for each preexisting account configured.
Fake credentials (these can achieve cool functions like account recovery).
For each set of credentials submitted to the connection, Burp will then analyze the content discovered behind the connection. This allows the crawler to capture the different functions available to different types of users:
Explore volatile content
Modern web applications frequently contain volatile content, where the âsameâ location or function will return responses that differ significantly on different occasions, not necessarily as a result of user action. This behavior may be the result of factors such as feeds from social media channels or user comments, online advertising, or genuinely random content (post of the day, A / B testing, etc.).
Burp’s crawler is able to identify many instances of volatile content and correctly re-identify the same location on different visits, despite different responses. This allows the crawler to focus its attention on the “core” elements within a set of application responses, which is possibly most important in terms of discovering key navigation paths to content and the interesting features of the application:
In some cases, visiting a given link on different occasions will return responses that are too different to be treated as “the same”. In this situation, Burp’s crawler will capture both versions of the response at two different locations and draw a non-deterministic edge in the graph. As long as the extent of the non-determinism in the application is not too great, Burp can still crawl the related content and reliably find its way to the content behind the non-deterministic link:
Exploration with the integrated browser (browser-based analysis)
By default, if your machine seems to support it, Burp will use its built-in Chromium browser for all navigation of your target websites and apps. This approach offers several major advantages, allowing Burp Scanner to handle most of the client-side technologies that modern browsers can handle.
If you prefer, you can also manually enable or disable browser scanning in your scan setup. You can find this option under “Explore options”> “Miscellaneous”> “Integrated browser options”.