Relwarc: the third part of the story about the client-side JS analyzer for mining HTTP requests

August 31, 2024 • Daniil Sigalov (@asterite3) Читать на русском языке

In this post (continuing these: 1, 2) I’ll talk about library support, converting the calls to HTTP requests, what we’ve got in the end as a result, conclusions we made, and something more.

But first things first.

How does it work — more on the algorithm

The algorithm searches for the calls to functions sending requests to the server. But how do we tell, by looking at the call, whether it sends a request? And how do we tell what exact request is being sent? In theory, if the called function is declared somewhere in the source code, we can find its code, analyze it, and understand everything. But in reality, it’s not that simple — many modern JS libraries have rather complex code and JavaScript code is hard to analyze statically. As we mentioned in the first post, even handling jQuery, the most popular client JS library on the Internet, poses problems for existing static analyzers. That’s why we are using library call patterns and library models built into the analyzer.

Library call patterns

The analyzer has a built-in set of patterns that help it understand that a particular call in the code is a call to a function that sends a request to server. The simplest patterns are the ones we added initially and that are still used — the ones that just examine function names. That is, the analyzer has a built-in knowledge that, if the method $.ajax is being called, with object name $ and method name ajax, it sends a request to server. Likewise, the analyzer knows that fetch and methods of XMLHttpRequest are designed for sending HTTP requests. We’ve added such patterns for a number of popular libraries: jQuery, Angular, Axios. Thanks to them, when the analyzer sees $http.post(...), axios(...) or jQuery.post(...), it understands that these are points where requests are sent. So, it has a built-in list of names. Can these patterns fail? Sure. Just rename a library object or function — and the call won’t not be found. And yet, often they are enough. It’s not very likely for developers to rename library objects: for code readability, it’s better for the call to have a familiar look. However, library functions and objects may be renamed when packaged by bundlers, especially when combined with minification. And some libraries are not designed to be accessed via some global object or function with standard name at all (they have to be imported one way or another). That’s why there are more complex kinds of patterns in the analyzer — for example, ones that detect the function by the way its code looks (using an AST pattern matching the body of the function). If a function is detected by such a pattern, then at call sites this function is matched by value, not by name.

Library models

Fine, we have found the calls and did our best to compute their arguments. But function names and arguments, while useful, are not enough for the scanner to search for vulnerabilities. It needs HTTP requests. Right now we solved this problem in a pretty straightforward way: for every function, for which we have a pattern, we have manually written the code that models its work. This code converts the sequence of arguments found by the analyzer to the HTTP request that would be sent by such a call with such arguments. A thing to note here is that the arguments are not always fully concrete — some values may remain unknown. With such incomplete data, we would rather not fail, but output some result, keeping the information we managed to discover (while a real library would be perfectly allowed to fail). In the worst case the URL may remain completely unknown (or the entire request body) — in this case the analyzer won’t output such a request. I would rather not go into more detail on this part, it’s not very exciting: each library model tries to take into account different nuances of the library’s work and different cases when data is not completely correct. You can check this in the source code yourself if you want.

Because we released it.

Relwarc

We have made the code of our analyzer open-source. We’ve also created a publicly-available application relwarc.solidpoint.net that allows to run our analyzer on some JS. We decided to give our analyzer the name Relwarc — that is, “crawler”, reversed. Because it solves the same problem as a crawler, but by going in the opposite direction. It tries to determine which requests are sent by client-side code to the server, but it does not emulate user actions on the page. Instead, it looks at the code itself. Works statically, not dynamically.

The version of code that more-or-less corresponds to what was described in this and previous posts, can be found on GitHub, by this link:

https://github.com/seclab-msu/relwarc/tree/vanilla-algorithm

The application relwarc.solidpoint.net can analyze both separate JS files and web pages as a whole. This app provides an API and also a web UI allowing to run the analyzer on something manually, play with it. Check it out! It might find new server endpoints for you in some code in situations when you don’t have enough time or energy to dig through manually. If you find a bug or have thoughts on what to improve — feel free to create an issue in the repository on GitHub (or a pull request right away)!

The result

So, this and the two previous posts tell about our algorithm for analyzing JS. Is this the complete algorithm we have developed? Well, actually, not quite. You can notice that the GitHub link above leads to a tag that doesn’t point to the newest commit in the repository. Since we’ve developed this basic version, we have improved a ton of things — we have actually added determining function return values, the support for classes, bundlers and a number of other enhancements. To describe them all even three posts is not enough, besides, the algorithm as described in these posts already worked quite well, and we were using it in practice.

Quality metrics

Alright, we’ve made this algorithm, implemented the call chain analysis, added various improvements here and there — but what did we get as a result? What’s the outcome? Does this analyzer work? Does it work well? Does it find anything useful? To answer this, it would be great for us to have some benchmark allowing us to measure the quality of the algorithm. Say, a set of applications or at least web pages, real or similar to real ones, which have some client-side JS sending requests to server, and for which we know the ground truth — what requests are actually sent. Then we could compare those correct answers with the output of our analyzer and calculate the quality metrics. We could not find such a benchmark (WIVET is somewhat similar, but not quite, it solves a different problem). So we decided to create our own! We have made a dataset of web pages taken from real applications (to be precise, there’s a page for a test application, but only one — the main page of Juice Shop). For each page we have manually found which server endpoints are accessed from its client-side JS and added a list of these requests to the dataset (the ground truth). Almost all pages in the dataset are taken from random Internet sites. We took them from the Alexa Top 1 Million list and from the scopes of public Bug Bounty programs. We have published this dataset, here’s the link:

https://github.com/seclab-msu/ajax-page-dataset

In addition to the pages and the ground truth, this repo contains scripts to run the analyzer and calculate the metrics. The main metric we use right now is the average coverage. We compute the coverage for each page (percentage of found requests) and take the average of these percents over all pages. Right now the average coverage is around 40%.

In conclusion

Static analysis is undecidable in general and analyzing JavaScript is particularly hard. So, making the analyzer that works ideally, in 100% of cases, is impossible. But we can try to make an algorithm that works in most real cases. Try hard to increase the percentage of real applications covered by the algorithm to 80% or 90%. It will always be possible to make a tricky code sample that will break the algorithm — but who cares, if such a code never occurs in reality? To figure out how to build an analyzer that works in the real world, we had to do more than just program, we had to study the real code. We’ve spent a lot of time exploring the ways the request-sending code is built, studying the data flows that influence the sent requests.

Right now the average coverage on our dataset is 40% — is this metric high or low? Of course it would be awesome to have those 80% or 90%. But still, statically analyzing JS is not easy, and we tried to make our dataset somewhat representative — so that by looking at its score, one could get an idea of the quality on a randomly selected site from the Internet. And none of the existing JS analyzers known in academic literature handles real modern web pages. Following the principle mentioned in the first post, we decided to start with something simple, so that we did not end up creating a huge complicated mechanism that would never actually work on something real (or which we won’t even finish making). After creating a simple tool our plan was to iteratively improve it, increasing complexity. Supporting new code features. As a result, such an analyzer that accounts for real-world code specifics (tailored for the specific task at hand) is able to handle real pages, and often is able to do that rather fast, requiring a minute or a few minutes.

Does this mean that this analyzer should now be only developed iteratively, gradually adding features or making small changes? Actually, no — sometimes you still have to rewrite the software, reworking it significantly. Maybe, building the new algorithm on other principles. But what will still be valuable are the quality metrics on the dataset, the information about which requests the previous algorithm found and how much time it spent doing so. What tests it passed. The gathered knowledge. If during the development the new algorithm will be inferior to the old one in some way, it will always be possible to look at why the old one managed to succeed. Which principles of its work allowed it to handle some code construct.

And finally, this static analyzer does not replace dynamic crawlers, it complements them. When we started developing it, we had an idea that there are requests that are hard to trigger by interactions with the user interface. But there are also requests, which are easier to discover dynamically. Demanding 99% coverage from a static analyzer would mean to require a static analyzer to always handle everything on its own. Maybe it is possible to make such an algorithm. But right now it is evident that somewhere it makes more sense to apply a static approach and somewhere it’s easier to use dynamic analysis. Neither is a silver bullet and using both will result in maximum coverage.