Mining HTTP requests from client-side JS with static analysis — part 2

May 8, 2024 • Daniil Sigalov (@asterite3) Читать на русском языке

In the previous post we looked at the basic idea behind the algorithm and even built a small analyzer based on it. In this post I’ll talk a bit more about the principles our algorithm works on, including how it determines function argument values and handles objects like XMLHttpRequest, plus a few notes on the supported operations.

But still — how?

As I’ve already said, in terms of the core ideas, the algorithm is the same as what was described in the previous post. However, there are some distinctions.

Supported types and operations

The analyzer supports more types beyond just strings and UNKNOWN. Among them are objects, arrays, numbers, boolean and undefined too (yes, as you probably know, there is a distinct type for undefined in JS). For all these types, the analyzer can convert their literals (including things like [1, 2, 3] and {foo: "bar", abc: [true]}) into values and can compute the results of some operations involving these types. The set of supported operations includes reading object/array fields (a[x], ob.prop) and writing them (a[x] = 123, ob.prop = v).

Functional values are also supported — when the analyzer processes function declarations or function expressions (function literals, if you will) it also produces values that can be stored in its memory.

So, for this code

function f() {
    return 123;
}
var g = function(x) {
    return x * 25;
}
var k = f;
var l = g;

The analyzer will figure out that variables f and k point to the first function (the one that returns 123) and g and l — to the second one (that returns its argument multiplied by 25).

Each functional value contains a reference to the code of the corresponding function (to its AST, to be precise). This will be useful for interprocedural analysis which I’ll talk about later.

By the way, we can notice that values known to the analyzer can reside somewhere else except being kept directly as values of the memory map — because the analyzer supports objects and arrays, values can “sit” in them, being their members. Perhaps on some depth (with multiple layers of nested objects/arrays).

The analyzer can also compute the results of calls to some JS built-in functions, including the methods that work with strings (.concat(...), .substr(...), etc.). Template strings are also supported.

And, as before, if the analyzer is unable to evaluate the result of some operation, it will consider the result to be UNKNOWN. This can happen if the operation is not supported, or the value, that the operation works on, is not known. For example, if the analyzer tries to process x = a[3] and a has the value UNKNOWN, then x will also become UNKNOWN.

Iterative analysis

The analysis algorithm described in the previous post does a single AST traversal — this more-or-less corresponds to going through the code “from beginning to end”. But in reality, program instructions are not always executed in the order they appear in the source code. Functions, for instance, are not necessarily called in the order they are written. What if we have the code like this:

function f() {
    fetch(baseURL + '/users/list');
}

baseURL = '/blog/api';
f();

A global variable baseURL is used here without being declared.

Or, like this:

function createRequestService() {
    var apiVer, section;
    function getSettings() {
        fetch('/api/' + apiVer + '/' + section + '/settings');
    }
    function notifyOnline() {
        fetch(
            '/api/' + apiVer + '/' + section + '/status?online=true',
            { method: 'POST' }
        );
    }
    function initialize() {
        apiVer = '2.1';
        section = 'shop';
    }
    return { getSettings, notifyOnline, initialize };
}

In both cases, variables have their values assigned later in the source code than they are used. We could say that data does not always “flow” from the beginning of the source code to its end, sometimes data “flows backward”. To handle this, analyzers make several analysis passes over the code. For the algorithm to be lightweight, we decided to start with two passes — one preliminary pass, where the analyzer will attempt to gather the assigned values everywhere, and the second, main pass. AJAX calls are processed only during the main pass, and the preliminary one is only used for gathering values. This will be sufficient for the two snippets above. For some more sophisticated cases this may become insufficient.

Classical static analyzers, built on textbook principles, often work as follows: they iterate over the code, gathering more and more information about it on each iteration, until the process converges — until the moment when an iteration adds no new information. For that to always work, an algorithm has to be built in a special way — so that the process is guaranteed to converge. Even if it does always converge, there are cases when it takes too long to wait until this happens. By the way, let me remind that you can use the references at the end of the previous post to learn more about the classical analysis algorithms.

By looking at the design of our algorithm one can tell that it is specialized for finding values that are actually “constant” — that are set once, and set with some concrete value (value fixed in the code, not coming from the outside). Our hypothesis is that “fixed” parts of the request, the ones that control which server endpoint will be invoked, will be set this way in the source code. We won’t find variable request parts, request parameter values (usernames, search queries, amounts of products, etc.) in the code, but maybe we don’t need them, after all.

Interprocedural analysis

The process of sending a request often involves several functions. We have to understand how data flows between them. When examining code samples from real sites, this most often translated for us to question “what data comes from the function arguments?”. We decided not to do the full-fledged interprocedural analysis for all functions of the program because we assumed that such analysis would be too heavyweight. Instead, we chose to do a targeted analysis only for points in the program that we’re interested in (later we reconsidered this decision a bit, but that’s a story for another day). We also decided to give up on determining the return values of regular, user-defined functions — they all are considered to be UNKNOWN. Let’s focus on the arguments.

To make the analysis targeted, to make it search for arguments only in places we need, we have to somehow understand: where do we need that. For that, a new special value was introduced: FROM_ARG. This value is similar to UNKNOWN — this is a unique value that signifies something unknown to us. But it additionally carries the information that this value originated from formal function arguments. When analyzing function bodies we will assume that values of all formal arguments are FROM_ARG.

Formal function arguments are initialized to FROM_ARG

If this value will then reach the AJAX call (for example, $.post or fetch), that is, one of the arguments to that call will contain the value FROM_ARG, that would mean that the request being sent depends on the formal arguments of the function that made this AJAX call. Which, in turn, means, that we should look for places where this function is called and try to determine which arguments are given to it there.

The value FROM_ARG has reached an AJAX-call

How do we search for those calls? We have our memory map, where all variables are stored. And the analyzer supports function values. So let’s take a function, which calls we want to find (function f in this case) and take from memory all variables pointing to the functional value of f. In the simplest case this will be the sole variable f, but in more complex cases other variables might also be picked if there were aliases: for example, if there was an assignment like other_func = f. For every such variable (pointing to f) we will again traverse the code in search for call sites, where this variable stands in place of the function being called — we will consider them to be call sites of f. For them we would like to determine the arguments passed to the call. Let’s assume that the found call site resides inside some other function, g.

Call site of `f` is found inside function `g`

We will now analyze the body of this function g — traverse the AST of its body in the same way as previous analysis traversals, but, when we meet the f call site we’re looking for, we will evaluate its arguments and will move to the beginning of the f’s body. We will then traverse the body of f, but instead of FROM_ARG we will take the computed arguments as values of f’s formal arguments. And then we will analyze f, traversing the AST of its body. This way we will “emulate” the call to f with computed arguments. Now, when we will reach the AJAX call inside f, there is a chance that we will discover the more specific arguments to that AJAX call.

Call site analysis allowed to descover the more specific arguments

We will perform such analysis for all discovered call sites.

The analysis will also be performed for other call site of f

The call site might also be found outside any function, on the top level. In this case we will repeat the analysis of the global context, i.e. the whole code. In other words, this case is handled as if the whole program is contained in one large function. This is far from being optimal, but we’ll leave it like this now for the sake of simplicity.

💡 Side note:

a bit more on what does "FROM_ARG will reach the AJAX call" mean

Of course it is possible for the call argument to be simply equal to FROM_ARG. But other possibilities often arise:

The algorithm supports arrays and objects, so, an array or an object containing FROM_ARG as one of its members may turn out to be an AJAX call argument. This exact case was shown on screenshots above, but the way. To detect such cases, data structures that flow into call arguments are traversed recursively in search for FROM_ARG.
Request data is often transformed using string operations. If the value FROM_ARG is involved in such an operation, it is cast to the string "FROM_ARG". When searching for FROM_ARG in call arguments, in addition to checking whether the value is identical to FROM_ARG, all string values are also checked for whether they contain the substring "FROM_ARG".

Call chains

We might need to analyze call chains containing more than two functions while searching for arguments. Suppose we had a function f that made an AJAX call, we found its call site inside some other function (g). It may happen so that g passes some of its arguments to the f call (or data depending on those arguments). We can detect that by seeing that, when analyzing the chain f -> g, we still notice the value FROM_ARG reaching the AJAX call. This would mean that we might have to analyze the longer chain — look for calls to g in the code and repeat the process starting from each of them.

Chain with three functions

We perform this call chain analysis after the regular, base analysis passes are done (the ones that traverse the whole code). This way, the analysis becomes more targeted — it examines the entire code for some time, but then pays more attention to parts that we’re interested in. Here we use the fact that we detect AJAX calls syntactically, using their names — so, “interesting” points, the places where call chains begin, are “cheap” for us to find, finding them does not require code analysis by itself. But, on the other side, we won’t find all AJAX calls this way. Besides, obviously, this method will only be able to build call chains for calls to free-standing functions, or, to be more specific, calls, where the function being called is given as a variable, for which the analyzer could find a value. Calls to object methods (x.f()) won’t be found by this algorithm as well as calls to functions that were, say, passed as arguments (callbacks).

💡 Side note:

okay, but this was a conceptual description, how was that actually implemented?

The analyzer maintains a call chain queue containing the chains it has to process. Each call chain is a sequence of pairs (fn_body,call_site), where fn_body is a body of the function being the part of the chain and call_site is a call site inside that function that we're interested in. Both fn_body and call_site are AST nodes. The call_site is not set for the last function in the chain — the analyzer will search for AJAX calls inside that function.

Before the call chain analysis, the analyzer performs the regular analysis passes over the entire code. If, during those passes, it detects places where FROM_ARG is passed into AJAX calls, first call chains are built and put into the queue (at this point all created chains will consist of 2 functions — the caller and the callee).

On the call chain analysis stage, the analyzer takes the chains from the queue one by one and "traverses" each chain. To traverse the chain it takes its first element and analyzes the fn_body. If this function is not the last in the chain, the analyzer traverses its code until it finds the call_site. At this point the analyzer computes the actual arguments to the call and "dives" into the call — for that, it takes the next chain element and starts to analyze its fn_body (using the computed call arguments). If the call chain element is the last one, the analyzer looks for AJAX calls inside it, gathering the calls together with arguments passed to them, like during the "regular" analysis stage. If it sees that the value FROM_ARG has reached an AJAX call this means that a longer chain should be analyzed. For that the analyzer will take the current chain, pick its first element and look for its call sites. For each found call site a new, longer chain will be built by taking the current chain and attaching to it a pair of the found call site and the function containing that call site. New chains created this way are added to the call chain queue. To prevent this process from lasting forever, there is a limit on the maximum call chain length. In the current version chains with lengths up to 5 functions are allowed.

Special objects

Previously the term AJAX call was used as if it is a single function call, that accepts all the data needed to construct an HTTP request, and that sends that request. But this is not always the case, sometimes sending a request from JS is done with multiple calls. In fact, the first browser interface for sending requests from JS, XMLHttpRequest, works just like that, it requires several calls to send a request. At the very least, one has to construct an XMLHttpRequest instance using new, then call its method .open(...) and then call .send(...). How do we find this with our analyzer?

In short, the idea is to create a special object, which will model an XMLHttpRequest instance. Such special objects will be among the values supported by the analyzer and will get passed during assignments, field accesses, etc. We will create such an object every time we see a beginning of XMLHttpRequest usage and these objects will memorize the request data put into them. Upon seeing the call someObj.open(...) we will check whether the value of someObj is one of our special XMLHttpRequest objects. If so, computed arguments to open will be stored inside that object. The same is done with .setRequestHeader(...) calls. When seeing the .send(...) call for our object, we will also compute its arguments — and at this point we will have all the data that the request is made from. The same technique is used, by the way, for modeling FormData instances. FormData objects are used to create the sets of request parameters and, just like XMLHttpRequest, they have their own state. Calls to .append(...) or .set(...) add parameters to FormData object, and then the whole set of parameters will be sent when this object is passed to .send(...) call (or to fetch(...)). Accordingly, for this we have a special FormDataModel class. Its instance is created when processing the expression new FormData() and, when the analyzer processes calls to append and set, data is written into this instance. That data is used when AJAX calls are processed.

We could do the same with XMLHttpRequest — create a special object when evaluating new XMLHttpRequest. But in reality, we don’t do it quite like that.

A hack for `XMLHttpRequest`

When analyzing the real code we noticed that, when XMLHttpRequest is used, calls to .open(...) and .send(...) are always located in the same function. In theory it is, of course, possible to create an XMLHttpRequest instance, call its .open(...) method, then pass this instance to some other function, where the method .send() will be called — but no one does this. In all the cases that we’ve seen, everything happens in the same function, and the object looks the same in both method calls. Like this:

// ...
xhr.open("POST", url, true);
// ...
xhr.send(data);
// ...

Meanwhile, new XMLHttpRequest() is often moved to another function — an auxiliary function for creating this object and returning it, so that some other code can use it (such an auxiliary function is used to support old browsers, where the class ActiveXObject or something like that is used instead of XMLHttpRequest). It will be sad if our “data-flow” analysis fails to track the way of an XHR object from new to the place where it’s used and the request being sent won’t be found because of that, while open and send always come together. Especially given that we’re not determining return values at the moment. So we decided to do the following:

Create an object modeling XMLHttpRequest when we see .open(...).
Identify this object not by accessing memory, but solely by the “appearance” of an AST node that denotes an object, which method is called. That is, to the left of .send(...) there should be the same thing (syntactically) that to the left of .open(...).

This heuristic has never given us false positives — yeah, maybe sometimes we create “extra” objects for some open calls, but there has never been such a coincidence that after that open the call to send followed for the same object, and that was not an XMLHttpRequest usage.

Well, I guess that’s enough detail for this post. Thanks for reading it! For more, head over to the third post about the JS analysis algorithm — it’s right here.