How to Save 70% Gas Fee on Aptos PancakeSwap with Only ~40 Lines of Code?

10 min readDec 23, 2022

Overview

TL;DR:

Since the deployment of PancakeSwap on Aptos, there has been an influx of activity and users on the Aptos Ecosystem. Hence, we noticed the opportunity to assist the PancakeSwap team to reduce gas consumption. The gas usage of PancakeSwap is relatively high with ~0.0105 APT per call to swap (see the original gas cost on Aptos Explorer, using Testnet Mode).

After collaborating with MoveBit, the new version which will be released soon will be around ~0.0034 APT per call (see the optimized gas cost on Aptos Explorer, using Testnet Mode).

Furthermore, after U128 overflow detection is added, the fee can be further reduced to 0.001478 which will account for a total reduction of 58% in the normal situation.

What’s more important is that all the change is within~40 lines of code.

Can you believe that?

Problem Analysis

Community shared their feedback that stablecoin swap on the Pancakeswap Move Ecosystem takes relatively higher gas fee.

We took a thorough look of the whole codebase and main swap logic is straightforward.

Then our focus switched to its dependencies. It basically depends on two things:

Aptos standard library & Aptos framework
pancake’s implementation of Math lib and U256 lib

We did a bit of review of libraries written by Aptos, and it is not very likely to cause high gas usage. Then we found that the U256 lib is the suspicious part:

swap() makes use of 8 u256::mul() and 2 u256::sub(). Those are, however, expensive. For example, sub() requires at most 16 calls to overflowing_sub(). And mul() takes 16 iterations, each of them calls 3 overflowing_add() and 5 other functions, which makes a total of more than 128 function invocations. Seems overly luxurious! Besides, although U256 has fixed fields v3..v0, the U256 module uses get() and put() as getter/setter to manipulate the fields. They use a lot of redundant if-else, which also results in more gas usage.

Expensive Code Identification

We first need to ensure that our previous inference is correct. That is, we need to confirm it is unnecessary usage of U256 causing the problem.

Let’s verify our guess.

Initially, we deployed the original PancakeSwap on the Aptos Testnet. As we’ve mentioned earlier, it is relatively costly: It takes ~0.0105 APT per swap action (see the original gas cost on Aptos Explorer in the test mode).

Then, we remove the usage of U256completely and replace all of them with regular built-in primitive types. The result is that this version only takes ~0.0033 APT (see no-U256 gas cost in the test mode). Of course, even though we can run it on Testnet with a few mocked transactions without any problem, this implementation is inherently wrong. Because it ignored possible overflow and might abort when a large number gets involved.

Optimization Step1

Before diving into any further refactoring. We should take a look at the computation on L552 of swap.move.

To simplify our discussion, we’ll use shorter symbols for variables in the program. Let a denote amount, r denote reserve, and a variable with a prime symbol denotes an adjusted value. For example：

Without consideration of any limit of bounded integer size, mathematically, the computation is equivalent to

Check whether

holds

Then we can analyze the range of them.

For b’_x, we see that,

b_x and P are both of type u64, thus their product can always be represented by a u128
a_ {x,in} is a u64. Multiplying it with 25 results in a product within the limit of u128
Ignoring the underflow, the difference between two u128 can be represented by a u128.

Similarly, we can also use a u128 to store b'_y.

Now p, which is the product of b’_x and b’_y might overflow if we were to define it as a u128. We have to use a 256-bit unsigned integer to store it.

Lastly, we can see k as the product of r_x P and r_y P. Similar to our previous analysis, r_x P and r_y P can both be u128 because all factors of them are u64. Therefore, k should also be a u256.

Now, we can eliminate all unnecessary u256variables, so the code becomes:

let prec = (PRECISION as u128);
let balance_x_adjusted = (balance_x as u128) * prec - (amount_x_in as u128) * 25u128;
let balance_y_adjusted = (balance_y as u128) * prec - (amount_y_in as u128) * 25u128;
let reserve_x_adjusted = (reserves.reserve_x as u128) * prec;
let reserve_y_adjusted = (reserves.reserve_y as u128) * prec;
let p = u256::mul(u256::from_u128(balance_x_adjusted), u256::from_u128(balance_y_adjusted));
let k = u256::mul(u256::from_u128(reserve_x_adjusted), u256::from_u128(reserve_y_adjusted));
let compare = u256::compare(&u256::mul(balance_x_adjusted, balance_y_adjusted), &k);
assert!(compare == u256::get_greater_than() || compare == u256::get_equal(), ERROR_K);

With only 2 u256 multiplications, the gas fee becomes ~0.0052 APT, which is 50% of the original gas cost (see it here).

Optimization Step2

Do you remember how we learned multiplication when we were kids? At first, we are taught to memorize the multiplication table of digits no greater than 9. Then for multiplication involving numbers larger than 10, we do it digit by digit.

For example, to calculate 472 × 32, we will have a table like this:

471  <-- 471 interpreted as 4*100 + 7*10 + 1
*    32  <--  32 interpreted as 3*10 + 2
-------
      2  <-- 1*2, 0 left shift
    14   <-- 7*2, 1 left shift
    8    <-- 4*2, 2 left shift
     3   <-- 1*3, 1 left shift
   21    <-- 7*3, 2 left shift
+ 12     <-- 4*3, 3 left shift
-------
  15072

In the decimal world, each digit ranges from 0 to 9.

Now, let’s imagine each u64 is a single digit, ranging from 0 to 2⁶⁴ — 1 (namely, MAX_U64). Then, each u128 is essentially two u64digits, which can U256 be four u64 digits. Digits with smaller indices mean insignificant parts, while those with larger indices mean significant parts.

To create a lossless multiplication of a and b of type u128, we can arrange the computation in a similar manner :

        a1 a0  <--- a = (a1<<64) + a0
*       b1 b0  <--- b = (b1<<64) + b0
-------------
        w1 w0  <--- [Step 1] a0*b0, 0 left shift
     x1 x0     <--- [Step 2] a1*b0, 1 left shift 
     y1 y0     <--- [Step 3] a0*b1, 1 left shift 
+ z1 z0        <--- [Step 4] a1*b1, 2 left shift 
-------------
  v3 v2 v1 v0  <--- interpreted as (v3<<192) + (v2<<128) + (v1<<64) + v0

Step 0. Digits extraction

So we first create those “digits” of u64:

let a1 = (a >> 64);
let a0 = (a & U64_MAX);

let b1 = (b >> 64);
let b0 = (b & U64_MAX);

Note that they are actually of type u128, as we want to do lossless multiplication of them later. Defining them as u128 directly saves us from doing repetitive type narrowing and widening, which charges more gas.

Step 1: a0 and b0

Then we calculate a0 * b0 and save them to v1 and v0:

let p = a0 * b0;
let v1 = (p >> 64);
let v0 = (p & U64_MAX);

Also, note that we don’t actually create w1 and w0. We will add values in later steps to v3..v0 on-the-fly to use fewer local variables and instructions.

Step 2: a1 and b0

The calculation of a0 * b1 takes more steps than the last step. The code looks like this:

let p = a1 * b0;
v1 = v1 + (p & U64_MAX) ;
let v2 = (p >> 64);
if (v1 > U64_MAX) {
    v2 = v2 + (v1 >> 64);
    v1 = (v1 & U64_MAX);
};

Since we have written value to v1 and v0, now we have to handle the addition carefully.

After adding x0 (defined as (a1*b0) & U64_MAX, namely p & U64_MAX) to v1 on line 2, we now have to save the possible carry number to v2 (line 3).

Furthermore, v1 might overflow after its increment. We have to give it a check on line 4. If this is the case, move the overflowed part to v2 (from line 5)

Step 3: a0 and b1

Multiplying a0 and b1 is similar to the last step, but we have more edge cases to consider. Be careful!

let p = a0 * b1;
v1 = v1 + (p & U64_MAX);
if (v1 > U64_MAX) {
    v2 = v2 + (v1 >> 64);
    v1 = (v1 & U64_MAX);
};
v2 = v2 + (p >> 64);
let v3 = (v2 >> 64);
if (v2 > U64_MAX) {
    v2 = (v2 & U64_MAX);
};

Like step 2, we add y0 to v1 on line 2. Since v2 might contain a number, we cannot assign value to it directly like before. So we should handle the possible overflow v1 from line 3 to line 6. Then we add the significant part of a0*b1 to v2 on line 7. Now, v2 might be a value that doesn't fit in u64, we define v3 to be the possible overflowed part and let v2 be the insignificant part.

Step 4: a1 and b1

Yay, just one more step to get it done! And it has nothing new:

let p = a1 * b1;
v2 = v2 + (p & U64_MAX);
if (v2 > U64_MAX) {
    v3 = v3 + (v2 >> 64);
    v2 = (v2 & U64_MAX);
};
v3 = v3 + (p >> 64);

Everything is straightforward: get z1 z0 (line 1); add z0 to v2 (line 2); handle the possible overflow of v2 (line 3-line 6); add a significant part z1 to v3.

But, did we just forget to check whether v3 would overflow if it were a u64?The answer is no. It is possible for v3 as u64 to overflow because it is ensured by math that the product of two u128 will always fit in a U256.

Final Touch

As the last step, we simply construct a U256 from v3..v0:

U256 { 
     v0: (v0 as u64), 
     v1: (v1 as u64), 
     v2: (v2 as u64), 
     v3: (v3 as u64), 
 }

Remember that v3..v0 are actually u128s, so don't forget the type narrowing to u64.

We also want to make sure that this implementation is correct. We can write some unit testing:

#[test]
fun test_mul_u128() {
    let a = U128_MAX - 3;
    let b = U128_MAX - 9;

    let p1 = mul_u128(a, b);
    let p2 = mul(from_u128(a), from_u128(b));    assert!(p1 == p2, 0);
}

Then we can run aptos move test. You will see that all the tests are passed.

Lastly, we may try the Move Prover. Create a file called u256.spec.move, and add specs:

spec pancake::u256 {
    spec module { pragma verify = false; }
    spec fun value_of_U256(a: U256): num {
        a.v0 + 
            a.v1 * P64 + 
            a.v2 * P64 * P64 +
            a.v3 * P64 * P64 * P64
    } 
    spec mul_u128 {
        pragma verify = true;
        pragma timeout = 3600;
        ensures value_of_U256(result) == a * b; 
    }
}

When you run aptos move prove, you will see error information complaining that Move Prover doesn't support bitwise-and operations. One workaround is to replace all p & U64_MAX with p % (1 + U64_MAX). However, unfortunately, this would cause SMT solver timeout (both Z3 and CVC5). Hopefully, it would get verified with a newer version of SMT solvers.

Complete Code

In file pancake-swap/sources/utils/u256.move, add

public fun mul_u128(a: u128, b: u128): U256 {
    let a1 = (a >> 64);
    let a0 = (a & U64_MAX);
    let b1 = (b >> 64);
    let b0 = (b & U64_MAX);
    // Step 1, v0 v1 created
    let p = a0 * b0;
    let v1 = (p >> 64);
    let v0 = (p & U64_MAX);
    // Step 2: v2 created, v1 modified
    let p = a1 * b0;
    v1 = v1 + (p & U64_MAX) ;
    let v2 = (p >> 64);
    if (v1 > U64_MAX) {
        v2 = v2 + (v1 >> 64);
        v1 = (v1 & U64_MAX);
    };
    // Step 3: v3 created, v2 v1 modified
    let p = a0 * b1;
    v1 = v1 + (p & U64_MAX);
    if (v1 > U64_MAX) {
        v2 = v2 + (v1 >> 64);
        v1 = (v1 & U64_MAX);
    };
    v2 = v2 + (p >> 64);
    let v3 = (v2 >> 64);
    if (v2 > U64_MAX) {
        v2 = (v2 & U64_MAX);
    };
    // Step 4: v3 v2 modified
    let p = a1 * b1;
    v2 = v2 + (p & U64_MAX);
    if (v2 > U64_MAX) {
        v3 = v3 + (v2 >> 64);
        v2 = (v2 & U64_MAX);
    };
    v3 = v3 + (p >> 64);
    // Result
    U256 { v0: (v0 as u64), v1: (v1 as u64), v2: (v2 as u64), v3: (v3 as u64) }
}

/// Greater or equal to
public fun ge(a: &U256, b: &U256): bool {
    if (a.v3 != b.v3) { return a.v3 >= b.v3 };
    if (a.v2 != b.v2) { return a.v2 >= b.v2 };
    if (a.v1 != b.v1) { return a.v1 >= b.v1 };
    (a.v0 >= b.v0)
}

Replace Line 522 with Line 560 in swap.move with:

let prec = (PRECISION as u128);
let balance_x_adjusted = (balance_x as u128) * prec - (amount_x_in as u128) * 25u128;
let balance_y_adjusted = (balance_y as u128) * prec - (amount_y_in as u128) * 25u128;
let reserve_x_adjusted = (reserves.reserve_x as u128) * prec;
let reserve_y_adjusted = (reserves.reserve_y as u128) * prec;
let p = u256::mul_u128(balance_x_adjusted, balance_y_adjusted);
let k = u256::mul_u128(reserve_x_adjusted, reserve_y_adjusted);
assert!(u256::ge(&p, &k), ERROR_K);

Conclusion

When writing code, always keep gas costs in mind! Yes, we ain’t no longer 1980s programmers who have to save every bit of resource. However, we still must not waste too much of them.

Currently, the Move compiler doesn’t do many optimizations. So you should have a basic impression of how your code will be compiled into bytecodes.

Then form some idioms. For example, loop {} is slightly better than while (true) {}.

Happy Moving!

About MoveBit

MoveBit is a security company for the Move ecosystem with a vision to make the Move ecosystem the most secure Web3 destination. The MoveBit team is composed of security leaders from academia and enterprise with 10 years of security experience. The team was one of the earliest contributors to the Move ecosystem, working with Move developers to set the standard for secure Move applications.

MoveBit Social Media Platforms:

Official Website | Twitter | Medium | GitHub | Discord