Earlier this year, in October, a new widespread cryptography vulnerability was announced. The initial announcement didn’t contain details about the vulnerability or much details on how to attack it (updated by now). It did state the affected systems though: RSA keys generated using smartcards and similar devices that use Infineon’s RSALib. The announcement came with obfuscated code that would check whether a public key is affected. Also, the name chosen by the researchers was a small hint on how to attack it: “Return of Coppersmith’s Attack”.

I decided to try and figure out the details before the conference paper describing them would be released. By the time the paper was released, I had reverse engineered the vulnerability and implemented my own attack, which did not use Coppersmith’s method at all. This post explains how I figured out what’s wrong with the affected RSA-keys and how I used that information to factor affected 512-bit RSA-keys.

## Reversing the Vulnerability

I started looking at the vulnerability, when a friend pointed me to a deobfuscated version of the detection code:

So this is the core of the Infineon RSA fail key detector: https://t.co/7hyEl5fVVs - this is very interesting indeed (and a huge fail).

— Hector Martin (@marcan42) October 20, 2017

At that point the ROCA paper wasn’t published yet. Figuring out how these keys are generated and how to attack them seemed like a nice challenge.

The detection code gives a first hint on this. It takes the public modulus $N$ and reduces it modulo a set of small primes $\{p_0, p_1, p_2, \ldots, p_{m}\}$ . For each prime $p_i$ it tests whether the remainder belongs to a set of allowed remainders $R_i$. If all remainders are in the corresponding set of allowed remainders, the key is flagged as vulnerable.

The first few tests are: \begin{align} N \bmod 11 &\in \{1, 10\} \\ N \bmod 13 &\in \{1, 3, 4, 9, 10, 12\} \\ N \bmod 17 &\in \{1, 2, 4, 8, 9, 13, 15, 16\} \\ N \bmod 19 &\in \{1, 4, 5, 6, 7, 9, 11, 16, 17\} \\ N \bmod 37 &\in \{1, 10, 26\} \\ \vdots \end{align}

This doesn’t look good.

If you take a random RSA key or large prime and reduce it modulo small primes, you’re expected to see all non-zero remainders evenly distributed. You can’t get a zero, as that would mean there is a small prime factor. For the full list of 17 small primes, only one of $\prod_i \frac{p_i - 1}{|R_i|} \approx 2^{27.8}$ possible keys has this property.

While an unintended loss 27.8 bits of entropy sounds bad as is, I assumed that this is only a symptom of whatever went wrong when generating those keys. While it would be possible to generate an RSA key from a uniform distribution of keys like this, it would be slower and more complicated than the straight forward correct way. You’d also have to deliberately restrict the allowed remainders, which seemed unlikely.

To figure out the flaw in Infineon’s RSALib, let’s first look at properly generating RSA keys. [Disclaimer: Don’t use this blog post as reference for implementing this.] The public modulus $N$ is the product of two large primes $P, Q$ of roughly equal bit size. You can constrain the bit size of $P, Q$ and $N$ by uniformly selecting primes $P$ and $Q$ from a suitable interval $I$.

The easiest way to uniformly select an element with a given property $T$ in an
interval $I$ is rejection sampling:^{1}
Uniformly select *any* element $x \in I$ (easy), check whether the property
$T(x)$ holds (hopefully easy), restart if it doesn’t.
The average number of iterations rejection sampling needs is inversely
proportional to the probability of a random $x \in I$ having the property.
The prime number theorem tells us that the probability of a random number
smaller than $N$ being prime is $\frac{1}{\log N}$.
For generating primes using rejection sampling this gets us a number of
iterations that grows linearly with the bit size.

This is already quite efficient, but can be optimized. A simple way to halve the expected number of required iterations is to sample a uniform odd number instead of any number within $I$. A further improvement would be to uniformly sample an odd number that is not divisible by three, but this already isn’t as straight forward anymore. It’s possible to continue like this by constructing and uniformly sampling more and more intricate subsets of $I$ that still contain all primes of $I$. But it is also getting harder to correctly do this, while the possible speedup is getting smaller and smaller.

This looks like a good place to screw up key generation, so let’s keep that in mind and look at the RSALib generated keys again. Assuming $P$ and $Q$ are generated independently and $N \bmod p_i \in R_i$, there must be a $R’_i$ so that $P \bmod p_i \in R’_i$ and $Q \bmod p_i \in R’_i$, i.e. $N$ can only be restricted modulo a small prime if $P$ and $Q$ also are. As any combination should be possible, we expect $R_i = \{ab \bmod p_i \mid a, b \in R'_i \times R'_i \}$ .

Playing around with the numbers in the detection code quickly shows that multiplying any two numbers in $R_i$ always results in another number in $R_i \pmod{p_i}$. Together with $1 \in R_i$ for all $R_i$, this lead me to assume $R’_i = R_i$. I didn’t rule out other possibilities in general, but for $R_0 = \{1, 10\}$ nothing else would work.

The next step was to identify what lead to the specific sets $R_i$.
We start with some observations:
As $R_i$ doesn’t contain zero, it is a *subset* of $\zzm{p_i}$, the
multiplicative group of integers modulo $p_i$.
We also discovered that $R_i$ is closed under multiplication modulo $p_i$.
This makes $R_i$ also a *subgroup* of $\zzm{p_i}$.
As $p_i$ is a prime, $\zzm{p_i}$ is a cyclic group, and thus $R_i$ is also a
cyclic group.
In particular this means there is a generator $a_i$, so that
$R_i = \{a_i^k \mid k \in \mathbb Z \}$
.

This is not exciting yet, but so far we only looked at what happens modulo individual small primes. I considered it much more likely that the RSALib code worked modulo the product $M$ of several small primes.

At that point I was thinking this: If, modulo a small prime $p_i$, all possible values are generated by an element $a_i$ that is not a generator for the whole group $\zzm{p_i}$, could it be that, modulo $M = \prod_i p_i$, all possible values are also generated by a single element $a$, so that $a_i = a \bmod p_i$?

This would be a bad idea, but we already know that someone went ahead with a bad idea, so concerning our hypothesis it’s a point in favor. So why is it a bad idea? Generating a prime candidate $P$ so that $P \bmod M$ is in $\zzm{M}$ sounds like a good idea. It would exclude all values that have a common factor with $M$, and thus cannot be prime, making our candidate more likely to be prime. So far that’s not a problem. The problem is sampling from $\zzm{M}$ by raising a single value $a$ to a random power. $\zzm{p_i}$ are cyclic groups, as $p_i$ is prime, and thus they do have a generating element $b_i$. It’s just not the $a_i$ used. In general for composite $k$ the group $\zzm{k}$ is not cyclic, i.e. it is not generated by a single element. So whatever $a$ they used, it only generates a subgroup $R$ of $\zzm{M}$. Even worse, that subgroup $R$ would be a lot smaller than $R_0 \times R_1 \times … \times R_m$, as that group again isn’t cyclic. The order of $R$ is given by $|R| = \lcm_i |R_i|$. This can be seen by considering that $a_i^k \equiv a_i^{k \bmod |R_i|} \pmod{p_i}$ and counting the possible combinations of $k \bmod |R_i|$ and $k \bmod |R_j|$.

At this point only one in $\frac{|\zzm{M}|}{|R|} \approx 2^{69.95}$ possible primes could be generated, but we haven’t validated our assumption yet.

Equipped with the test vectors that came with the original detection code, I
searched for a matching generator $a$ modulo a subset of the small primes.
I did this by combining all possible combinations of $a_i$ using the Chinese
remainder theorem (CRT).
I started with a small subset of the small primes, as this was much faster and
could falsify the hypothesis if no match was found.
As soon as $65537$ appeared as a candidate I knew that my guesses were right.
$65537 = 2^{16} + 1$ is a prime larger than our small primes, thus coprime to M
and would be a generator of $\zza{M}$ the *additive* group of integers modulo
M, which *is* a cyclic group.
Also multiplication with $2^{16} + 1$ can be very fast, especially on 8 and
16-bit microcontrollers.

Confusing the properties of $\zza{M}$ and $\zzm{M}$ could be an explanation of why someone inexperienced with cryptography wouldn’t see a problem with this approach. It does not explain why someone was allowed to go ahead with their own way of generating primes or why no one able to spot this mistake reviewed or audited this algorithm, especially given the intended applications.

We’re not quite done identifying the vulnerability yet. When looking at the set of small primes used, you can see that some primes are skipped. But what if they were only skipped in the deobfuscated and optimized detection code, because $a$ happened to be a generator for $\zzm{p_i}$? In fact when marcan published the deobfuscated code he mentioned that he removed no-op tests. Even if $a$ is a generator for $\zzm{p_i}$ we shouldn’t discard it, as the cyclic subgroup $R$ generated by $a$ modulo $M$ is smaller than the product of the individual subgroups $R_i$.

Using the test vectors I verified that for 512-bit keys the set of small primes consists of all primes up to $167$. Recomputing the size of the cyclic subgroup $R$ of $\zzm{M}$ shows that only one in $\frac{|\zzm{M}|}{|R|} \approx 2^{154.89}$ possible primes can be generated. This loses more than half of the expected entropy.

## The Attack

Having so much information about the private key can be enough to very quickly factor it. Ignoring the kind of information we have, just counting the bits of entropy, it could be possible to efficiently factor the key using variants of Coppersmith’s method. The CA in ROCA also stands for Coppersmith’s attack, but a straightforward application isn’t possible. While the entropy of the information we gained from this vulnerability is enough, it doesn’t have the right form.

Coppersmith’s method is applicable if we know that a factor has the form $c + kx$ for fixed $c$ and $k$, and $|x| < N^{\frac{1}{4}}$. This is the case when we know consecutive bits of the binary representation or know the factor modulo any number of a suitable size. In our case we only know that the factors have the form $(a^i \bmod M) + Mx$ for small $i$ and $x$. If we could afford to just bruteforce all possible values for $i$, we could apply Coppersmith’s method, assuming $|x| < N^{\frac{1}{4}}$ holds.

There are $|R| \approx 2^{61.09}$ possible values for $i$. So at first, this looks too expensive. On the other hand, in this case $|x| < \frac{2^{257}}{M} \approx 2^{37.81} \ll N^{\frac{1}{4}}$, so maybe it is possible to find some trade-off.

We are looking for a way to make $|R|$ and $M$ smaller, without making $M$ too small. Luckily this is easy: we can just ignore some of the small primes $p_i$. This results in a smaller $M’$, just the product of the new primes $\prod_i p’_i$, and a smaller $R’ \subset \zzm{M’}$. As $|R’|$ depends on the common factors of $|R’_i|$, it can be a bit difficult to find an optimal trade-off.

I implemented this attack using the Coppersmith implementation of PARI/GP, but no matter what trade-off I chose, my estimated runtime was much higher than the published one. As this is the attack described in the ROCA paper, in retrospect, I think the Coppersmith implementation I chose was just not optimized enough for this use case. In addition to that I might have missed the optimal choice for $M’$, but even a single invocation of Coppersmith’s method was much slower for me.

This prompted me to try different approaches. I hit many dead ends, until I came across an older but interesting attack for factoring RSA keys with partial knowledge of a factor. The attack was published in 1986 by Rivest and Shamir, the R and S in RSA. The paper is called “Efficient Factoring Based on Partial Information”.

Compared to Coppersmith’s method it has the downside of needing not only half the bits of a factor, i.e. $|x| < N^{\frac{1}{4}}$, but two thirds, i.e. $|x| < N^{\frac{1}{3}}$. This might seem bad at first, but as $M’$ grows faster than $|R’|$, the bruteforcing work doesn’t increase as much as going from $N^{\frac{1}{4}}$ to $N^{\frac{1}{3}}$ might suggest. Also, I guessed that it would be so much faster than Coppersmith’s method, that, for 512-bit keys, it would more than make up for that.

To understand why, we need to look at how the attack works. The paper describes a slightly different scenario. It assumes we know the factors have the form $x + 2^m c$. This corresponds to knowing the most significant bits of the binary representation of a factor. The described attack works without modification for factors of the form $x + Mc$ as it doesn’t make use of the fact that $2^m$ is a power of two.

If we assume $P = x + M c$ and $Q = y + M d$ with we get \begin{align} N &= xy + dxM + cyM + cdM^2. \end{align} We also assume that $0 \le x \le M$ and $0 \le y \le M$.

Let $t = N - cdM^2$, a constant we know, and we get \begin{align} t &= xy + Mdx + Mcy. \end{align} The paper then presents a heuristic argument, which is roughly this: Because $xy$ is much smaller than $t$, $Mdx$ and $Mdy$ it is likely that replacing $xy$ with $s$ and searching for the solution minimizing $s$ in \begin{align} t &= s + Mdx + Mcy \end{align} results in a solution to the original equation and thereby in a factorization of $N$.

This is a two-dimensional integer programming instance, i.e. a set of integer linear constraints (the bounds for $x$ and $y$), an integer linear objective (minimize $s = t - Mdx + Mcy$) and two integer unknowns ($x$ and $y$). It is then noted that integer programming in a fixed number of dimensions can be solved in polynomial time.

The paper also mentions that a similar approach would work for knowing the
*least* significant bits of a factor.
This corresponds to $P = c + Mx$ and $Q = d + My$ with $0 \le x \le \sqrt{M}$
and $0 \le y \le \sqrt{M}$, which is exactly what we need.

In this case we get \begin{align} N &= cd + dxM + cyM + xyM^2 \\ t &= \frac{N - cd}{M} \\ t &= dx + cy + xyM. \end{align}

Again, we’d like to get rid of the $xy$ term, to make it a linear problem. I did this by working modulo $M$: \begin{align} t &\equiv dx + cy \pmod{M} \end{align}

Usually for RSA-keys we know an upper bound for $P$ and $Q$, which together with $N$ also translates to a lower bound. From this we can compute bounds for $x$ and $y$.

Here I noticed, that it is possible to find a solution using an approach more direct than integer programming. The solutions to $t \equiv dx + cy \pmod{M}$ form a two-dimensional affine lattice. To understand how we need to define lattices first.

Given a $n \times d$ matrix $B$ consisting of $d$ linear independent column vectors $B = (\mathbf b_0, \mathbf b_1, \ldots, \mathbf b_{d-1})$ the corresponding lattice $L$ is the set of integer linear combinations of these vectors: \begin{align} L = \{B \mathbf g \mid g \in \mathbb Z^d\} \end{align} As this set is closed under negation and addition, it forms a subgroup of $\mathbb R^n$. Luckily we are working in two dimensions, which makes it easy to visualize lattices:

The green dots are the lattice points and the red vectors are the basis vectors.

The basis vectors for a lattice are not unique, adding an integer multiple of one basis vector to another generates the same lattice. This is easy to see, as you can get the original lattice vector by subtracting the same multiple again, so every integer linear combination of either basis is also an integer linear combination of the other. Negating a basis vector or exchanging the position of two vectors also doesn’t change the generated lattice. Performing an arbitrary number of those operations is equivalent to multiplying the basis $B$ with an unimodular matrix $U$, i.e. an integer matrix $U$ with $|\det U| = 1$. This makes sense as those matrices are exactly the integer matrices which have an integer inverse.

$\mathbf b_0, \mathbf b_1$ and $\mathbf b’_0, \mathbf b’_1$ define the same lattice: $\mathbf b’_0 = \mathbf b_0 + \mathbf b_1, \mathbf b’_1 = \mathbf b_1 - 2\mathbf b’_0$.

An affine lattice is a lattice with an offset. Given a basis $B$ and an offset vector $\mathbf o$ it consists of the lattice points $A = \{B \mathbf g + \mathbf o \mid g \in \mathbb Z^d \}$. This is not a group anymore, but adding an element of $L$ to $A$ gives another point in $A$ and the difference of two points in $A$ is in $L$.

I claimed that the solutions to $t \equiv dx + cy \pmod{M}$ form an affine lattice. Assume we have a single known solution $(x_0, y_0)$. It’s not hard to see that adding multiples of $M$ to $x_0$ or $y_0$ is still a valid solution. These solutions would form an affine lattice, using the basis vectors $(M, 0)$ and $(0, M)$, but that lattice would not contain all solutions. We know that $c$ and $d$ are coprime to $M$, otherwise $P$ or $Q$ would have a small factor. This means that we should have a solution $(x, \frac{t - dx}{c})$ for any value of $x$. Taking the difference of two solutions with consecutive $x$ gives us a basis vector $\mathbf b_0 = (1, \frac{-d}{c})$. Together with $\mathbf b_1 = (0, M)$ and $\mathbf o = (0, \frac{t}{c})$ this defines an affine lattice containing all solutions.

Given this affine lattice, we’re interested in the lattice points within the region defined by our bounds for $x$ and $y$. If we can find a lattice point closest to a given arbitrary point, we could compute the closest point to the center of that region. In general for arbitrary lattice dimensions that problem is NP-complete. Luckily for two dimensional lattices this is very efficient.

The difficulty in finding a closest lattice points stems from the fact that basis vectors can point in roughly the same or opposite direction. In fact for our affine lattice of solutions to $t \equiv dx + cy \pmod{M}$, the basis vectors we derived point in almost the same direction. Let’s, for a moment, assume the opposite would be the case and that the basis vectors are orthogonal. We could then just represent a point as a non-integer multiple of the basis vectors and individually round the multiples to the nearest integer. As moving along one basis vector direction doesn’t affect the closest multiple in the other directions, we would get the nearest point.

When the basis vectors aren’t exactly orthogonal but close, it is possible to bound the distance when approximating the nearest point by independent rounding in each basis vector direction. Consider the two-dimensional case: rounding in direction of $\mathbf b_0$ moves the point by $\mu_{0,1} = \frac{\sp{\mathbf b_0}{\mathbf b_1}}{\norm{b_1}^2}$ times the length of $\mathbf b_1$ in the direction of $\mathbf b_1$. The value of $\mu_{0,1}$ is the (signed) length of $\mathbf b_0$ projected onto $\mathbf b_1$, divided by the length of $\mathbf b_1$.

The orange vector is the projection of the blue vector onto the red vector. It is equal to $\mu$ times the red vector.

This is great because we can find an equivalent basis so that $|\mu_{0,1}|$ and $|\mu_{1,0}| \le \frac{1}{2}$. This is done using the Lagrange-Gauss algorithm, which finds the shortest basis for a two-dimensional lattice. It works similar to the Euclidean algorithm for computing the greatest common divisor of two numbers by repeatedly reducing one value by the other. Let $\lfloor x \rceil$ be the closest integer to $x$. If $|\mu_{0,1}| > \frac{1}{2}$ the vector $\mathbf b_1 - \lfloor \mu_{0,1} \rceil \mathbf b_0$ is shorter than $\mathbf b_1$ and can replace it. The same is true with exchanged basis vectors and $\mu_{1,0}$. Replacing one lattice vector with a shorter one like this can be iterated until neither $|\mu_{0,1}|$ nor $|\mu_{1,0}|$ are greater than $\frac{1}{2}$. For basis vectors with integer components the number of iterations needed grows logarithmically with the length of the basis vectors, i.e. linear with their bit size.

For such a two-dimensional basis, finding a close point by rounding the multiples of the basis vectors results in a very small distance bound. I haven’t computed the exact bound, but rounding introduces an offset of at most half a basis vector in each basis vector direction. This would introduce an error of at most a quarter basis vector each, which is enough to cross the midpoint between two multiples, but not enough to go further. In practice, for the lattices we’re interested in, the point found by rounding happens to also be the closest point.

With this we can find a solution to $t \equiv dx + cy \pmod{M}$ which is closest to the midpoint of the rectangle defined by the bounds for $x$ and $y$. Most of the time this solution is the unique solution within bounds that leads to a factoring of $N$ if such a solution exists. Sometimes, though, when one basis vector is particularly short, there are multiple solutions within bounds. Luckily it seems that this only happens when the other basis vector is long. This means that all solutions within bounds lie on a single line. In that case a solution can be efficiently found or shown to not exist by recursively bisecting the line and checking whether a point that factors $N$ can exist between the endpoints.

This gives us a complete algorithm to check a single candidate. Together with an optimized value for $M’$ and an outer loop that bruteforces through the $|R’|$ possible guesses, this allows me to break a ROCA-vulnerable 512-bit RSA key in less than $900$ seconds using a single thread on my laptop. As the outer loop can be trivially parallelized, breaking those keys on a more powerful server with many threads takes less than $30$ seconds. I’ve also looked at using this approach for 1024-bit keys, but a rough estimate put the runtime far above the runtime of the ROCA attack. For larger keys it is even worse, so I didn’t pursue that path.

## Source Code

I’ve decided to release the source code of my attack implementation. It’s implemented in C++ and uses GMP for most bignum arithmetic, except inside the lattice reduction where custom routines are used. It includes some low-level optimizations that I’ve glossed over, for example using floats to approximate bignums while keeping track of the accumulated error.

Feel free to contact me if you want to know more about specific parts or about the implementation. I’ll also be at the 34c3 and am happy to have a chat with anyone interested in this or related things.

- Rejection sampling also allows for non-uniform source and target distributions, but simplifies to the described algorithm for a uniform source distribution and a target distribution that is uniform among all values of a given property and zero otherwise.
^{[return]}