Not Even Coppersmith’s Attack

Posted on December 23rd 2017, tagged algorithms, cryptography

Earlier this year, in October, a new widespread cryptography vulnerability was announced. The initial announcement didn’t contain details about the vulnerability or much details on how to attack it (updated by now). It did state the affected systems though: RSA keys generated using smartcards and similar devices that use Infineon’s RSALib. The announcement came with obfuscated code that would check whether a public key is affected. Also, the name chosen by the researchers was a small hint on how to attack it: “Return of Coppersmith’s Attack”.

I decided to try and figure out the details before the conference paper describing them would be released. By the time the paper was released, I had reverse engineered the vulnerability and implemented my own attack, which did not use Coppersmith’s method at all. This post explains how I figured out what’s wrong with the affected RSA-keys and how I used that information to factor affected 512-bit RSA-keys.

Reversing the Vulnerability

I started looking at the vulnerability, when a friend pointed me to a deobfuscated version of the detection code:

“So this is the core of the Infineon RSA fail key detector: https://marcan.st/paste/MOEoh2EH.txt - this is very interesting indeed (and a huge fail).”

@marcan42 on twitter

At that point the ROCA paper wasn’t published yet. Figuring out how these keys are generated and how to attack them seemed like a nice challenge.

The detection code gives a first hint on this. It takes the public modulus NN and reduces it modulo a set of small primes {p0,p1,p2,,pm}\{p_0, p_1, p_2, \ldots, p_{m}\}. For each prime pip_i it tests whether the remainder belongs to a set of allowed remainders RiR_i. If all remainders are in the corresponding set of allowed remainders, the key is flagged as vulnerable.

The first few tests are: Nmod11{1,10}Nmod13{1,3,4,9,10,12}Nmod17{1,2,4,8,9,13,15,16}Nmod19{1,4,5,6,7,9,11,16,17}Nmod37{1,10,26}\begin{align*} N \bmod 11 &\in \{1, 10\} \\ N \bmod 13 &\in \{1, 3, 4, 9, 10, 12\} \\ N \bmod 17 &\in \{1, 2, 4, 8, 9, 13, 15, 16\} \\ N \bmod 19 &\in \{1, 4, 5, 6, 7, 9, 11, 16, 17\} \\ N \bmod 37 &\in \{1, 10, 26\} \\ \vdots \end{align*}

This doesn’t look good.

If you take a random RSA key or large prime and reduce it modulo small primes, you’re expected to see all non-zero remainders evenly distributed. You can’t get a zero, as that would mean there is a small prime factor. For the full list of 17 small primes, only one of ipi1Ri227.8\prod_i \frac{p_i - 1}{|R_i|} \approx 2^{27.8} possible keys has this property.

While an unintended loss 27.8 bits of entropy sounds bad as is, I assumed that this is only a symptom of whatever went wrong when generating those keys. While it would be possible to generate an RSA key from a uniform distribution of keys like this, it would be slower and more complicated than the straight forward correct way. You’d also have to deliberately restrict the allowed remainders, which seemed unlikely.

To figure out the flaw in Infineon’s RSALib, let’s first look at properly generating RSA keys. [Disclaimer: Don’t use this blog post as reference for implementing this.] The public modulus NN is the product of two large primes P,QP, Q of roughly equal bit size. You can constrain the bit size of P,QP, Q and NN by uniformly selecting primes PP and QQ from a suitable interval II.

The easiest way to uniformly select an element with a given property TT in an interval II is rejection sampling:1 Uniformly select any element xIx \in I (easy), check whether the property T(x)T(x) holds (hopefully easy), restart if it doesn’t. The average number of iterations rejection sampling needs is inversely proportional to the probability of a random xIx \in I having the property. The prime number theorem tells us that the probability of a random number smaller than NN being prime is 1logN\frac{1}{\log N}. For generating primes using rejection sampling this gets us a number of iterations that grows linearly with the bit size.

This is already quite efficient, but can be optimized. A simple way to halve the expected number of required iterations is to sample a uniform odd number instead of any number within II. A further improvement would be to uniformly sample an odd number that is not divisible by three, but this already isn’t as straight forward anymore. It’s possible to continue like this by constructing and uniformly sampling more and more intricate subsets of II that still contain all primes of II. But it is also getting harder to correctly do this, while the possible speedup is getting smaller and smaller.

This looks like a good place to screw up key generation, so let’s keep that in mind and look at the RSALib generated keys again. Assuming PP and QQ are generated independently and NmodpiRiN \bmod p_i \in R_i, there must be a RiR'_i so that PmodpiRiP \bmod p_i \in R'_i and QmodpiRiQ \bmod p_i \in R'_i, i.e. NN can only be restricted modulo a small prime if PP and QQ also are. As any combination should be possible, we expect Ri={abmodpia,bRi×Ri}R_i = \{ab \bmod p_i \mid a, b \in R'_i \times R'_i \}.

Playing around with the numbers in the detection code quickly shows that multiplying any two numbers in RiR_i always results in another number in Ri(modpi)R_i \pmod{p_i}. Together with 1Ri1 \in R_i for all RiR_i, this lead me to assume Ri=RiR'_i = R_i. I didn’t rule out other possibilities in general, but for R0={1,10}R_0 = \{1, 10\} nothing else would work.

The next step was to identify what lead to the specific sets RiR_i. We start with some observations: As RiR_i doesn’t contain zero, it is a subset of Zpi\zzm{p_i}, the multiplicative group of integers modulo pip_i. We also discovered that RiR_i is closed under multiplication modulo pip_i. This makes RiR_i also a subgroup of Zpi\zzm{p_i}. As pip_i is a prime, Zpi\zzm{p_i} is a cyclic group, and thus RiR_i is also a cyclic group. In particular this means there is a generator aia_i, so that Ri={aikkZ}R_i = \{a_i^k \mid k \in \mathbb Z \}.

This is not exciting yet, but so far we only looked at what happens modulo individual small primes. I considered it much more likely that the RSALib code worked modulo the product MM of several small primes.

At that point I was thinking this: If, modulo a small prime pip_i, all possible values are generated by an element aia_i that is not a generator for the whole group Zpi\zzm{p_i}, could it be that, modulo M=ipiM = \prod_i p_i, all possible values are also generated by a single element aa, so that ai=amodpia_i = a \bmod p_i?

This would be a bad idea, but we already know that someone went ahead with a bad idea, so concerning our hypothesis it’s a point in favor. So why is it a bad idea? Generating a prime candidate PP so that PmodMP \bmod M is in ZM\zzm{M} sounds like a good idea. It would exclude all values that have a common factor with MM, and thus cannot be prime, making our candidate more likely to be prime. So far that’s not a problem. The problem is sampling from ZM\zzm{M} by raising a single value aa to a random power. Zpi\zzm{p_i} are cyclic groups, as pip_i is prime, and thus they do have a generating element bib_i. It’s just not the aia_i used. In general for composite kk the group Zk\zzm{k} is not cyclic, i.e. it is not generated by a single element. So whatever aa they used, it only generates a subgroup RR of ZM\zzm{M}. Even worse, that subgroup RR would be a lot smaller than R0×R1×...×RmR_0 \times R_1 \times ... \times R_m, as that group again isn’t cyclic. The order of RR is given by R=lcmiRi|R| = \lcm_i |R_i|. This can be seen by considering that aikaikmodRi(modpi)a_i^k \equiv a_i^{k \bmod |R_i|} \pmod{p_i} and counting the possible combinations of kmodRik \bmod |R_i| and kmodRjk \bmod |R_j|.

At this point only one in ZMR269.95\frac{|\zzm{M}|}{|R|} \approx 2^{69.95} possible primes could be generated, but we haven’t validated our assumption yet.

Equipped with the test vectors that came with the original detection code, I searched for a matching generator aa modulo a subset of the small primes. I did this by combining all possible combinations of aia_i using the Chinese remainder theorem (CRT). I started with a small subset of the small primes, as this was much faster and could falsify the hypothesis if no match was found. As soon as 6553765537 appeared as a candidate I knew that my guesses were right. 65537=216+165537 = 2^{16} + 1 is a prime larger than our small primes, thus coprime to M and would be a generator of ZM+\zza{M} the additive group of integers modulo M, which is a cyclic group. Also multiplication with 216+12^{16} + 1 can be very fast, especially on 8 and 16-bit microcontrollers.

Confusing the properties of ZM+\zza{M} and ZM\zzm{M} could be an explanation of why someone inexperienced with cryptography wouldn’t see a problem with this approach. It does not explain why someone was allowed to go ahead with their own way of generating primes or why no one able to spot this mistake reviewed or audited this algorithm, especially given the intended applications.

We’re not quite done identifying the vulnerability yet. When looking at the set of small primes used, you can see that some primes are skipped. But what if they were only skipped in the deobfuscated and optimized detection code, because aa happened to be a generator for Zpi\zzm{p_i}? In fact when marcan published the deobfuscated code he mentioned that he removed no-op tests. Even if aa is a generator for Zpi\zzm{p_i} we shouldn’t discard it, as the cyclic subgroup RR generated by aa modulo MM is smaller than the product of the individual subgroups RiR_i.

Using the test vectors I verified that for 512-bit keys the set of small primes consists of all primes up to 167167. Recomputing the size of the cyclic subgroup RR of ZM\zzm{M} shows that only one in ZMR2154.89\frac{|\zzm{M}|}{|R|} \approx 2^{154.89} possible primes can be generated. This loses more than half of the expected entropy.

The Attack

Having so much information about the private key can be enough to very quickly factor it. Ignoring the kind of information we have, just counting the bits of entropy, it could be possible to efficiently factor the key using variants of Coppersmith’s method. The CA in ROCA also stands for Coppersmith’s attack, but a straightforward application isn’t possible. While the entropy of the information we gained from this vulnerability is enough, it doesn’t have the right form.

Coppersmith’s method is applicable if we know that a factor has the form c+kxc + kx for fixed cc and kk, and x<N14|x| < N^{\frac{1}{4}}. This is the case when we know consecutive bits of the binary representation or know the factor modulo any number of a suitable size. In our case we only know that the factors have the form (aimodM)+Mx(a^i \bmod M) + Mx for small ii and xx. If we could afford to just bruteforce all possible values for ii, we could apply Coppersmith’s method, assuming x<N14|x| < N^{\frac{1}{4}} holds.

There are R261.09|R| \approx 2^{61.09} possible values for ii. So at first, this looks too expensive. On the other hand, in this case x<2257M237.81N14|x| < \frac{2^{257}}{M} \approx 2^{37.81} \ll N^{\frac{1}{4}}, so maybe it is possible to find some trade-off.

We are looking for a way to make R|R| and MM smaller, without making MM too small. Luckily this is easy: we can just ignore some of the small primes pip_i. This results in a smaller MM', just the product of the new primes ipi\prod_i p'_i, and a smaller RZMR' \subset \zzm{M'}. As R|R'| depends on the common factors of Ri|R'_i|, it can be a bit difficult to find an optimal trade-off.

I implemented this attack using the Coppersmith implementation of PARI/GP, but no matter what trade-off I chose, my estimated runtime was much higher than the published one. As this is the attack described in the ROCA paper, in retrospect, I think the Coppersmith implementation I chose was just not optimized enough for this use case. In addition to that I might have missed the optimal choice for MM', but even a single invocation of Coppersmith’s method was much slower for me.

This prompted me to try different approaches. I hit many dead ends, until I came across an older but interesting attack for factoring RSA keys with partial knowledge of a factor. The attack was published in 1986 by Rivest and Shamir, the R and S in RSA. The paper is called “Efficient Factoring Based on Partial Information”.

Compared to Coppersmith’s method it has the downside of needing not only half the bits of a factor, i.e. x<N14|x| < N^{\frac{1}{4}}, but two thirds, i.e. x<N13|x| < N^{\frac{1}{3}}. This might seem bad at first, but as MM' grows faster than R|R'|, the bruteforcing work doesn’t increase as much as going from N14N^{\frac{1}{4}} to N13N^{\frac{1}{3}} might suggest. Also, I guessed that it would be so much faster than Coppersmith’s method, that, for 512-bit keys, it would more than make up for that.

To understand why, we need to look at how the attack works. The paper describes a slightly different scenario. It assumes we know the factors have the form x+2mcx + 2^m c. This corresponds to knowing the most significant bits of the binary representation of a factor. The described attack works without modification for factors of the form x+Mcx + Mc as it doesn’t make use of the fact that 2m2^m is a power of two.

If we assume P=x+McP = x + M c and Q=y+MdQ = y + M d with we get N=xy+dxM+cyM+cdM2.\begin{align*} N &= xy + dxM + cyM + cdM^2. \end{align*} We also assume that 0xM0 \le x \le M and 0yM0 \le y \le M.

Let t=NcdM2t = N - cdM^2, a constant we know, and we get t=xy+Mdx+Mcy.\begin{align*} t &= xy + Mdx + Mcy. \end{align*} The paper then presents a heuristic argument, which is roughly this: Because xyxy is much smaller than tt, MdxMdx and MdyMdy it is likely that replacing xyxy with ss and searching for the solution minimizing ss in t=s+Mdx+Mcy\begin{align*} t &= s + Mdx + Mcy \end{align*} results in a solution to the original equation and thereby in a factorization of NN.

This is a two-dimensional integer programming instance, i.e. a set of integer linear constraints (the bounds for xx and yy), an integer linear objective (minimize s=tMdx+Mcys = t - Mdx + Mcy) and two integer unknowns (xx and yy). It is then noted that integer programming in a fixed number of dimensions can be solved in polynomial time.

The paper also mentions that a similar approach would work for knowing the least significant bits of a factor. This corresponds to P=c+MxP = c + Mx and Q=d+MyQ = d + My with 0xM0 \le x \le \sqrt{M} and 0yM0 \le y \le \sqrt{M}, which is exactly what we need.

In this case we get N=cd+dxM+cyM+xyM2t=NcdMt=dx+cy+xyM.\begin{align*} N &= cd + dxM + cyM + xyM^2 \\ t &= \frac{N - cd}{M} \\ t &= dx + cy + xyM. \end{align*}

Again, we’d like to get rid of the xyxy term, to make it a linear problem. I did this by working modulo MM: tdx+cy(modM)\begin{align*} t &\equiv dx + cy \pmod{M} \end{align*}

Usually for RSA-keys we know an upper bound for PP and QQ, which together with NN also translates to a lower bound. From this we can compute bounds for xx and yy.

Here I noticed, that it is possible to find a solution using an approach more direct than integer programming. The solutions to tdx+cy(modM)t \equiv dx + cy \pmod{M} form a two-dimensional affine lattice. To understand how we need to define lattices first.

Given a n×dn \times d matrix BB consisting of dd linear independent column vectors B=(b0,b1,,bd1)B = (\mathbf b_0, \mathbf b_1, \ldots, \mathbf b_{d-1}) the corresponding lattice LL is the set of integer linear combinations of these vectors: L={BggZd}\begin{align*} L = \{B \mathbf g \mid g \in \mathbb Z^d\} \end{align*} As this set is closed under negation and addition, it forms a subgroup of Rn\mathbb R^n. Luckily we are working in two dimensions, which makes it easy to visualize lattices:

Two-dimensional lattice example The green dots are the lattice points and the red vectors are the basis vectors.

The basis vectors for a lattice are not unique, adding an integer multiple of one basis vector to another generates the same lattice. This is easy to see, as you can get the original lattice vector by subtracting the same multiple again, so every integer linear combination of either basis is also an integer linear combination of the other. Negating a basis vector or exchanging the position of two vectors also doesn’t change the generated lattice. Performing an arbitrary number of those operations is equivalent to multiplying the basis BB with an unimodular matrix UU, i.e. an integer matrix UU with detU=1\lvert\det U\rvert = 1. This makes sense as those matrices are exactly the integer matrices which have an integer inverse.

Two equivalent bases

b0,b1\mathbf b_0, \mathbf b_1 and b0,b1\mathbf b'_0, \mathbf b'_1 define the same lattice: b0=b0+b1,b1=b12b0\mathbf b'_0 = \mathbf b_0 + \mathbf b_1, \mathbf b'_1 = \mathbf b_1 - 2\mathbf b'_0.

An affine lattice is a lattice with an offset. Given a basis BB and an offset vector o\mathbf o it consists of the lattice points A={Bg+ogZd}A = \{B \mathbf g + \mathbf o \mid g \in \mathbb Z^d \}. This is not a group anymore, but adding an element of LL to AA gives another point in AA and the difference of two points in AA is in LL.

I claimed that the solutions to tdx+cy(modM)t \equiv dx + cy \pmod{M} form an affine lattice. Assume we have a single known solution (x0,y0)(x_0, y_0). It’s not hard to see that adding multiples of MM to x0x_0 or y0y_0 is still a valid solution. These solutions would form an affine lattice, using the basis vectors (M,0)(M, 0) and (0,M)(0, M), but that lattice would not contain all solutions. We know that cc and dd are coprime to MM, otherwise PP or QQ would have a small factor. This means that we should have a solution (x,tdxc)(x, \frac{t - dx}{c}) for any value of xx. Taking the difference of two solutions with consecutive xx gives us a basis vector b0=(1,dc)\mathbf b_0 = (1, \frac{-d}{c}). Together with b1=(0,M)\mathbf b_1 = (0, M) and o=(0,tc)\mathbf o = (0, \frac{t}{c}) this defines an affine lattice containing all solutions.

Given this affine lattice, we’re interested in the lattice points within the region defined by our bounds for xx and yy. If we can find a lattice point closest to a given arbitrary point, we could compute the closest point to the center of that region. In general for arbitrary lattice dimensions that problem is NP-complete. Luckily for two dimensional lattices this is very efficient.

The difficulty in finding a closest lattice points stems from the fact that basis vectors can point in roughly the same or opposite direction. In fact for our affine lattice of solutions to tdx+cy(modM)t \equiv dx + cy \pmod{M}, the basis vectors we derived point in almost the same direction. Let’s, for a moment, assume the opposite would be the case and that the basis vectors are orthogonal. We could then just represent a point as a non-integer multiple of the basis vectors and individually round the multiples to the nearest integer. As moving along one basis vector direction doesn’t affect the closest multiple in the other directions, we would get the nearest point.

When the basis vectors aren’t exactly orthogonal but close, it is possible to bound the distance when approximating the nearest point by independent rounding in each basis vector direction. Consider the two-dimensional case: rounding in direction of b0\mathbf b_0 moves the point by μ0,1=b0,b1b12\mu_{0,1} = \frac{\sp{\mathbf b_0}{\mathbf b_1}}{\norm{b_1}^2} times the length of b1\mathbf b_1 in the direction of b1\mathbf b_1. The value of μ0,1\mu_{0,1} is the (signed) length of b0\mathbf b_0 projected onto b1\mathbf b_1, divided by the length of b1\mathbf b_1.

Projection example The orange vector is the projection of the blue vector onto the red vector. It is equal to μ\mu times the red vector.

This is great because we can find an equivalent basis so that μ0,1|\mu_{0,1}| and μ1,012|\mu_{1,0}| \le \frac{1}{2}. This is done using the Lagrange-Gauss algorithm, which finds the shortest basis for a two-dimensional lattice. It works similar to the Euclidean algorithm for computing the greatest common divisor of two numbers by repeatedly reducing one value by the other. Let x\lfloor x \rceil be the closest integer to xx. If μ0,1>12|\mu_{0,1}| > \frac{1}{2} the vector b1μ0,1b0\mathbf b_1 - \lfloor \mu_{0,1} \rceil \mathbf b_0 is shorter than b1\mathbf b_1 and can replace it. The same is true with exchanged basis vectors and μ1,0\mu_{1,0}. Replacing one lattice vector with a shorter one like this can be iterated until neither μ0,1|\mu_{0,1}| nor μ1,0|\mu_{1,0}| are greater than 12\frac{1}{2}. For basis vectors with integer components the number of iterations needed grows logarithmically with the length of the basis vectors, i.e. linear with their bit size.

For such a two-dimensional basis, finding a close point by rounding the multiples of the basis vectors results in a very small distance bound. I haven’t computed the exact bound, but rounding introduces an offset of at most half a basis vector in each basis vector direction. This would introduce an error of at most a quarter basis vector each, which is enough to cross the midpoint between two multiples, but not enough to go further. In practice, for the lattices we’re interested in, the point found by rounding happens to also be the closest point.

With this we can find a solution to tdx+cy(modM)t \equiv dx + cy \pmod{M} which is closest to the midpoint of the rectangle defined by the bounds for xx and yy. Most of the time this solution is the unique solution within bounds that leads to a factoring of NN if such a solution exists. Sometimes, though, when one basis vector is particularly short, there are multiple solutions within bounds. Luckily it seems that this only happens when the other basis vector is long. This means that all solutions within bounds lie on a single line. In that case a solution can be efficiently found or shown to not exist by recursively bisecting the line and checking whether a point that factors NN can exist between the endpoints.

This gives us a complete algorithm to check a single candidate. Together with an optimized value for MM' and an outer loop that bruteforces through the R|R'| possible guesses, this allows me to break a ROCA-vulnerable 512-bit RSA key in less than 900900 seconds using a single thread on my laptop. As the outer loop can be trivially parallelized, breaking those keys on a more powerful server with many threads takes less than 3030 seconds. I’ve also looked at using this approach for 1024-bit keys, but a rough estimate put the runtime far above the runtime of the ROCA attack. For larger keys it is even worse, so I didn’t pursue that path.

Source Code

I’ve decided to release the source code of my attack implementation. It’s implemented in C++ and uses GMP for most bignum arithmetic, except inside the lattice reduction where custom routines are used. It includes some low-level optimizations that I’ve glossed over, for example using floats to approximate bignums while keeping track of the accumulated error.

Feel free to contact me if you want to know more about specific parts or about the implementation. I’ll also be at the 34c3 and am happy to have a chat with anyone interested in this or related things.

  1. Rejection sampling also allows for non-uniform source and target distributions, but simplifies to the described algorithm for a uniform source distribution and a target distribution that is uniform among all values of a given property and zero otherwise.↩︎