PCI and SHA-1
In jPTS’ user interface, we need a way to click on a given card and show all transactions for that card over a specified period of time. Because all sensitive data is stored using AES-256
in the secureData
column of the tl_capture
table using the CryptoService—a column that can be further encrypted at the database level—we need a lightweight, database index-friendly, one-way consistent hash of the primary account number.
Because computing a hash these days is an extremely fast operation, knowing the last four digits of a 16-digit PAN and the hash is enough information to brute-force the full PAN in a matter of milliseconds. Therefore, jPTS uses a dynamic per-BIN seed that incorporates several layers of defense.
Here is a screenshot of the UI page that requires this feature. When you click on a specific card, the transactions for that card alone are displayed.
The 128-bit SHA-1 algorithm, the same as other hash algorithms can be brute forced to produce a collision under certain circumstances. With today’s CPUs, finding a collision would take approximately 2600 years, a time that can be reduced to about 100 years using a cluster of GPUs. In 2017, the first collision for SHA-1 1 was found. Further research has provided additional cryptanalysis strategies that could theoretically reduce this time to approximately 24 days. (See also 2345).
Understanding Collisions
Imagine you have a PDF file that has the SHA-1 hash:
845cfdca0ba5951580f994250cfda4fee34a1b58
Using the techniques mentioned above, it is theoretically feasible to find another file different from the original one that has the same hash, though this requires significant computing effort. However, this scenario applies to large files, not to small 16 to 19 digit Primary Account Numbers (PANs).
When dealing with a limited number of digits, especially where the last digit must satisfy a modulus 10 condition (LUHN)6, finding a collision is not possible as there are no collissions. Studies have shown that the minimum size for a collision is approximately 422 bytes7. This refers to full 8-bit bytes, not just a restricted set of bytes within the ASCII range of 0x30 to 0x39 (representing digits 0 through 9).
Going to an extreme, imagine you have just one digit, from 0 to 1:
# | SHA-1 hash |
---|---|
0 | b6589fc6ab0dc82cf12099d1c2d40ab994e8410c |
1 | 356a192b7913b04c54574d18c28d46e6395428ab |
2 | da4b9237bacccdf19c0760cab7aec4a8359010b0 |
3 | 77de68daecd823babbb58edb1c8e14d7106e83bb |
4 | 1b6453892473a467d07372d45eb05abc2031647a |
5 | ac3478d69a3c81fa62e60f5c3696165a4e5e6ac4 |
6 | c1dfd96eea8cc2b62785275bca38ac261256e278 |
7 | 902ba3cda1883801594b6e1b452790cc53948fda |
8 | fe5dbbcea5ce7e2988b8c69bcfdfde8904aabc1f |
9 | 0ade7c2cf97f75d009975f4d720d1fa6c19f4897 |
You can see there are no collisions. If we add two digits, there are no collisions either. With three digits, still none. Even with 16 digits, none. At 19 digits, none. And with 100 digits, still none. The shortest collision theoretically possible is around 422 bytes, but we're talking about full 8-bit bytes here, not just limited to the scope of 0x30 to 0x39 values, which also need to honor the LUHN algorithm.
SHA-1 in the wild
Git, the most widely used source control system in the world, uses SHA-1. In 2017, Linus Torvalds commented on this:
I doubt the sky is falling for Git as a source control management tool.8
And indeed, the sky was not falling. Fast forward to 2024, and Git still uses SHA-1. While it makes sense for Git to consider moving to another algorithm because finding a collision is theoretically feasible (though quite more difficult than in a PDF due to Git internal indexing constraints), this is just not possible in the jPTS case where we deal with limited PAN numbers.
Understanding Hash Functions
In Java applications, all objects have an internal hashCode
that is used in collections, maps, sets, caches, and more. The hashCode
is typically quite simple, often auto-generated by IDEs, and can result in many collisions. Collisions are not a problem if they are properly handled. Despite the potential for collisions, this simple hash is very useful for placing objects into different buckets for indexing purposes. The popular and reliable Amazon DynamoDB, for example, uses MD5 (a hashing algorithm more prone to collisions than SHA-1) to derive partition keys. This does not mean that Amazon DynamoDB is insecure.
SHA-1 is designed to be collision-resistant, meaning it should be computationally infeasible to find two different inputs that produce the same hash output. Although SHA-1 has been shown to have vulnerabilities, leading to the discovery of practical collision attacks, these attacks require a large amount of computational power and typically involve much larger and more complex inputs than a simple 16-19 digit number.
The space of possible PANs, even considering 16-19 digits and a final LUHN check digit, is relatively small compared to the space of possible SHA-1 hash outputs (2^160 possible outputs). This means that the likelihood of any two distinct PANs producing the same SHA-1, as mentioned above, is just not possible.
Worst Case Scenario
Imagine that all of this is wrong and SHA-1 is totally broken, allowing a malicious user to easily find a PAN collision. In such a scenario, an operator might see a screen of transactions with more than one card. While this situation is highly improbable, as described above, if all our theories were wrong, this would be the outcome. It would not be a major issue or a security problem. If such a situation were to occur, implementing compensating controls to detect duplicates would be quite straightforward, and the case could be presented at security conferences.
HMAC SHA-256
For the reasons explained above, we believe SHA-1 is a good trade-off between security, performance, and database/storage use. That said, there are situations where and audit can go South due to a cognitive mismatch with the QSA/CISO. In those cases, using HMAC SHA-256 can be a solution.
SHA-3 (FIPS 202)
We could use SHA-3 for this use case and avoid the need for this explanation, but the problem is that SHA-3 is overkill for such a simple use case. The previous sample hash would be quite longer:
5b6791cfd813470437bbea989ecfbd006571d89eeebe809fdb8e742b67e62a6e1a256eb57df90353065aa31e999724e0
and would put unnecessary pressure on the database and the indexes would be larger without any real benefit. Some QSAs have suggested using a truncated version of SHA-3 to meet the requirement of using SHA-3 while still maintaining a lightweight index.
However, we reject this approach because it would involve inventing our own cryptographic method. We do not have sufficient expertise in cryptography and mathematics (and obviously neither does the QSA) to determine whether a truncated SHA-3 would be less prone to collisions than SHA-1. What we do know is that creating cryptographic algorithms without a deep understanding is a recipe for problems.
Prove me wrong—how hard could it be?
Show me two PANs with the same BIN (since we seed by BIN), a proper LUHN and the same SHA-1 hash and I'll hire you for our next audit.
Update
I had a great conversation with my friend and QSA @dbergert about this post and he provided excellent pointers to the latest requirements as well as his expert opinion on this issue.
NIST will retire SHA-1 by 20309 and since 2017 it is not permitted for authentication, which is not the case.
He agrees that collisions are not a concern here. However, the valid issue he raises is that simply hashing the PAN makes it vulnerable to brute force attacks. This is not the case here, and in fact, the same concern applies to SHA-3. Due to hardware acceleration, SHA-3 can be even faster than SHA-1. Therefore, requesting a switch from SHA-1 to SHA-3 is somewhat naïve, as this is not what the PCI standard requires. He pointed me to this PCI FAQ: What is the Council's guidance on the use of SHA-1?. Here are some highlights:
The continued use of SHA-1 is permitted in only in the following scenarios:
- Within top-level certificates (the Root Certificate Authority), which is the trust anchor for the hierarchy of certificates used.
- Authentication of initial code on ROM that initiates upon device startup (note: all subsequent code must be authenticated using SHA-2).
- In conjunction with the generation of HMAC values and surrogate PANs (with salt), for deriving keys using key derivation functions (i.e., KDFs) and random number generation.
We "almost" land in case number 3 as the secure seed is equivalent to HMAC in this case, BUT, Dave has a very valid point:
I'm guessing that the PCI council is movig to HMAC's because those are well defined vs the myriad of ways a salted hash could be done, and security around the "salt".
That's a very reasonable concern, and I understand it. Our approach is just one of the many ways this can be implemented. Migrating to HMAC instead makes complete sense, regardless of the use of SHA-1 or SHA-3.
Footnotes
-
Stevens, Marc, et al. "The first collision for SHA-1." IACR Cryptology ePrint Archive, vol. 2017, p. 190, 2017. ↩
-
Stevens, Marc, et al. "Chosen-prefix collision attacks on SHA-1 and applications." Journal of Cryptology, vol. 33, no. 4, pp. 2291-2326, 2020, Springer. ↩
-
Wang, Xiaoyun, Yiqun Lisa Yin, and Hongbo Yu. "Breaking SHA-1 in the real world." Annual International Cryptology Conference, pp. 89-110, 2005, Springer. ↩
-
Wang, Xiaoyun, and Hongbo Yu. "Finding Collisions in the Full SHA-1." Annual International Cryptology Conference, pp. 17-36, 2005, Springer. ↩
-
Wang, Xiaoyun, et al. "Collisions for Hash Functions MD5, SHA-0 and SHA-1." IACR Cryptology ePrint Archive, vol. 2004, p. 199, 2004. ↩
-
LUHN algorithm named after IBM scientist Hans Peter Luhn that created it in 1954. ↩
-
https://www.nist.gov/news-events/news/2022/12/nist-retires-sha-1-cryptographic-algorithm ↩