GH-49614: [C++] Fix silent truncation in base64_decode on invalid input#49660
GH-49614: [C++] Fix silent truncation in base64_decode on invalid input#49660Reranko05 wants to merge 3 commits intoapache:mainfrom
Conversation
|
|
cpp/src/arrow/vendored/base64.cpp
Outdated
|
|
||
| for (char c : encoded_string) { | ||
| if (!(is_base64(c) || c == '=')) { | ||
| return ""; |
There was a problem hiding this comment.
I’m not comfortable with "" as the error path here. It’s indistinguishable from a valid decode of empty input, so malformed input still fails silently. I’d prefer this API to fail explicitly (Result<std::string> / checked variant) and have Gandiva propagate that as an error.
Returning null would be slightly better than returning "", because at least it doesn’t collide with a valid decoded empty string. But I still don’t think it’s the right default behavior here as null still turns malformed input into a regular value rather than an explicit failure.
cpp/src/arrow/vendored/base64.cpp
Outdated
| std::string ret; | ||
|
|
||
| for (char c : encoded_string) { | ||
| if (!(is_base64(c) || c == '=')) { |
There was a problem hiding this comment.
This is absolutely insufficient and will not trip on input like abcd=AAA. Please do some research on best practices for sufficient and efficient base64 input validation.
cpp/src/arrow/util/string_test.cc
Outdated
| std::string input = "hello world!"; // invalid base64 | ||
| std::string output = arrow::util::base64_decode(input); | ||
|
|
||
| EXPECT_EQ(output, ""); |
There was a problem hiding this comment.
More tests! In our day and age with tools that we have this is not even bare minimum. Null input? Valid input? Non-ascii input?
Did you locate other tests? I'm not seeing any other tests for base64_decode in this file so where are they?
4670ec5 to
5c7db64
Compare
|
Thanks for the feedback. I’ve updated the implementation and tests.
All tests pass locally. Please let me know if any further adjustments are needed. |
5c7db64 to
8f053b7
Compare
kou
left a comment
There was a problem hiding this comment.
Could you use arrow::Result<std::string> return type instead of using ARROW_LOG()?
|
FYI: You can run CI on your fork by enabling GitHub Actions on your fork. |
Thanks for the suggestion @kou ! Just to clarify, would you prefer changing the existing base64_decode API to return arrow::Resultstd::string, or introducing a separate checked variant while keeping the current API unchanged? I want to make sure the approach aligns with existing usage and expectations. |
|
"changing the existing base64_decode API to return arrow::Resultstd::string". |
|
@kou I checked the current usages of Updating to I can proceed with the API change and update the affected call sites accordingly. |
|
Hi @kou, just following up on this. I can proceed with updating Happy to proceed based on your guidance. |
|
Oh, sorry. I forgot to reply this... Yes. Let's proceed with |
8f053b7 to
34a388c
Compare
|
Hi @kou, thanks for confirming! I’ve updated All tests are passing locally. Please let me know if you’d like any changes or adjustments. |
cpp/src/arrow/vendored/base64.cpp
Outdated
| "0123456789+/"; | ||
|
|
||
| auto is_base64 = [](unsigned char c) -> bool { | ||
| return (std::isalnum(c) || (c == '+') || (c == '/')); |
There was a problem hiding this comment.
Seeing this logic right below base64_chars definition is a bit strange
cpp/src/arrow/vendored/base64.cpp
Outdated
| return arrow::Status::Invalid("Invalid base64 input: length is not a multiple of 4"); | ||
| } | ||
|
|
||
| size_t padding_start = encoded_string.find('='); |
There was a problem hiding this comment.
I am not a fan of separate validation loop as it is a performance overhead.
Please validate and decode in a single pass.
There was a problem hiding this comment.
Thanks for pointing it out. I’ll integrate the validation into the decoding loop to avoid the extra pass.
There was a problem hiding this comment.
It seems that the current implementation still uses 2 passes.
|
|
||
| ARROW_EXPORT | ||
| std::string base64_decode(std::string_view s); | ||
| arrow::Result<std::string> base64_decode(std::string_view s); |
There was a problem hiding this comment.
No need to change the callers?
Perhaps a single integration test that would call a caller function with an invalid input to validate.
There was a problem hiding this comment.
Based on @kou's earlier suggestion, I proceeded with updating base64_decode() to return arrow::Result<std::string> and adjusted the existing usages accordingly.
cpp/src/arrow/util/base64.h
Outdated
|
|
||
| #include "arrow/util/visibility.h" | ||
| #include "arrow/result.h" | ||
| #include "arrow/status.h" |
There was a problem hiding this comment.
Can we remove this? I think that arrow/result.h includes arrow/status.h.
cpp/src/arrow/util/string_test.cc
Outdated
| ASSERT_TRUE(result.starts_with("0")) << result; | ||
| ASSERT_TRUE(result.rfind("0", 0) == 0) << result; | ||
| result = ToChars(0.25); | ||
| ASSERT_TRUE(result.starts_with("0.25")) << result; | ||
| ASSERT_TRUE(result.rfind("0.25", 0) == 0) << result; |
There was a problem hiding this comment.
You're right, this change is not directly related to the base64 fix. I updated it earlier due to compatibility issues with starts_with, but I’ll revert it to keep the scope of this PR focused.
cpp/src/arrow/util/string_test.cc
Outdated
| auto r1 = arrow::util::base64_decode("Zg=="); | ||
| ASSERT_TRUE(r1.ok()); |
There was a problem hiding this comment.
Could you use ASSERT_OK_AND_ASSIGN()?
| auto r1 = arrow::util::base64_decode("Zg=="); | |
| ASSERT_TRUE(r1.ok()); | |
| ASSERT_OK_AND_ASSIGN(auto string, arrow::util::base64_decode("Zg==")); |
There was a problem hiding this comment.
Good suggestion, thanks! I’ll update the tests to use ASSERT_OK_AND_ASSIGN.
cpp/src/arrow/util/string_test.cc
Outdated
| auto r1 = arrow::util::base64_decode("Zg=="); | ||
| ASSERT_TRUE(r1.ok()); | ||
| EXPECT_EQ(r1.ValueOrDie(), "f"); | ||
| auto r2 = arrow::util::base64_decode("Zm8="); |
There was a problem hiding this comment.
Can we reuse one variable instead of declaring multiple variables (r1, r2, ...)?
Or could you use more meaningful variable name for each case such as two_paddings, one_padding and no_padding?
There was a problem hiding this comment.
I’ll update the tests to use more meaningful variable names while also aligning with ASSERT_OK_AND_ASSIGN.
cpp/src/arrow/util/string_test.cc
Outdated
| auto r4 = arrow::util::base64_decode("aGVsbG8gd29ybGQ="); | ||
| ASSERT_TRUE(r4.ok()); | ||
| EXPECT_EQ(r4.ValueOrDie(), "hello world"); |
There was a problem hiding this comment.
Why do we need this case? What is the difference with other cases?
There was a problem hiding this comment.
Good point, this case doesn’t add new coverage beyond the existing ones. I’ll remove it to keep the tests focused.
cpp/src/arrow/util/string_test.cc
Outdated
| auto r1 = arrow::util::base64_decode("===="); | ||
| ASSERT_FALSE(r1.ok()); |
There was a problem hiding this comment.
It seems that this is an invalid padding case.
There was a problem hiding this comment.
You are right.. this case should fall under invalid padding I’ll move it accordingly.
cpp/src/arrow/vendored/base64.cpp
Outdated
| } | ||
|
|
||
| std::string base64_decode(std::string_view encoded_string) { | ||
| arrow::Result<std::string> base64_decode(std::string_view encoded_string) { |
There was a problem hiding this comment.
We can use Result here because this is in arrow::util namespace:
| arrow::Result<std::string> base64_decode(std::string_view encoded_string) { | |
| Result<std::string> base64_decode(std::string_view encoded_string) { |
There was a problem hiding this comment.
Good point, I’ll update this to use Result directly within the namespace.
cpp/src/arrow/vendored/base64.cpp
Outdated
| if (padding_start != std::string::npos) { | ||
| for (size_t k = padding_start; k < encoded_string.size(); ++k) { | ||
| if (encoded_string[k] != '=') { | ||
| return arrow::Status::Invalid("Invalid base64 input: padding character '=' found at invalid position"); |
There was a problem hiding this comment.
| return arrow::Status::Invalid("Invalid base64 input: padding character '=' found at invalid position"); | |
| return Status::Invalid("Invalid base64 input: padding character '=' found at invalid position"); |
cpp/src/arrow/vendored/base64.cpp
Outdated
| auto is_base64 = [](unsigned char c) -> bool { | ||
| return (std::isalnum(c) || (c == '+') || (c == '/')); | ||
| }; |
There was a problem hiding this comment.
The redefinition isn’t necessary. I’ll remove this and reuse the existing base64_chars for validation instead.
cpp/src/arrow/vendored/base64.cpp
Outdated
| size_t padding_start = encoded_string.find('='); | ||
| if (padding_start != std::string::npos) { | ||
| for (size_t k = padding_start; k < encoded_string.size(); ++k) { | ||
| if (encoded_string[k] != '=') { | ||
| return arrow::Status::Invalid("Invalid base64 input: padding character '=' found at invalid position"); | ||
| } | ||
| } | ||
|
|
||
| size_t padding_count = encoded_string.size() - padding_start; | ||
| if (padding_count > 2) { | ||
| return arrow::Status::Invalid("Invalid base64 input: too many padding characters"); | ||
| } | ||
| } | ||
|
|
||
| for (char c : encoded_string) { | ||
| if (c != '=' && !is_base64(c)) { | ||
| return arrow::Status::Invalid("Invalid base64 input: contains non-base64 character '" + std::string(1, c) + "'"); | ||
| } | ||
| } |
There was a problem hiding this comment.
These validations traverses the encoded_string multiple times, right? Does this have performance penalty for large input?
Can we optimize them?
There was a problem hiding this comment.
Pull request overview
This PR addresses a correctness issue in Arrow’s C++ base64 decoder by ensuring malformed base64 input is detected instead of producing silently truncated/partial output.
Changes:
- Adds pre-validation for base64 input (length, padding placement, invalid characters) in
base64_decode. - Changes
base64_decodeAPI to returnarrow::Result<std::string>withStatus::Invalidon malformed input. - Adds unit tests covering valid/invalid decoding cases (and adjusts a couple of
ToCharsassertions).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| cpp/src/arrow/vendored/base64.cpp | Adds base64 input validation and switches decode to return errors instead of partial output. |
| cpp/src/arrow/util/base64.h | Updates public API signature of base64_decode to return Result<std::string>. |
| cpp/src/arrow/util/string_test.cc | Adds tests for base64 decode validity/error cases and tweaks ToChars expectations. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| ARROW_EXPORT | ||
| std::string base64_decode(std::string_view s); | ||
| arrow::Result<std::string> base64_decode(std::string_view s); |
There was a problem hiding this comment.
base64_decode was changed from returning std::string to arrow::Result<std::string>, but there are existing call sites in the repo that still treat it as a std::string (e.g., constructing streams, assigning to std::string). As-is, this is an API/ABI breaking change and will not compile unless all callers are updated or a backwards-compatible overload/wrapper is kept.
| arrow::Result<std::string> base64_decode(std::string_view s); | |
| std::string base64_decode(std::string_view s); |
| std::string base64_encode(std::string_view s); | ||
|
|
||
| ARROW_EXPORT | ||
| std::string base64_decode(std::string_view s); | ||
| arrow::Result<std::string> base64_decode(std::string_view s); | ||
|
|
There was a problem hiding this comment.
This change conflicts with the PR description/user-facing behavior: the description says invalid input should return an empty string, but the new signature returns Result<std::string> and the implementation returns Status::Invalid(...) on malformed inputs. Please align the implementation/API and the stated behavior (either update the description and downstream expectations, or preserve the old std::string API that returns "" on invalid input).
| #include "arrow/util/base64.h" | ||
| #include "arrow/util/logging.h" | ||
| #include "arrow/result.h" | ||
| #include "arrow/status.h" | ||
| #include <iostream> | ||
| #include <cctype> |
There was a problem hiding this comment.
arrow/util/logging.h is newly included here but isn't used anywhere in this file. Please remove the unused include to avoid unnecessary dependencies and potential lint/Werror failures.
cpp/src/arrow/vendored/base64.cpp
Outdated
| static const std::string base64_chars = | ||
| "ABCDEFGHIJKLMNOPQRSTUVWXYZ" | ||
| "abcdefghijklmnopqrstuvwxyz" | ||
| "0123456789+/"; | ||
|
|
||
| auto is_base64 = [](unsigned char c) -> bool { | ||
| return (std::isalnum(c) || (c == '+') || (c == '/')); | ||
| }; | ||
|
|
There was a problem hiding this comment.
Inside base64_decode, base64_chars and is_base64 are redefined even though the same base64_chars and is_base64 already exist at file scope above. This introduces name shadowing (potentially tripping -Wshadow) and also leaves the file-scope is_base64 unused. Prefer reusing the existing helpers (or refactor to a single definition) to avoid warnings and duplication.
| static const std::string base64_chars = | |
| "ABCDEFGHIJKLMNOPQRSTUVWXYZ" | |
| "abcdefghijklmnopqrstuvwxyz" | |
| "0123456789+/"; | |
| auto is_base64 = [](unsigned char c) -> bool { | |
| return (std::isalnum(c) || (c == '+') || (c == '/')); | |
| }; |
cpp/src/arrow/vendored/base64.cpp
Outdated
| } | ||
|
|
||
| size_t padding_start = encoded_string.find('='); | ||
| if (padding_start != std::string::npos) { |
There was a problem hiding this comment.
padding_start is computed via encoded_string.find('=') on a std::string_view, but compared against std::string::npos. This works but is inconsistent and easy to misread; prefer comparing against std::string_view::npos when operating on a string_view.
| if (padding_start != std::string::npos) { | |
| if (padding_start != std::string_view::npos) { |
cpp/src/arrow/vendored/base64.cpp
Outdated
| for (char c : encoded_string) { | ||
| if (c != '=' && !is_base64(c)) { | ||
| return arrow::Status::Invalid("Invalid base64 input: contains non-base64 character '" + std::string(1, c) + "'"); | ||
| } | ||
| } |
There was a problem hiding this comment.
The invalid-character error message embeds the raw byte into the Status string (std::string(1, c)). For non-printable / non-UTF8 bytes (your tests include 0xFF), this can produce unreadable or problematic error messages/logging. Consider reporting the offending byte as an escaped/hex value (and/or the index) instead of inserting it verbatim.
| for (char c : encoded_string) { | |
| if (c != '=' && !is_base64(c)) { | |
| return arrow::Status::Invalid("Invalid base64 input: contains non-base64 character '" + std::string(1, c) + "'"); | |
| } | |
| } | |
| auto format_hex_byte = [](unsigned char c) -> std::string { | |
| static const char hex_digits[] = "0123456789ABCDEF"; | |
| std::string hex = "0x00"; | |
| hex[2] = hex_digits[(c >> 4) & 0x0F]; | |
| hex[3] = hex_digits[c & 0x0F]; | |
| return hex; | |
| }; | |
| for (size_t k = 0; k < encoded_string.size(); ++k) { | |
| unsigned char c = static_cast<unsigned char>(encoded_string[k]); | |
| if (c != '=' && !is_base64(c)) { | |
| return arrow::Status::Invalid( | |
| "Invalid base64 input: contains non-base64 byte " + format_hex_byte(c) + | |
| " at position " + std::to_string(k)); | |
| } | |
| } |
cpp/src/arrow/util/string_test.cc
Outdated
| } | ||
|
|
||
| TEST(Base64DecodeTest, NonAsciiInput) { | ||
| std::string input = std::string("abcd") + char(0xFF) + "=="; |
There was a problem hiding this comment.
NonAsciiInput is currently guaranteed to fail the new size % 4 == 0 validation (the constructed string length is 7), so it doesn't actually exercise the non-ASCII character validation path. Adjust the test input to a length that is a multiple of 4 so it fails specifically due to the non-base64 byte (e.g., keep padding rules valid and include the 0xFF byte).
| std::string input = std::string("abcd") + char(0xFF) + "=="; | |
| std::string input = std::string("abc") + static_cast<char>(0xFF); |
34a388c to
ed84348
Compare
|
Hi @kou, I have addressed all review comments:
All tests pass locally. |
|
Could you enable GitHub Actions on your fork to run CI on your fork too? |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
cpp/src/arrow/vendored/base64.cpp
Outdated
| while (in_len-- && encoded_string[in_] != '=') { | ||
| unsigned char c = encoded_string[in_]; | ||
|
|
||
| if (base64_chars.find(c) == std::string::npos) { | ||
| return Status::Invalid("Invalid base64 input: contains non-base64 byte at position " + std::to_string(in_)); | ||
| } |
There was a problem hiding this comment.
The validation loop can exit with i == 2 for inputs with == padding (e.g. "Zg=="). In the trailing partial-quantum handling later in this function, char_array_3[1] is computed using char_array_4[2] even when i == 2, which reads an uninitialized stack value (undefined behavior). Consider zero-initializing the remaining char_array_4 slots before computing char_array_3, or only computing the bytes that will actually be appended based on i.
| if (encoded_string.size() % 4 != 0) { | ||
| return Status::Invalid("Invalid base64 input: length is not a multiple of 4"); | ||
| } |
There was a problem hiding this comment.
PR description says invalid input should "return an empty string", but the implementation now returns Status::Invalid(...) (and the header signature is Result<std::string>). Either update the PR description/user-facing notes to reflect the new error-reporting API, or adjust the implementation to match the documented behavior (e.g. preserve the std::string API and return "" on invalid input).
cpp/src/arrow/util/string_test.cc
Outdated
| TEST(Base64DecodeTest, InvalidLength) { | ||
| ASSERT_RAISES(Invalid, arrow::util::base64_decode("abc")); | ||
| ASSERT_RAISES(Invalid, arrow::util::base64_decode("abcde")); | ||
| } | ||
|
|
||
| TEST(Base64DecodeTest, InvalidCharacters) { | ||
| ASSERT_RAISES(Invalid, arrow::util::base64_decode("ab$=")); | ||
| } |
There was a problem hiding this comment.
These tests assert that invalid input raises Invalid, but the PR description/user-facing notes say invalid base64 should return an empty string. Please align the tests with the intended public behavior (either update the implementation/headers to return "" on invalid input, or update the PR description to reflect the new error-returning API).
cpp/src/arrow/util/string_test.cc
Outdated
| ASSERT_OK_AND_ASSIGN(auto two_paddings, arrow::util::base64_decode("Zg==")); | ||
| EXPECT_EQ(two_paddings, "f"); | ||
|
|
||
| ASSERT_OK_AND_ASSIGN(auto one_padding, arrow::util::base64_decode("Zm8=")); | ||
| EXPECT_EQ(one_padding, "fo"); | ||
|
|
||
| ASSERT_OK_AND_ASSIGN(auto no_padding, arrow::util::base64_decode("Zm9v")); | ||
| EXPECT_EQ(no_padding, "foo"); | ||
|
|
||
| ASSERT_OK_AND_ASSIGN(auto single_char, arrow::util::base64_decode("TQ==")); | ||
| EXPECT_EQ(single_char, "M"); |
There was a problem hiding this comment.
It seems that f and M cases check the same pattern.
cpp/src/arrow/util/string_test.cc
Outdated
| ASSERT_RAISES(Invalid, arrow::util::base64_decode("abc")); | ||
| ASSERT_RAISES(Invalid, arrow::util::base64_decode("abcde")); |
There was a problem hiding this comment.
Ah, sorry. Could you use ASSERT_RAISES_WITH_MESSAGE() to check message too?
There was a problem hiding this comment.
Could you create base64_test.cc instead of reusing existing string_test.cc?
In general, we create XXX_test.cc for XXX.{cc,h}.
We can build base64_test.cc with the following CMakeLists.txt change:
diff --git a/cpp/src/arrow/util/CMakeLists.txt b/cpp/src/arrow/util/CMakeLists.txt
index 4352716ebd..deb3e9e3fb 100644
--- a/cpp/src/arrow/util/CMakeLists.txt
+++ b/cpp/src/arrow/util/CMakeLists.txt
@@ -49,6 +49,7 @@ add_arrow_test(utility-test
SOURCES
align_util_test.cc
atfork_test.cc
+ base64_test.cc
byte_size_test.cc
byte_stream_split_test.cc
cache_test.cc
cpp/src/arrow/vendored/base64.cpp
Outdated
| return arrow::Status::Invalid("Invalid base64 input: length is not a multiple of 4"); | ||
| } | ||
|
|
||
| size_t padding_start = encoded_string.find('='); |
There was a problem hiding this comment.
It seems that the current implementation still uses 2 passes.
Sure, I'll do that. |
|
Hi @kou, I re-checked the usages of base64_decode() across the codebase. Besides tests, it is used in multiple modules: Flight: Gandiva: Parquet: These call sites currently expect a std::string (e.g., direct assignment or usage in std::stringstream). After updating the API to return arrow::Resultstd::string, CI is failing due to these usages. So the impact appears broader than I initially estimated and spans multiple components. Would you prefer that I:
|
Rationale for this change
arrow::util::base64_decodesilently truncates output when encountering invalid base64 characters, returning partial results without signaling an error. This can lead to unintended data corruption.What changes are included in this PR?
Are these changes tested?
Yes. A unit test has been added to verify that invalid input returns an empty string.
Are there any user-facing changes?
Yes. Previously, invalid base64 input could result in partial decoded output. Now, such inputs return an empty string.
This PR contains a "Critical Fix".
This change fixes a correctness issue where invalid base64 input could result in silently truncated output, leading to incorrect data being produced. The fix ensures such inputs are detected and handled safely.