Skip to content

GH-45664: [C++] Allow LargeString,LargeBinary,FixedSizeBinary,StringView and BinaryView for RecordBatch::MakeStatisticsArray() #46031

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

andishgar
Copy link
Contributor

@andishgar andishgar commented Apr 4, 2025

Rationale for this change

MakeStatisticsArray in RecordBatch does not support StringView,BinaryView,LargeString,LargeBinary,FixedSizeBinary type

What changes are included in this PR?

the correction of MakeStatisticsArray in RecordBatch to support all arrow string types and relevant tests. Additionally, some changes applied to MakeStatisticsArray in record_batch_test.cc

Are these changes tested?

Yes, I run all relevant unit tests

Are there any user-facing changes?

No.( Add support for large_utf8, large_binary,fixed_size_binary, StringView, and BinaryView to RecordBatch::MakeStatisticsArray()).

@andishgar andishgar force-pushed the c++-add-StringView-BinaryView-to-statistics-shcema branch from 5c5fd1b to c86cbd6 Compare April 4, 2025 11:24
@andishgar andishgar marked this pull request as draft April 4, 2025 11:25
@andishgar andishgar marked this pull request as ready for review April 4, 2025 12:28
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Apr 7, 2025
@kou kou changed the title GH-45664: [C++] Add StringView and BinaryView to ArrayStatistics GH-45664: [C++] Allow StringView and BinaryView for ArrayStatistics value types Apr 7, 2025
@kou
Copy link
Member

kou commented Apr 7, 2025

The PR title and associated issue focus on ArrayStatistics and StringView/BinaryView.
But the PR description and changes focus on RecordBatch::MakeStatisticsArray().
How about updating the PR title and associated issue to focus on RecordBatch::MakeStatisticsArray()?

The PR description says that this is a "Critical Fix". Why do you think that this is a "Critical Fix"?

@andishgar
Copy link
Contributor Author

The PR description says that this is a "Critical Fix". Why do you think that this is a "Critical Fix"?

"The issue with the initial implementation (the one in the main branch) is that RecordBatch::MakeStatisticsArray() cannot handle large_utf8 and large_binary. I’m referring to the following code in the main branch

TEST_F(TestRecordBatch, MakeStatisticsArrayString) {
  auto schema =
      ::arrow::schema({field("no-statistics", boolean()), field("string", large_utf8())});
  auto no_statistics_array = ArrayFromJSON(boolean(), "[true, false, true]");
  auto string_array_data = ArrayFromJSON(large_utf8(), "[\"a\", null, \"c\"]")->data()->Copy();
  string_array_data->statistics = std::make_shared<ArrayStatistics>();
  string_array_data->statistics->is_max_exact = true;
  string_array_data->statistics->max = "c";
  auto string_array = MakeArray(std::move(string_array_data));
  auto batch = RecordBatch::Make(schema, string_array->length(),
                                 {no_statistics_array, string_array});

  ASSERT_OK_AND_ASSIGN(auto statistics_array, batch->MakeStatisticsArray());
  ARROW_LOGGER_INFO("",statistics_array->ToString());
}

print a massage like

/home/arashandishgar/Desktop/arrow/cpp/src/arrow/record_batch_test.cc:1440: -- is_valid: all not null
Running main() from /home/arashandishgar/Desktop/arrow/cpp/cmake-build-ninja-debug-1/_deps/googletest-src/googletest/src/gtest_main.cc
-- child 0 type: int32
  [
    null,
    1
  ]
-- child 1 type: map<dictionary<values=string, indices=int32, ordered=0>, dense_union<int64: int64=0, large_utf8: large_string=1>>
<Invalid array: List child array invalid: Invalid: Struct child array #1 invalid: Invalid: Union child array #1 invalid: Invalid: Offsets buffer size (bytes): 12 isn't large enough for length: 1 and offset: 0
/home/arashandishgar/Desktop/arrow/cpp/src/arrow/array/validate.cc:597  ValidateOffsetsAndSizes(type, values.size())

/home/arashandishgar/Desktop/arrow/cpp/src/arrow/array/validate.cc:176  ValidateBinaryLike(type)
/home/arashandishgar/Desktop/arrow/cpp/src/arrow/array/validate.cc:277  ValidateListLike(type)>

@kou
Copy link
Member

kou commented Apr 7, 2025

OK. I think that it's just "not supported yet" not a bug.

@andishgar
Copy link
Contributor Author

OK. I think that it's just "not supported yet" not a bug.

I apologize for using 'critical fix' in my description—I’ve updated it. (I used it because I thought it tied to this: (b) a bug causing incorrect or invalid data,)

@andishgar andishgar changed the title GH-45664: [C++] Allow StringView and BinaryView for ArrayStatistics value types GH-45664: [C++] Allow LargeString,LargeBinary,FixedSizeBinary,StringView and BinaryView for RecordBatch::MakeStatisticsArray() Apr 7, 2025
@andishgar andishgar force-pushed the c++-add-StringView-BinaryView-to-statistics-shcema branch from c86cbd6 to 8e9e748 Compare April 8, 2025 11:34
@andishgar andishgar marked this pull request as draft April 8, 2025 11:34
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 8, 2025
…rray

RecordBatch::MakeStatisticsArray  produces only utf8() and binary string types, MakeStatisticsArray in record_batch_test.cc  produces only utf8() types.  I change it in this  manner to support all string types(Binary,String,LargeBinary,LargeString,BinaryView,StringView)

include  "arrow/array/statistics.h"

include type_traits
the following thing are done in commits
add support for fixed_size_binary with relevant tests
resolve issues regarding, template , null data type and naming
@andishgar andishgar force-pushed the c++-add-StringView-BinaryView-to-statistics-shcema branch from 8e9e748 to c71c88f Compare April 8, 2025 12:32
@andishgar andishgar requested a review from kou April 8, 2025 13:26
@andishgar andishgar marked this pull request as ready for review April 8, 2025 13:26
@andishgar andishgar force-pushed the c++-add-StringView-BinaryView-to-statistics-shcema branch from 7c1645a to 5084eb6 Compare April 8, 2025 18:08
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants