GH-48467: [C++][Parquet] Add estimated_buffered_stats() API for RowGroupWriter#49527
GH-48467: [C++][Parquet] Add estimated_buffered_stats() API for RowGroupWriter#49527wecharyu wants to merge 3 commits intoapache:mainfrom
Conversation
cpp/src/parquet/file_writer.cc
Outdated
| if (column_writers_[i]) { | ||
| estimated_buffered_value_bytes += | ||
| column_writers_[i]->estimated_buffered_value_bytes() + | ||
| column_writers_[i]->EstimatedBufferedLevelsBytes(); |
There was a problem hiding this comment.
I still have the question: do we need to consider buffered dictionary values?
There was a problem hiding this comment.
Instead of return a sum, should we return a struct like below:
struct BufferedStats {
int64_t def_level_bytes;
int64_t rep_level_bytes;
int64_t value_bytes;
int64_t dict_bytes;
};
cpp/src/parquet/file_writer.h
Outdated
| @@ -34,6 +34,13 @@ class ColumnWriter; | |||
| static constexpr uint8_t kParquetMagic[4] = {'P', 'A', 'R', '1'}; | |||
| static constexpr uint8_t kParquetEMagic[4] = {'P', 'A', 'R', 'E'}; | |||
|
|
|||
| struct PARQUET_EXPORT BufferedStats { | |||
There was a problem hiding this comment.
BufferedStats is a bit weird naming for me. AFAIK, if we append column batch by batch, before flushing the row-group, the pages are still "buffered". Can we choose a better name or add the comment for it?
There was a problem hiding this comment.
Agreed. We can add a comment to explain that this is for uncompressed values that do not belong to any page and buffered by page writers.
We can also make it an inner class of the RowGroupWriter class.
|
How about fixing the PR title and rebasing on the latest main to pass all CIs? |
6313725 to
63c2be2
Compare
Rationale for this change
Expose an API for buffered bytes of values and levels in RowGroupWriter, it's useful in deciding whether a new row group is needed.
Discussion in #48468 (comment)
What changes are included in this PR?
Add new API
total_buffered_bytes()inRowGroupWriter.Are these changes tested?
Test locally.
Are there any user-facing changes?
Yes, user could use the new API to implement their customized row group strategy.