Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add nvtext substring deduplication API #18104

Draft
wants to merge 24 commits into
base: branch-25.04
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
51e83d6
Add nvtext substring deduplication API
davidwendt Feb 26, 2025
e275a9b
Merge branch 'branch-25.04' into dedup-substring
davidwendt Feb 26, 2025
e48bda9
Merge branch 'branch-25.04' into dedup-substring
davidwendt Feb 27, 2025
69f38a7
Merge branch 'branch-25.04' into dedup-substring
davidwendt Feb 27, 2025
76500ed
Merge branch 'branch-25.04' into dedup-substring
davidwendt Mar 4, 2025
5deeee5
Merge branch 'branch-25.04' into dedup-substring
davidwendt Mar 4, 2025
b2025fd
add sliced support; fix pytest
davidwendt Mar 4, 2025
915a290
fix pytest
davidwendt Mar 5, 2025
e3fdaab
fix merge conflicts
davidwendt Mar 5, 2025
e65ca23
Merge branch 'dedup-substring' of github.com:davidwendt/cudf into ded…
davidwendt Mar 5, 2025
2c9f675
add max-run-length edge case handling
davidwendt Mar 5, 2025
aeb4d6a
Merge branch 'branch-25.04' into dedup-substring
davidwendt Mar 5, 2025
4053735
Merge branch 'branch-25.04' into dedup-substring
davidwendt Mar 5, 2025
72b0784
Merge branch 'branch-25.04' into dedup-substring
davidwendt Mar 6, 2025
697082e
Merge branch 'dedup-substring' of github.com:davidwendt/cudf into ded…
davidwendt Mar 7, 2025
a02cf0d
Merge branch 'branch-25.04' into dedup-substring
davidwendt Mar 7, 2025
2d26ed9
Merge branch 'branch-25.04' into dedup-substring
davidwendt Mar 7, 2025
b1e44f2
use cub sort instead of thrust sort
davidwendt Mar 7, 2025
a5f721b
Merge branch 'branch-25.04' into dedup-substring
davidwendt Mar 10, 2025
031bd6d
add remove-if-safe
davidwendt Mar 11, 2025
c2c9d03
Merge branch 'branch-25.04' into dedup-substring
davidwendt Mar 12, 2025
c5ff64b
build_suffix_array
davidwendt Mar 12, 2025
b8ba990
Merge branch 'branch-25.04' into dedup-substring
davidwendt Mar 12, 2025
9b89188
fix cython interface
davidwendt Mar 12, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -743,6 +743,7 @@ add_library(
src/table/table.cpp
src/table/table_device_view.cu
src/table/table_view.cpp
src/text/dedup.cu
src/text/detokenize.cu
src/text/edit_distance.cu
src/text/generate_ngrams.cu
Expand Down
69 changes: 69 additions & 0 deletions cpp/include/nvtext/dedup.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
/*
* Copyright (c) 2025, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once

#include <cudf/column/column.hpp>
#include <cudf/strings/strings_column_view.hpp>
#include <cudf/utilities/export.hpp>
#include <cudf/utilities/memory_resource.hpp>

#include <rmm/cuda_stream_view.hpp>
#include <rmm/device_uvector.hpp>

//! NVText APIs
namespace CUDF_EXPORT nvtext {
/**
* @addtogroup nvtext_replace
* @{
* @file
*/

/**
* @brief Returns a duplicate strings found in the given input
*
* The internal implementation creates a suffix array of the input which
* requires ~10x the input size for temporary memory.
*
* The output includes any strings of at least `min_width` bytes that
* appear more than once in the entire input.
*
* @param input Strings column to dedup
* @param min_width Minimum number of bytes must match to specify a duplicate
* @param stream CUDA stream used for device memory operations and kernel launches
* @param mr Device memory resource used to allocate the returned column's device memory
* @return New strings column with updated strings
*/
std::unique_ptr<cudf::column> substring_deduplicate(
cudf::strings_column_view const& input,
cudf::size_type min_width,
rmm::cuda_stream_view stream = cudf::get_default_stream(),
rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref());

/**
* @brief Builds a suffix array for the input strings column
*
* @param input Strings column to build suffix array for
* @param stream CUDA stream used for device memory operations and kernel launches
* @param mr Device memory resource used to allocate the returned column's device memory
* @return Sorted suffix array and corresponding sizes
*/
std::unique_ptr<rmm::device_uvector<int64_t>> build_suffix_array(
cudf::strings_column_view const& input,
rmm::cuda_stream_view stream = cudf::get_default_stream(),
rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref());

/** @} */ // end of group
} // namespace CUDF_EXPORT nvtext
Loading
Loading