Skip to content

Commit

Permalink
sample size k build changed, tests added
Browse files Browse the repository at this point in the history
  • Loading branch information
BERENZ committed Nov 30, 2023
1 parent 70d8d14 commit eacfbba
Show file tree
Hide file tree
Showing 9 changed files with 22 additions and 21 deletions.
2 changes: 1 addition & 1 deletion .Rproj.user/E3DB6272/pcs/files-pane.pper
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,5 @@
"ascending": false
}
],
"path": "~/git/nauka/ncn-foreigners/software/blocking/R"
"path": "~/git/nauka/ncn-foreigners/software/blocking"
}
2 changes: 1 addition & 1 deletion .Rproj.user/E3DB6272/pcs/source-pane.pper
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{
"activeTab": 0,
"activeTab": 2,
"activeTabSourceWindow0": 0
}
1 change: 1 addition & 0 deletions .Rproj.user/shared/notebooks/paths
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
/Users/berenz/git/nauka/ncn-foreigners/software/blocking/R/method_annoy.R="B0938ADD"
/Users/berenz/git/nauka/ncn-foreigners/software/blocking/R/method_hnsw.R="C19508EB"
/Users/berenz/git/nauka/ncn-foreigners/software/blocking/R/method_mlpack.R="9402F0E3"
/Users/berenz/git/nauka/ncn-foreigners/software/blocking/R/method_nnd.R="E5F797EE"
/Users/berenz/git/nauka/ncn-foreigners/software/blocking/R/methods.R="081419BC"
/Users/berenz/git/nauka/ncn-foreigners/software/blocking/R/reclin2_pair_ann.R="D089A6FC"
/Users/berenz/git/nauka/ncn-foreigners/software/blocking/README.Rmd="2B1049F0"
Expand Down
1 change: 1 addition & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@
8. first vignette added.
9. evaluation with standard metrics (recall, fpr etc) added, works with vector for deduplication.
10. added saving index for hnsw and annoy
11. `rnndescend` support added.
2 changes: 1 addition & 1 deletion R/method_nnd.R
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ method_nnd <- function(x,
control) {

l_ind <- rnndescent::rnnd_build(data = x,
k = control$nnd$k_build,
k = if (nrow(x) < control$nnd$k_build) nrow(x) else control$nnd$k_build,
metric = distance,
verbose = verbose,
n_threads = n_threads,
Expand Down
1 change: 1 addition & 0 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ A small package used to block records for data deduplication and record linkage

Currently supports the following R packages that binds to specific ANN algorithms

+ [rnndescent](https://cran.r-project.org/package=rnndescent) (default),
+ [RcppHNSW](https://cran.r-project.org/package=RcppHNSW),
+ [RcppAnnoy](https://cran.r-project.org/package=RcppAnnoy),
+ [mlpack](https://cran.r-project.org/package=RcppAnnoy) (see `mlpack::lsh` and `mlpack::knn`).
Expand Down
26 changes: 12 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ and graphs (via `igraph`).
Currently supports the following R packages that binds to specific ANN
algorithms

- [rnndescent](https://cran.r-project.org/package=rnndescent) (default),
- [RcppHNSW](https://cran.r-project.org/package=RcppHNSW),
- [RcppAnnoy](https://cran.r-project.org/package=RcppAnnoy),
- [mlpack](https://cran.r-project.org/package=RcppAnnoy) (see
Expand Down Expand Up @@ -93,13 +94,10 @@ Deduplication using blocking

``` r
blocking_result <- blocking(x = df_example$txt)
#> 'as(<dgTMatrix>, "dgCMatrix")' is deprecated.
#> Use 'as(., "CsparseMatrix")' instead.
#> See help("Deprecated") and help("Matrix-deprecated").
## data frame with indices and block
blocking_result
#> ========================================================
#> Blocking based on the hnsw method.
#> Blocking based on the nnd method.
#> Number of blocks: 2.
#> Number of columns used for blocking: 28.
#> Reduction ratio: 0.5714.
Expand All @@ -113,16 +111,16 @@ Table with blocking

``` r
blocking_result$result
#> x y block
#> <int> <int> <num>
#> 1: 1 2 1
#> 2: 2 1 1
#> 3: 2 3 1
#> 4: 2 4 1
#> 5: 5 6 2
#> 6: 5 7 2
#> 7: 5 8 2
#> 8: 6 5 2
#> x y block dist
#> <int> <int> <num> <num>
#> 1: 1 2 1 0.10000005
#> 2: 1 3 1 0.14188367
#> 3: 1 4 1 0.28286284
#> 4: 2 1 1 0.10000005
#> 5: 5 6 2 0.08333336
#> 6: 5 7 2 0.13397458
#> 7: 5 8 2 0.27831215
#> 8: 6 5 2 0.08333336
```

Deduplication followed by the `reclin2` package
Expand Down
2 changes: 1 addition & 1 deletion inst/tinytest/test_reclin2.R
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
source("test_data.R")

expect_silent(
pair_ann(x = df_example, on = "txt")
pair_ann(x = df_example, on = "txt", ann = "hnsw")
)

expect_equal(
Expand Down
6 changes: 3 additions & 3 deletions man/controls_ann.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

0 comments on commit eacfbba

Please # to comment.