29 Sep 08:45

qinwf

517bc4e

CRAN Version 0.9.1 Latest

Latest

Changes in Version 0.9.1 (2016-9-28)

Major Change: distance and vector_distance now return integer value as distance.
Major Change: requires C++11 with GCC 4.9+ to build this package
Fix: tobin now returns the correct value
Fix: get_idf rownames with 1 based index
Add: new_user_word now has a default tag
Add: apply_list to handle nested list input data
Add: simhash_dist to compute distance of simhash values
Add: simhash_dist_mat to compute compute distance matrix of simhash values
Add: vector_tag to tag a character vector
Add: more docs
Depreciated: quick mode will be remove in v0.11.0
Depreciated: filecoding to file_coding
Warning: next version will update internal CppJieba version to 5.0.0, query_threshold, words_locate will be removed due to the upstream apis changes.

Assets 2

31 Jan 00:15

qinwf

v0.8

d54f312

CRAN Version 0.8

Changes in Version 0.8

o Remove: ShowDictPath() EditDict() tag()
o Remove: some C API due to CppJieba V4.4.1 update.

o C APIs will not work: jiebaR_mp_ptr jiebaR_mp_cut jiebaR_query_ptr jiebaR_query_cut jiebaR_hmm_ptr jiebaR_hmm_cut.

o C APIs will work but give a warning: jiebaR_mix_ptr jiebaR_mix_cut jiebaR_tag_ptr jiebaR_tag_tag jiebaR_tag_file. jiebaR_mix_cut.

o C APIs change: jiebaR_key_ptr jiebaR_sim_ptr add user path varible.

o Add: some C API due to CppJieba V4.4.1 update.

jiebaR_jiebaclass_ptr, jiebaR_jiebaclass_mix_cut, jiebaR_jiebaclass_mp_cut, jiebaR_jiebaclass_hmm_cut, jiebaR_jiebaclass_query_cut, jiebaR_jiebaclass_full_cut, jiebaR_jiebaclass_level_cut, jiebaR_jiebaclass_level_cut_pair, jiebaR_jiebaclass_tag_tag,jiebaR_jiebaclass_tag_file, jiebaR_set_query_threshold, jiebaR_add_user_word, jiebaR_u64tobin, jiebaR_get_loc

o Add: more type for segmentation, add: full cut, level cut.
o Add: default attributte for the type of segmentation.
o Add: add new user word after worker engine created.
o Add: query_threshold to update query threshold
o Add: words_locate to locate the positions of words
o Fix: build on GCC 5.3.2 with gnu++14
o Fix: build on Clang 3.8 RC
o Fix: add roxygen2 as a dependency for the update of devtools

Assets 2

06 Dec 14:12

qinwf

0.7

d664000

CRAN Version 0.7

Changes in Version 0.7

o Add: tobin() to transform simhash to binary format.
o Add: vector_simhash() vector_distance() to extract simhash or compute Hamming distance from the result of segmentation.
o Add: get_tuple() to get tuple from segmentation result.
o Add: get_idf() to generate IDF dict.
o Fix: C API now work with Clang on Mac 10.11.
o Enhencement: Update tests for C API.
o Warning: Next version will update internal CppJieba version and tag(), EditDict(), ShowDictPath() will be remove.

一、增加：get_tuple() 返回分词结果中 n 个连续的字符串组合的频率情况，可以作为自定义词典的参考。

get_tuple(c("sd","sd","sd","rd"),size=3)
#     name count
# 4   sdsd     2
# 1   sdrd     1
# 2 sdsdrd     1
# 3 sdsdsd     1
get_tuple(list(
        c("sd","sd","sd","rd"),
        c("新浪","微博","sd","rd"),
    ), size = 2)
#       name count
# 2     sdrd     2
# 3     sdsd     2
# 1   微博sd     1
# 4 新浪微博     1

二、增加：get_idf() 根据多文档词条结果计算 IDF 值。输入一个包含多个文本向量的 list,每一个文本向量代表一个文档，可自定义停止词列表。

get_idf(a_big_list,stop="停止词列表",path="输出IDF目录")

三、增加：可以使用 vector_simhash vector_distance 直接对文本向量计算 simhash 和海明距离。

sim = worker("simhash")
cutter = worker()
vector_simhash(cutter["这是一个比较长的测试文本。"],sim)

$simhash
[1] "9679845206667243434"

$keyword
8.94485 7.14724 4.77176 4.29163 2.81755 
 "文本"  "测试"  "比较"  "这是"  "一个"

vector_simhash(c("今天","天气","真的","十分","不错","的","感觉"),sim)

$simhash
[1] "13133893567857586837"

$keyword
6.45994 6.18823 5.64148 5.63374 4.99212 
 "天气"  "不错"  "感觉"  "真的"  "今天"

vector_distance(c("今天","天气","真的","十分","不错","的","感觉"),c("今天","天气","真的","十分","不错","的","感觉"),sim)

$distance
[1] "0"

$lhs
6.45994 6.18823 5.64148 5.63374 4.99212 
 "天气"  "不错"  "感觉"  "真的"  "今天" 

$rhs
6.45994 6.18823 5.64148 5.63374 4.99212 
 "天气"  "不错"  "感觉"  "真的"  "今天"

四、增加：可以使用 tobin 进行 simhash 数值的二进制转换。

res = vector_simhash(c("今天","天气","真的","十分","不错","的","感觉"),sim) 
tobin(res$simhash)

[1] "0000000000000000000000000000000000010101111100000111001010010101"

Assets 2

02 Oct 08:34

qinwf

v0.6

9104f1a

CRAN version 0.6

Changes in Version 0.6 (2015-10-1)

Add: C API
Add: freq() to count word frequency
Fix: filter_segment() may occasionally remove words
Enhencement: filter_segment() now can handle list of vectors of words, and adds a unit option.
Enhencement: segmentation worker now can remove stop words. The default STOPPATH is not used by default for segmentation worker.
Enhencement: when symbol = F, 2010-10-13, 10.2 can be identified.

一、增强：分词、词性标注时，增加过滤停止词功能，默认的 STOPPATH 路径将不会被使用，不默认使用停止词库。需要自定义其他路径，停止词才能在分词时使用。停止词库的编码需要为 UTF-8 格式，否则读入的数据可能为乱码。

cutter = worker()
cutter
# Worker Type:  Mix Segment

# Fixed Model Components:  
# ...

# $stop_word
# NULL

# $timestamp
# [1] 1442716020

# $detect $encoding $symbol $output $write $lines $bylines can be reset

cutter = worker(stop_word="../stop.txt")
cutter
# Worker Type:  Mix Segment

# Fixed Model Components:  
# ...

# $stop_word
# [1] "../stop.txt"

# $timestamp
# [1] 1442716020

# $detect $encoding $symbol $output $write $lines $bylines can be reset.

二、增强：分词时，symbol = FALSE 时，2010-10-12，20.2 类似格式的文本中的符号会被保留。单纯的符号将会被过滤。

cutter = worker()
cutter$symbol = F
cutter["2010-10-10"]

三、增加：freq() 进行词频统计，输入内容为文本向量，输出内容为文本频率的数据框。

freq(c("测试", "测试", "文本"))

四、增强：filter_segment() 现在可以输入以文本向量为内容的 list。

cutter = worker()
result_segment = list(  cutter["我是测试文本，用于测试过滤分词效果。"], 
                      cutter["我是测试文本，用于测试过滤分词效果。"])
result_segment
filter_words  = c("我","你","它","大家")
filter_segment(result_segment,filter_words)

五、修复：filter_segment() 可能会出现删除非停止词。

六、增加：filter_segment() 增加unit 选项。

处理文本时，停止词数量较多时，生成的正则表达式超过 265 bytes ，R 可能会报错。通过 unit 选项可以对于较多的停止词分多次处理，控制每次识别的停止词的个数，控制生成的正则表达式的长度。unit 默认值为 50，一般不需要修改 unit 的默认值。

help(regex)

Long regular expressions may or may not be accepted: the POSIX standard only requires up to 256 bytes.

filter_segment(result_segment,filter_words) # 使用默认值，一般不需要修改。

filter_segment(result_segment,filter_words, unit=10) # 如果你有较多文本长度很长的停止词词条

七、增加： C API，可以在其他 R 包调用本包的 C 接口。

// inst/include/jiebaRAPI.h
SEXP jiebaR_filecoding(SEXP fileSEXP);

SEXP jiebaR_mp_ptr(SEXP dictSEXP, SEXP userSEXP); 

....

Assets 2

29 Apr 11:28

qinwf

0.5

c62e0fc

CRAN version 0.5

Changes in Version 0.5 (2015-04-29)

Fix: edit_dict() on Mac
New function: filter_segment() to filter segmentation result
New function: vector_keywords() to extract keywords from a string
Enhancement: Segmentation support: Vector input => List output
Enhancement: Segmentation support: Input by lines => Output by lines
Enhancement: Add option write = "NOFILE"
Enhancement: New rules for "English word + Numbers"
Update documentation

一、增加过滤分词结果的方法 filter_segment()，类似于关键词提取中使用的停止词功能。

cutter = worker()
result_segment = cutter["我是测试文本，用于测试过滤分词效果。"]
result_segment

[1] "我"   "是"   "测试" "文本" "用于" "测试" "过滤" "分词" "效果"

filter_words = c("我","你","它","大家")
filter_segment(result_segment,filter_words)

[1] "是"   "测试" "文本" "用于" "测试" "过滤" "分词" "效果"

二、分词支持 “向量文本输入 => list输出” 与 “按行输入文件 => list输出”

通过 bylines 选项控制是否按行输出，默认值为bylines = FALSE。

cutter = worker(bylines = TRUE)
cutter

Worker Type:  Mix Segment

Detect Encoding :  TRUE
Default Encoding:  UTF-8
Keep Symbols    :  FALSE
Output Path     :  
Write File      :  TRUE
By Lines        :  TRUE
Max Read Lines  :  1e+05
....

cutter[c("这是非常的好","大家好才是真的好")]

[[1]]
[1] "这是" "非常" "的"   "好"  

[[2]]
[1] "大家" "好"   "才"   "是"   "真的" "好"

cutter$write = FALSE

# 输入文件文本是：
# 这是一个分行测试文本
# 用于测试分行的输出结果

cutter["files.path"]

[[1]]
[1] "这是" "一个" "分行" "测试" "文本" 

[[2]]
[1] "用于" "测试" "分行"   "的" "输出" "结果"

# 按行写入文件
cutter$write = TRUE
cutter$bylines = TRUE

三、可以使用 vector_keywords 对一个文本向量提取关键词。

keyworker = worker("keywords")
cutter = worker()
vector_keywords(cutter["这是一个比较长的测试文本。"],keyworker)

8.94485 7.14724 4.77176 4.29163 2.81755 
 "文本"  "测试"  "比较"  "这是"  "一个"

vector_keywords(c("今天","天气","真的","十分","不错","的","感觉"),keyworker)

6.45994 6.18823 5.64148 5.63374 4.99212 
 "天气"  "不错"  "感觉"  "真的"  "今天"

四、增加 write = "NOFILE" 选项，不检查文件路径。

cutter = worker(write = "NOFILE",symbol = TRUE)
cutter["./test.txt"] # 目录下有test.txt 文件

[1] "."    "/"    "test" "."    "txt"

Assets 2

04 Jan 06:28

qinwf

0.4

f641b26

CRAN version 0.4

Remove Rcpp Modules
Better symbol filter in segmentation
Separate data files to jiebaRD package

Assets 2

02 Dec 14:02

qinwf

0.3

a8c321b

CRAN version 0.3

Pass UBSAN test
2X segmentation speed
Quick Mode
A new [ symbol to do segmentation
Portable internal string utility function

Assets 2

24 Nov 05:21

qinwf

0.2

0f85c02

CRAN version 0.2

This is the first release.to CRAN.

Assets 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes in Version 0.9.1 (2016-9-28)

Changes in Version 0.8

Changes in Version 0.7

Changes in Version 0.6 (2015-10-1)

Changes in Version 0.5 (2015-04-29)

Releases: qinwf/jiebaR

CRAN Version 0.9.1

Changes in Version 0.9.1 (2016-9-28)

CRAN Version 0.8

Changes in Version 0.8

CRAN Version 0.7

Changes in Version 0.7

CRAN version 0.6

Changes in Version 0.6 (2015-10-1)

CRAN version 0.5

Changes in Version 0.5 (2015-04-29)

CRAN version 0.4

CRAN version 0.3

CRAN version 0.2