YjdnJlpとcaretを使ってテキスト分類 - その1

先日作った、YjdnJlpパッケージとcaretパッケージを使って、簡単なテキスト分類作れないかなーと思い、実験。

読み込み

> library(YjdnJlp)
Loading required package: XML
Loading required package: RCurl
Loading required package: bitops
> library(caret)
Loading required package: lattice
Loading required package: reshape
Loading required package: plyr

Attaching package: 'reshape'

The following object(s) are masked from 'package:plyr':

    round_any

Warning message:
package 'caret' was built under R version 2.12.2 

テキストを読み込む

今回は、とりあえず、読売新聞オンライン : ニュース&お得サイトから適当にコピペして作成。

> doc.list <- c( paste(readLines("data/eco-1.txt"), collapse=" "),
 paste(readLines("data/eco-2.txt"), collapse=" "),
 paste(readLines("data/eco-3.txt"), collapse=" "),
 paste(readLines("data/eco-4.txt"), collapse=" "),
 paste(readLines("data/eco-5.txt"), collapse=" "),
 paste(readLines("data/soc-1.txt"), collapse=" "),
 paste(readLines("data/soc-2.txt"), collapse=" "),
 paste(readLines("data/soc-3.txt"), collapse=" "),
 paste(readLines("data/soc-4.txt"), collapse=" "),
 paste(readLines("data/soc-5.txt"), collapse=" "),
 paste(readLines("data/spt-1.txt"), collapse=" "),
 paste(readLines("data/spt-2.txt"), collapse=" "),
 paste(readLines("data/spt-3.txt"), collapse=" "),
 paste(readLines("data/spt-4.txt"), collapse=" "),
 paste(readLines("data/spt-5.txt"), collapse=" ") )

キーフレーズ抽出

YjdnJlpの中に、特徴語(キーフレーズ)とスコアを出してくれる関数があるので、それを使ってみる。

> con <- initYjdnJlp("*****")
key.list <- lapply(doc.list, function(x){ Keyphrase(con, x) })

> term.all <- unique(unlist(lapply(key.list, function(x){ toDataFrame(x)$keyphrase })))

> length(term.all)
[1] 398

> head(term.all)
[1] "放射性ヨウ素"   "真岡市"         "放射性セシウム" "春菊"           "検体"          
[6] "カキナ" 

表を作成する

> key.matrix <- matrix(sapply(key.list,
              function(x){
                key.row <- rep(0, length(term.all))
                names(key.row) <- term.all
                
                for( v in x@keyphrase.list ){
                     key.row[v@keyphrase] <- v@score
                   }
                key.row
              }), ncol=length(term.all), dimnames=list(1:length(key.list), term.all))

> head(key.matrix)
  放射性ヨウ素 真岡市 放射性セシウム 春菊 検体 カキナ 規制値 同県さくら市 ホウレンソウ
1          100     28              0    0    0      0      0            0            0
2           80     27              0    0    0      0      0            0            0
3           66     25              0    0    0      0      0            0            0
4           66     17              0    0    0      0      0            0            0
5           63      9              0    0    0      0      0            0            0
6           62      0              0    0    0      0      0            0            0
  食品衛生法 出荷自粛 暫定規制値 栃木県 自主回収 農業団体 規制値以上 出荷品 県内
1          0        0          0      0        0        0          0      0    0
2          0        0          0      0        0        0          0      0    0
3          0        0          0      0        0        0          0      0    0
4          0        0          0      0        0        0          0      0    0
5          0        0          0      0        0        0          0      0    0
6          0        0          0      0        0        0          0      0    01042080ベクレル 世帯 各都道府県 要介護者 連帯保証 社会福祉協議会 緊急貸し付け
1        94           62    0          0        0        0              0            0
2        87           62    0          0        0        0              0            0
3        81           61    0          0        0        0              0            0
4        77           58    0          0        0        0              0            0
5        76            0    0          0        0        0              0            0
6        76            0    0          0        0        0              0            0
...

caretを使って学習

準備
> library(caret)
> library(e1071)
> key.df.orig <- as.data.frame(key.matrix)
> genre <- as.factor(c(rep("eco", 5), rep("soc", 5), rep("spt",5)))

> nzv<- nearZeroVar(key.df.orig)
> key.df <- key.df.orig[, -nzv]
SVM
> key.svm <- train(key.df, genre,
                 method="svmRadial",
                 preProcess=c("center", "scale"),
                 trControl=fitControl)
Fitting: sigma=0.02809416, C=0.25 
Fitting: sigma=0.02809416, C=0.5 
Fitting: sigma=0.02809416, C=1 
Aggregating results
Selecting tuning parameters
Fitting model on full training set
 50 件以上の警告がありました (警告を見るには warnings() を使って下さい) 
> key.svm
15 samples
44 predictors

Pre-processing: centered, scaled 
Resampling: Repeated Train/Test Splits (30 reps, 0.75%) 

Summary of sample sizes: 12, 12, 12, 12, 12, 12, ... 

Resampling results across tuning parameters:

  C     Accuracy  Kappa  Accuracy SD  Kappa SD
  0.25  0.7       0.55   0.237        0.356   
  0.5   0.7       0.55   0.237        0.356   
  1     0.633     0.45   0.221        0.331   

Tuning parameter 'sigma' was held constant at a value of 0.0228
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were C = 0.25 and sigma = 0.0228. 
Random Forest
> library(randomForest)
> key.rf <- train(key.df, genre,
                   method="rf",
                   preProcess=c("center", "scale"),
                   trControl=fitControl)
Fitting: mtry=2 
Fitting: mtry=23 
Fitting: mtry=44 
Aggregating results
Selecting tuning parameters
Fitting model on full training set
 50 件以上の警告がありました (警告を見るには warnings() を使って下さい) 
> key.rf
15 samples
44 predictors

Pre-processing: centered, scaled 
Resampling: Repeated Train/Test Splits (30 reps, 0.75%) 

Summary of sample sizes: 12, 12, 12, 12, 12, 12, ... 

Resampling results across tuning parameters:

  mtry  Accuracy  Kappa  Accuracy SD  Kappa SD
  2     0.933     0.9    0.136        0.203   
  23    0.944     0.917  0.126        0.19    
  44    0.989     0.983  0.0609       0.0913  

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 44.