Add files via upload (7f6d9209) · Commits · github_fork / R4BD

docs/快速建模：高性能机器学习工具.html

+40 −40

Original line number	Diff line number	Diff line
		@@ -292,7 +292,7 @@ pre > code.sourceCode > span > a:first-child::before { text-decoration: underlin
		<div id="fig-ml_process" class="quarto-figure quarto-figure-center quarto-float anchored">
		<figure class="quarto-float quarto-float-fig figure">
		<div aria-describedby="fig-ml_process-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
		<img src="fig/ml_process.jpg" class="img-fluid figure-img">
		<img src="fig/ml_process.jpg" class="img-fluid figure-img" width="1467">
		</div>
		<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-ml_process-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
		Figure 7.1: 机器学习基本流程
		@@ -394,28 +394,28 @@ Figure 7.1: 机器学习基本流程
		<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>split</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
		<div class="cell-output cell-output-stdout">
		<pre><code>$train
		[1] 2 3 4 5 6 8 9 10 12 13 14 20 21 23 24 25 26 27
		[19] 29 32 34 35 36 37 38 40 42 45 46 47 48 49 50 51 53 59
		[37] 60 62 63 64 65 67 70 71 74 77 78 79 81 82 83 84 86 89
		[55] 91 92 94 95 96 97 98 100 101 102 103 104 105 106 107 109 110 111
		[73] 112 113 114 118 119 120 121 122 123 127 128 129 130 132 133 134 135 136
		[91] 138 140 141 143 144 145 146 147 149 150 151 152 154 156 158 159 160 162
		[109] 163 165 167 169 170 171 172 174 175 178 179 180 181 182 184 186 187 189
		[127] 190 192 193 196 197 198 200 201 202 204 205 206 207 208 210 211 214 215
		[145] 216 217 222 223 225 227 229 230 231 232 234 236 237 240 241 244 245 246
		[163] 247 248 250 251 252 253 254 256 258 259 260 261 262 264 266 267 268 269
		[181] 271 272 273 275 276 277 278 279 280 281 282 285 286 287 288 289 290 291
		[199] 294 299 300 301 302 303 305 307 308 309 310 311 312 313 314 316 317 318
		[217] 319 320 321 322 323 325 327 329 330 337 338 339 341 342 343
		[1] 1 2 3 4 5 6 9 11 12 13 15 16 17 18 21 22 23 25
		[19] 26 27 28 30 33 34 35 36 37 38 39 40 41 46 47 48 49 50
		[37] 52 54 58 59 60 62 64 65 67 69 71 73 74 76 79 80 81 83
		[55] 85 86 87 89 90 91 92 93 94 96 97 98 99 100 101 102 103 104
		[73] 106 107 108 109 110 111 113 118 121 123 124 125 126 127 128 129 131 132
		[91] 134 135 139 140 141 142 143 144 148 149 150 151 154 155 156 158 159 160
		[109] 161 163 164 165 166 167 171 172 173 174 175 176 177 178 180 181 183 184
		[127] 186 188 189 194 196 197 198 199 200 201 202 203 204 206 207 208 209 210
		[145] 211 212 213 214 216 217 218 222 223 225 227 228 229 230 231 232 233 234
		[163] 235 237 238 239 241 243 244 245 251 253 254 256 259 261 262 263 266 267
		[181] 268 271 274 275 276 279 280 281 283 284 286 287 288 289 290 291 292 293
		[199] 294 296 297 298 299 301 302 303 304 305 308 309 312 313 314 315 316 317
		[217] 318 320 321 322 323 327 328 329 331 334 335 337 338 340 342

		$test
		[1] 1 7 11 15 16 17 18 19 22 28 30 31 33 39 41 43 44 52
		[19] 54 55 56 57 58 61 66 68 69 72 73 75 76 80 85 87 88 90
		[37] 93 99 108 115 116 117 124 125 126 131 137 139 142 148 153 155 157 161
		[55] 164 166 168 173 176 177 183 185 188 191 194 195 199 203 209 212 213 218
		[73] 219 220 221 224 226 228 233 235 238 239 242 243 249 255 257 263 265 270
		[91] 274 283 284 292 293 295 296 297 298 304 306 315 324 326 328 331 332 333
		[109] 334 335 336 340 344</code></pre>
		[1] 7 8 10 14 19 20 24 29 31 32 42 43 44 45 51 53 55 56
		[19] 57 61 63 66 68 70 72 75 77 78 82 84 88 95 105 112 114 115
		[37] 116 117 119 120 122 130 133 136 137 138 145 146 147 152 153 157 162 168
		[55] 169 170 179 182 185 187 190 191 192 193 195 205 215 219 220 221 224 226
		[73] 236 240 242 246 247 248 249 250 252 255 257 258 260 264 265 269 270 272
		[91] 273 277 278 282 285 295 300 306 307 310 311 319 324 325 326 330 332 333
		[109] 336 339 341 343 344</code></pre>
		</div>
		</div>
		<p>上面这一步操作，将数据分为了两份，一份是训练数据，一份是测试数据。split变量是一个列表，放着的是训练集和测试集所在的行号。下一步，我们将选择机器学习的模型，制定一个学习器。我们将选用决策树算法进行训练，让其进行分类，实现方法如下：</p>
		@@ -466,10 +466,10 @@ node), split, n, loss, yval, (yprob)
		* denotes terminal node

		1) root 231 129 Adelie (0.441558442 0.199134199 0.359307359)
		2) flipper_length< 207 144 44 Adelie (0.694444444 0.298611111 0.006944444)
		4) bill_length< 43.35 100 3 Adelie (0.970000000 0.030000000 0.000000000) *
		5) bill_length>=43.35 44 4 Chinstrap (0.068181818 0.909090909 0.022727273) *
		3) flipper_length>=207 87 5 Gentoo (0.022988506 0.034482759 0.942528736) *</code></pre>
		2) flipper_length< 207.5 146 45 Adelie (0.691780822 0.301369863 0.006849315)
		4) bill_length< 44.3 104 4 Adelie (0.961538462 0.038461538 0.000000000) *
		5) bill_length>=44.3 42 2 Chinstrap (0.023809524 0.952380952 0.023809524) *
		3) flipper_length>=207.5 85 3 Gentoo (0.011764706 0.023529412 0.964705882) *</code></pre>
		</div>
		</div>
		<p>利用这个学习器得到的模型，我们可以对测试集的数据进行预测。我们会调用学习器learner的predict方法，然后对之前task任务数据中行号为test的测试集进行预测：</p>
		@@ -479,12 +479,12 @@ node), split, n, loss, yval, (yprob)
		<div class="cell-output cell-output-stdout">
		<pre><code><PredictionClassif> for 113 observations:
		row_ids truth response
		1 Adelie Adelie
		7 Adelie Adelie
		11 Adelie Adelie
		---
		336 Chinstrap Chinstrap
		340 Chinstrap Gentoo
		8 Adelie Adelie
		10 Adelie Adelie
		--- --- ---
		341 Chinstrap Adelie
		343 Chinstrap Gentoo
		344 Chinstrap Chinstrap</code></pre>
		</div>
		</div>
		@@ -493,7 +493,7 @@ node), split, n, loss, yval, (yprob)
		<div class="sourceCode cell-code" id="cb15"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a>prediction<span class="sc">$</span><span class="fu">score</span>(<span class="fu">msr</span>(<span class="st">"classif.acc"</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
		<div class="cell-output cell-output-stdout">
		<pre><code>classif.acc
		0.9557522 </code></pre>
		0.9292035 </code></pre>
		</div>
		</div>
		<p>上面这些步骤，就是我们利用mlr3框架对一个基本任务进行机器学习的全过程。有了这个基本的概念，我们来看mlr3是如何使用其框架给我们的机器学习过程提供便捷的工具，让整个任务更加高效。</p>
		@@ -633,15 +633,15 @@ node), split, n, loss, yval, (yprob)
		<div class="cell-output cell-output-stdout">
		<pre><code> task_id learner_id classif.ce
		<char> <char> <num>
		1: german_credit classif.rpart 0.2880000
		2: german_credit classif.ranger 0.2370000
		1: german_credit classif.rpart 0.2690000
		2: german_credit classif.ranger 0.2320000
		3: german_credit classif.featureless 0.3000000
		4: sonar classif.rpart 0.3022067
		5: sonar classif.ranger 0.1773519
		6: sonar classif.featureless 0.4660859</code></pre>
		4: sonar classif.rpart 0.2694541
		5: sonar classif.ranger 0.1730546
		6: sonar classif.featureless 0.4663182</code></pre>
		</div>
		</div>
		<p>这里我们没有对结果进行展示，读者可以自行运行代码来观察，其中需要注意的细节包括：1、在使用算法的时候，“classif.rpart”使用的是决策树算法，“classif.ranger”使用的是随机森林算法，而“classif.featureless”则是一个基线模型，在分类问题上会对盲猜为多数类；2、这里默认观察模型的效果，会计算分类的错误率，错误率月低，代表模型表现越好，详见官方文档（<a href="https://mlr3.mlr-org.com/reference/mlr_measures_classif.ce.html" class="uri">https://mlr3.mlr-org.com/reference/mlr_measures_classif.ce.html</a>）；3、在对bmr对象进行观察的时候，调用了<code>aggregate</code>方法，这个方法能够对不同迭代的表现结果进行汇总；4、mlr3框架使用了<strong>data.table</strong>作为底层，因此在观察模型效果的时候，我们选择列直接用了data.table中的方法。关于更多模型比较的内容，可以参考官方文档（<a href="https://mlr3book.mlr-org.com/chapters/chapter3/evaluation_and_benchmarking.html" class="uri">https://mlr3book.mlr-org.com/chapters/chapter3/evaluation_and_benchmarking.html</a>）。</p>
		<p>这里我们没有对结果进行展示，读者可以自行运行代码来观察，其中需要注意的细节包括：1、在使用算法的时候，“classif.rpart”使用的是决策树算法，“classif.ranger”使用的是随机森林算法，而“classif.featureless”则是一个基线模型，在分类问题上会对盲猜为多数类；2、这里默认观察模型的效果，会计算分类的错误率，错误率越低，代表模型表现越好，详见官方文档（<a href="https://mlr3.mlr-org.com/reference/mlr_measures_classif.ce.html" class="uri">https://mlr3.mlr-org.com/reference/mlr_measures_classif.ce.html</a>）；3、在对bmr对象进行观察的时候，调用了<code>aggregate</code>方法，这个方法能够对不同迭代的表现结果进行汇总；4、mlr3框架使用了<strong>data.table</strong>作为底层，因此在观察模型效果的时候，我们选择列直接用了data.table中的方法。关于更多模型比较的内容，可以参考官方文档（<a href="https://mlr3book.mlr-org.com/chapters/chapter3/evaluation_and_benchmarking.html" class="uri">https://mlr3book.mlr-org.com/chapters/chapter3/evaluation_and_benchmarking.html</a>）。</p>
		</section>
		<section id="参数调节" class="level3" data-number="7.3.4">
		<h3 data-number="7.3.4" class="anchored" data-anchor-id="参数调节"><span class="header-section-number">7.3.4</span> 参数调节</h3>
		@@ -734,14 +734,14 @@ range [100, 400]</code></pre>
		<div class="cell-output cell-output-stdout">
		<pre><code> num.trees learner_param_vals x_domain classif.ce
		<char> <list> <list> <num>
		1: 400 <list[2]> <list[1]> 0.1971705</code></pre>
		1: 200 <list[2]> <list[1]> 0.1972395</code></pre>
		</div>
		<div class="sourceCode cell-code" id="cb36"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb36-1"><a href="#cb36-1" aria-hidden="true" tabindex="-1"></a><span class="co"># 显示调参结果</span></span>
		<span id="cb36-2"><a href="#cb36-2" aria-hidden="true" tabindex="-1"></a>instance<span class="sc">$</span>result</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
		<div class="cell-output cell-output-stdout">
		<pre><code> num.trees learner_param_vals x_domain classif.ce
		<char> <list> <list> <num>
		1: 400 <list[2]> <list[1]> 0.1971705</code></pre>
		1: 200 <list[2]> <list[1]> 0.1972395</code></pre>
		</div>
		</div>
		<p>此外，mlr3框架中实现同样的方法，还有其他更加明晰的实现方法：</p>
		@@ -769,7 +769,7 @@ range [100, 400]</code></pre>
		<div class="cell-output cell-output-stdout">
		<pre><code> num.trees learner_param_vals x_domain classif.ce
		<char> <list> <list> <num>
		1: 400 <list[2]> <list[1]> 0.1731539</code></pre>
		1: 200 <list[2]> <list[1]> 0.1728088</code></pre>
		</div>
		</div>
		<p>最后返回的结果就是本次测试中获得的最佳参数（这里没有给出，请读者自行运行尝试）。关于如何在mlr3框架中灵活地对各种参数组合进行调节，可以参考官方文档（<a href="https://mlr3book.mlr-org.com/chapters/chapter4/hyperparameter_optimization.html" class="uri">https://mlr3book.mlr-org.com/chapters/chapter4/hyperparameter_optimization.html</a>）。</p>

Admin message