Unverified Commit 68b5bb2f authored by Hope's avatar Hope Committed by GitHub
Browse files

Add files via upload

parent f916878b
Loading
Loading
Loading
Loading
+2 −2
Original line number Diff line number Diff line
@@ -7,7 +7,7 @@
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">

<meta name="author" content="黄天元">
<meta name="dcterms.date" content="2024-07-31">
<meta name="dcterms.date" content="2024-10-10">

<title>实战大数据:基于R语言</title>
<style>
@@ -267,7 +267,7 @@ pre > code.sourceCode > span > a:first-child::before { text-decoration: underlin
    <div>
    <div class="quarto-title-meta-heading">Published</div>
    <div class="quarto-title-meta-contents">
      <p class="date">July 31, 2024</p>
      <p class="date">October 10, 2024</p>
    </div>
  </div>
  
+26 −6

File changed.

Preview size limit exceeded, changes collapsed.

+4 −7
Original line number Diff line number Diff line
@@ -311,7 +311,7 @@ Figure&nbsp;12.1: DuckDB数据库Logo
</section>
<section id="数据库的连接" class="level3" data-number="12.2.2">
<h3 data-number="12.2.2" class="anchored" data-anchor-id="数据库的连接"><span class="header-section-number">12.2.2</span> 数据库的连接</h3>
<p>事开头难,对数据库操作的第一步就是必须让R环境与数据库连接起来。在R中要与数据库连接,一般需要两个包:其一是<strong>DBI</strong>,这个包提供了用于数据库连接、数据传输、执行查询的通用函数;其二是针对用户连接数据库系统的定制包,这些包能够把<strong>DBI</strong>命令转化为特定数据库系统能够解读的命令,比如要使用SQLite就需要<strong>RSQLite</strong>包,使用PostgreSQL就需要使用<strong>PostgreSQL</strong>包。对于咱们的试验来说,需要使用<strong>duckdb</strong>包来完成这个操作,实现方法如下:</p>
<p>事开头难,对数据库操作的第一步就是必须让R环境与数据库连接起来。在R中要与数据库连接,一般需要两个包:其一是<strong>DBI</strong>,这个包提供了用于数据库连接、数据传输、执行查询的通用函数;其二是针对用户连接数据库系统的定制包,这些包能够把<strong>DBI</strong>命令转化为特定数据库系统能够解读的命令,比如要使用SQLite就需要<strong>RSQLite</strong>包,使用PostgreSQL就需要使用<strong>PostgreSQL</strong>包。对于咱们的试验来说,需要使用<strong>duckdb</strong>包来完成这个操作,实现方法如下:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb2"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a>con <span class="ot">=</span> <span class="fu">dbConnect</span>(<span class="fu">duckdb</span>())</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
@@ -670,7 +670,7 @@ See $.data for the source Arrow object</code></pre>
<span id="cb38-2"><a href="#cb38-2" aria-hidden="true" tabindex="-1"></a><span class="fu">p_load</span>(polars,tidypolars,tidyverse,tidyfst)</span>
<span id="cb38-3"><a href="#cb38-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb38-4"><a href="#cb38-4" aria-hidden="true" tabindex="-1"></a><span class="co"># 扫描数据</span></span>
<span id="cb38-5"><a href="#cb38-5" aria-hidden="true" tabindex="-1"></a>pl<span class="sc">$</span><span class="fu">scan_parquet</span>(<span class="st">"df.parquet"</span>) <span class="ot">-&gt;</span> dat_pl</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<span id="cb38-5"><a href="#cb38-5" aria-hidden="true" tabindex="-1"></a><span class="fu">scan_parquet_polars</span>(<span class="st">"temp/df.parquet"</span>) <span class="ot">-&gt;</span> dat_pl</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
<p>需要注意的是,在上面的操作中,我们并没有把数据导入到环境里面。我们用了“扫描”一词,其实相当于对数据进行了连接,类似于我们在前一章节中提到的<code>open_dataset</code>操作。在这个背景下,我们可以对这个没有导入环境的数据进行各种操作,并把结果收集到环境中进行展示,操作方法如下:</p>
<div class="cell">
@@ -694,11 +694,8 @@ See $.data for the source Arrow object</code></pre>
<span id="cb39-18"><a href="#cb39-18" aria-hidden="true" tabindex="-1"></a><span class="co"># 查看结果</span></span>
<span id="cb39-19"><a href="#cb39-19" aria-hidden="true" tabindex="-1"></a>res</span>
<span id="cb39-20"><a href="#cb39-20" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb39-21"><a href="#cb39-21" aria-hidden="true" tabindex="-1"></a><span class="co"># 把结果转化为R中的数据框</span></span>
<span id="cb39-22"><a href="#cb39-22" aria-hidden="true" tabindex="-1"></a>res<span class="sc">$</span><span class="fu">to_data_frame</span>()</span>
<span id="cb39-23"><a href="#cb39-23" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb39-24"><a href="#cb39-24" aria-hidden="true" tabindex="-1"></a><span class="co"># 把结果转化为数据框并使用tibble形式进行展示</span></span>
<span id="cb39-25"><a href="#cb39-25" aria-hidden="true" tabindex="-1"></a>res <span class="sc">%&gt;%</span> <span class="fu">as_tibble</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<span id="cb39-21"><a href="#cb39-21" aria-hidden="true" tabindex="-1"></a><span class="co"># 把结果转化为数据框并使用tibble形式进行展示</span></span>
<span id="cb39-22"><a href="#cb39-22" aria-hidden="true" tabindex="-1"></a>res <span class="sc">%&gt;%</span> <span class="fu">as_tibble</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
<p>通过上面的试验,我们可以发现只需要把数据先存为Parquet格式,然后使用<code>scan_parquet</code>方法进行数据连接,就可以利用我们熟悉的<strong>dplyr</strong><strong>tidyr</strong>函数对保存在磁盘中的数据进行各式的数据操作,这给我们的大数据分析提供了巨大的便利,是解决内存不足计算(Out-of-Memory Computation)的最佳方案之一。</p>
</section>
+1 −0
Original line number Diff line number Diff line
@@ -235,6 +235,7 @@ ul.task-list li input[type="checkbox"] {
<li><a href="https://spark.posit.co/">R interface to Apache Spark</a></li>
<li><a href="https://pola-rs.github.io/r-polars/vignettes/polars.html">An Introduction to Polars from R</a></li>
<li><a href="https://tidypolars.etiennebacher.com/">tidypolars</a></li>
<li><a href="https://fastverse.github.io/fastverse/">fastverse</a></li>
</ol>


+16 −0
Original line number Diff line number Diff line
@@ -245,6 +245,8 @@ pre > code.sourceCode > span > a:first-child::before { text-decoration: underlin
  <li><a href="#极限压缩" id="toc-极限压缩" class="nav-link" data-scroll-target="#极限压缩"><span class="header-section-number">4.3</span> 极限压缩</a></li>
  <li><a href="#通用交流" id="toc-通用交流" class="nav-link" data-scroll-target="#通用交流"><span class="header-section-number">4.4</span> 通用交流</a></li>
  <li><a href="#小结" id="toc-小结" class="nav-link" data-scroll-target="#小结"><span class="header-section-number">4.5</span> 小结</a></li>
  <li><a href="#练习" id="toc-练习" class="nav-link" data-scroll-target="#练习"><span class="header-section-number">4.6</span> 练习</a></li>
  <li><a href="#参考资料" id="toc-参考资料" class="nav-link" data-scroll-target="#参考资料"><span class="header-section-number">4.7</span> 参考资料</a></li>
  </ul>
</nav>
    </div>
@@ -482,6 +484,20 @@ Figure&nbsp;4.1: rio包的六边形标志符
<section id="小结" class="level2" data-number="4.5">
<h2 data-number="4.5" class="anchored" data-anchor-id="小结"><span class="header-section-number">4.5</span> 小结</h2>
<p>本章聚焦于大数据的读写性能,介绍了大数据读写中需要考虑的三个要素:(1)读写速度;(2)内存占用;(3)文件格式通用性。在R平台中进行测试,发现读写速度最快的文件格式是fst,而存储效率最高的是Parquet格式,在考虑通用交流的时候则需靠考虑团队成员能够读取什么格式的文件。</p>
</section>
<section id="练习" class="level2" data-number="4.6">
<h2 data-number="4.6" class="anchored" data-anchor-id="练习"><span class="header-section-number">4.6</span> 练习</h2>
<p>设计一个试验,对于不同体量(不应低于100M)的数据,观察读写不同数据格式的文件(包括但不限于csv、parquet、qs、fst等),需要的时间和空间分别是多少。要求使用图表进行展示,并给出明确的结论。附加考虑:当数据是不同类型的时候,上面的结论是否有所变化?</p>
</section>
<section id="参考资料" class="level2" data-number="4.7">
<h2 data-number="4.7" class="anchored" data-anchor-id="参考资料"><span class="header-section-number">4.7</span> 参考资料</h2>
<ul>
<li><a href="https://rsangole.netlify.app/posts/2022-09-14_data-read-write-performance/data-read-write-perf" class="uri">https://rsangole.netlify.app/posts/2022-09-14_data-read-write-performance/data-read-write-perf</a></li>
<li><a href="https://tomaztsql.wordpress.com/2022/05/08/comparing-performances-of-csv-to-rds-parquet-and-feather-data-types/" class="uri">https://tomaztsql.wordpress.com/2022/05/08/comparing-performances-of-csv-to-rds-parquet-and-feather-data-types/</a></li>
<li><a href="https://prof-thiagooliveira.netlify.app/post/data-read-write-performance/" class="uri">https://prof-thiagooliveira.netlify.app/post/data-read-write-performance/</a></li>
<li><a href="https://stackoverflow.com/questions/58699848/best-file-type-for-loading-data-in-to-r-speed-wise" class="uri">https://stackoverflow.com/questions/58699848/best-file-type-for-loading-data-in-to-r-speed-wise</a></li>
<li><a href="https://h2oai.github.io/db-benchmark/" class="uri">https://h2oai.github.io/db-benchmark/</a></li>
</ul>


</section>
Loading