首頁手記谷歌云數據工程師考試 - Data Proc 復習筆記

谷歌云數據工程師考試 - Data Proc 復習筆記

標簽：

數據結構

Dataproc Summary

How to load data?

a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning.

Dataproc connects to BigQuery

Option 1:

Screen Shot 2018-07-15 at 12.34.04 am.png

BigQuery does not natively know how to work with a Hadoop file system.

Cloud storage can act as an intermediary between BigQuery and data proc.

You would export the data from BigQuery into cloud storage as sharded data.

Then the worker notes in data proc would read the sharded data.

Symmetrically, if the data proc job is producing output it can be stored in a format in cloud storage that can be input to BigQuery.

Appropriate for periodic or infrequent transfers

Option 2:

Another option is to setup a BigQuery connector on the Dataproc cluster. The connector is a Java library that enables read write access from Spark and Hadoop directly into BigQuery.

Need to save BigQuery result as table first.

![Screen Shot 2018-07-15 at 12.48.01 am.png](https://upload-images.jianshu.io/upload_images/9976001-6fcaa78c38c1d404.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) ![Screen Shot 2018-07-15 at 12.50.02 am.png](https://upload-images.jianshu.io/upload_images/9976001-9a1b2c9c68b70469.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

Screen Shot 2018-07-15 at 12.44.25 am.png

Screen Shot 2018-07-15 at 12.44.35 am.png

Screen Shot 2018-07-15 at 12.48.01 am.png

Screen Shot 2018-07-15 at 12.50.02 am.png

Screen Shot 2018-07-15 at 12.50.20 am.png

Option 3:

When you want to process data in memory for speed - Pandas Dataframe

In memory, fast but limited in size

Creating a Dataproc cluster

Ways:
Deployment manager template, which is an infrastructure automation service in Google Cloud.
CLI commands
Google cloud console

Keys:

0 Create a cluster specifically for one job

1 Match your data location to the compute location
-> better performance
-> also able to shut down cluster when not processing jobs

2 use Cloud Storage instead of HDFS, shutdown the cluster when it’s not actually processing data
-> It reduces the complexity of disk provisioning and enables you to shut down your cluster when it's not processing a job.

3 Use custom machine types to closely manage the resources that the job requires

4 On non-critical jobs requiring huge clusters, use preemptible VMs to hasten results and cut costs at the same time

作者：塞小娜
链接：https://www.jianshu.com/p/b1e2abe367df

點擊查看更多內容

為 TA 點贊

若覺得本文不錯，就分享一下吧！

評論

評論

共同學習，寫下你的評論

評論加載中...

展開查看更多評論

作者其他優質文章

正在加載中

幕布斯6054654

手記
篇

粉絲

221

獲贊與收藏

1015

關注作者，訂閱最新文章

閱讀免費教程

數據結構入門教程

7個小節 24143 629

后端通用面試教程

41個小節 32087 358

ES6-10 入門教程

61個小節 95722 750

推薦

評論

收藏

共同學習，寫下你的評論



感謝您的支持，我會繼續努力的～

掃碼打賞，你說多少就多少

贊賞金額會直接到老師賬戶

支付方式

打開微信掃一掃，即可進行掃碼打賞哦

今天注冊有機會得

100積分直接送

付費專欄免費學

大額優惠券免費領

立即參與放棄機會

點擊
抽獎

慕課手記新用戶專享福利

恭喜你，你的運氣太好了，居然抽中了 100個積分！

恭喜你，抽中了價值元的專欄！

太棒了，直接落到你賬戶里！

積分商城里的羅技鼠標、機械鍵盤、
Kindle 閱讀器、小米平衡車
Apple iPad （10.2英寸）、大額優惠券
在等著你去兌換了噢

作者：

免費贈送

兌換碼：1111222211 復制

優惠券可用于購買實戰課、體系課
無門檻使用

先去看看，有什么好東西馬上兌換我愛學習，選課去


亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空