#clouddataproc
Explore tagged Tumblr posts
Text
Read PII Data with Google Distributed Cloud Dataproc

PII data
Due to operational or regulatory constraints, Google Cloud clients who are interested in developing or updating their data lake architecture frequently have to keep a portion of their workloads and data on-premises.
You can now completely modernise your data lake with cloud-based technologies while creating hybrid data processing footprints that enable you to store and process on-prem data that you are unable to shift to the cloud, thanks to Dataproc on Google Distributed Cloud, which was unveiled in preview at Google Cloud Next ’24.
Using Google-provided hardware in your data centre, Dataproc on Google Distributed Cloud enables you to run Apache Spark processing workloads on-premises while preserving compatibility between your local and cloud-based technology.
For instance, in order to comply with regulatory obligations, a sizable European telecoms business is updating its data lake on Google Cloud while maintaining Personally Identifiable Information (PII) data on-premises on Google Distributed Cloud.
Google Cloud will demonstrate in this blog how to utilise Dataproc on Google Distributed Cloud to read PII data that is stored on-premises, compute aggregate metrics, and transfer the final dataset to the cloud’s data lake using Google Cloud Storage.
PII is present in this dataset. PII needs to be kept on-site in their own data centre in order to comply with regulations. The customer will store this data on-premises in object storage that is S3-compatible in order to meet this requirement. Now, though, the customer wants to use their larger data lake in Google Cloud to determine the optimal places to invest in new infrastructure by analysing signal strength by geography.
Full local execution of Spark jobs capable of performing an aggregation on signal quality is supported by Dataproc on Google Distributed Cloud, allowing integration with Google Cloud Data Analytics while adhering to compliance standards.
PII is present in this dataset. PII needs to be kept on-site in their own data centre in order to comply with regulations. The customer will store this data on-premises in object storage that is S3 compatible in order to meet this requirement. The customer now wants to analyse signal strength by location and determine the optimal places for new infrastructure expenditures using their larger data lake in Google Cloud.
Reading PII data with Google Distributed Cloud Dataproc requires various steps to assure data processing and privacy compliance.
To read PII data with Google Distributed Cloud Dataproc, just set up your Google Cloud environment.
Create a Google Cloud Project: If you don’t have one, create one in GCP.
Project billing: Enable billing.
In your Google Cloud project, enable the Dataproc API, Cloud Storage API, and any other relevant APIs.
Prepare PII
Securely store PII in Google Cloud Storage. Encrypt and restrict bucket and data access.
Classifying Data: Label data by sensitivity and compliance.
Create and configure Dataproc Cluster
Create a Dataproc cluster using the Google Cloud Console or gcloud command-line tool. Set the node count and type, and configure the cluster using software and libraries.
Security Configuration: Set IAM roles and permissions to restrict data access and processing to authorised users.
Develop Your Data Processing Job
Choose a Processing Framework: Consider Apache Spark or Hadoop.
Write the Data Processing Job: Create a script or app to process PII. This may involve reading GCS data, transforming it, and writing the output to GCS or another storage solution.
Job Submission to Dataproc Cluster
Submit your job to the cluster via the Google Cloud Console, gcloud command-line tool, or Dataproc API.
Check work status and records to guarantee completion.
Compliance and Data Security
Encrypt data at rest and in transit.
Use IAM policies to restrict data and resource access.
Compliance: Follow data protection laws including GDPR and CCPA.
Destruction of Dataproc Cluster
To save money, destroy the Dataproc cluster after data processing.
Best Practices
Always mask or anonymize PII data when processing.
Track PII data access and changes with extensive recording and monitoring.
Regularly audit data access and processing for compliance.
Data minimization: Process just the PII data you need.
Conclusion
PII processing with Google Distributed Cloud Dataproc requires careful design and execution to maintain data protection and compliance. Follow the methods and recommended practices above to use Dataproc for data processing while protecting sensitive data.
Dataproc
The managed, scalable Dataproc service supports Apache Hadoop, Spark, Flink, Presto, and over thirty open source tools and frameworks. For safe data science, ETL, and data lake modernization at scale that is integrated with Google Cloud at a significantly lower cost, use Dataproc.
ADVANTAGES
Bring your open source data processing up to date.
Your attention may be diverted from your infrastructure to your data and analytics using serverless deployment, logging, and monitoring. Cut the Apache Spark management TCO by as much as 54%. Create and hone models five times faster.
OSS for data science that is seamless and intelligent
Provide native connections with BigQuery, Dataplex, Vertex AI, and OSS notebooks like JupyterLab to let data scientists and analysts do data science tasks with ease.
Google Cloud integration with enterprise security
Features for security include OS Login, customer-managed encryption keys (CMEK), VPC Service Controls, and default at-rest encryption. Add a security setting to enable Hadoop Secure Mode using Kerberos.
Important characteristics
Completely automated and managed open-source big data applications
Your attention may be diverted from your infrastructure to your data and analytics using serverless deployment, logging, and monitoring. Cut the Apache Spark management TCO by as much as 54%. Integrate with Vertex AI Workbench to enable data scientists and engineers to construct and train models 5X faster than with standard notebooks. While Dataproc Metastore removes the need for you to manage your own Hive metastore or catalogue service, the Jobs API from Dataproc makes it simple to integrate large data processing into custom applications.
Use Kubernetes to containerise Apache Spark jobs
Create your Apache Spark jobs with Dataproc on Kubernetes so that you may utilise Dataproc to provide isolation and job portability while using Google Kubernetes Engine (GKE).
Google Cloud integration with enterprise security
By adding a Security Configuration, you can use Kerberos to enable Hadoop Secure Mode when you construct a Dataproc cluster. Additionally, customer-managed encryption keys (CMEK), OS Login, VPC Service Controls, and default at-rest encryption are some of the most often utilised Google Cloud-specific security features employed with Dataproc.
The best of Google Cloud combined with the finest of open source
More than 30 open source frameworks, including Apache Hadoop, Spark, Flink, and Presto, are supported by the managed, scalable Dataproc service. Simultaneously, Dataproc offers native integration with the whole Google Cloud database, analytics, and artificial intelligence ecosystem. Building data applications and linking Dataproc to BigQuery, Vertex AI, Spanner, Pub/Sub, or Data Fusion is a breeze for data scientists and developers.
Read more on govindhtech.com
#GoogleCloud#GoogleCloudNext#VertexAI#BigQuery#Dataplex#PIIdata#clouddataproc#cloudata#cloudstorage#API#news#technews#technology#technologynews#technologytrends#govindhtech
0 notes
Photo

พี่ติ๋วเลี้ยง #ข้าวผัดกุ้ง เลยมารองท้องยามดึก ระหว่างลอง @AcloudGuru #Playground เพื่อทดสอบ #CloudComposer & #CloudDataproc (at Uthai Thani) https://www.instagram.com/p/CkBWlm1puMf/?igshid=NGJjMDIxMWI=
0 notes
Text
AWS/Azure/GCPサービス比較 2019.05
from https://qiita.com/hayao_k/items/906ac1fba9e239e08ae8?utm_campaign=popular_items&utm_medium=feed&utm_source=popular_items
はじめに
こちら のAWSサービス一覧をもとに各クラウドで対応するサービスを記載しています
AWSでは提供されていないが、Azure/GCPでは提供されているサービスが漏れている場合があります
主観が含まれたり、サービス内容が厳密に一致していない場合もあると思いますが、ご容赦ください
Office 365やG SuiteなどMicrsoft/Googleとして提供されているものは括弧書き( )で記載しています
物理的なデバイスやSDKなどのツール群は記載していません
Analytics
AWS Azure GCP
データレイクへのクエリ Amazon Athena Azure Data Lake Analytics Google BigQuery
検索 Amazon CloudSearch Azure Search -
Hadoopクラスターの展開 Amazon EMR HD Insight/Azure Databricks CloudDataproc
Elasticsearchクラスターの展開 Amazon Elasticserach Service - -
ストリーミング処理 Amazon Kinesis Azure Event Hubs Cloud Dataflow
Kafkaクラスターの展開 Amazon Managed Streaming for Kafka - -
DWH Amazon Redshift Azure SQL Data Warehouse Google BigQuery
BIサービス Quick Sight (Power BI) (Goolge データーポータル)
ワークフローオーケストレーション AWS Data Pipeline Azure Data Factory Cloud Composer
ETL AWS Glue Azure Data Factory Cloud Data Fusion
データレイクの構築 AWS Lake Formation - -
データカタログ AWS Glue Azure Data Catalog Cloud Data Catalog
Application Integration
AWS Azure GCP
分散アプリケーションの作成 AWS Step Functions Azure Logic Apps -
メッセージキュー Amazon Simple Queue Service Azure Queue Storage -
Pub/Sub Amazon Simple Notification Service Azure Service Bus Cloud Pub/Sub
ActiveMQの展開 Amazon MQ
GraphQL AWS AppSync - -
イベントの配信 Amazon CloudWatch Events Event Grid -
Blockchain
AWS Azure GCP
ネットワークの作成と管理 Amazon Managed Blockchain Azure Blockchain Service -
台帳データベース Amazon Quantum Ledger Database - -
アプリケーションの作成 - Azure Blockchain Workbench -
Business Applications
AWS Azure GCP
Alexa Alexa for Business - -
オンラインミーティング Amazon Chime (Office 365) (G Suite)
Eメール Amazon WorkMail (Office 365) (G Suite)
Compute
AWS Azure GCP
仮想マシン Amazon EC2 Azure Virtual Machines Compute Engine
オートスケール Amazon EC2 Auto Scaling Virtual Machine Scale Sets Autoscaling
コンテナオーケストレーター Amazon Elastic Container Service Service Fabric -
Kubernetes Amazon Elastic Container Service for Kubernetes Azure Kubernetes Service Google Kubernetes Engine
コンテナレジストリ Amazon Elastic Container Registry Azure Container Registry Container Registry
VPS Amazon Lightsail - -
バッチコンピューティング AWS Batch Azure Batch -
Webアプリケーションの実行環境 Amazon Elastic Beanstalk Azure App Service App Engine
Function as a Service AWS Lambda Azure Functions Cloud Functions
サーバーレスアプリケーションのリポジトリ AWS Serverless Application Repository - -
VMware環境の展開 VMware Cloud on AWS Azure VMware Solutions -
オンプレミスでの展開 AWS Outposts Azure Stack Cloud Platform Service
バイブリットクラウドの構築 - - Anthos
ステートレスなHTTPコンテナの実行 - - Cloud Run
Cost Management
AWS Azure GCP
使用状況の可視化 AWS Cost Explorer Azure Cost Management -
予算の管理 AWS Budgets Azure Cost Management -
リザーブドインスタンスの管理 Reserved Instance Reporting Azure Cost Management -
使用状況のレポート AWS Cost & Usage Report Azure Cost Management -
Customer Engagement
AWS Azure GCP
コンタクトセンター Amazon Connect - Contact Center AI
エンゲージメントのパーソナライズ Amazon Pinpoint Notification Hubs -
Eメールの送受信 Amazon Simple Email Service - -
Database
AWS Azure GCP
MySQL Amazon RDS for MySQL/Amazon Aurora Azure Database for MySQL Cloud SQL for MySQL
PostgreSQL Amazon RDS for PostgreSQL/Amazon Aurora Azure Database for PostgreSQL Cloud SQL for PostgreSQL
Oracle Amazon RDS for Oracle - -
SQL Server Amazon RDS for SQL Server SQL Database Cloud SQL for SQL Server
MariaDB Amazon RDS for MySQL for MariaDB Azure Database for MariaDB -
NoSQL Amazon DynamoDB Azure Cosmos DB Cloud Datastore/Cloud Bigtable
インメモリキャッシュ Amazon ElastiCache Azure Cache for Redis Cloud Memorystore
グラフDB Amazon Neptune Azure Cosmos DB(API for Gremlin) -
時系列DB Amazon Timestream - -
MongoDB Amazon DocumentDB (with MongoDB compatibility) Azure Cosmos DB(API for MongoDB) -
グローバル分散RDB - - Cloud Spanner
リアルタイムDB - - Cloud Firestore
エッジに配置可能なDB - Azure SQL Database Edge -
Developer Tools
AWS Azure GCP
開発プロジェクトの管理 AWS CodeStar Azure DevOps -
Gitリポジトリ AWS CodeCommit Azure Repos Cloud Source Repositories
継続的なビルドとテスト AWS CodeBuild Azure Pipelines Cloud Build
継続的なデプロイ AWS CodeDeploy Azure Pipelines Cloud Build
パイプライン AWS CodePipeline Azure Pipelines Cloud Build
作業の管理 - Azure Boards -
パッケージレジストリ - Azure Artifacts -
テスト計画の管理 - Azure Test Plans -
IDE AWS Cloud9 (Visual Studio Online) -
分散トレーシング AWS X-Ray Azure Application Insights Stackdriver Trace
End User Computing
AWS Azure GCP
デスクトップ Amazon WorkSpaces Windows Virtual Desktop -
アプリケーションストリーミング Amazon AppStream 2.0 - -
ストレージ Amazon WorkDocs (Office 365) (G Suite)
社内アプリケーションへのアクセス Amazon WorkLink Azure AD Application Proxy -
Internet of Things
AWS Azure GCP
デバイスとクラウドの接続 AWS IoT Core Azure IoT Hub Cloud IoT Core
エッジへの展開 AWS Greengrass Azure IoT Edge Cloud IoT Edge
デバイスから任意の関数を実行 AWS IoT 1-Click - -
デバイスの分析 AWS IoT Analytics Azure Stream Analytics/Azure Time Series Insights -
デバイスのセキュリティ管理 AWS IoT Device Defender - -
デバイスの管理 AWS IoT Device Management Azure IoT Hub Cloud IoT Core
デバイスで発生するイベントの検出 AWS IoT Events - -
産業機器からデータを収集 AWS IoT SiteWise - -
IoTアプリケーションの構築 AWS IoT Things Graph Azure IoT Central -
位置情報 - Azure Maps Google Maps Platform
実世界のモデル化 - Azure Digital Twins
Machine Learning
AWS Azure GCP
機械学習モデルの構築 Amazon SageMaker Azure Machine Learning Service Cloud ML Engine
自然言語処理 Amazon Comprehend Language Understanding Cloud Natural Language
チャットボットの構築 Amazon Lex Azure Bot Service (Dialogflow)
Text-to-Speech Amazon Polly Speech Services Cloud Text-to-Speech
画像認識 Amazon Rekognition Computer Vision Cloud Vision
翻訳 Amazon Translate Translator Text Cloud Translation
Speech-to-Text Amazon Transcribe Speech Services Cloud Speech-to-Text
レコメンデーション Amazon Personalize - Recommendations AI
時系列予測 Amazon Forecast - -
ドキュメント検出 Amazon Textract - -
推論の高速化 Amazon Elastic Inference - -
データセットの構築 Amazon SageMaker Ground Truth - -
ビジョンモデルのカスタマイズ - Custom Vision Cloud AutoML Vision
音声モデルのカスタマイズ - Custom Speech -
言語処理モデルのカスタマイズ Amazon Comprehend - Cloud AutoML Natural Language
翻訳モデルのカスタマイズ - - Cloud AutoML Translation
Managemnet & Governance
AWS Azure GCP
モニタリング Amazon CloudWatch Azure Monitor Google Stackdriver
リソースの作成と管理 AWS CloudFormation Azure Resource Manager Cloud Deployment Manager
アクティビティの追跡 AWS CloudTrail Azure Activity Log
リソースの設定変更の記録、監査 AWS Config - -
構成管理サービスの展開 AWS OpsWorks(Chef/Puppet) - -
ITサービスカタログの管理 AWS Service Catalog - Private Catalog
インフラストラクチャの可視化と制御 AWS Systems Manager - -
パフォーマンスとセキュリティの最適化 AWS Trusted Advisor Azure Advisor -
使用しているサービスの状態表示 AWS Personal Health Dashboard Azure Resource Health -
基準に準拠したアカウントのセットアップ AWS Control Tower Azure Policy -
ライセンスの管理 AWS License Manager - -
ワークロードの見直しと改善 AWS Well-Architected Tool - -
複数アカウントの管理 AWS Organizations Subspricton+RBAC -
ディザスタリカバリ - Azure Site Recovery -
ブラウザベースのシェル AWS Systems Manager Session Manager Cloud Shell Cloud Shell
Media Services
AWS Azure GCP
メディア変換 Amazon Elastic Transcoder/AWS Elemental MediaConvert Azure Media Services - Encoding (Anvato)
ライブ動画処理 AWS Elemental MediaLive Azure Media Services - Live and On-demand Streaming (Anvato)
動画の配信とパッケージング AWS Elemental MediaPackage Azure Media Services (Anvato)
動画ファイル向けストレージ AWS Elemental MediaStore - -
ターゲティング広告の挿入 AWS Elemental MediaTailor - -
Migration & Transfer
AWS Azure GCP
移行の管理 AWS Migration Hub - -
移行のアセスメント AWS Application Discovery Service Azure Migrate -
データベースの移行 AWS Database Migration Service Azure Database Migration Service -
オンプレミスからのデータ転送 AWS DataSync - -
サーバーの移行 AWS Server Migration Service Azure Site Recovery -
大容量データの移行 Snowファミリー Azure Data box Transfer Appliance
SFTP AWS Transfer for SFTP - -
クラウド間のデータ転送 - - Cloud Storage Transfer Service
Mobile
AWS Azure GCP
モバイル/Webアプリケーションの構築とデプロイ AWS Amplify Mobile Apps (Firebase)
アプリケーションテスト AWS Device Farm (Xamarin Test Cloud) (Firebase Test Lab)
Networking & Content Delivery
AWS Azure GCP
仮想ネットワーク Amazon Virtual Private Cloud Azure Virtual Network Virtual Private Cloud
APIの管理 Amazon API Gateway API Management Cloud Endpoints/Apigee
CDN Amazon CloudFront Azure CDN Cloud CDN
DNS Amazon Route 53 Azure DNS Cloud DNS
プライベート接続 Amazon VPC PrivateLink Virtual Network Service Endpoints Private Access Options for Services
サービスメッシュ AWS App Mesh Azure Service Fabric Mesh Traffic Director
サービスディスカバリー AWS Cloud Map - -
専用線接続 AWS Direct Connect ExporessRoute Cloud Interconnect
グローバルロードバランサー AWS Global Accelerator Azure Traffic Manager Cloud Load Balancing
ハブ&スポーク型ネットワーク接続 AWS Transit Gateway - -
ネットワークパフォーマンスの監視 - Network Watcher -
Security, Identity & Compliance
AWS Azure GCP
ID管理 AWS Identity and Access Management Azure Active Directory Cloud IAM
階層型データストア Amazon Cloud Directory - -
アプリケーションのID管理 Amazon Cognito Azure Mobile Apps -
脅威検出 Amazon GuardDuty Azure Security Center Cloud Security Command Center
サーバーのセキュリティの評価 Amazon Inspector Azure Security Center Cloud Security Command Center
機密データの検出と保護 Amazon Macie Azure Information Protection -
コンプライアンスレポートへのアクセス AWS Artifact (Service Trust Portal) -
SSL/TLS証明書の管理 AWS Certificate Manager App Service Certificates Google-managed SSL certificates
ハードウェアセキュリティモジュール AWS Cloud HSM Azure Dedicated HSM Cloud HSM
Active Directory AWS Directory Service Azure Active Directory Managed Service for Microsoft Active Directory
ファイアウォールルールの一元管理 AWS Firewall Manager - -
キーの作成と管理 AWS Key Management Service Azure Key Vault Clou Key Management Service
機密情報の管理 AWS Secrets Manager Azure Key Vault -
セキュリティ情報の一括管理 AWS Security Hub Azure Sentinel -
DDoS保護 AWS Shield Azure DDoS Protection Cloud Armor
シングルサインオン AWS Single Sign-On Azure Active Directory B2C Cloud Identity
WAF AWS WAF Azure Application Gateway Cloud Armor
Storage
AWS Azure GCP
オブジェクトストレージ Amazon S3 Azure Blob Cloud Storage
ブロックストレージ Amazon EBS Disk Storage Persistent Disk
ファイルストレージ(NFS) Amazon Elastic File System Azure NetApp Files Cloud Filestore
ファイルストレージ(SMB) Amazon FSx for Windows File Server Azure Files -
HPC向けファイルシステム Amazon FSx for Lustre Azure FXT Edge Filer -
アーカイブストレージ Amazon S3 Glacier Storage archive access tier Cloud Storage Coldline
バックアップの一元管理 AWS Backup Azure Backup -
ハイブリットストレージ AWS Storage Gateway Azure StorSimple -
その他
AWS Azure GCP
AR/VRコンテンツの作成 Amazon Sumerian - -
ゲームサーバーホスティング Amazon GameLift - -
ゲームエンジン Amazon Lumberyard - -
ロボット工学 RoboMaker - -
人工衛星 Ground Station - -
参考情報
0 notes