#clouddataproc | Explore Tumblr posts and blogs

govindhtech · 1 year ago

Text

Read PII Data with Google Distributed Cloud Dataproc

PII data

Due to operational or regulatory constraints, Google Cloud clients who are interested in developing or updating their data lake architecture frequently have to keep a portion of their workloads and data on-premises.

You can now completely modernise your data lake with cloud-based technologies while creating hybrid data processing footprints that enable you to store and process on-prem data that you are unable to shift to the cloud, thanks to Dataproc on Google Distributed Cloud, which was unveiled in preview at Google Cloud Next ’24.

Using Google-provided hardware in your data centre, Dataproc on Google Distributed Cloud enables you to run Apache Spark processing workloads on-premises while preserving compatibility between your local and cloud-based technology.

For instance, in order to comply with regulatory obligations, a sizable European telecoms business is updating its data lake on Google Cloud while maintaining Personally Identifiable Information (PII) data on-premises on Google Distributed Cloud.

Google Cloud will demonstrate in this blog how to utilise Dataproc on Google Distributed Cloud to read PII data that is stored on-premises, compute aggregate metrics, and transfer the final dataset to the cloud’s data lake using Google Cloud Storage.

PII is present in this dataset. PII needs to be kept on-site in their own data centre in order to comply with regulations. The customer will store this data on-premises in object storage that is S3-compatible in order to meet this requirement. Now, though, the customer wants to use their larger data lake in Google Cloud to determine the optimal places to invest in new infrastructure by analysing signal strength by geography.

Full local execution of Spark jobs capable of performing an aggregation on signal quality is supported by Dataproc on Google Distributed Cloud, allowing integration with Google Cloud Data Analytics while adhering to compliance standards.

PII is present in this dataset. PII needs to be kept on-site in their own data centre in order to comply with regulations. The customer will store this data on-premises in object storage that is S3 compatible in order to meet this requirement. The customer now wants to analyse signal strength by location and determine the optimal places for new infrastructure expenditures using their larger data lake in Google Cloud.

Reading PII data with Google Distributed Cloud Dataproc requires various steps to assure data processing and privacy compliance.

To read PII data with Google Distributed Cloud Dataproc, just set up your Google Cloud environment.

Create a Google Cloud Project: If you don’t have one, create one in GCP.

Project billing: Enable billing.

In your Google Cloud project, enable the Dataproc API, Cloud Storage API, and any other relevant APIs.

Prepare PII

Securely store PII in Google Cloud Storage. Encrypt and restrict bucket and data access.

Classifying Data: Label data by sensitivity and compliance.

Create and configure Dataproc Cluster

Create a Dataproc cluster using the Google Cloud Console or gcloud command-line tool. Set the node count and type, and configure the cluster using software and libraries.

Security Configuration: Set IAM roles and permissions to restrict data access and processing to authorised users.

Develop Your Data Processing Job

Choose a Processing Framework: Consider Apache Spark or Hadoop.

Write the Data Processing Job: Create a script or app to process PII. This may involve reading GCS data, transforming it, and writing the output to GCS or another storage solution.

Job Submission to Dataproc Cluster

Submit your job to the cluster via the Google Cloud Console, gcloud command-line tool, or Dataproc API.

Check work status and records to guarantee completion.

Compliance and Data Security

Encrypt data at rest and in transit.

Use IAM policies to restrict data and resource access.

Compliance: Follow data protection laws including GDPR and CCPA.

Destruction of Dataproc Cluster

To save money, destroy the Dataproc cluster after data processing.

Best Practices

Always mask or anonymize PII data when processing.

Track PII data access and changes with extensive recording and monitoring.

Regularly audit data access and processing for compliance.

Data minimization: Process just the PII data you need.

Conclusion

PII processing with Google Distributed Cloud Dataproc requires careful design and execution to maintain data protection and compliance. Follow the methods and recommended practices above to use Dataproc for data processing while protecting sensitive data.

Dataproc

The managed, scalable Dataproc service supports Apache Hadoop, Spark, Flink, Presto, and over thirty open source tools and frameworks. For safe data science, ETL, and data lake modernization at scale that is integrated with Google Cloud at a significantly lower cost, use Dataproc.

ADVANTAGES

Bring your open source data processing up to date.

OSS for data science that is seamless and intelligent

Provide native connections with BigQuery, Dataplex, Vertex AI, and OSS notebooks like JupyterLab to let data scientists and analysts do data science tasks with ease.

Google Cloud integration with enterprise security

Features for security include OS Login, customer-managed encryption keys (CMEK), VPC Service Controls, and default at-rest encryption. Add a security setting to enable Hadoop Secure Mode using Kerberos.

Important characteristics

Completely automated and managed open-source big data applications

Your attention may be diverted from your infrastructure to your data and analytics using serverless deployment, logging, and monitoring. Cut the Apache Spark management TCO by as much as 54%. Integrate with Vertex AI Workbench to enable data scientists and engineers to construct and train models 5X faster than with standard notebooks. While Dataproc Metastore removes the need for you to manage your own Hive metastore or catalogue service, the Jobs API from Dataproc makes it simple to integrate large data processing into custom applications.

Use Kubernetes to containerise Apache Spark jobs

Create your Apache Spark jobs with Dataproc on Kubernetes so that you may utilise Dataproc to provide isolation and job portability while using Google Kubernetes Engine (GKE).

Google Cloud integration with enterprise security

By adding a Security Configuration, you can use Kerberos to enable Hadoop Secure Mode when you construct a Dataproc cluster. Additionally, customer-managed encryption keys (CMEK), OS Login, VPC Service Controls, and default at-rest encryption are some of the most often utilised Google Cloud-specific security features employed with Dataproc.

The best of Google Cloud combined with the finest of open source

More than 30 open source frameworks, including Apache Hadoop, Spark, Flink, and Presto, are supported by the managed, scalable Dataproc service. Simultaneously, Dataproc offers native integration with the whole Google Cloud database, analytics, and artificial intelligence ecosystem. Building data applications and linking Dataproc to BigQuery, Vertex AI, Spanner, Pub/Sub, or Data Fusion is a breeze for data scientists and developers.

Read more on govindhtech.com

#GoogleCloud #GoogleCloudNext #VertexAI #BigQuery #Dataplex #PIIdata #clouddataproc #cloudata #cloudstorage #API #news #technews #technology #technologynews #technologytrends #govindhtech

0 notes

masaa-ma · 6 years ago

Text

AWS/Azure/GCPサービス比較 2019.05

from https://qiita.com/hayao_k/items/906ac1fba9e239e08ae8?utm_campaign=popular_items&utm_medium=feed&utm_source=popular_items

はじめに

こちらのAWSサービス一覧をもとに各クラウドで対応するサービスを記載しています

AWSでは提供されていないが、Azure/GCPでは提供されているサービスが漏れている場合があります

主観が含まれたり、サービス内容が厳密に一致していない場合もあると思いますが、ご容赦ください

Office 365やG SuiteなどMicrsoft/Googleとして提供されているものは括弧書き( )で記載しています

物理的なデバイスやSDKなどのツール群は記載していません

Analytics

AWS Azure GCP

データレイクへのクエリ Amazon Athena Azure Data Lake Analytics Google BigQuery

検索 Amazon CloudSearch Azure Search -

Hadoopクラスターの展開 Amazon EMR HD Insight/Azure Databricks CloudDataproc

Elasticsearchクラスターの展開 Amazon Elasticserach Service - -

ストリーミング処理 Amazon Kinesis Azure Event Hubs Cloud Dataflow

Kafkaクラスターの展開 Amazon Managed Streaming for Kafka - -

DWH Amazon Redshift Azure SQL Data Warehouse Google BigQuery

BIサービス Quick Sight (Power BI) (Goolge データーポータル)

ワークフローオーケストレーション AWS Data Pipeline Azure Data Factory Cloud Composer

ETL AWS Glue Azure Data Factory Cloud Data Fusion

データレイクの構築 AWS Lake Formation - -

データカタログ AWS Glue Azure Data Catalog Cloud Data Catalog

Application Integration

AWS Azure GCP

分散アプリケーションの作成 AWS Step Functions Azure Logic Apps -

メッセージキュー Amazon Simple Queue Service Azure Queue Storage -

Pub/Sub Amazon Simple Notification Service Azure Service Bus Cloud Pub/Sub

ActiveMQの展開 Amazon MQ

GraphQL AWS AppSync - -

イベントの配信 Amazon CloudWatch Events Event Grid -

Blockchain

AWS Azure GCP

ネットワークの作成と管理 Amazon Managed Blockchain Azure Blockchain Service -

台帳データベース Amazon Quantum Ledger Database - -

アプリケーションの作成 - Azure Blockchain Workbench -

Business Applications

AWS Azure GCP

Alexa Alexa for Business - -

オンラインミーティング Amazon Chime (Office 365) (G Suite)

Eメール Amazon WorkMail (Office 365) (G Suite)

Compute

AWS Azure GCP

仮想マシン Amazon EC2 Azure Virtual Machines Compute Engine

オートスケール Amazon EC2 Auto Scaling Virtual Machine Scale Sets Autoscaling

コンテナオーケストレーター Amazon Elastic Container Service Service Fabric -

Kubernetes Amazon Elastic Container Service for Kubernetes Azure Kubernetes Service Google Kubernetes Engine

コンテナレジストリ Amazon Elastic Container Registry Azure Container Registry Container Registry

VPS Amazon Lightsail - -

バッチコンピューティング AWS Batch Azure Batch -

Webアプリケーションの実行環境 Amazon Elastic Beanstalk Azure App Service App Engine

Function as a Service AWS Lambda Azure Functions Cloud Functions

サーバーレスアプリケーションのリポジトリ AWS Serverless Application Repository - -

VMware環境の展開 VMware Cloud on AWS Azure VMware Solutions -

オンプレミスでの展開 AWS Outposts Azure Stack Cloud Platform Service

バイブリットクラウドの構築 - - Anthos

ステートレスなHTTPコンテナの実行 - - Cloud Run

Cost Management

AWS Azure GCP

使用状況の可視化 AWS Cost Explorer Azure Cost Management -

予算の管理 AWS Budgets Azure Cost Management -

リザーブドインスタンスの管理 Reserved Instance Reporting Azure Cost Management -

使用状況のレポート AWS Cost & Usage Report Azure Cost Management -

Customer Engagement

AWS Azure GCP

コンタクトセンター Amazon Connect - Contact Center AI

エンゲージメントのパーソナライズ Amazon Pinpoint Notification Hubs -

Eメールの送受信 Amazon Simple Email Service - -

Database

AWS Azure GCP

MySQL Amazon RDS for MySQL/Amazon Aurora Azure Database for MySQL Cloud SQL for MySQL

PostgreSQL Amazon RDS for PostgreSQL/Amazon Aurora Azure Database for PostgreSQL Cloud SQL for PostgreSQL

Oracle Amazon RDS for Oracle - -

SQL Server Amazon RDS for SQL Server SQL Database Cloud SQL for SQL Server

MariaDB Amazon RDS for MySQL for MariaDB Azure Database for MariaDB -

NoSQL Amazon DynamoDB Azure Cosmos DB Cloud Datastore/Cloud Bigtable

インメモリキャッシュ Amazon ElastiCache Azure Cache for Redis Cloud Memorystore

グラフDB Amazon Neptune Azure Cosmos DB(API for Gremlin) -

時系列DB Amazon Timestream - -

MongoDB Amazon DocumentDB (with MongoDB compatibility) Azure Cosmos DB(API for MongoDB) -

グローバル分散RDB - - Cloud Spanner

リアルタイムDB - - Cloud Firestore

エッジに配置可能なDB - Azure SQL Database Edge -

Developer Tools

AWS Azure GCP

開発プロジェクトの管理 AWS CodeStar Azure DevOps -

Gitリポジトリ AWS CodeCommit Azure Repos Cloud Source Repositories

継続的なビルドとテスト AWS CodeBuild Azure Pipelines Cloud Build

継続的なデプロイ AWS CodeDeploy Azure Pipelines Cloud Build

パイプライン AWS CodePipeline Azure Pipelines Cloud Build

作業の管理 - Azure Boards -

パッケージレジストリ - Azure Artifacts -

テスト計画の管理 - Azure Test Plans -

IDE AWS Cloud9 (Visual Studio Online) -

分散トレーシング AWS X-Ray Azure Application Insights Stackdriver Trace

End User Computing

AWS Azure GCP

デスクトップ Amazon WorkSpaces Windows Virtual Desktop -

アプリケーションストリーミング Amazon AppStream 2.0 - -

ストレージ Amazon WorkDocs (Office 365) (G Suite)

社内アプリケーションへのアクセス Amazon WorkLink Azure AD Application Proxy -

Internet of Things

AWS Azure GCP

デバイスとクラウドの接続 AWS IoT Core Azure IoT Hub Cloud IoT Core

エッジへの展開 AWS Greengrass Azure IoT Edge Cloud IoT Edge

デバイスから任意の関数を実行 AWS IoT 1-Click - -

デバイスの分析 AWS IoT Analytics Azure Stream Analytics/Azure Time Series Insights -

デバイスのセキュリティ管理 AWS IoT Device Defender - -

デバイスの管理 AWS IoT Device Management Azure IoT Hub Cloud IoT Core

デバイスで発生するイベントの検出 AWS IoT Events - -

産業機器からデータを収集 AWS IoT SiteWise - -

IoTアプリケーションの構築 AWS IoT Things Graph Azure IoT Central -

位置情報 - Azure Maps Google Maps Platform

実世界のモデル化 - Azure Digital Twins

Machine Learning

AWS Azure GCP

機械学習モデルの構築 Amazon SageMaker Azure Machine Learning Service Cloud ML Engine

自然言語処理 Amazon Comprehend Language Understanding Cloud Natural Language

チャットボットの構築 Amazon Lex Azure Bot Service (Dialogflow)

Text-to-Speech Amazon Polly Speech Services Cloud Text-to-Speech

画像認識 Amazon Rekognition Computer Vision Cloud Vision

翻訳 Amazon Translate Translator Text Cloud Translation

Speech-to-Text Amazon Transcribe Speech Services Cloud Speech-to-Text

レコメンデーション Amazon Personalize - Recommendations AI

時系列予測 Amazon Forecast - -

ドキュメント検出 Amazon Textract - -

推論の高速化 Amazon Elastic Inference - -

データセットの構築 Amazon SageMaker Ground Truth - -

ビジョンモデルのカスタマイズ - Custom Vision Cloud AutoML Vision

音声モデルのカスタマイズ - Custom Speech -

言語処理モデルのカスタマイズ Amazon Comprehend - Cloud AutoML Natural Language

翻訳モデルのカスタマイズ - - Cloud AutoML Translation

Managemnet & Governance

AWS Azure GCP

モニタリング Amazon CloudWatch Azure Monitor Google Stackdriver

リソースの作成と管理 AWS CloudFormation Azure Resource Manager Cloud Deployment Manager

アクティビティの追跡 AWS CloudTrail Azure Activity Log

リソースの設定変更の記録、監査 AWS Config - -

構成管理サービスの展開 AWS OpsWorks(Chef/Puppet) - -

ITサービスカタログの管理 AWS Service Catalog - Private Catalog

インフラストラクチャの可視化と制御 AWS Systems Manager - -

パフォーマンスとセキュリティの最適化 AWS Trusted Advisor Azure Advisor -

使用しているサービスの状態表示 AWS Personal Health Dashboard Azure Resource Health -

基準に準拠したアカウントのセットアップ AWS Control Tower Azure Policy -

ライセンスの管理 AWS License Manager - -

ワークロードの見直しと改善 AWS Well-Architected Tool - -

複数アカウントの管理 AWS Organizations Subspricton+RBAC -

ディザスタリカバリ - Azure Site Recovery -

ブラウザベースのシェル AWS Systems Manager Session Manager Cloud Shell Cloud Shell

Media Services

AWS Azure GCP

メディア変換 Amazon Elastic Transcoder/AWS Elemental MediaConvert Azure Media Services - Encoding (Anvato)

ライブ動画処理 AWS Elemental MediaLive Azure Media Services - Live and On-demand Streaming (Anvato)

動画の配信とパッケージング AWS Elemental MediaPackage Azure Media Services (Anvato)

動画ファイル向けストレージ AWS Elemental MediaStore - -

ターゲティング広告の挿入 AWS Elemental MediaTailor - -

Migration & Transfer

AWS Azure GCP

移行の管理 AWS Migration Hub - -

移行のアセスメント AWS Application Discovery Service Azure Migrate -

データベースの移行 AWS Database Migration Service Azure Database Migration Service -

オンプレミスからのデータ転送 AWS DataSync - -

サーバーの移行 AWS Server Migration Service Azure Site Recovery -

大容量データの移行 Snowファミリー Azure Data box Transfer Appliance

SFTP AWS Transfer for SFTP - -

クラウド間のデータ転送 - - Cloud Storage Transfer Service

Mobile

AWS Azure GCP

モバイル/Webアプリケーションの構築とデプロイ AWS Amplify Mobile Apps (Firebase)

アプリケーションテスト AWS Device Farm (Xamarin Test Cloud) (Firebase Test Lab)

Networking & Content Delivery

AWS Azure GCP

仮想ネットワーク Amazon Virtual Private Cloud Azure Virtual Network Virtual Private Cloud

APIの管理 Amazon API Gateway API Management Cloud Endpoints/Apigee

CDN Amazon CloudFront Azure CDN Cloud CDN

DNS Amazon Route 53 Azure DNS Cloud DNS

プライベート接続 Amazon VPC PrivateLink Virtual Network Service Endpoints Private Access Options for Services

サービスメッシュ AWS App Mesh Azure Service Fabric Mesh Traffic Director

サービスディスカバリー AWS Cloud Map - -

専用線接続 AWS Direct Connect ExporessRoute Cloud Interconnect

グローバルロードバランサー AWS Global Accelerator Azure Traffic Manager Cloud Load Balancing

ハブ&スポーク型ネットワーク接続 AWS Transit Gateway - -

ネットワークパフォーマンスの監視 - Network Watcher -

Security, Identity & Compliance

AWS Azure GCP

ID管理 AWS Identity and Access Management Azure Active Directory Cloud IAM

階層型データストア Amazon Cloud Directory - -

アプリケーションのID管理 Amazon Cognito Azure Mobile Apps -

脅威検出 Amazon GuardDuty Azure Security Center Cloud Security Command Center

サーバーのセキュリティの評価 Amazon Inspector Azure Security Center Cloud Security Command Center

機密データの検出と保護 Amazon Macie Azure Information Protection -

コンプライアンスレポートへのアクセス AWS Artifact (Service Trust Portal) -

SSL/TLS証明書の管理 AWS Certificate Manager App Service Certificates Google-managed SSL certificates

ハードウェアセキュリティモジュール AWS Cloud HSM Azure Dedicated HSM Cloud HSM

Active Directory AWS Directory Service Azure Active Directory Managed Service for Microsoft Active Directory

ファイアウォールルールの一元管理 AWS Firewall Manager - -

キーの作成と管理 AWS Key Management Service Azure Key Vault Clou Key Management Service

機密情報の管理 AWS Secrets Manager Azure Key Vault -

セキュリティ情報の一括管理 AWS Security Hub Azure Sentinel -

DDoS保護 AWS Shield Azure DDoS Protection Cloud Armor

シングルサインオン AWS Single Sign-On Azure Active Directory B2C Cloud Identity

WAF AWS WAF Azure Application Gateway Cloud Armor

Storage

AWS Azure GCP

オブジェクトストレージ Amazon S3 Azure Blob Cloud Storage

ブロックストレージ Amazon EBS Disk Storage Persistent Disk

ファイルストレージ(NFS) Amazon Elastic File System Azure NetApp Files Cloud Filestore

ファイルストレージ(SMB) Amazon FSx for Windows File Server Azure Files -

HPC向けファイルシステム Amazon FSx for Lustre Azure FXT Edge Filer -

アーカイブストレージ Amazon S3 Glacier Storage archive access tier Cloud Storage Coldline

バックアップの一元管理 AWS Backup Azure Backup -

ハイブリットストレージ AWS Storage Gateway Azure StorSimple -

その他

AWS Azure GCP

AR/VRコンテンツの作成 Amazon Sumerian - -

ゲームサーバーホスティング Amazon GameLift - -

ゲームエンジン Amazon Lumberyard - -

ロボット工学 RoboMaker - -

人工衛星 Ground Station - -

参考情報

#Feedly

0 notes