#awsdynamodb
Explore tagged Tumblr posts
fortunatelycoldengineer · 2 years ago
Text
Tumblr media
What is AWS DynamoDB? . . visit: http://bit.ly/3JmtEgj for more information
0 notes
jaiinfoway · 2 years ago
Text
Tumblr media
Visit #jaiinfoway www.jaiinfoway.com for AWS DynamoDB
AWS DynamoDB; The Ultimate Guide You Should Know In 2022
Read more; https://jaiinfoway.com/aws-dynamodb-the-ultimate-guide-you-should-know-in-2022/
0 notes
ecorptrainings · 7 years ago
Text
Amazon DynamoDB online training at Ecorptrainings Hyderabad India.
#AWS DynamoDB NoSQL database is one of the most powerful and widely used non-relational databases available today. It is a fault tolerant, highly scalable database with tunable consistency that meets the demanding requirements of the can’t fail, must scale systems driving growth for many of the most successful enterprises of today.
ABOUT ECORPTRAININGS:
Ecorp Trainings are one of the best institute providing quality level of training in E-learning process.This is instructor led online training.
We also provide corporate training , if group of people interested in same technology.
Contact us for detailed course content & register for a free demo.
We also provide support in client interviews , resume preparation , ticket resolving.
Contact us for custom designed training course by experts exclusively for yourself.
We provide training for almost all IT technologies i.e ; JAVA , DOTNET , SAP ,ORACLE , PEOPLESOFT ,HYPERION etc, contact us if you have any particular need.
Contact:
Ecorptrainings
USA: +1-703-445-4802 UK : +44 20 3287 2021
India: +91-720704GH3304 / 7207043306,+91-8143-111-555
Gtalk ID : ecorptrainings
Skype ID : ecorptrainings
For content:click here
0 notes
globalmediacampaign · 4 years ago
Text
Export and analyze Amazon DynamoDB data in an Amazon S3 data lake in Apache Parquet format
Amazon DynamoDB is a key-value and document database that delivers single-digit millisecond performance at any scale. It’s a fully managed, multi-region, multi-active, durable database with built-in security, backup and restore, and in-memory caching for internet-scale applications. DynamoDB can handle more than 10 trillion requests per day and can support peaks of more than 20 million requests per second. It’s relied on across industries and verticals to back mission-critical applications. As such, DynamoDB is designed for efficient online transaction processing (OLTP) workloads; however, what if you want to also perform ad hoc online analytical processing (OLAP) queries? The data in OLTP systems is the proverbial gold mine for analytics. Recently, DynamoDB announced a feature you can use to export point-in-time recovery (PITR) backups to Amazon Simple Storage Service (Amazon S3), making your DynamoDB table data easily accessible in your Amazon S3 bucket at the click of a button. For more information, see Exporting DynamoDB table data to Amazon S3. After you export your data to Amazon S3, you can use Amazon Athena, Amazon Redshift, Amazon SageMaker, or any other big data tools to extract rich analytical insights. Although you can query the data directly in the DynamoDB JSON or Amazon Ion format, we find that for larger datasets, converting the exported output into Apache Parquet—a popular, high-performant columnar data format—translates into faster queries and cost savings. Like DynamoDB itself, this feature functions at any scale with no impact on the performance or availability of production applications. You can export data from your DynamoDB PITR backup at any point in time in the last 35 days at per-second granularity, and the exported dataset can be delivered to an Amazon S3 bucket in any AWS Region or account. Previously, integrating and analyzing table data in DynamoDB required custom configurations by using tools such as AWS Data Pipeline or Amazon EMR. These tools perform a table scan and export the data to Amazon S3 or a data warehouse for analytics, thereby consuming table read capacity. In addition, these scan-based solutions require expertise in big-data tools, infrastructure deployment, capacity management, and maintenance. In this post, we show how to use the DynamoDB-to-Amazon S3 data export feature, convert the exported data into Apache Parquet with AWS Glue, and query it via Amazon Athena with standard SQL. Solution overview The walkthrough in this post shows you how to: Enable point-in-time recovery (PITR) on a DynamoDB table. Initiate a data export. View the dataset in Amazon S3. Transform the exported data into Apache Parquet by using AWS Glue. Build and craft SQL queries with Athena. This post assumes that you’re working with an AWS Identity and Access Management (IAM) role that can access DynamoDB, Amazon S3, AWS Glue, and Athena. If you don’t have an IAM role to access these resources, it’s recommended that you work with your AWS account administrator. The AWS usage in this post consumes resources beyond the Free Tier, so you will incur associated costs by implementing this walkthrough. It’s recommended that you remove resources after you complete the walkthrough. The following diagram illustrates this post’s solution architecture. We start by exporting Amazon DynamoDB data to Amazon S3 in DynamoDB JSON format [1]. Once the export is complete, we configure an AWS Glue crawler to detect the schema from the exported dataset [2] and populate the AWS Glue Data Catalog [3]. Next, we run an AWS Glue ETL job to convert the data into Apache Parquet [4], and store the data in S3 [5]. Amazon Athena uses the AWS Glue Catalog to determine which files it must read from Amazon S3 and then executes the query [6]. About the process of exporting data from DynamoDB to Amazon S3 Let’s first walk through the process of exporting a DynamoDB table to Amazon S3. For this post, we have a DynamoDB table populated with data from the Amazon Customer Reviews Dataset. This data is a collection of reviews written by users over a 10-year period on Amazon.com. DynamoDB is a good service for serving a review catalog like this because it can scale to virtually unlimited throughput and storage based on user traffic. This is an OLTP workload with well-defined access patterns (create and retrieve product reviews). For this post, our data is structured on the table by using ProductID as the partition key and ReviewID as the sort key (for more information about key selection, see Choosing the Right DynamoDB Partition Key). With this key design, the application can create and retrieve reviews related to products quickly and efficiently. The following screenshot shows the data model of this table, which we created by using NoSQL Workbench. The data model for this review catalog works well for the OLTP requests from the application, but a common request is to support analytical queries. For this table, imagine that a marketing team wants to find the product that received the most reviews, or perhaps they want to identify which customers posted the most reviews. These are basic analytic queries, but the table isn’t organized to handle these queries so it requires a full table scan and an application-side comparison to retrieve the information. Although you can model the data to create real-time aggregate counts indexed with a sharded global secondary index, this would require planning and complexity. In this example, the marketing team likely has tens or hundreds of analytical queries, and some of them are built on the results of previous queries. Designing a non-relational data model to fit many analytical queries is neither reasonable nor cost-effective. Analytical queries don’t require high throughput, many concurrent users, or consistent low latency, so these queries also don’t benefit from a service like DynamoDB. However, if the DynamoDB dataset is easily accessible in Amazon S3, you can analyze data directly with services such as Athena or Amazon Redshift by using standard SQL. Previously, exporting a table to Amazon S3 required infrastructure management, custom scripts and solutions, and capacity planning to ensure sufficient read capacity units to perform a full table scan. Now, you can export your DynamoDB PITR backups dataset to your Amazon S3 bucket at the click of a button. The export feature uses the DynamoDB native PITR feature as the export data source, and it doesn’t consume read capacity and has zero impact on table performance and availability. Calculating cost is simplified because the export cost is per GB of exported data. The beauty of the export feature is that it makes it simple for you to export data to Amazon S3, where analytical queries are straightforward by using tools such as Athena. Enable PITR on a DynamoDB table The export feature relies on the ability of DynamoDB to continuously back up your data using PITR. It enables you to restore from your continuous backup to a new table and to export your backup data to Amazon S3 at any point in time in the last 35 days. Get started by enabling PITR on a DynamoDB table, which you can do via the API, AWS Command Line Interface (AWS CLI), or DynamoDB console. In our case, we have used the DynamoDB console to enable PITR, as shown in the following screenshot. Initiate a data export from DynamoDB to Amazon S3 DynamoDB data is exported to Amazon S3 and saved as either compressed DynamoDB JSON or Amazon Ion. Once your data is available in your Amazon S3 bucket, you can start analyzing it directly with Athena. However, to get better performance, you can partition the data, compress data, or convert it to columnar formats such as Apache Parquet using AWS Glue. AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and integrate the data into a data lake where it is natively accessed by analytics and machine learning tools such as Athena, SageMaker, and Redshift Spectrum. After you have your PITR-enabled table and selected a destination Amazon S3 bucket for the data export, you can initiate your first DynamoDB export. For this post, we’ve done that in the new DynamoDB console (available in preview) by navigating to the Exports and streams tab of the table. Enter your bucket name in the Destination S3 bucket box (in our case, it’s s3://dynamodb-to-s3-export-results, as shown in the following screenshot). You also can specify a bucket that is in another account or AWS Region. Choosing Additional settings allows you to configure a specific point in time of the restore, the export output format, and the encryption key. For sake of simplicity, we have not changed the additional settings. Start the export job by choosing Export to S3. After you initiate an export job, you can view the status in the console. View the dataset in Amazon S3 The time to export a table using the new feature is fast—even for very large tables—when compared to the previous approach of scanning the table to perform an export. You will spend no time deploying infrastructure, and the export time is not necessarily a function of table size because it’s performed in parallel and depends on how uniformly the table data is distributed. In our test, we exported 160 GB of data in 11 minutes. When an export is initiated from DynamoDB, the IAM role that initiates the export job is the same role that writes the data to your Amazon S3 bucket. When the export process is complete, a new AWSDynamoDB folder is shown in your Amazon S3 bucket with a subfolder corresponding to the export ID. In our case, we have four manifest objects and a data folder. The manifest objects include the following details from the export: manifest-files.json – Lists the names and item counts for each exported data object. manifest-summary.json – Provides general information about the exported dataset. manifest-files.md5 and manifest-summary.md5 – Are .md5 checksums of the corresponding JSON objects. The data folder contains the entire dataset saved as .gz files. Now that the data is in Amazon S3, you can use AWS Glue to add the data as a table to the AWS Glue Data Catalog. Associate exported data with an AWS Glue Data Catalog AWS Glue is the AWS entry point for ETL and analytics workloads. In the AWS Glue console, you can define a crawler by using the Amazon S3 export location. You must configure the Glue crawler to crawl all objects in s3:///AWSDynamoDB//data. In this walkthrough, we don’t go deep into how to use AWS Glue crawlers and the Data Catalog. At a high level, crawlers scan the S3 path containing the exported data objects to create table definitions in the Data Catalog that you can use for executing Athena queries or Glue ETL jobs. After the crawler has been created, the exported data is now associated with a new Data Catalog table. We can query our table directly by using Athena. For example, our query to count which customer posted the most reviews looks like the following. SELECT item.customer_id.s AS customer, COUNT(item.customer_id.s) AS Count FROM "default"."data" GROUP BY item.customer_id.s ORDER BY Count DESC limit 10; The following screenshot shows the results of this query, which include customers (identified by item.customer_id.s) and how many reviews (identified by Count) they have published. A single customer in this sample data wrote 1,753 reviews. Creating more complex queries is straightforward; grouping by year and product is as simple as expanding the query. Transform the exported data into Parquet format and partition the data for optimized analytics For data analytics at scale, you also can use AWS Glue to transform the exported data into higher performing data formats. AWS Glue natively supports many format options for ETL inputs and outputs, including Parquet and Avro, which are commonly used for analytics. Parquet is a column-storage format that provides data compression and encoding that can improve performance of analytics on large datasets. Partitioning data is an important step for organizing data for more efficient analytics queries, reducing cost and query time in the process. The default DynamoDB JSON is not partitioned or formatted for analytics. Let’s extend our example by using AWS Glue to transform the exported data into Parquet format and partition the data for optimized analytics. The first step is understanding what the data should look like when the ETL job is complete. By default, the data is nested within JSON item structures. We want to flatten the data so that the individual, nested attributes become top-level columns in the analytics table. This enables more efficient partitioning and simpler queries. We can configure AWS Glue jobs by using a visual builder, but to flatten the nested structure, we need to leverage a transformation called Relationalize. (For a step-by-step walkthrough of this transformation, see Simplify Querying Nested JSON with the AWS Glue Relationalize Transform.) For our purposes in this post, we jump right into the AWS Glue job that transforms and partitions the data as well as the query impact, as shown in the following code example. import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job # Begin variables to customize with your information glue_source_database = "" glue_source_table = "" glue_temp_storage = "s3:///temp/" glue_relationalize_output_s3_path = "s3:////" dfc_root_table_name = "root" #default value is "roottable" ## @params: [JOB_NAME] args = getResolvedOptions(sys.argv, ['JOB_NAME']) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args) datasource0 = glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = glue_source_table, transformation_ctx = "datasource0") dfc = Relationalize.apply(frame = datasource0, staging_path = glue_temp_storage, name = dfc_root_table_name, transformation_ctx = "dfc") flatData = dfc.select(dfc_root_table_name) flatDataOutput = glueContext.write_dynamic_frame.from_options( frame = flatData, connection_type = "s3", connection_options = {"path": glue_relationalize_output_s3_path, "partitionKeys": ["item.year.n"]}, format = "parquet", transformation_ctx = "flatDataOutput") job.commit() Although the structure of this job is similar to the aforementioned method to implement the Relationalize transformation, note the following: The code connection_options = {... "partitionKeys": ["item.year.n"]} specifies how data is partitioned in Amazon S3. The partition key choice is specific to the dataset and the nature of the analytics. For example, if you’re performing queries using time periods, item.year.n is a good choice for partitioning the data. The code format = "parquet" sets the AWS Glue job to write the data to Amazon S3 in Parquet format. The result of the preceding AWS Glue job is a new set of Parquet files organized by year in Amazon S3 folders. As a new time-specific query, let’s say we want instead to use Athena to see the first 100 reviews of an item. The following Athena code example shows this query. SELECT * FROM "default"."ddbs3_parquetpartitioned__output_flat_partitioned" WHERE "item.product_title.s" = 'Dune' ORDER BY "item.year.n" asc limit 100; The dataset from this example is relatively small, but we can see a significant improvement in the amount of data read while querying. For the sake of comparison, when we performed the earlier AWS Glue job (partitioned by year), we performed a job that exported to Parquet without a partition key. Now, when we use Athena to query the dataset that was not partitioned, we scan 322 MB (the query against the partitioned data scanned only 42 MB). The partitioned data is nearly eight times more efficient, and, because Athena bills by GB scanned, eight times more cost-efficient. Clean up your resources After you create an export, view the dataset in Amazon S3, transform the exported data into Parquet format, and use Athena to query and review the results, you should remove any resources that you created in this process. Resources that remain active can incur associated costs. Conclusion In this post, we demonstrated how to use the DynamoDB data export to Amazon S3 feature with AWS Glue and Athena to perform analytics at scale by using Apache Parquet. This feature reduces the complexity, infrastructure management, and production impact of making the DynamoDB data easily accessible in Amazon S3, and it also removes the need for additional tools that scan and export the table. Happy exporting and querying! About the authors Mazen Ali is a Senior Technical Product Manager on the Amazon DynamoDB team.           Shiladitya Mandal is a Software Engineer on the Amazon DynamoDB team. https://aws.amazon.com/blogs/database/export-and-analyze-amazon-dynamodb-data-in-an-amazon-s3-data-lake-in-apache-parquet-format/
0 notes
phungthaihy · 5 years ago
Photo
Tumblr media
AWS Cloud Storage | Cloud Storage Services | AWS Certification Training | Edureka http://ehelpdesk.tk/wp-content/uploads/2020/02/logo-header.png [ad_1] AWS Certification Training: http... #awscertification #awscertifiedcloudpractitioner #awscertifieddeveloper #awscertifiedsolutionsarchitect #awscertifiedsysopsadministrator #awscloudsecuritytraining #awscloudstorage #awscloudstorageservices #awscloudstoragetutorial #awsdynamodb #awsedureka #awsefs #awsglacier #awss3 #awss3buckets #ciscoccna #cloudsecuritymodel #cloudstorage #cloudstoragebestpractices #cloudstorageexplained #cloudstoragegateway #cloudstoragemyths #cloudstorageonaws #comptiaa #comptianetwork #comptiasecurity #cybersecurity #edureka #ethicalhacking #it #kubernetes #linux #microsoftaz-900 #microsoftazure #networksecurity #s3objects #software #whatiscloudstorage #windowsserver #ytccon
0 notes
vikas-brilworks · 1 year ago
Text
0 notes
globalmediacampaign · 4 years ago
Text
Cross-account replication with Amazon DynamoDB
Hundreds of thousands of customers use Amazon DynamoDB for mission-critical workloads. In some situations, you may want to migrate your DynamoDB tables into a different AWS account, for example, in the eventuality of a company being acquired by another company. Another use case is adopting a multi-account strategy, in which you have a dependent account and want to replicate production data in DynamoDB to this account for development purposes. Finally, for disaster recovery, you can use DynamoDB global tables to replicate your DynamoDB tables automatically across different AWS Regions, thereby achieving sub-minute Recovery Time and Point Objectives (RTO and RPO). However, you might want to replicate not only to a different Region, but also to another AWS account. In this post, we cover a cost-effective method to migrate and sync DynamoDB tables across accounts while having no impact on the source table performance and availability. Overview of solution We split this article into two main sections: initial migration and ongoing replication. We complete the initial migration by using a new feature that allows us to export DynamoDB tables to any Amazon Simple Storage Service (Amazon S3) bucket and use an AWS Glue job to perform the import. For ongoing replication, we use Amazon DynamoDB Streams and AWS Lambda to replicate any subsequent INSERTS, UPDATES, and DELETES. The following diagram illustrates this architecture. Initial migration The new native export feature leverages the point in time recovery (PITR) capability in DynamoDB and allows us to export a 1.3 TB table in a matter of minutes without consuming any read capacity units (RCUs), which is considerably faster and more cost-effective than what was possible before its release. Alternatively, for smaller tables that take less than 1 hour to migrate (from our tests, tables smaller than 140 GB), we can use an AWS Glue job to copy the data between tables without writing into an intermediate S3 bucket. Step-by-step instructions to deploy this solution are available in our GitHub repository. Exporting the table with the native export feature To export the DynamoDB table to a different account using the native export feature, we first need to grant the proper permissions by attaching two AWS Identity and Access Management (IAM) policies: one S3 bucket policy and one identity-based policy on the IAM user who performs the export, both allowing write and list permissions. The following code is the S3 bucket policy (target account): { "Version": "2012-10-17", "Id": "Policy1605099029795", "Statement": [ { "Sid": "Stmt1605098975368", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam:::user/" }, "Action": [ "s3:ListBucket", "s3:PutObjectAcl", "s3:AbortMultipartUpload", "s3:PutObject" ], "Resource": [ "arn:aws:s3::: ", "arn:aws:s3::: /*" ] } ] } The following code is the IAM user policy (source account): { "Version": "2012-10-17", "Statement": [ { "Sid": "Stmt1605019439671", "Action": [ "s3:ListBucket", "s3:PutObject", "s3:PutObjectAcl" ], "Effect": "Allow", "Resource": "arn:aws:s3:::" } ] }  Make sure DynamoDB Streams is enabled in the source table at least 2 minutes before starting the export. This is needed for the ongoing replication step. For instructions on performing the export, see New – Export Amazon DynamoDB Table Data to Your Data Lake in Amazon S3, No Code Writing Required. When doing the export, you can choose the output format in either DynamoDB JSON or Amazon Ion. In this post, we choose DynamoDB JSON. The files are exported in the following S3 location: s3:///AWSDynamoDB//data/ After the export has finished, the objects are still owned by the user in the source account, so no one in the target account has permissions to access them. To fix this, we can change the owner by using the bucket-owner-full-control ACL. We use the AWS Command Line Interface (AWS CLI) in the source account and the following command to list all the objects in the target S3 bucket and output the object keys to a file: aws s3 ls s3:// --recursive | awk '{print $4}' > file.txt Then, we created a bash script to go over every line and update the owner of each object using the put-object-acl command. Edit the script by changing the path of the file, and run the script. Importing the table Now that we have our data exported, we use an AWS Glue job to read the compressed files from the S3 location and write them to the target DynamoDB table. The job requires a schema containing metadata in order to know how to interpret the data. The AWS Glue Data Catalog is a managed service that lets you store, annotate, and share metadata in the AWS Cloud. After the data is cataloged, it’s immediately available for querying and transformation using Amazon Athena, Amazon EMR, Amazon Redshift Spectrum, and AWS Glue. To populate the Data Catalog, we use an AWS Glue crawler to infer the schema and create a logical table on top of our recently exported files. For more information on how configure the crawler, see Defining Crawlers. Most of the code that the job runs can be generated by AWS Glue Studio, so we don’t have to type all the existing fields manually. For instructions, see Tutorial: Getting started with AWS Glue Studio. In this post, we focus on just two sections of the code: the data transformation and the sink operation. Our GitHub repo has the full version of the code. The following is the data transformation snippet of the generated code: Transform0 = ApplyMapping.apply(frame = DataSource0, mappings =[ ("item.ID.S", "string", "item.ID.S", "string"), ("item.date.M", "string", "item.date.M", "string"), ("item.location.M.lat.S", "string", "item.location.M.lat.S", "string"), ("item.location.M.lng.S", "string", "item.location.M.lng.S", "string")], transformation_ctx = "Transform0") Now we have to make sure all the key names, data types, and nested objects have the same values and properties as in the source. For example, we need to change the key name item.ID.S to ID, item.date.M to date, the date type from string to map, and so on. The location object contains nested JSON and again, we have to make sure the structure is respected in the target as well. Our snippet looks like the following after all the required code changes are implemented: Mapped = ApplyMapping.apply(frame = Source, mappings = [ ("item.ID.S", "string", "ID", "string"), ("item.date.M", "string", "date", "map"), ("item.location.M.lng.S", "string", "location.lng", "string"), ("item.location.M.lat.S", "string", "location.lat", "string")], transformation_ctx = "Mapped") Another essential part of our code is the one that allows us to write directly to DynamoDB. Here we need to specify several parameters to configure the sink operation. One of these parameters is dynamodb.throughput.write.percent, which allows us to specify what percentage of write capacity the job should use. For this post, we choose 1.0 for 100% of the available WCUs. Our target table is configured using provisioned capacity and the only activity on the table is this initial import. Therefore, we configure the AWS Glue job to consume all write capacity allocated to the table. For on-demand tables, AWS Glue handles the write capacity of the table as 40,000. This is the code snippet responsible for the sink operation: glueContext.write_dynamic_frame_from_options( frame=Mapped, connection_type="dynamodb", connection_options={ "dynamodb.region": "", "dynamodb.output.tableName": "", "dynamodb.throughput.write.percent": "1.0" Finally, we start our import operation with an AWS Glue job backed by 17 standard workers. Because the price difference was insignificant even when we used half this capacity, we chose the number of workers that resulted in the shortest import time. The maximum number of workers correlates with the table’s capacity throughput limit, so there is a theoretical ceiling based on the write capacity of the target table. The following graph shows that we’re using DynamoDB provisioned capacity (which can be seen as the red line) because we know our capacity requirements and can therefore better control cost. We requested a write capacity limit increase using AWS Service Quota to double the table default limit of 40,000 WCUs so the import finishes faster. DynamoDB account limits are soft limits that can be raised by request if you need to increase the speed at which data is exported and imported. There is virtually no limit on how much capacity you request, but each request is subject to review by the DynamoDB service. Our AWS Glue job took roughly 9 hours and 20 minutes, leaving us with a total migration time of 9 hours and 35 minutes. It’s considerably faster when compared to the total of 14 hours migration using Data Pipeline, for a lower cost. After the import finishes, change the target DynamoDB table write capacity to either one of the following options based on the target table’s use case: On-demand – Choose this option if you don’t start the ongoing replication immediately after the initial migration or if the target table is a development table. The same WCU you have in the source table – If you’re planning to start the ongoing replication immediately after the initial migration finishes, this is the most cost-effective option. Also, if this is a DR use case, use this option to match the target throughput capacity with the source. Ongoing replication To ensure data integrity across both tables, the initial (full load) migration should be completed before enabling ongoing replication. In the ongoing replication process, any item-level modifications that happened in the source DynamoDB table during and after the initial migration are captured by DynamoDB Streams. DynamoDB streams store these time-ordered records for 24 hours. Then, a Lambda function reads records from the stream and replicates those changes to the target DynamoDB table. The following diagram (option 1) depicts the ongoing replication architecture. However, if the initial migration takes more than 24 hours, we have to use Amazon Kinesis Data Streams instead. In our case, we migrated 1.3 TB in just under 10 hours. Therefore, if the table you’re migrating is bigger than 3 TB (with the DynamoDB table write limit increased to 80,000 WCUs), the initial migration part could take more than 24 hours. In this case, use Kinesis Data Streams as a buffer to capture changes to the source table, thereby extending the retention from 1 day to 365 days. The following diagram (option 2) depicts the ongoing replication architecture if we use Kinesis Data Streams as a buffer. All updates happening on the source table can be automatically copied to a Kinesis data stream using the new Kinesis Data Streams for DynamoDB feature. A Lambda function reads records from the stream and replicates those changes to the target DynamoDB table. In this post, we use DynamoDB Streams (option 1) to capture the changes on the source table. The ongoing replication solution is available in the GitHub repo. We use an AWS Serverless Application Model (AWS SAM) template to create and deploy the Lambda function that processes the records in the stream. This function assumes an IAM role in the target account to write modified and new items to the target DynamoDB table. Therefore, before deploying the AWS SAM template, complete the following steps in the target account to create the IAM role: On the IAM console, choose Roles in the navigation pane. Choose Create role. Select Another AWS account. For Account ID, enter the source account number. Choose Next. Create a new policy and copy the following permissions to the policy. Replace with the target DynamoDB table ARN. { "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0", "Effect": "Allow", "Action": [ "dynamodb:BatchGetItem", "dynamodb:BatchWriteItem", "dynamodb:PutItem", "dynamodb:DescribeTable", "dynamodb:DeleteItem", "dynamodb:GetItem", "dynamodb:Scan", "dynamodb:Query", "dynamodb:UpdateItem" ], "Resource": "" }, { "Sid": "VisualEditor1", "Effect": "Allow", "Action": "dynamodb:ListTables", "Resource": "*" } ] } Return to the role creation wizard and refresh the list of policies. Chose the newly created policy. Choose Next. Enter the target role name. Choose Create role. Deploying and running the ongoing replication solution Follow the instructions in the GitHub repo to deploy the template after the initial migration is finished. When deploying the template, you’re prompted to enter several parameters including but not limited to TargetRoleName (the IAM role created in last step) and SourceTableStreamARN. For more information, see Parameter Details in the GitHub repo. One of the most important parameters is the MaximumRecordAgeInSeconds, which defines the oldest record in the stream that the Lambda function starts processing. If DynamoDB Streams was enabled only a few minutes before starting the initial export, set this parameter to -1 to process all records in the stream. If you didn’t have to turn on the stream because it was already enabled, set the MaximumRecordAgeInSeconds parameter to a few minutes (2–5 minutes) before the initial export starts. Otherwise, Lambda processes items that were already copied during the migration step, thereby consuming unnecessary Lambda resources and DynamoDB write capacity. For example, let’s assume you started the initial export at 2:00 PM and it took 1 hour, finishing at 3:00 PM. After that, the import has started and took 7 hours to complete. If you deploy the template at 10:00 PM, set the age to 28,920 seconds (8 hours, 2 minutes). The deployment creates a Lambda function that reads from the source DynamoDB Streams and writes to the table in the target account. It also creates a disabled DynamoDB event source mapping. The reason why this was disabled is because the moment we enable it, the function starts processing records in the stream automatically. Because we should start the ongoing replication only after the initial migration finishes, we need to control when to enable the trigger. To start the ongoing replication, enable the event source mapping on the Lambda console, as shown in the following screenshot.   Alternatively, you can also use the update-event-source-mapping command to enable the trigger or change any of the settings, such as MaximumRecordAgeInSeconds. Verifying the number of records in the source and target tables To verify number of records in both the source and target tables, check the Item summary section on the DynamoDB console, as shown in the following screenshot. You can also use the following command to determine the number of items as well as the size of the source and target tables: aws dynamodb describe-table --table-name   DynamoDB updates the size and item count values approximately every 6 hours. Alternatively, you can run an item count on the DynamoDB console. This operation does a full scan on the table to retrieve the current size and item count, and therefore it’s not recommended to run this action on large tables. Cleaning up Delete the resources you created if you no longer need them: Delete the IAM roles we created. Disable DynamoDB Streams. Disable PITR in the source table. Delete the AWS SAM template to delete the Lambda functions: aws cloudformation delete-stack --stack-name Delete the AWS Glue job, tables, and database. Conclusion In this post, we showcased the fastest and most cost-effective way to migrate DynamoDB tables between AWS accounts, using the new DynamoDB export feature along with AWS Glue for the initial migration, and Lambda in conjunction with DynamoDB Streams for the ongoing replication. Should you have any questions or suggestions, feel free to reach out to us on GitHub and we can take the conversation further. Until next time, enjoy your cloud journey! About the Authors Ahmed Zamzam is a Solutions Architect with Amazon Web Services. He supports SMB customers in the UK in their digital transformation and their cloud journey to AWS, and specializes in Data Analytics. Outside of work, he loves traveling, hiking, and cycling.       Dragos Pisaroc is a Solutions Architect supporting SMB customers in the UK in their cloud journey, and has a special interest in big data and analytics. Outside of work, he loves playing the keyboard and drums, as well as studying psychology and philosophy. https://aws.amazon.com/blogs/database/cross-account-replication-with-amazon-dynamodb/
0 notes