#AmazonEMRclusters
Explore tagged Tumblr posts
govindhtech · 3 months ago
Text
What is Amazon EMR? How to create Amazon EMR clusters
Tumblr media
Describe Amazon EMR.
Amazon EMR, previously Amazon Elastic MapReduce, allows Apache Hadoop and Apache Spark easy to run on AWS for processing and analysing enormous amounts of data. These frameworks and open-source apps process data for corporate intelligence and analytics. Amazon EMR transforms and transfers massive volumes of data between Amazon DynamoDB and Amazon S3..
Amazon EMR cluster setup and operation
A detailed overview of Amazon EMR clusters, including how to submit work, how data is handled, and the cluster's processing phases.
Learning nodes and clusters
Main component of Amazon EMR is cluster. Amazon EC2 clusters are groups of instances. Every cluster instance is a node. Each cluster node type has a role. Amazon EMR puts software components on each node type to assign it a function in a distributed application like Apache Hadoop.
Types of Amazon EMR nodes:
The primary node runs software to coordinate work and data allocation across processing nodes, administering the cluster. The primary node monitors cluster health and tasks. Every cluster has a primary node that can form a single-node cluster.
The core node contains the software needed to run operations and store data in your cluster's Hadoop Distributed File System. Core nodes are present in multi-node clusters.
Task nodes: Software-equipped nodes that execute tasks without storing data in HDFS. Task nodes are optional.
Submitted work to cluster
When running an Amazon EMR cluster, you may specify tasks in several ways.
Provide clear instructions for cluster construction phases. This is frequently done to clusters that process a particular amount of data and then shut down.
Submit steps, including jobs, using the Amazon EMR UI, API, or CLI after constructing a long-running cluster. Check out Submit work to an Amazon EMR cluster.
Establish a cluster, connect to the primary node and other nodes via SSH, then complete tasks and send interactive or scripted queries using the installed apps' interfaces. Learn more from the Amazon EMR Release Guide.
Data processing
When you launch your cluster, you choose data processing frameworks and apps. You can process data in your Amazon EMR cluster by performing steps in the cluster or sending jobs or queries to installed apps.
Jobs posted directly to applications
Your Amazon EMR cluster's software lets you submit jobs and communicate with it. This is usually done by connecting securely to the primary node and utilising the tools and interfaces for your cluster's software.
Executing data processing procedures
Amazon EMR clusters can receive ordered steps. Each stage contains data modification instructions for the cluster's software.
The following procedure has four steps:
Submit a dataset for processing.
Process first-stage output with Pig.
Hive can process a second input dataset.
Make an output dataset.
Amazon EMR usually processes data from your chosen file system, such as HDFS or Amazon S3. This data progresses via processing. The output data is written to an Amazon S3 bucket in the last stage.
Steps are performed in this order:
Start processing is requested.
All actions are pending.
It becomes RUNNING when the sequence starts. The remaining steps are PENDING.
After the first stage, it becomes COMPLETED.
Once the sequence continues, its status becomes RUNNING. Its condition is COMPLETED when done.
This cycle continues until all stages are completed and processing is complete.
The following diagram shows processing steps and state changes.
Failure while processing marks a step as FAILED. Choose a follow-up for each stage. If a previous step fails, the remaining steps are set to CANCELLED and do not execute. Other alternatives include stopping the cluster immediately or disregarding the failure and continuing.
The figure shows the default state change and step sequence when a processing step fails.
Understanding cluster lifespan
Successful Amazon EMR clusters work like this:
Amazon EMR creates EC2 instances in the cluster for each instance based on your requirements. See Amazon EMR cluster hardware and networking configuration for more. Amazon EMR always utilises the default AMI or your custom Amazon Linux AMI. For more, see Using a custom AMI to increase Amazon EMR cluster configuration flexibility. The cluster state is just beginning.
You can configure bootstrap activities for each Amazon EMR instance. Custom apps can be installed and customised using bootstrap activities. Read Create bootstrap actions for Amazon EMR cluster software installation. Currently, the cluster is BOOTSTRAPPING.
Amazon EMR may install native apps like Hive, Hadoop, Spark, and others when you establish the cluster. After startup and native application installation, the cluster is RUNNING. After connecting to cluster instances, the cluster will execute the sequential steps you selected when you established it. Submit further actions after prior steps are complete. Check out Submit work to an Amazon EMR cluster.
A successful step puts the cluster in WAITING.
Following the last phase, an auto-terminating cluster enters TERMINATING before terminating. Waiting requires manually shutting down the cluster. After a manual shutdown, the cluster enters TERMINATING before TERMINATED.
Amazon EMR terminates the cluster and all instances if a cluster lifecycle failure occurs without termination protection. If a cluster fails, its data is destroyed and its status changed to TERMINATED_WITH_ERRORS. If configured, you can restore data, deactivate termination protection, and end the cluster. Find out how termination protection can prevent unintended shutdown of Amazon EMR clusters.
This image shows the cluster lifespan and how each stage corresponds to a cluster state.
0 notes
govindhtech · 2 months ago
Text
Amazon EMR Studio Workspace Creation and Launching in AWS
Tumblr media
Design and customise workspaces in an EMR studio to organise and operate notebooks. This section covers workspace construction and use.
Helpful EMR Studio Workspace topics:
Make an EMR Studio Workspace
Start an EMR Studio Workspace
Learn EMR Studio's Workspace UI.
See EMR Studio notebook examples.
Save EMR Studio Workspace content.
EMR Studio Workspace and notebook deletion
Know workspace status
Fix Workspace connectivity.
Make an EMR Studio Workspace
Create EMR Studio Workspaces to run notebook code.
To create an EMR studio workplace
Log into EMR Studio.
Select “Create a Workspace.”
Enter a workspace description and name. Naming workspaces helps find them on the Workspaces page.
Workspace collaboration allows Studio users to collaborate in real time on this workspace. Create collaborators after starting the Workspace.
Joining a cluster to a workspace requires expanding Advanced setup. You can add a cluster later. Refer to Attach CPU to EMR Studio Workspace for information.
Provisioning a new cluster requires administrator access.
After choosing a workspace cluster, attach the cluster.
Click Create a Workspace at the bottom.
After creating a workspace, EMR Studio opens the Workspaces page. The freshly established workspace is listed with a green success banner at the top.
Any Studio user can see shared workspaces by default. However, only one individual can utilise a workstation. You can collaborate with other users in EMR Studio using workspace collaboration.
Launch of EMR Studio Workspace
The notebook editor in a Workspace lets you deal with notebook files. The Workspaces page of a studio lists all accessible workspaces, along with their Name, Status, Creation time, and Last Modified.
Note
Your EMR notebooks from the previous Amazon EMR console may be in the console as EMR Studio Workspaces. IAM role rights are needed to access or create Workspaces in EMR Notebooks. You may need to refresh the Workspaces list to see a notebook you made in the last terminal.
To create a notepad and editing workspace
Your Studio's Workspaces page has the workspace. Keywords and column values can filter the list.
Select the workspace name to open it in a new browser tab. It may take several minutes to open the workspace if idle. Click Launch Workspace after selecting the Workspace row.
These launch options are available:
Quick launch: Use default settings to launch your workspace. Select Quick launch to connect clusters to JupyterLab.
Start your workstation with customisable settings. Launch Jupyter or JupyterLab, connect to an EMR cluster, and select security groups.
Note
Working in a workspace is limited to one user. EMR Studio alerts you when you try to open a specified Workspace that is in use. The Workspaces page shows the workspace user in the User column.
1 note · View note
govindhtech · 2 months ago
Text
Amazon EMR Notebooks For Enhanced Big Data Exploration
Tumblr media
Amazon EMR Notebooks
EMR Notebooks: AWS Simplifies Spark Cluster Data Analysis
Amazon Web Services (AWS) makes big data management more flexible and integrated for data scientists and analysts. Amazon EMR Notebooks offer a familiar interactive interface that connects Apache Spark-powered Amazon EMR clusters. The new feature streamlines data searches, model creation, and result visualisation.
Amazon EMR users can access EMR Notebooks as EMR Studio Workspaces. The console interface's “Create Workspace” button simplifies notebook creation. Users need extra IAM role permissions to create or access these Workspaces.
EMR notebooks are “serverless” interfaces. The equations, queries, models, code, and narrative text you write are client-side in the notebook interface, while a kernel on the Amazon EMR cluster executes your commands. This configuration directly uses your EMR system's scalable computing capability for interactive analysis sessions.
Designing to protect your valuable work from computing cluster transience is crucial. EMR notebook contents are automatically stored on Amazon S3. Your notes, code, and analysis are separated from the cluster's data, allowing flexible notebook reuse and durability (your work continues even if the cluster is shut down).
The flexibility of laptop cluster connections is a major benefit. Users can establish an EMR cluster, connect their notebook for analysis, then shutdown the cluster when they're done for cost-effective, on-demand computing. Closing a notebook connected to one cluster to another lets you shift environments or work with data on another cluster fast.
Multiple users can connect their notebooks to the same EMR cluster at once, and notebook files are hosted on Amazon S3, making sharing easy. It is stated that these features will reduce notebook reset time for diverse datasets and clusters.
Interactive console or programmatic use of EMR Notebooks. Headless execution lets users run an EMR notebook over the Amazon EMR API without using the UI. This involves marking a cell in the EMR notebook with “parameters” to enable. When an external script is launched programmatically, this cell acts as a gateway to feed the notebook new data.
This is useful when creating parameterised notebooks that can be reused with different input values without requiring extra copies. Every time an API-executed parameterised notebook is launched, Amazon EMR generates and stores the output notebook on S3. This functionality can be developed using example API instructions.
EMR Notebooks support 5.18.0 and newer clusters. AWS recommends EMR Notebooks with Amazon EMR 5.30.0, 5.32.0, or 6.2.0 clusters for optimum performance. In these latter versions, the Jupyter kernels that run your code run directly on the cluster, thus this guidance is crucial. Direct cluster execution is said to boost performance and kernel and library modification.
Customers considering Amazon EMR Notebooks should consider the costs. As expected, Amazon S3 storage for notebook data will cost. Standard fees will also apply to connected Amazon EMR clusters utilised for notebook instructions.
Finally, Amazon EMR Notebooks provide a comfortable, adaptable, and interactive environment for data professionals to analyse and develop data directly connected to their Amazon EMR Spark clusters. S3 storing, adjustable cluster attachment, multi-user access, and powerful headless execution make them a compelling AWS large data alternative.
0 notes
govindhtech · 2 months ago
Text
Creating a Scalable Amazon EMR Cluster on AWS in Minutes
Tumblr media
Minutes to Scalable EMR Cluster on AWS
AWS EMR cluster
Spark helps you easily build up an Amazon EMR cluster to process and analyse data. This page covers Plan and Configure, Manage, and Clean Up.
This detailed guide to cluster setup:
Amazon EMR Cluster Configuration
Spark is used to launch an example cluster and run a PySpark script in the course. You must complete the “Before you set up Amazon EMR” exercises before starting.
While functioning live, the sample cluster will incur small per-second charges under Amazon EMR pricing, which varies per location. To avoid further expenses, complete the tutorial’s final cleaning steps.
The setup procedure has numerous steps:
Amazon EMR Cluster and Data Resources Configuration
This initial stage prepares your application and input data, creates your data storage location, and starts the cluster.
Setting Up Amazon EMR Storage:
Amazon EMR supports several file systems, but this article uses EMRFS to store data in an S3 bucket. EMRFS reads and writes to Amazon S3 in Hadoop.
This lesson requires a specific S3 bucket. Follow the Amazon Simple Storage Service Console User Guide to create a bucket.
You must create the bucket in the same AWS region as your Amazon EMR cluster launch. Consider US West (Oregon) us-west-2.
Amazon EMR bucket and folder names are limited. Lowercase letters, numerals, periods (.), and hyphens (-) can be used, but bucket names cannot end in numbers and must be unique across AWS accounts.
The bucket output folder must be empty.
Small Amazon S3 files may incur modest costs, but if you’re within the AWS Free Tier consumption limitations, they may be free.
Create an Amazon EMR app using input data:
Standard preparation involves uploading an application and its input data to Amazon S3. Submit work with S3 locations.
The PySpark script examines 2006–2020 King County, Washington food business inspection data to identify the top ten restaurants with the most “Red” infractions. Sample rows of the dataset are presented.
Create a new file called health_violations.py and copy the source code to prepare the PySpark script. Next, add this file to your new S3 bucket. Uploading instructions are in Amazon Simple Storage Service’s Getting Started Guide.
Download and unzip the food_establishment_data.zip file, save the CSV file to your computer as food_establishment_data.csv, then upload it to the same S3 bucket to create the example input data. Again, see the Amazon Simple Storage Service Getting Started Guide for uploading instructions.
“Prepare input data for processing with Amazon EMR” explains EMR data configuration.
Create an Amazon EMR Cluster:
Apache Spark and the latest Amazon EMR release allow you to launch the example cluster after setting up storage and your application. This may be done with the AWS Management Console or CLI.
Console Launch:
Launch Amazon EMR after login into AWS Management Console.
Start with “EMR on EC2” > “Clusters” > “Create cluster”. Note the default options for “Release,” “Instance type,” “Number of instances,” and “Permissions”.
Enter a unique “Cluster name” without <, >, $, |, or `. Install Spark from “Applications” by selecting “Spark”. Note: Applications must be chosen before launching the cluster. Check “Cluster logs” to publish cluster-specific logs to Amazon S3. The default destination is s3://amzn-s3-demo-bucket/logs. Replace with S3 bucket. A new ‘logs’ subfolder is created for log files.
Select your two EC2 keys under “Security configuration and permissions”. For the instance profile, choose “EMR_DefaultRole” for Service and “EMR_EC2_DefaultRole” for IAM.
Choose “Create cluster”.
The cluster information page appears. As the EMR fills the cluster, its “Status” changes from “Starting” to “Running” to “Waiting”. Console view may require refreshing. Status switches to “Waiting” when cluster is ready to work.
AWS CLI’s aws emr create-default-roles command generates IAM default roles.
Create a Spark cluster with aws emr create-cluster. Name your EC2 key pair –name, set –instance-type, –instance-count, and –use-default-roles. The sample command’s Linux line continuation characters () may need Windows modifications.
Output will include ClusterId and ClusterArn. Remember your ClusterId for later.
Check your cluster status using aws emr describe-cluster –cluster-id myClusterId>.
The result shows the Status object with State. As EMR deployed the cluster, the State changed from STARTING to RUNNING to WAITING. When ready, operational, and up, the cluster becomes WAITING.
Open SSH Connections
Before connecting to your operating cluster via SSH, update your cluster security groups to enable incoming connections. Amazon EC2 security groups are virtual firewalls. At cluster startup, EMR created default security groups: ElasticMapReduce-slave for core and task nodes and ElasticMapReduce-master for main.
Console-based SSH authorisation:
Authorisation is needed to manage cluster VPC security groups.
Launch Amazon EMR after login into AWS Management Console.
Select the updateable cluster under “Clusters”. The “Properties” tab must be selected.
Choose “Networking” and “EC2 security groups (firewall)” from the “Properties” tab. Select the security group link under “Primary node”.
EC2 console is open. Select “Edit inbound rules” after choosing “Inbound rules”.
Find and delete any public access inbound rule (Type: SSH, Port: 22, Source: Custom 0.0.0.0/0). Warning: The ElasticMapReduce-master group’s pre-configured rule that allowed public access and limited traffic to reputable sources should be removed.
Scroll down and click “Add Rule”.
Choose ��SSH” for “Type” to set Port Range to 22 and Protocol to TCP.
Enter “My IP” for “Source” or a range of “Custom” trustworthy client IP addresses. Remember that dynamic IPs may need updating. Select “Save.”
When you return to the EMR console, choose “Core and task nodes” and repeat these steps to provide SSH access to those nodes.
Connecting with AWS CLI:
SSH connections may be made using the AWS CLI on any operating system.
Use the command: AWS emr ssh –cluster-id –key-pair-file <~/mykeypair.key>. Replace with your ClusterId and the full path to your key pair file.
After connecting, visit /mnt/var/log/spark to examine master node Spark logs.
The next critical stage following cluster setup and access configuration is phased work submission.
0 notes