emr interview questions

AWS Glue ETL Vs EMR – Which one should I use?

1 What does AMI include?

An AMI includes the following things:

  • A template for the root volume for the instance.
  • Launch permissions to decide which AWS accounts can avail the AMI to launch instances.
  • A block device mapping that determines the volumes to attach to the instance when it is launched.
  • 2 What is a Stateful and a Stateless Firewall?

    A Stateful Firewall is the one that maintains the state of the rules defined. It requires you to define only inbound rules. Based on the inbound rules defined, it automatically allows the outbound traffic to flow.

    On the other hand, a Stateless Firewall requires you to explicitly define rules for inbound as well as outbound traffic.

    For example, if you allow inbound traffic from Port 80, a Stateful Firewall will allow outbound traffic to Port 80, but a Stateless Firewall will not do so.

    2 What are Spot Instances and On-Demand Instances?

    When AWS creates EC2 instances, there are some blocks of computing capacity and processing power left unused. AWS releases these blocks as Spot Instances. Spot Instances run whenever capacity is available. These are a good option if you are flexible about when your applications can run and if your applications can be interrupted.

    On the other hand, On-Demand Instances can be created as and when needed. The prices of such instances are static. Such instances will always be available unless you explicitly terminate them.

    Connection Draining is a feature provided by AWS which enables your servers which are either going to be updated or removed, to serve the current requests.

    If Connection Draining is enabled, the Load Balancer will allow an outgoing instance to complete the current requests for a specific period but will not send any new request to it. Without Connection Draining, an outgoing instance will immediately go off and the requests pending on that instance will error out.

    What is Amazon EMR?Amazon EMR(Elastic MapReduce) is used in Data Analysis, Web Indexing, Data Warehousing, Financial Analysis, and Scientific Simulation. It also provides a managed framework in running data processing frameworks like Apache Hadoop, Apache Presto, and Apache Spark in securing manner and cost-effective.

  • Reliable in Nature – Amazon EMR helps in sensing if it retries fails tasks and also replaces its performing instances.
  • Elasticity – Amazon EMR helps in computing large amounts of instances in processing data at any scale.
  • Flexibility – Amazon EMR helps by completing control over the clusters and rooting access to all instances.
  • Securing – Amazon EMR helps in configuring AWS EC2 Firewall settings, controlling network access to the instances, launching clusters in AWS VPC, and many more.
  • Security and Data Access Control

    Q: How do I prevent other people from viewing my data during cluster execution?

    Amazon EMR starts your instances in two Amazon EC2 security groups, one for the master and another for the other cluster nodes. The master security group has a port open for communication with the service. It also has the SSH port open to allow you to SSH into the instances, using the key specified at startup. The other nodes start in a separate security group, which only allows interaction with the master instance. By default both security groups are set up to not allow access from external sources including Amazon EC2 instances belonging to other customers. Since these are security groups within your account, you can reconfigure them using the standard EC2 tools or dashboard. Click here to learn more about EC2 security groups. Additionally, you can configure Amazon EMR block public access in each region that you use to prevent cluster creation if a rule allows public access on any port that you dont add to a list of exceptions.

    Q: How secure is my data?

    Amazon S3 provides authentication mechanisms to ensure that stored data is secured against unauthorized access. Unless the customer who is uploading the data specifies otherwise, only that customer can access the data. Amazon EMR customers can also choose to send data to Amazon S3 using the HTTPS protocol for secure transmission. In addition, Amazon EMR always uses HTTPS to send data between Amazon S3 and Amazon EC2. For added security, customers may encrypt the input data before they upload it to Amazon S3 (using any common data encryption tool); they then need to add a decryption step to the beginning of their cluster when Amazon EMR fetches the data from Amazon S3.

    Q: Can I get a history of all EMR API calls made on my account for security or compliance auditing?

    Yes. AWS CloudTrail is a web service that records AWS API calls for your account and delivers log files to you. The AWS API call history produced by CloudTrail enables security analysis, resource change tracking, and compliance auditing. Learn more about CloudTrail at the AWS CloudTrail detail page, and turn it on via CloudTrails AWS Management Console.

    Q: How do I control what EMR users can access in Amazon S3?

    By default, Amazon EMR application processes use EC2 instance profiles when they call other AWS services. For multi-tenant clusters, Amazon EMR offers three options to manage user access to Amazon S3 data.

    Q: How does Amazon EMR make use of Availability Zones?

    Amazon EMR launches all nodes for a given cluster in the same Amazon EC2 Availability Zone. Running a cluster in the same zone improves performance of the jobs flows. By default, Amazon EMR chooses the Availability Zone with the most available resources in which to run your cluster. However, you can specify another Availability Zone if required. You also have the option to optimize your allocation for lowest-priced on demand instances, optimal spot capacity, or use On-Demand Capacity Reservations.

    Q: In what Regions is this Amazon EMR available?

    For a list of the supported Amazon EMR AWS regions, please visit the AWS Region Table for all AWS global infrastructure.

    Q: Is Amazon EMR supported in AWS Local Zones?

    EMR supports launching clusters in the Los Angeles AWS Local Zone. You can use EMR in the US West (Oregon) region to launch clusters into subnets associated with the Los Angeles AWS Local Zone.

    Q: Which Region should I select to run my clusters?

    When creating a cluster, typically you should select the Region where your data is located.

    Q: Can I use EU data in a cluster running in the US region and vice versa?

    Yes, you can. If you transfer data from one region to the other you will be charged bandwidth charges. For bandwidth pricing information, please visit the pricing section on the EC2 detail page.

    Q: What is different about the AWS GovCloud (US) region?

    The AWS GovCloud (US) region is designed for US government agencies and customers. It adheres to US ITAR requirements. In GovCloud, EMR does not support spot instances or the enable-debugging feature. The EMR Management Console is not yet available in GovCloud.

    How can we deploy Amazon EMR?We can deploy Amazon EMR workloads by using AWS EKS(Elastic Kubernetes Service), AWS EC2, and premises of AWS Outposts. We can also run and manage our workloads within EMR Console, API, CLI, or SDK by using AWS Managed Workflow for Apache Airflow or AWS Step Functions.

  • Apache Hadoop used in processing large datasets.
  • Apache Spark used in big data workloads and optimizes execution for supporting general batch processing.
  • Apache HBase used as a Big Data Store that is present in the Hadoop Ecosystem.
  • Presto used in processing data data form various data stores which also includes HDFS(Hadoop Distributed File System).
  • FAQ

    What is EMR process?

    Gastrointestinal endoscopic mucosal resection (EMR) is a procedure to remove precancerous, early-stage cancer or other abnormal tissues (lesions) from the digestive tract. Endoscopic mucosal resection is performed with a long, narrow tube equipped with a light, video camera and other instruments.

    Does EMR use EC2?

    Amazon EMR can quickly process large amounts of data using Amazon EC2. Users can configure Amazon EMR to take advantage of On-Demand, Reserved and Spot Instances.

    What is the main use of EMR in AWS?

    Amazon EMR (previously called Amazon Elastic MapReduce) is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark , on AWS to process and analyze vast amounts of data.

    What is EMR role?

    The EMR role defines the allowable actions for Amazon EMR when provisioning resources and performing service-level tasks that are not performed in the context of an EC2 instance running within a cluster. For example, the service role is used to provision EC2 instances when a cluster launches.

    Related Posts

    Leave a Reply

    Your email address will not be published. Required fields are marked *