#delimiter in hive table | Explore Tumblr posts and blogs

siva3155 · 6 years ago

Text

300+ TOP BIG DATA Interview Questions and Answers

BIG Data Interview Questions for freshers experienced :-

1. What is Big Data? Big Data is relative term. When Data can’t be handle using conventional systems like RDBMS because Data is generating with very high speed, it is known as Big Data. 2. Why Big Data? Since Data is growing rapidly and RDBMS can’t control it, Big Data technologies came into picture. 3. What are 3 core dimension of Big Data. Big Data have 3 core dimensions: Volume Variety Velocity 4. Role of Volume in Big Data Volume: Volume is nothing but amount of data. As Data is growing with high speed, a huge volume of data is getting generated every second. 5. Role of variety in Big Data Variety: So many applications are running nowadays like mobile, mobile sensors etc. Each application is generating data in different variety. 6. Role of Velocity in Big Data Velocity: This is speed of data getting generated. for example: Every minute, Instagram receives 46,740 new photos. So day by day speed of data generation is getting higher. 7. Remaining 2 less known dimension of Big Data There are two more V’s of Big Data. Below are less known V’s: Veracity Value 8. Role of Veracity in Big Data Veracity: Veracity is nothing but the accuracy of data. Big Data should have some accurate data in order to process it. 9. Role of Value in Big Data Value: Big Data should contain some value to us. Junk Values/Data is not considered as real Big Data. 10. What is Hadoop? Hadoop: Hadoop is a project of Apache. This is a framework which is open Source. Hadoop is use for storing Big data and then processing it.

BIG DATA Interview Questions 11. Why Hadoop? In order to process Big data, we need some framework. Hadoop is an open source framework which is owned by Apache organization. Hadoop is the basic requirement when we think about processing big data. 12. Connection between Hadoop and Big Data Big Data will be processed using some framework. This framework is known as Hadoop. 13. Hadoop and Hadoop Ecosystem Hadoop Ecosystem is nothing but a combination of various components. Below are the components which comes under Hadoop Ecosystem’s Umbrella: HDFS YARN MapReduce Pig Hive Sqoop, etc. 14. What is HDFS. HDFS: HDFS is known as Hadoop Distributed File System. Like Every System have one file system in order to see/manage files stored, in the same way Hadoop is having HDFS which works in distributed manner. 15. Why HDFS? HDFS is the core component of Hadoop Ecosystem. Since Hadoop is a distributed framework and HDFS is also distributed file system. It is very well compatible with Hadoop. 16. What is YARN YARN: YARN is known as Yet Another Resource Manager. This is a project of Apache Hadoop. 17. Use of YARN. YARN is use for managing resources. Jobs are scheduled using YARN in Apache Hadoop. 18. What is MapReduce? MapReduce: MapReduce is a programming approach which consist of two steps: Map and Reduce. MapReduce is the core of Apache Hadoop. 19. Use of MapReduce MapReduce is a programming approach to process our data. MapReduce is use to process Big Data. 20. What is Pig? This is a project of Apache. It is a platform using which huge datasets are analyzed. It runs on the top of MapReduce. 21. Use of Pig Pig is use for the purpose of analyzing huge datasets. Data flow are created using Pig in order to analyze data. Pig Latin language is use for this purpose. 22. What is Pig Latin Pig Latin is a script language which is used in Apache Pig to create Data flow in order to analyze data. 23. What is Hive? Hive is a project of Apache Hadoop. Hive is a dataware software which runs on the top of Hadoop. 24. Use of Hive Hive works as a storage layer which is used to store structured data. This is very useful and convenient tool for SQL user as Hive use HQL. 25. What is HQL? HQL is an abbreviation of Hive Query Language. This is designed for those user who are very comfortable with SQL. HQL is use to query structured data into hive. 26. What is Sqoop? Sqoop is a short form of SQL to Hadoop. This is basically a command line tool to transfer data between Hadoop and SQL and vice-versa. Q27) Use of Sqoop? Sqoop is a CLI tool which is used to migrate data between RDBMS to Hadoop and vice-versa. Q28) What are other components of Hadoop Ecosystem? Below are other components of Hadoop Ecosystem: a) HBase b) Oozie c) Zookeeper d) Flume etc. Q29) Difference Between Hadoop and HDFS Hadoop is a framework while HDFS is a file system which works on the top of Hadoop. Q30) How to access HDFS below is command: hdfs fs or hdfs dfs Q31) How to create directory in HDFS below is command: hdfs fs -mkdir Q32) How to keep files in HDFS below is command: hdfs fs -put or hdfs fs -copyfromLocal Q33) How to copy file from HDFS to local below is command: hdfs fs -copyToLocal Q34) How to Delete directory from HDFS below is command: hdfs fs -rm Q35) How to Delete file from HDFS below is command: hdfs fs -rm Become an Big Data Hadoop Certified Expert in 25Hours Q36) How to Delete directory and files recursively from HDFS below is command: hdfs fs -rm -r Q37) How to read file in HDFS below is command: hdfs fs -cat Managed/internal table Here once the table gets deleted both meta data and actual data is deleted –>external table Here once the table gets deleted only the mata data gets deleted but not the actual data. Q63) How to managed create a table in hive? hive>create table student(sname string, sid int) row format delimited fileds terminated by ‘,’; //hands on hive>describe student; Q64) How to load data into table created in hive? hive>load data local inpath /home/training/simple.txt into table student; //hands on hive> select * from student; Q65) How to create/load data into exteranal tables? *without location hive>create external table student(sname string, sid int) row format delimited fileds terminated by ‘,’; hive>load data local inpath /home/training/simple.txt into table student; *With Location hive>create external table student(sname string, sid int) row format delimited fileds terminated by ‘,’ location /Besant_HDFS; Here no need of load command Became an Big Data Hadoop Expert with Certification in 25hours Q66) Write a command to write static partitioned table. hive>create table student(sname string, sid int) partitioned by(int year) row format delimited fileds terminated by ‘,’; Q67) How to load a file in static partition? hive>load data local inpath /home/training/simple2018.txt into table student partition(year=2018); Q68) Write a commands to write dynamic partitioned table. Answer: –> create a normal table hive>create table student(sname string, sid int) row format delimited fileds terminated by ‘,’; –>load data hive>load data local inpath /home/training/studnetall.txt into table student ; –>create a partitioned table hive>create table student_partition(sname string, sid int) partitioned by(int year) row format delimited fileds terminated by ‘,’; –>set partitions hive>set hive.exec.dynamic.partition.mode = nonstrict; –>insert data hive>insert into table student_partition select * from student; –>drop normal table hive>drop table student; Q69) What is pig? Answer:Pig is an abstraction over map reduce. It is a tool used to deal with huge amount of structured and semi structed data. Q70) What is atom in pig? its a small piece of data or a filed eg: ‘shilpa’ Q71) What is tuple? ordered set of filed (shilpa, 100) Q72) Bag in pig? un-ordered set of tuples eg.{(sh,1),(ww,ww)} Q73) What is relation? bag of tuples Q74) What is hbase? its a distributed column oriented database built on top of hadoop file system it is horizontally scalable Q75) Difference between hbase and rdbms RDMBS is schema based hbase is not RDMBS only structured data hbase structured and semi structured data. RDMBS involves transactions Hbase no transactions Q76) What is table in hbase? collection of rows Q77) What is row in hbase? collection of column families Q78) Column family in hbase? Answer:collection of columns Q79) What is column? Answer:collection of key value pair Q80) How to start hbase services? Answer: >hbase shell hbase>start -hbase.sh Q81) DDL commands used in hbase? Answer: create alter drop drop_all exists list enable is_enabled? disable is_disbled? Q82) DML commands? Answer: put get scan delete delete_all Q83) What services run after running hbase job? Answer: Name node data node secondary NN JT TT Hmaster HRegionServer HQuorumPeer Q84) How to create table in hbase? Answer:>create ’emp’, ‘cf1′,’cf2’ Q85) How to list elements Answer:>scan ’emp’ Q86) Scope operators used in hbase? Answer: MAX_FILESIZE READONLY MEMSTORE_FLUSHSIZE DEFERRED_LOG_FLUSH Q87) What is sqoop? sqoop is an interface/tool between RDBMS and HDFS to importa nd export data Q88) How many default mappers in sqoop? 4 Q89) What is map reduce? map reduce is a data processing technique for distributed computng base on java map stage reduce stage Q90) list few componets that are using big data Answer: facebook adobe yahoo twitter ebay Q91) Write a quert to import a file in sqoop $>sqoop-import –connect jdbc:mysql://localhost/Besant username hadoop password hadoop table emp target_dir sqp_dir fields_terminated_by ‘,’ m 1 Q92) What is context in map reduce? it is an object having the information about hadoop configuration Q93) How job is started in map reduce? To start a job we need to create a configuration object. configuration c = new configuration(); Job j = new Job(c,”wordcount calculation); Q94) How to load data in pig? A= load ‘/home/training/simple.txt’ using PigStorage ‘|’ as (sname : chararray, sid: int, address:chararray); Q95) What are the 2 modes used to run pig scripts? local mode pig -x local pig -x mapreduce Q96) How to show up details in pig ? dump command is used. grunt>dump A; Q97) How to fetch perticular columns in pig? B = foreach A generate sname, sid; Q100) How to restrict the number of lines to be printed in pig ? c=limit B 2; Get Big Data Hadoop Online Training Q101) Define Big Data Big Data is defined as a collection of large and complex of unstructured data sets from where insights are derived from the Data Analysis using open-source tools like Hadoop. Q102) Explain The Five Vs of Big Data The five Vs of Big Data are – Volume – Amount of data in the Petabytes and Exabytes Variety – Includes formats like an videos, audio sources, textual data, etc. Velocity – Everyday data growth which are includes conversations in forums,blogs,social media posts,etc. Veracity – Degree of accuracy of data are available Value – Deriving insights from collected data to the achieve business milestones and new heights Q103) How is Hadoop related to the Big Data ? Describe its components? Apache Hadoop is an open-source framework used for the storing, processing, and analyzing complex unstructured data sets for the deriving insights and actionable intelligence for businesses. The three main components of Hadoop are- MapReduce – A programming model which processes large datasets in the parallel HDFS – A Java-based distributed file system used for the data storage without prior organization YARN – A framework that manages resources and handles requests from the distributed applications Q104) Define HDFS and talk about their respective components? The Hadoop Distributed File System (HDFS) is the storage unit that’s responsible for the storing different types of the data blocks in the distributed environment. The two main components of HDFS are- NameNode – A master node that processes of metadata information for the data blocks contained in the HDFS DataNode – Nodes which act as slave nodes and a simply store the data, for use and then processing by the NameNode. Q105) Define YARN, and talk about their respective components? The Yet Another Resource Negotiator (YARN) is the processing component of the Apache Hadoop and is responsible for managing resources and providing an execution environment for said of processes. The two main components of YARN are- ResourceManager– Receives processing requests and allocates its parts to the respective Node Managers based on processing needs. Node Manager– Executes tasks on the every single Data Node Q106) Explain the term ‘Commodity Hardware? Commodity Hardware refers to hardware and components, collectively needed, to run the Apache Hadoop framework and related to the data management tools. Apache Hadoop requires 64-512 GB of the RAM to execute tasks, and any hardware that supports its minimum for the requirements is known as ‘Commodity Hardware. Q107) Define the Port Numbers for NameNode, Task Tracker and Job Tracker? Name Node – Port 50070 Task Tracker – Port 50060 Job Tracker – Port 50030 Q108) How does HDFS Index Data blocks? Explain. HDFS indexes data blocks based on the their respective sizes. The end of data block points to address of where the next chunk of data blocks get a stored. The DataNodes store the blocks of datawhile the NameNode manages these data blocks by using an in-memory image of all the files of said of data blocks. Clients receive for the information related to data blocked from the NameNode. 109. What are Edge Nodes in Hadoop? Edge nodes are gateway nodes in the Hadoop which act as the interface between the Hadoop cluster and external network.They run client applications and cluster administration tools in the Hadoop and are used as staging areas for the data transfers to the Hadoop cluster. Enterprise-class storage capabilities (like 900GB SAS Drives with Raid HDD Controllers) is required for the Edge Nodes,and asingle edge node for usually suffices for multiple of Hadoop clusters. Q110) What are some of the data management tools used with the Edge Nodes in Hadoop? Oozie,Ambari,Hue,Pig and Flume are the most common of data management tools that work with edge nodes in the Hadoop. Other similar tools include to HCatalog,BigTop and Avro. Q111) Explain the core methods of a Reducer? There are three core methods of a reducer. They are- setup() – Configures different to parameters like distributed cache, heap size, and input data. reduce() – A parameter that is called once per key with the concerned on reduce task cleanup() – Clears all temporary for files and called only at the end of on reducer task. Q112) Talk about the different tombstone markers used for deletion purposes in HBase.? There are three main tombstone markers used for the deletion in HBase. They are- Family Delete Marker – Marks all the columns of an column family Version Delete Marker – Marks a single version of an single column Column Delete Marker– Marks all the versions of an single column Q113) How would you transform unstructured data into structured data? How to Approach: Unstructured data is the very common in big data. The unstructured data should be transformed into the structured data to ensure proper data are analysis. Q114) Which hardware configuration is most beneficial for Hadoop jobs? Dual processors or core machines with an configuration of 4 / 8 GB RAM and ECC memory is ideal for running Hadoop operations. However, the hardware is configuration varies based on the project-specific workflow and process of the flow and need to the customization an accordingly. Q115) What is the use of the Record Reader in Hadoop? Since Hadoop splits data into the various blocks, RecordReader is used to read the slit data into the single record. For instance, if our input data is the split like: Row1: Welcome to Row2: Besant It will be read as the “Welcome to Besant” using RecordReader. Q116) What is Sequencefilein put format? Hadoop uses the specific file format which is known as the Sequence file. The sequence file stores data in the serialized key-value pair. Sequencefileinputformat is an input format to the read sequence files. Q117) What happens when two users try to access to the same file in HDFS? HDFS NameNode supports exclusive on write only. Hence, only the first user will receive to the grant for the file access & second that user will be rejected. Q118) How to recover an NameNode when it’s are down? The following steps need to execute to the make the Hadoop cluster up and running: Use the FsImage which is file system for metadata replicate to start an new NameNode. Configure for the DataNodes and also the clients to make them acknowledge to the newly started NameNode. Once the new NameNode completes loading to the last for checkpoint FsImage which is the received to enough block reports are the DataNodes, it will start to serve the client. In case of large of Hadoop clusters, the NameNode recovery process to consumes a lot of time which turns out to be an more significant challenge in case of the routine maintenance. Q119) What do you understand by the Rack Awareness in Hadoop? It is an algorithm applied to the NameNode to decide then how blocks and its replicas are placed. Depending on the rack definitions network traffic is minimized between DataNodes within the same of rack. For example, if we consider to replication factor as 3, two copies will be placed on the one rack whereas the third copy in a separate rack. Q120) What are the difference between of the “HDFS Block” and “Input Split”? The HDFS divides the input data physically into the blocks for processing which is known as the HDFS Block. Input Split is a logical division of data by the mapper for mapping operation. Q121) DFS can handle a large volume of data then why do we need Hadoop framework? Hadoop is not only for the storing large data but also to process those big data. Though DFS (Distributed File System) tool can be store the data, but it lacks below features- It is not fault for tolerant Data movement over the network depends on bandwidth. Q122) What are the common input formats are Hadoop? Text Input Format – The default input format defined in the Hadoop is the Text Input Format. Sequence File Input Format – To read files in the sequence, Sequence File Input Format is used. Key Value Input Format – The input format used for the plain text files (files broken into lines) is the Key Value for Input Format. Q123) Explain some important features of Hadoop? Hadoop supports are the storage and processing of big data. It is the best solution for the handling big data challenges. Some of important features of Hadoop are 1. Open Source – Hadoop is an open source framework which means it is available free of cost Also,the users are allowed to the change the source code as per their requirements. 2. Distributed Processing – Hadoop supports distributed processing of the data i.e. faster processing. The data in Hadoop HDFS is stored in the distributed manner and MapReduce is responsible for the parallel processing of data. 3. Fault Tolerance – Hadoop is the highly fault-tolerant. It creates three replicas for each block at different nodes, by the default. This number can be changed in according to the requirement. So, we can recover the data from the another node if one node fails. The detection of node of failure and recovery of data is done automatically. 4. Reliability – Hadoop stores data on the cluster in an reliable manner that is independent of the machine. So, the data stored in Hadoop environment is not affected by the failure of machine. 5. Scalability – Another important feature of Hadoop is the scalability. It is compatible with the other hardware and we can easily as the new hardware to the nodes. 6. High Availability – The data stored in Hadoop is available to the access even after the hardware failure. In case of hardware failure, the data can be accessed from the another path. Q124) Explain the different modes are which Hadoop run? Apache Hadoop runs are the following three modes – Standalone (Local) Mode – By default, Hadoop runs in the local mode i.e. on a non-distributed,single node. This mode use for the local file system to the perform input and output operation. This mode does not support the use of the HDFS, so it is used for debugging. No custom to configuration is needed for the configuration files in this mode. In the pseudo-distributed mode, Hadoop runs on a single of node just like the Standalone mode. In this mode, each daemon runs in the separate Java process. As all the daemons run on the single node, there is the same node for the both Master and Slave nodes. Fully – Distributed Mode – In the fully-distributed mode, all the daemons run on the separate individual nodes and thus the forms a multi-node cluster. There are different nodes for the Master and Slave nodes. Q125) What is the use of jps command in Hadoop? The jps command is used to the check if the Hadoop daemons are running properly or not. This command shows all the daemons running on the machine i.e. Datanode, Namenode, NodeManager, ResourceManager etc. Q126) What are the configuration parameters in the “MapReduce” program? The main configuration parameters in “MapReduce” framework are: Input locations of Jobs in the distributed for file system Output location of Jobs in the distributed for file system The input format of data The output format of data The class which contains for the map function The class which contains for the reduce function JAR file which contains for the mapper, reducer and the driver classes Q127) What is a block in HDFS? what is the default size in Hadoop 1 and Hadoop 2? Can we change the block size? Blocks are smallest continuous of data storage in a hard drive. For HDFS, blocks are stored across Hadoop cluster. The default block size in the Hadoop 1 is: 64 MB The default block size in the Hadoop 2 is: 128 MB Yes,we can change block size by using the parameters – dfs.block.size located in the hdfs-site.xml file. Q128) What is Distributed Cache in the MapReduce Framework? Distributed Cache is an feature of the Hadoop MapReduce framework to cache files for the applications. Hadoop framework makes cached files for available for every map/reduce tasks running on the data nodes. Hence, the data files can be access the cache file as the local file in the designated job. Q129) What are the three running modes of the Hadoop? The three running modes of the Hadoop are as follows: Standalone or local: This is the default mode and doesn’t need any configuration. In this mode, all the following components for Hadoop uses local file system and runs on single JVM – NameNode DataNode ResourceManager NodeManager Pseudo-distributed: In this mode, all the master and slave Hadoop services is deployed and executed on a single node. Fully distributed: In this mode, Hadoop master and slave services is deployed and executed on the separate nodes. Q130) Explain JobTracker in Hadoop? JobTracker is a JVM process in the Hadoop to submit and track MapReduce jobs. JobTracker performs for the following activities in Hadoop in a sequence – JobTracker receives jobs that an client application submits to the job tracker JobTracker notifies NameNode to determine data node JobTracker allocates TaskTracker nodes based on the available slots. It submits the work on the allocated TaskTracker Nodes, JobTracker monitors on the TaskTracker nodes. Q131) What are the difference configuration files in Hadoop? The different configuration files in Hadoop are – core-site.xml – This configuration file of contains Hadoop core configuration settings, for example, I/O settings, very common for the MapReduce and HDFS. It uses hostname an port. mapred-site.xml – This configuration file specifies a framework name for MapReduce by the setting mapreduce.framework.name hdfs-site.xml – This configuration file contains of HDFS daemons configuration for settings. It also specifies default block for permission and replication checking on HDFS. yarn-site.xml – This configuration of file specifies configuration settings for the ResourceManager and NodeManager. Q132) What are the difference between Hadoop 2 and Hadoop 3? Following are the difference between Hadoop 2 and Hadoop 3 – Kerberos are used to the achieve security in Hadoop. There are 3 steps to access an service while using Kerberos, at a high level. Each step for involves a message exchange with an server. Authentication – The first step involves authentication of the client to authentication server, and then provides an time-stamped TGT (Ticket-Granting Ticket) to the client. Authorization – In this step, the client uses to received TGT to request a service ticket from the TGS (Ticket Granting Server) Service Request – It is the final step to the achieve security in Hadoop. Then the client uses to service ticket to authenticate an himself to the server. Q133) What is commodity hardware? Commodity hardware is an low-cost system identified by the less-availability and low-quality. The commodity hardware for comprises of RAM as it performs an number of services that require to RAM for the execution. One doesn’t require high-end hardware of configuration or super computers to run of Hadoop, it can be run on any of commodity hardware. Q134) How is NFS different from HDFS? There are a number of the distributed file systems that work in their own way. NFS (Network File System) is one of the oldest and popular distributed file an storage systems whereas HDFS (Hadoop Distributed File System) is the recently used and popular one to the handle big data. Q135) How do Hadoop MapReduce works? There are two phases of the MapReduce operation. Map phase – In this phase, the input data is split by the map tasks. The map tasks run in the parallel. These split data is used for analysis for purpose. Reduce phase – In this phase, the similar split data is the aggregated from the entire to collection and shows the result. Q136) What is MapReduce? What are the syntax you use to run a MapReduce program? MapReduce is a programming model in the Hadoop for processing large data sets over an cluster of the computers, commonly known as the HDFS. It is a parallel to programming model. The syntax to run a MapReduce program is the hadoop_jar_file.jar /input_path /output_path. Q137) What are the different file permissions in the HDFS for files or directory levels? Hadoop distributed file system (HDFS) uses an specific permissions model for files and directories. 1. Following user levels are used in HDFS – Owner Group Others. 2. For each of the user on mentioned above following permissions are applicable – read (r) write (w) execute(x). 3. Above mentioned permissions work on differently for files and directories. For files The r permission is for reading an file The w permission is for writing an file. For directories The r permission lists the contents of the specific directory. The w permission creates or deletes the directory. The X permission is for accessing the child directory. Q138) What are the basic parameters of a Mapper? The basic parameters of a Mapper is the LongWritable and Text and Int Writable Q139) How to restart all the daemons in Hadoop? To restart all the daemons, it is required to the stop all the daemons first. The Hadoop directory contains sbin as directory that stores to the script files to stop and start daemons in the Hadoop. Use stop daemons command /sbin/stop-all.sh to the stop all the daemons and then use /sin/start-all.sh command to start all the daemons again. Q140) Explain the process that overwrites the replication factors in HDFS? There are two methods to the overwrite the replication factors in HDFS – Method 1: On File Basis In this method, the replication factor is the changed on the basis of file using to Hadoop FS shell. The command used for this is: $hadoop fs – setrep –w2/my/test_file Here, test_file is the filename that’s replication to factor will be set to 2. Method 2: On Directory Basis In this method, the replication factor is changed on the directory basis i.e. the replication factor for all the files under the given directory is modified. $hadoop fs –setrep –w5/my/test_dir Here, test_dir is the name of the directory, then replication factor for the directory and all the files in it will be set to 5. Q141) What will happen with a NameNode that doesn’t have any data? A NameNode without any for data doesn’t exist in Hadoop. If there is an NameNode, it will contain the some data in it or it won’t exist. Q142) Explain NameNode recovery process? The NameNode recovery process involves to the below-mentioned steps to make for Hadoop cluster running: In the first step in the recovery process, file system metadata to replica (FsImage) starts a new NameNode. The next step is to configure DataNodes and Clients. These DataNodes and Clients will then acknowledge of new NameNode. During the final step, the new NameNode starts serving to the client on the completion of last checkpoint FsImage for loading and receiving block reports from the DataNodes. Note: Don’t forget to mention, this NameNode recovery to process consumes an lot of time on large Hadoop clusters. Thus, it makes routine maintenance to difficult. For this reason, HDFS high availability architecture is recommended to use. Q143) How Is Hadoop CLASSPATH essential to start or stop Hadoop daemons? CLASSPATH includes necessary directories that contain the jar files to start or stop Hadoop daemons. Hence, setting the CLASSPATH is essential to start or stop on Hadoop daemons. However, setting up CLASSPATH every time its not the standard that we follow. Usually CLASSPATH is the written inside /etc/hadoop/hadoop-env.sh file. Hence, once we run to Hadoop, it will load the CLASSPATH is automatically. Q144) Why is HDFS only suitable for large data sets and not the correct tool to use for many small files? This is due to the performance issue of the NameNode.Usually, NameNode is allocated with the huge space to store metadata for the large-scale files. The metadata is supposed to be an from a single file for the optimum space utilization and cost benefit. In case of the small size files, NameNode does not utilize to the entire space which is a performance optimization for the issue. Q145) Why do we need Data Locality in Hadoop? Datasets in HDFS store as the blocks in DataNodes the Hadoop cluster. During the execution of the MapReducejob the individual Mapper processes to the blocks (Input Splits). If the data does not reside in the same node where the Mapper is the executing the job, the data needs to be copied from DataNode over the network to mapper DataNode. Now if an MapReduce job has more than 100 Mapper and each Mapper tries to copy the data from the other DataNode in cluster simultaneously, it would cause to serious network congestion which is an big performance issue of the overall for system. Hence, data proximity are the computation is an effective and cost-effective solution which is the technically termed as Data locality in the Hadoop. It helps to increase the overall throughput for the system. Enroll Now! Q146) What’s Big Big Data or Hooda? Only a concept that facilitates handling large data databases. Hadoop has a single framework for dozens of tools. Hadoop is primarily used for block processing. The difference between Hadoop, the largest data and open source software, is a unique and basic one. Q147) Big data is a good life? Analysts are increasing demand for industry and large data buildings. Today, many people are looking to pursue their large data industry by having great data jobs like freshers. However, the larger data itself is just a huge field, so it’s just Hadoop jobs for freshers Q148) What is the great life analysis of large data analysis? The large data analytics has the highest value for any company, allowing it to make known decisions and give the edge among the competitors. A larger data career increases the opportunity to make a crucial decision for a career move. Q149) Hope is a NoSQL? Hadoop is not a type of database, but software software that allows software for computer software. It is an application of some types, which distributes noSQL databases (such as HBase), allowing thousands of servers to provide data in lower performance to the rankings Q150) Need Hodop to Science? Data scientists have many technical skills such as Hadoto, NoSQL, Python, Spark, R, Java and more. … For some people, data scientist must have the ability to manage using Hoodab alongside a good skill to run statistics against data set. Q151)What is the difference between large data and large data analysis? On the other hand, data analytics analyzes structured or structured data. Although they have a similar sound, there are no goals. … Great data is a term of very large or complex data sets that are not enough for traditional data processing applications Q152) Why should you be a data inspector? A data inspector’s task role involves analyzing data collection and using various statistical techniques. … When a data inspector interviewed for the job role, the candidates must do everything they can to see their communication skills, analytical skills and problem solving skills Q153) Great Data Future? Big data refers to the very large and complex data sets for traditional data entry and data management applications. … Data sets continue to grow and applications are becoming more and more time-consuming, with large data and large dataprocessing cloud moving more Q154) What is a data scientist on Facebook? This assessment is provided by 85 Facebook data scientist salary report (s) employees or based on statistical methods. When a factor in bonus and extra compensation, a data scientist on Facebook expected an average of $ 143,000 in salary Q155) Can Hedop Transfer? HODOOP is not just enough to replace RDGMS, but it is not really what you want to do. … Although it has many advantages to the source data fields, Hadoopcannot (and usually does) replace a data warehouse. When associated with related databases. However, this creates a powerful and versatile solution. Get Big Data Hadoop Course Now! Q156) What’s happening in Hadoop? MapReduce is widely used in I / O forms, a sequence file is a flat file containing binary key / value pairs. Graphical publications are stored locally in sequencer. It provides Reader, Writer and Seater classes. The three series file formats are: Non-stick key / value logs. Record key / value records are compressed – only ‘values’ are compressed here. Pressing keys / value records – ‘Volumes’ are collected separately and shortened by keys and values. The ‘volume’ size can be configured. Q157) What is the Work Tracker role in Huda? The task tracker’s primary function, resource management (managing work supervisors), resource availability and monitoring of the work cycle (monitoring of docs improvement and wrong tolerance). This is a process that runs on a separate terminal, not often in a data connection. The tracker communicates with the label to identify the location of the data. The best mission to run tasks at the given nodes is to find the tracker nodes. Track personal work trackers and submit the overall job back to the customer. MapReduce works loads from the slush terminal. Q158) What is the RecordReader application in Hutch? Since the Hadoop data separates various blocks, recordReader is used to read split data in a single version. For example, if our input data is broken: Row1: Welcome Row2: Intellipaat It uses “Welcome to Intellipaat” using RecordReader. Q159)What is Special Execution in Hooda? A range of Hadoop, some sloping nodes, are available to the program by distributing tasks at many ends. Tehre is a variety of causes because the tasks are slow, which are sometimes easier to detect. Instead of identifying and repairing slow-paced tasks, Hopep is trying to find out more slowly than he expected, then backs up the other equivalent task. Hadoop is the insulation of this backup machine spectrum. This creates a simulated task on another disk. You can activate the same input multiple times in parallel. After most work in a job, the rest of the functions that are free for the time available are the remaining jobs (slowly) copy copy of the splash execution system. When these tasks end, it is reported to JobTracker. If other copies are encouraging, Hudhoft dismays the tasktakers and dismiss the output. Hoodab is a normal natural process. To disable, set mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution Invalid job options Q160) What happens if you run a hood job? It will throw an exception that the output file directory already exists. To run MapReduce task, you need to make sure you do not have a pre-release directory in HDFS. To delete the directory before you can work, you can use the shell: Hadoop fs -rmr / path / to / your / output / or via Java API: FileSystem.getlocal (conf) .delete (outputDir, true); Get Hadoop Course Now! Q161) How can you adjust the Hadoopo code? Answer: First, check the list of currently running MapReduce jobs. Next, we need to see the orphanage running; If yes, then you have to determine the location of the RM records. Run: “ps -ef | grep -I ResourceManager” And search result log in result displayed. Check the job-id from the displayed list and check if there is any error message associated with the job. Based on RM records, identify the employee tip involved in executing the task. Now, log on to that end and run – “ps -ef | grep -iNodeManager” Check the tip manager registration. Major errors reduce work from user level posts for each diagram. Q162) How should the reflection factor in FFAS be constructed? Answer: Hdfs-site.xml is used to configure HDFS. Changing the dfs.replication property on hdfs-site.xml will change the default response to all files in HDFS. You can change the reflection factor based on a file you are using Hadoop FS shell: $ hadoopfs -setrep -w 3 / n / fileConversely, You can also change the reflection factors of all the files under a single file. $ hadoopfs-setrep -w 3 -R / my / dir Now go through the Hadoop administrative practice to learn about the reflection factor in HDFS! Q163) How to control the release of the mapper, but does the release issue not? Answer: To achieve this summary, you must set: conf.set (“mapreduce.map.output.compress”, true) conf.set (“mapreduce.output.fileoutputformat.compress”, incorrect) Q164) Which companies use a hoop? Learn how Big Data and HADOOP have changed the rules of the game in this blog post. Yahoo (the largest contribution to the creation of the hawkop) – Yahoo search engine created for Hadoop, Facebook – Analytics, Amazon, Netflix, Adobe, Ebay, Spadys, Twitter, Adobe. Q165) Do I have to know Java to learn the habit? The ability of MapReduce in Java is an additional plus but not needed. … learn the Hadoop and create an excellent business with Hadoo, knowing basic basic knowledge of Linux and Java Basic Programming Policies Q166) What should you consider when using the second name line? Secondary mode should always be used on a separate separate computer. This prevents intermittent interaction with the mainstream. Q167) Name the Hadoop code as executable modes. There are various methods to run the Hadoop code – Fully distributed method Pseudosiphrit method Complete mode Q168)Name the operating system supported by the hadoop operation. Linux is the main operating system. However, it is also used as an electric power Windows operating system with some additional software. Q169) HDFS is used for applications with large data sets, not why Many small files? HDFS is more efficient for a large number of data sets, maintained in a file Compared to smaller particles of data stored in multiple files. Saving NameNode The file system metadata in RAM, the amount of memory that defines the number of files in the HDFS file System. In simpler terms, more files will generate more metadata, which means more Memory (RAM). It is recommended that you take 150 bytes of a block, file or directory metadata. Q170) What are the main features of hdfssite.xml? There are three important properties of hdfssite.xml: data.dr – Identify the location of the data storage. name.dr – Specify the location of the metadata storage and specify the DFS is located On disk or remote location. checkpoint.dir – for the second name name. Q171) What are the essential hooping tools that improve performance? Big data? Some of the essential hoopoe tools that enhance large data performance – Hive, HDFS, HBase, Avro, SQL, NoSQL, Oozie, Clouds, Flume, SolrSee / Lucene, and ZooKeeper Q172) What do you know about Fillil soon? The sequence is defined as a flat file containing the binary key or value pairs. This is important Used in MapReduce’s input / output format. Graphical publications are stored locally SequenceFile. Several forms of sequence – Summary of record key / value records – In this format, the values are compressed. Block compressed key / value records – In this format, the values and keys are individually The blocks are stored and then shortened. Sticky Key / Value Entries – In this format, there are no values or keys. Get 100% Placement Oriented Training in Hadoop! Q173) Explain the work tracker’s functions. In Hadoop, the work tracker’s performers perform various functions, such as – It manages resources, manages resources and manages life cycle Tasks. It is responsible for finding the location of the data by contacting the name Node. It performs tasks at the given nodes by finding the best worker tracker. Work Tracker Manages to monitor all task audits individually and then submit The overall job for the customer. It is responsible for supervising local servicemen from Macpute’s workplace Node. Q174) The FASAL is different from NAS? The following points distinguish HDFS from NAS – Hadoop shared file system (HDFS) is a distributed file system that uses data Network Attached Storage (NAS) is a file-wide server Data storage is connected to the computer network. HDFS distributes all databases in a distributed manner As a cluster, NAS saves data on dedicated hardware. HDFS makes it invaluable when using NAS using materials hardware Data stored on highhend devices that include high spending The HDFS work with MapReduce diagram does not work with MapReduce Data and calculation are stored separately. Q175)Does the HDFS go wrong? If so, how? Yes, HDFS is very mistaken. Whenever some data is stored in HDFS, name it Copying data (copies) to multiple databases. Normal reflection factor is 3. It needs to be changed according to your needs. If DataNode goes down, NameNode will take Copies the data from copies and copies it to another node, thus making the data available automatically. TheThe way, as the HDFS is the wrong tolerance feature and the fault tolerance Q176) Distinguish HDFS Block and Input Unit. The main difference between HDFS Block and Input Split is HDFS Black. While the precise section refers to the input sector, the business section of the data is knownData. For processing, HDFS first divides the data into blocks, and then stores all the packages Together, when MapReduce divides the data into the first input section then allocate this input and divide it Mapper function. Q177) What happens when two clients try to access the same file on HDFS? Remember that HDFS supports specific characters Only at a time). NName client nameNode is the nameNode that gives the name Node Lease the client to create this file. When the second client sends the request to open the same file To write, the lease for those files is already supplied to another customer, and the name of the name Reject second customer request. Q178) What is the module in HDFS? The location for a hard drive or a hard drive to store data As the volume. Store data blocks in HDFS, and then distributed via the hoodo cluster. The entire file is divided into the first blocks and stored as separate units. Q179) What is Apache? YARN still has another resource negotiation. This is a hoodup cluster Management system. It is also the next generation introduced by MapReduce and Hoodab 2 Account Management and Housing Management Resource Management. It helps to further support the hoodoop Different processing approaches and wide-ranging applications. Q180) What is the terminal manager? Node Manager is TARStracker’s YARN equivalent. It takes steps from it Manages resourceManager and single-source resources. This is the responsibility Containers and ResourceManager monitor and report their resource usage. Each Single container processes operated at slavery pad are initially provided, monitored and tracked By the tip manager associated with the slave terminal. Q181) What is the recording of the Hope? In Hadoop, RecordReader is used to read a single log split data. This is important Combining data, Hatopo divides data into various editions. For example, if input data is separated Row1: Welcome Line 2: The Hoodah’s World Using RecordReader, it should be read as “Welcome to the Hope World”. Q182) Shorten up the mappers do not affect the Output release? Answer: In order to minimize the output of the maple, the output will not be affected and set as follows: Conf.set (“mapreduce.map.output.compress”, true) Conf.set (“mapreduce.output.fileoutputformat.compress”, incorrect) Q183) A Reducer explain different methods. Answer: Various methods of a Reducer include: System () – It is used to configure various parameters such as input data size. Syntax: general vacuum system (environment) Cleaning () – It is used to clean all temporary files at the end of the task. Syntax: General Vacuum Cleanup (Eco) Reduce () – This method is known in the heart of Rezar. This is used once A key to the underlying work involved. Syntax: reduce general void (key, value, environment) Q184) How can you configure the response factor in the HDFL? For the configuration of HDFS, the hdfssite.xml file is used. Change the default value The reflection factor for all the files stored in HDFS is transferred to the following asset hdfssite.xml dfs.replication Q185) What is the use of the “jps” command? The “Jps” command is used to verify that the Hadoop daemons state is running. TheList all hadoop domains running in the command line. Namenode, nodemanager, resource manager, data node etc Q186) What is the next step after Mapper or Mumpask?: The output of the map is sorted and the partitions for the output will be created. The number of partitions depends on the number of disadvantages. Q187) How do we go to the main control for a certain reduction? Any Reducer can control the keys (through which posts) by activating the custom partition. Q188) What is the use of the coordinator? It can be specified by Job.setCombinerClass (ClassName) to make local integration with a custom component or class, and intermediate outputs, which helps reduce the size of the transfers from the Mapper to Reducer. Q189) How many maps are there in specific jobs? The number of maps is usually driven by total inputs, that is, the total volume of input files. Usually it has a node for 10-100 maps. The work system takes some time, so it is best to take at least a minute to run maps. If you expect 10TB input data and have a 128MB volume, you will end up with 82,000 maps, which you can control the volume of the mapreduce.job.maps parameter (this only provides a note structure). In the end, the number of tasks are limited by the number of divisions returned by the InputFormat.getSplits () over time (you can overwrite). Q190) What is the use of defect? Reducer reduces the set of intermediate values, which shares one key (usually smaller) values. The number of job cuts is set by Job.setNumReduceTasks (int). Q191) Explain Core modalities of deficiency? The Reducer API is similar to a Mapper, a run () method, which modes the structure of the work and the reconfiguration of the reconfiguration framework from reuse. Run () method once (), minimize each key associated with the task to reduce (once), and finally clean up the system. Each of these methods can be accessed using the context structure of the task using Context.getConfiguration (). As for the mapper type, these methods may be violated with any or all custom processes. If none of these methods are violated, the default reduction action is a symbolic function; Values go further without processing. Reducer heart is its reduction (method). This is called a one-time one; The second argument is Iterable, which provides all the key related values. Q192) What are the early stages of deficiency? Shake, sort and lower. 193) Shuffle’s explanation? Reducer is a sorted output of input mappers. At this point, the configuration receives a partition associated with the output of all the mappers via HTTP. 194) Explain the Reducer’s Line Stage? Structured groups at this point are Reducer entries with the keys (because different movers may have the same key output). Mixed and sequence phases occur simultaneously; They are combined when drawing graphic outputs (which are similar to the one-sequence). 195) Explain Criticism? At this point the reduction (MapOutKeyType, Iterable, environment) method is grouped into groups for each group. Reduction work output is typically written to FileSystem via Context.write (ReduceOutKeyType, ReduceOutValType). Applications can use application progress status, set up application level status messages, counters can update, or mark their existence. Reducer output is not sorted. Big Data Questions and Answers Pdf Download Read the full article

0 notes

udemy-gift-coupon-blog · 6 years ago

Link

Hadoop Spark Hive Big Data Admin Class Bootcamp Course NYC ##FreeCourse ##UdemyDiscount #Admin #Big #Bootcamp #Class #Data #Hadoop #Hive #NYC #Spark Hadoop Spark Hive Big Data Admin Class Bootcamp Course NYC Introduction Hadoop Big Data Course Introduction to the Course Top Ubuntu commands Understand NameNode, DataNode, YARN and Hadoop Infrastructure Hadoop Install Hadoop Installation & HDFS Commands Java based Mapreduce # Hadoop 2.7 / 2.8.4 Learn HDFS commands Setting up Java for mapreduce Intro to Cloudera Hadoop & studying Cloudera Certification SQL and NoSQL SQL, Hive and Pig Installation (RDBMS world and NoSQL world) More Hive and SQOOP (Cloudera – Sqoop and Hive on Cloudera. JDBC drivers. Pig Intro to NoSQL, MongoDB, Hbase Installation Understanding different databases Hive : Hive Partitions and Bucketing Hive External and Internal Tables Spark Scala Python Spark Installations and Commands Spark Scala Scala Sheets Hadoop Streaming Python Map Reduce PySpark – (Python – Basics). RDDs. Running Spark-shell and importing data from csv files PySpark – Running RDD Mid Term Projects Pull data from csv online and move to Hive using hive import Pull data from spark-shell and run map reduce for fox news first page Create Data in MySQL and using SQOOP move it to HDFS Using Jupyter Anaconda and Spark Context run count on file that has Fox news first page Save raw data using delimiter comma, space, tab and pipe and move that into spark-context and spark shell Broadcasting Data – stream of data Kafka Message Broadcasting Who this course is for: Carrier changes who would like to move to Big Data Hadoop Learners who want to learn Hadoop installations 👉 Activate Udemy Coupon 👈 Free Tutorials Udemy Review Real Discount Udemy Free Courses Udemy Coupon Udemy Francais Coupon Udemy gratuit Coursera and Edx ELearningFree Course Free Online Training Udemy Udemy Free Coupons Udemy Free Discount Coupons Udemy Online Course Udemy Online Training 100% FREE Udemy Discount Coupons https://www.couponudemy.com/blog/hadoop-spark-hive-big-data-admin-class-bootcamp-course-nyc/

0 notes

milindjagre · 8 years ago

Text

Post 38 | HDPCD | Load data into Hive table from a Local Directory

Hello, everyone. Thanks for returning for the next tutorial in the HDPCD certification series. In the last tutorial, we saw how to specify the delimiter of the Hive table. In this tutorial, we are going to see how to load the data from the local Directory into the Hive table.

Let us begin then.

Apache Hive: Loading data from Local file

The above infographics show the step by step process to…

View On WordPress

0 notes

knolspeak · 8 years ago

Text

UnderStanding External Table In Hive

Usually when you create tables in hive using raw data in HDFS, it moves them to a different location – “/user/hive/warehouse”. If you created a simple table, it will be located inside the data warehouse. The following hive command creates a table with data location at “/user/hive/warehouse/empl”

hive> CREATE TABLE EMPL(ID int,NAME string)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ; OK Time…

View On WordPress

0 notes

theresawelchy · 6 years ago

Text

Convert CSVs to ORC Faster

Every analytical database I've used converts imported data into a form that is quicker to read. Often this means storing data in column form instead of row form. The taxi trip dataset I benchmark with is around 100 GB when gzip-compressed compressed in row form but the five columns that are queried can be stored in around 3.5 GB of space in columnar form when compressed using a mixture of dictionary encoding, run-length encoding and Snappy compression.

The process of converting rows into columns is time consuming and compute-intensive. Most systems can take the better part of an hour to finish this conversion, even when using a cluster of machines. I once believed that compression was causing most of the overhead but in researching this post I found Spark 2.4.0 had a ~7% difference in conversion time between using Snappy, zlib, lzo and not using any compression at all.

I've historically used Hive for this conversion process but there are ways to mix and match different Hadoop tools, including Spark and Presto, to get the same outcome and often with very different processing times.

Spark, Hive and Presto are all very different code bases. Spark is made up of 500K lines of Scala, 110K lines of Java and 40K lines of Python. Presto is made up of 600K lines of Java. Hive is made up of over one million lines of Java and 100K lines of C++ code. Any libraries they share are out-weighted by the unique approaches they've taken in the architecture surrounding their SQL parsers, query planners, optimizers, code generators and execution engines when it comes to tabular form conversion.

I recently benchmarked Spark 2.4.0 and Presto 0.214 and found that Spark out-performed Presto when it comes to ORC-based queries. In this post I'm going to examine the ORC writing performance of these two engines plus Hive and see which can convert CSV files into ORC files the fastest.

AWS EMR Up & Running

The following will launch an EMR cluster with a single master node and 20 core nodes. The cluster runs version 2.8.5 of Amazon's Hadoop distribution, Hive 2.3.4, Presto 0.214 and Spark 2.4.0. All nodes are spot instances to keep the cost down.

$ aws emr create-cluster --applications \ Name=Hadoop \ Name=Hive \ Name=Presto \ Name=Spark \ --auto-scaling-role EMR_AutoScaling_DefaultRole \ --ebs-root-volume-size 10 \ --ec2-attributes '{ "KeyName": "emr", "InstanceProfile": "EMR_EC2_DefaultRole", "EmrManagedSlaveSecurityGroup": "sg-...", "EmrManagedMasterSecurityGroup": "sg-..." }' \ --enable-debugging \ --instance-groups '[ { "Name": "Core - 2", "InstanceCount": 20, "BidPrice": "OnDemandPrice", "InstanceType": "m3.xlarge", "InstanceGroupType": "CORE" }, { "InstanceCount": 1, "Name": "Master - 1", "InstanceGroupType": "MASTER", "EbsConfiguration": { "EbsOptimized": false, "EbsBlockDeviceConfigs": [ { "VolumeSpecification": { "VolumeType": "standard", "SizeInGB": 400 }, "VolumesPerInstance": 1 } ] }, "BidPrice": "OnDemandPrice", "InstanceType": "m3.xlarge" } ]' \ --log-uri 's3n://aws-logs-...-eu-west-1/elasticmapreduce/' \ --name 'My cluster' \ --region eu-west-1 \ --release-label emr-5.20.0 \ --scale-down-behavior TERMINATE_AT_TASK_COMPLETION \ --service-role EMR_DefaultRole \ --termination-protected

With the cluster provisioned and bootstrapped I was able to SSH in.

$ ssh -i ~/.ssh/emr.pem \ [email protected]

__| __|_ ) _| ( / Amazon Linux AMI ___|\___|___| https://aws.amazon.com/amazon-linux-ami/2018.03-release-notes/ 3 package(s) needed for security, out of 6 available Run "sudo yum update" to apply all updates. EEEEEEEEEEEEEEEEEEEE MMMMMMMM MMMMMMMM RRRRRRRRRRRRRRR E::::::::::::::::::E M:::::::M M:::::::M R::::::::::::::R EE:::::EEEEEEEEE:::E M::::::::M M::::::::M R:::::RRRRRR:::::R E::::E EEEEE M:::::::::M M:::::::::M RR::::R R::::R E::::E M::::::M:::M M:::M::::::M R:::R R::::R E:::::EEEEEEEEEE M:::::M M:::M M:::M M:::::M R:::RRRRRR:::::R E::::::::::::::E M:::::M M:::M:::M M:::::M R:::::::::::RR E:::::EEEEEEEEEE M:::::M M:::::M M:::::M R:::RRRRRR::::R E::::E M:::::M M:::M M:::::M R:::R R::::R E::::E EEEEE M:::::M MMM M:::::M R:::R R::::R EE:::::EEEEEEEE::::E M:::::M M:::::M R:::R R::::R E::::::::::::::::::E M:::::M M:::::M RR::::R R::::R EEEEEEEEEEEEEEEEEEEE MMMMMMM MMMMMMM RRRRRRR RRRRRR

The CSV dataset I'll be using in this benchmark is a data dump I've produced of 1.1 billion taxi trips conducted in New York City over a six year period. The Billion Taxi Rides in Redshift blog post goes into detail on how I put this dataset together. They're stored on AWS S3 so I'll configure the aws CLI with my access and secret keys and retrieve them.

I'll then set the client's concurrent requests limit to 100 so the files download quicker than they would with stock settings.

$ aws configure set \ default.s3.max_concurrent_requests \ 100

I've requested an EBS storage volume on the master node so that I can download the dataset before loading it onto HDFS.

Filesystem Size Used Avail Use% Mounted on devtmpfs 7.4G 80K 7.4G 1% /dev tmpfs 7.4G 0 7.4G 0% /dev/shm /dev/xvda1 9.8G 4.6G 5.1G 48% / /dev/xvdb1 5.0G 33M 5.0G 1% /emr /dev/xvdb2 33G 289M 33G 1% /mnt /dev/xvdc 38G 33M 38G 1% /mnt1 /dev/xvdd 400G 33M 400G 1% /mnt2

I ran the following to pull the dataset off of S3.

$ mkdir -p /mnt2/csv/ $ aws s3 sync s3://<bucket>/csv/ /mnt2/csv/

I then ran the following to push the data onto HDFS.

$ hdfs dfs -mkdir /trips_csv/ $ hdfs dfs -copyFromLocal /mnt2/csv/*.csv.gz /trips_csv/

Converting CSVs to ORC using Hive

I'll use Hive to create a schema catalogue for the various datasets that will be produced in this benchmark. The following will create the table for the CSV-formatted dataset.

CREATE TABLE trips_csv ( trip_id INT, vendor_id VARCHAR(3), pickup_datetime TIMESTAMP, dropoff_datetime TIMESTAMP, store_and_fwd_flag VARCHAR(1), rate_code_id SMALLINT, pickup_longitude DECIMAL(18,14), pickup_latitude DECIMAL(18,14), dropoff_longitude DECIMAL(18,14), dropoff_latitude DECIMAL(18,14), passenger_count SMALLINT, trip_distance DECIMAL(6,3), fare_amount DECIMAL(6,2), extra DECIMAL(6,2), mta_tax DECIMAL(6,2), tip_amount DECIMAL(6,2), tolls_amount DECIMAL(6,2), ehail_fee DECIMAL(6,2), improvement_surcharge DECIMAL(6,2), total_amount DECIMAL(6,2), payment_type VARCHAR(3), trip_type SMALLINT, pickup VARCHAR(50), dropoff VARCHAR(50), cab_type VARCHAR(6), precipitation SMALLINT, snow_depth SMALLINT, snowfall SMALLINT, max_temperature SMALLINT, min_temperature SMALLINT, average_wind_speed SMALLINT, pickup_nyct2010_gid SMALLINT, pickup_ctlabel VARCHAR(10), pickup_borocode SMALLINT, pickup_boroname VARCHAR(13), pickup_ct2010 VARCHAR(6), pickup_boroct2010 VARCHAR(7), pickup_cdeligibil VARCHAR(1), pickup_ntacode VARCHAR(4), pickup_ntaname VARCHAR(56), pickup_puma VARCHAR(4), dropoff_nyct2010_gid SMALLINT, dropoff_ctlabel VARCHAR(10), dropoff_borocode SMALLINT, dropoff_boroname VARCHAR(13), dropoff_ct2010 VARCHAR(6), dropoff_boroct2010 VARCHAR(7), dropoff_cdeligibil VARCHAR(1), dropoff_ntacode VARCHAR(4), dropoff_ntaname VARCHAR(56), dropoff_puma VARCHAR(4) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/trips_csv/';

I'll create a table that will store the dataset as an ORC-formatted, Snappy-compressed dataset that will be produced by Hive.

CREATE TABLE trips_orc_snappy_hive ( trip_id INT, vendor_id STRING, pickup_datetime TIMESTAMP, dropoff_datetime TIMESTAMP, store_and_fwd_flag STRING, rate_code_id SMALLINT, pickup_longitude DOUBLE, pickup_latitude DOUBLE, dropoff_longitude DOUBLE, dropoff_latitude DOUBLE, passenger_count SMALLINT, trip_distance DOUBLE, fare_amount DOUBLE, extra DOUBLE, mta_tax DOUBLE, tip_amount DOUBLE, tolls_amount DOUBLE, ehail_fee DOUBLE, improvement_surcharge DOUBLE, total_amount DOUBLE, payment_type STRING, trip_type SMALLINT, pickup STRING, dropoff STRING, cab_type STRING, precipitation SMALLINT, snow_depth SMALLINT, snowfall SMALLINT, max_temperature SMALLINT, min_temperature SMALLINT, average_wind_speed SMALLINT, pickup_nyct2010_gid SMALLINT, pickup_ctlabel STRING, pickup_borocode SMALLINT, pickup_boroname STRING, pickup_ct2010 STRING, pickup_boroct2010 STRING, pickup_cdeligibil STRING, pickup_ntacode STRING, pickup_ntaname STRING, pickup_puma STRING, dropoff_nyct2010_gid SMALLINT, dropoff_ctlabel STRING, dropoff_borocode SMALLINT, dropoff_boroname STRING, dropoff_ct2010 STRING, dropoff_boroct2010 STRING, dropoff_cdeligibil STRING, dropoff_ntacode STRING, dropoff_ntaname STRING, dropoff_puma STRING ) STORED AS orc LOCATION '/trips_orc_snappy_hive/' TBLPROPERTIES ("orc.compress"="snappy");

Below I'll convert the CSV dataset into ORC using Hive alone. The following took 55 mins and 5 seconds.

INSERT INTO trips_orc_snappy_hive SELECT * FROM trips_csv;

Converting CSVs to ORC using Spark

I'll create a table for Spark to store its ORC-formatted, Snappy-compressed data.

CREATE TABLE trips_orc_snappy_spark ( trip_id INT, vendor_id STRING, pickup_datetime TIMESTAMP, dropoff_datetime TIMESTAMP, store_and_fwd_flag STRING, rate_code_id SMALLINT, pickup_longitude DOUBLE, pickup_latitude DOUBLE, dropoff_longitude DOUBLE, dropoff_latitude DOUBLE, passenger_count SMALLINT, trip_distance DOUBLE, fare_amount DOUBLE, extra DOUBLE, mta_tax DOUBLE, tip_amount DOUBLE, tolls_amount DOUBLE, ehail_fee DOUBLE, improvement_surcharge DOUBLE, total_amount DOUBLE, payment_type STRING, trip_type SMALLINT, pickup STRING, dropoff STRING, cab_type STRING, precipitation SMALLINT, snow_depth SMALLINT, snowfall SMALLINT, max_temperature SMALLINT, min_temperature SMALLINT, average_wind_speed SMALLINT, pickup_nyct2010_gid SMALLINT, pickup_ctlabel STRING, pickup_borocode SMALLINT, pickup_boroname STRING, pickup_ct2010 STRING, pickup_boroct2010 STRING, pickup_cdeligibil STRING, pickup_ntacode STRING, pickup_ntaname STRING, pickup_puma STRING, dropoff_nyct2010_gid SMALLINT, dropoff_ctlabel STRING, dropoff_borocode SMALLINT, dropoff_boroname STRING, dropoff_ct2010 STRING, dropoff_boroct2010 STRING, dropoff_cdeligibil STRING, dropoff_ntacode STRING, dropoff_ntaname STRING, dropoff_puma STRING ) STORED AS orc LOCATION '/trips_orc_snappy_spark/' TBLPROPERTIES ("orc.compress"="snappy");

I'll then launch Spark and convert the CSV data into ORC using its engine.

The following took 1 hour, 43 mins and 7 seconds.

INSERT INTO TABLE trips_orc_snappy_spark SELECT * FROM trips_csv;

To show that Parquet isn't the more optimised format I'll create a table to store Snappy-compressed data in Parquet format and run the same CSV conversion.

CREATE TABLE trips_parquet_snappy_spark ( trip_id INT, vendor_id STRING, pickup_datetime TIMESTAMP, dropoff_datetime TIMESTAMP, store_and_fwd_flag STRING, rate_code_id SMALLINT, pickup_longitude DOUBLE, pickup_latitude DOUBLE, dropoff_longitude DOUBLE, dropoff_latitude DOUBLE, passenger_count SMALLINT, trip_distance DOUBLE, fare_amount DOUBLE, extra DOUBLE, mta_tax DOUBLE, tip_amount DOUBLE, tolls_amount DOUBLE, ehail_fee DOUBLE, improvement_surcharge DOUBLE, total_amount DOUBLE, payment_type STRING, trip_type SMALLINT, pickup STRING, dropoff STRING, cab_type STRING, precipitation SMALLINT, snow_depth SMALLINT, snowfall SMALLINT, max_temperature SMALLINT, min_temperature SMALLINT, average_wind_speed SMALLINT, pickup_nyct2010_gid SMALLINT, pickup_ctlabel STRING, pickup_borocode SMALLINT, pickup_boroname STRING, pickup_ct2010 STRING, pickup_boroct2010 STRING, pickup_cdeligibil STRING, pickup_ntacode STRING, pickup_ntaname STRING, pickup_puma STRING, dropoff_nyct2010_gid SMALLINT, dropoff_ctlabel STRING, dropoff_borocode SMALLINT, dropoff_boroname STRING, dropoff_ct2010 STRING, dropoff_boroct2010 STRING, dropoff_cdeligibil STRING, dropoff_ntacode STRING, dropoff_ntaname STRING, dropoff_puma STRING ) STORED AS parquet LOCATION '/trips_parquet_snappy_spark/' TBLPROPERTIES ("parquet.compress"="snappy");

The following took 1 hour, 56 minutes and 29 seconds.

INSERT INTO TABLE trips_parquet_snappy_spark SELECT * FROM trips_csv;

The HDFS cluster only has 1.35 TB of capacity and 3x replication is being used so I'll clear out these datasets before continuing.

$ hdfs dfs -rm -r -skipTrash /trips_orc_snappy_hive/ $ hdfs dfs -rm -r -skipTrash /trips_orc_snappy_spark/ $ hdfs dfs -rm -r -skipTrash /trips_parquet_snappy_spark/

Converting CSVs to ORC using Presto

Below I'll create a table for Presto to store a Snappy-compressed ORC dataset.

CREATE TABLE trips_orc_snappy_presto ( trip_id INT, vendor_id STRING, pickup_datetime TIMESTAMP, dropoff_datetime TIMESTAMP, store_and_fwd_flag STRING, rate_code_id SMALLINT, pickup_longitude DOUBLE, pickup_latitude DOUBLE, dropoff_longitude DOUBLE, dropoff_latitude DOUBLE, passenger_count SMALLINT, trip_distance DOUBLE, fare_amount DOUBLE, extra DOUBLE, mta_tax DOUBLE, tip_amount DOUBLE, tolls_amount DOUBLE, ehail_fee DOUBLE, improvement_surcharge DOUBLE, total_amount DOUBLE, payment_type STRING, trip_type SMALLINT, pickup STRING, dropoff STRING, cab_type STRING, precipitation SMALLINT, snow_depth SMALLINT, snowfall SMALLINT, max_temperature SMALLINT, min_temperature SMALLINT, average_wind_speed SMALLINT, pickup_nyct2010_gid SMALLINT, pickup_ctlabel STRING, pickup_borocode SMALLINT, pickup_boroname STRING, pickup_ct2010 STRING, pickup_boroct2010 STRING, pickup_cdeligibil STRING, pickup_ntacode STRING, pickup_ntaname STRING, pickup_puma STRING, dropoff_nyct2010_gid SMALLINT, dropoff_ctlabel STRING, dropoff_borocode SMALLINT, dropoff_boroname STRING, dropoff_ct2010 STRING, dropoff_boroct2010 STRING, dropoff_cdeligibil STRING, dropoff_ntacode STRING, dropoff_ntaname STRING, dropoff_puma STRING ) STORED AS orc LOCATION '/trips_orc_snappy_presto/' TBLPROPERTIES ("orc.compress"="snappy");

I'll run the CSV to ORC conversion in Presto's CLI.

$ presto-cli \ --schema default \ --catalog hive

The following took 37 mins and 35 seconds.

INSERT INTO trips_orc_snappy_presto SELECT * FROM trips_csv;

The above generated a dataset 118.6 GB in size (excluding HDFS replication).

$ hdfs dfs -du -s -h /trips_orc_snappy_presto/

Facebook have been working on implementing ZStandard, a lossless data compression algorithm, and integrating it into various tools in the Hadoop ecosystem. Spark 2.4.0 on EMR with stock settings isn't able to read this compression scheme but Presto can both read and write it.

I'll create a ZStandard-compressed table for Presto below using Hive.

CREATE TABLE trips_orc_zstd_presto ( trip_id INT, vendor_id STRING, pickup_datetime TIMESTAMP, dropoff_datetime TIMESTAMP, store_and_fwd_flag STRING, rate_code_id SMALLINT, pickup_longitude DOUBLE, pickup_latitude DOUBLE, dropoff_longitude DOUBLE, dropoff_latitude DOUBLE, passenger_count SMALLINT, trip_distance DOUBLE, fare_amount DOUBLE, extra DOUBLE, mta_tax DOUBLE, tip_amount DOUBLE, tolls_amount DOUBLE, ehail_fee DOUBLE, improvement_surcharge DOUBLE, total_amount DOUBLE, payment_type STRING, trip_type SMALLINT, pickup STRING, dropoff STRING, cab_type STRING, precipitation SMALLINT, snow_depth SMALLINT, snowfall SMALLINT, max_temperature SMALLINT, min_temperature SMALLINT, average_wind_speed SMALLINT, pickup_nyct2010_gid SMALLINT, pickup_ctlabel STRING, pickup_borocode SMALLINT, pickup_boroname STRING, pickup_ct2010 STRING, pickup_boroct2010 STRING, pickup_cdeligibil STRING, pickup_ntacode STRING, pickup_ntaname STRING, pickup_puma STRING, dropoff_nyct2010_gid SMALLINT, dropoff_ctlabel STRING, dropoff_borocode SMALLINT, dropoff_boroname STRING, dropoff_ct2010 STRING, dropoff_boroct2010 STRING, dropoff_cdeligibil STRING, dropoff_ntacode STRING, dropoff_ntaname STRING, dropoff_puma STRING ) STORED AS orc LOCATION '/trips_orc_zstd_presto/' TBLPROPERTIES ("orc.compress"="zstd");

I'll then use Presto to convert the CSV data into ZStandard-compressed ORC files.

$ presto-cli \ --schema default \ --catalog hive

The following took 37 mins and 44 seconds.

INSERT INTO trips_orc_zstd_presto SELECT * FROM trips_csv;

The above generated 79 GB of data (excluding HDFS replication).

$ hdfs dfs -du -s -h /trips_orc_zstd_presto/

Presto ORC Benchmark: Snappy versus ZStandard

ZStandard did a good job to save space on HDFS and still converted the data in a very short amount of time. Below I'll look at the impact of the two compression schemes on query performance.

The following were the fastest times I saw after running each query multiple times.

$ presto-cli \ --schema default \ --catalog hive

These four queries were run on the Snappy-compressed, ORC-formatted dataset.

The following completed in 5.48 seconds.

SELECT cab_type, count(*) FROM trips_orc_snappy_presto GROUP BY cab_type;

The following completed in 7.85 seconds.

SELECT passenger_count, avg(total_amount) FROM trips_orc_snappy_presto GROUP BY passenger_count;

The following completed in 8.55 seconds.

SELECT passenger_count, year(pickup_datetime), count(*) FROM trips_orc_snappy_presto GROUP BY passenger_count, year(pickup_datetime);

The following completed in 11.92 seconds.

SELECT passenger_count, year(pickup_datetime) trip_year, round(trip_distance), count(*) trips FROM trips_orc_snappy_presto GROUP BY passenger_count, year(pickup_datetime), round(trip_distance) ORDER BY trip_year, trips desc;

These four queries were run on the ZStandard-compressed, ORC-formatted dataset.

The following completed in 4.21 seconds.

SELECT cab_type, count(*) FROM trips_orc_zstd_presto GROUP BY cab_type;

The following completed in 5.97 seconds.

SELECT passenger_count, avg(total_amount) FROM trips_orc_zstd_presto GROUP BY passenger_count;

The following completed in 7.3 seconds.

SELECT passenger_count, year(pickup_datetime), count(*) FROM trips_orc_zstd_presto GROUP BY passenger_count, year(pickup_datetime);

The following completed in 11.68 seconds.

SELECT passenger_count, year(pickup_datetime) trip_year, round(trip_distance), count(*) trips FROM trips_orc_zstd_presto GROUP BY passenger_count, year(pickup_datetime), round(trip_distance) ORDER BY trip_year, trips desc;

Thoughts on the Results

Hive being twice as fast as Spark at converting CSVs to ORC files took me by surprise as Spark has a younger code base. That being said, Presto being 1.5x faster as Hive was another shocker. I'm hoping in publishing this post that the community are made more aware of these performance differences and we can find improvements in future releases of all three packages.

Hive being a central store of catalogue data for so many tools in the Hadoop ecosystem cannot be ignored and I expect I will continue to use it for a long time to come. Where I'm put into an awkward position is in recommending that converting CSVs into ORCs should be done in Presto for significant time savings but querying that data afterword is quicker in Spark. There really is no one tool that rules them all.

ZStandard offers a lot in terms of disk space savings while not impacting query performance significantly. To take the raw data and make it 1.5x smaller while not impacting conversion time is fantastic. When Hadoop 3's Erasure Coding is mixed into the equation it'll bring space requirements down by 2/3rds on HDFS. Being able to buy a petabyte of storage capacity and storing three times as much data on it is huge.

I feel that my research into converting CSVs into ORC files is just getting started. Beyond fundamental architectural differences between these three pieces of software I suspect stock settings on EMR could be improved to provide faster conversion times. This isn't something I can prove at this point in time and this will be a subject of further research.

Spark did receive support for ZStandard in 2.3.0 so I suspect it's nothing more than a configuration change to add support into Spark on EMR. I'm going to keep an eye on future releases of AWS EMR to see if this is added to the stock settings.

Thank you for taking the time to read this post. I offer consulting, architecture and hands-on development services to clients in North America & Europe. If you'd like to discuss how my offerings can help your business please contact me via

DataTau published first on DataTau

#Technology #Finance

0 notes

easyinplay · 7 years ago

Text

漫谈数据仓库之拉链表（原理、设计以及在Hive中的实现）

0x00 前言

本文将会谈一谈在数据仓库中拉链表相关的内容，包括它的原理、设计、以及在我们大数据场景下的实现方式。

全文由下面几个部分组成：

先分享一下拉链表的用途、什么是拉链表。

通过一些小的使用场景来对拉链表做近一步的阐释，以及拉链表和常用的切片表的区别。

举一个具体的应用场景，来设计并实现一份拉链表，最后并通过一些例子说明如何使用我们设计的这张表（因为现在Hive的大规模使用，我们会以Hive场景下的设计为例）。

分析一下拉链表的优缺点，并对前面的提到的一些内容进行补充说明，比如说拉链表和流水表的区别。

0x01 什么是拉链表

拉链表是针对数据仓库设计中表存储数据的方式而定义的，顾名思义，所谓拉链，就是记录历史。记录一个事物从开始，一直到当前状态的所有变化的信息。

我们先看一个示例，这就是一张拉链表，存储的是用户的最基本信息以及每条记录的生命周期。我们可以使用这张表拿到最新的当天的最新数据以及之前的历史数据。

我们暂且不对这张表做细致的讲解，后文会专门来阐述怎么来设计、实现和使用它。

拉链表的使用场景

在数据仓库的数据模型设计过程中，经常会遇到下面这种表的设计：

有一些表的数据量很大，比如一张用户表，大约10亿条记录，50个字段，这种表，即使使用ORC压缩，单张表的存储也会超过100G，在HDFS使用双备份或者三备份的话就更大一些。

表中的部分字段会被update更新操作，如用户联系方式，产品的描述信息，订单的状态等等。

需要查看某一个时间点或者时间段的历史快照信息，比如，查看某一个订单在历史某一个时间点的状态。

表中的记录变化的比例和频率不是很大，比如，总共有10亿的用户，每天新增和发生变化的有200万左右，变化的比例占的很小。

那么对于这种表我该如何设计呢？下面有几种方案可选：

方案一：每天只留最新的一份，比如我们每天用Sqoop抽取最新的一份全量数据到Hive中。

方案二：每天保留一份全量的切片数据。

方案三：使用拉链表。

为什么使用拉链表

现在我们对前面提到的三种进行逐个的分析。

方案一

这种方案就不用多说了，实现起来很简单，每天drop掉前一天的数据，重新抽一份最新的。

优点很明显，节省空间，一些普通的使用也很方便，不用在选择表的时候加一个时间分区什么的。

缺点同样明显，没有历史数据，先翻翻旧账只能通过其它方式，比如从流水表里面抽。

方案二

每天一份全量的切片是一种比较稳妥的方案，而且历史数据也在。

缺点就是存储空间占用量太大太大了，如果对这边表每天都保留一份全量，那么每次全量中会保存很多不变的信息，对存储是极大的浪费，这点我感触还是很深的……

当然我们也可以做一些取舍，比如只保留近一个月的数据？但是，需求是无耻的，数据的生命周期不是我们能完全左右的。

拉链表

拉链表在使用上基本兼顾了我们的需求。

首先它在空间上做了一个取舍，虽说不像方案一那样占用量那么小，但是它每日的增量可能只有方案二的千分之一甚至是万分之一。

其实它能满足方案二所能满足的需求，既能获取最新的数据，也能添加筛选条件也获取历史的数据。

所以我们还是很有必要来使用拉链表的。

0x02 拉链表的设计和实现

如何设计一张拉链表

下面我们来举个栗子详细看一下拉链表。

我们先看一下在Mysql关系型数据库里的user表中信息变化。

在2017-01-01这一天表中的数据是：

在2017-01-02这一天表中的数据是，用户002和004资料进行了修改，005是新增用户：

在2017-01-03这一天表中的数据是，用户004和005资料进行了修改，006是新增用户：

如果在数据仓库中设计成历史拉链表保存该表，则会有下面这样一张表，这是最新一天（即2017-01-03）的数据：

说明

t_start_date表示该条记录的生命周期开始时间，t_end_date表示该条记录的生命周期结束时间。

t_end_date = ‘9999-12-31’表示该条记录目前处于有效状态。

如果查询当前所有有效的记录，则select * from user where t_end_date = ‘9999-12-31’。

如果查询2017-01-02的历史快照，则select * from user where t_start_date <= ‘2017-01-02’ and t_end_date >= ‘2017-01-02’。（此处要好好理解，是拉链表比较重要的一块。）

在Hive中实现拉链表

在现在的大数据场景下，大部分的公司都会选择以Hdfs和Hive为主的数据仓库架构。目前的Hdfs版本来讲，其文件系统中的文件是不能做改变的，也就是说Hive的表智能进行删除和添加操作，而不能进行update。基于这个前提，我们来实现拉链表。

还是以上面的用户表为例，我们要实现用户的拉链表。在实现它之前，我们需要先确定一下我们有哪些数据源可以用。

我们需要一张ODS层的用户全量表。至少需要用它来初始化。

每日的用户更新表。

而且我们要确定拉链表的时间粒度，比如说拉链表每天只取一个状态，也就是说如果一天有3个状态变更，我们只取最后一个状态，这种天粒度的表其实已经能解决大部分的问题了。

另外，补充一下每日的用户更新表该怎么获取，据笔者的经验，有3种方式拿到或者间接拿到每日的用户增量，因为它比较重要，所以详细说明：

我们可以监听Mysql数据的变化，比如说用Canal，最后合并每日的变化，获取到最后的一个状态。

假设我们每天都会获得一份切片数据，我们可以通过取两天切片数据的不同来作为每日更新表，这种情况下我们可以对所有的字段先进行concat，再取md5，这样就ok了。

流水表！有每日的变更流水表。

ods层的user表

现在我们来看一下我们ods层的用户资料切片表的结构：

CREATE EXTERNAL TABLE ods.user ( user_num STRING COMMENT '用户编号', mobile STRING COMMENT '手机号码', reg_date STRING COMMENT '注册日期' COMMENT '用户资料表' PARTITIONED BY (dt string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS ORC LOCATION '/ods/user'; )

ods层的user_update表

然后我们还需要一张用户每日更新表，前面已经分析过该如果得到这张表，现在我们假设它已经存在。

CREATE EXTERNAL TABLE ods.user_update ( user_num STRING COMMENT '用户编号', mobile STRING COMMENT '手机号码', reg_date STRING COMMENT '注册日期' COMMENT '每日用户资料更新表' PARTITIONED BY (dt string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS ORC LOCATION '/ods/user_update'; )

拉链表

现在我们创建一张拉链表：

CREATE EXTERNAL TABLE dws.user_his ( user_num STRING COMMENT '用户编号', mobile STRING COMMENT '手机号码', reg_date STRING COMMENT '用户编号', t_start_date , t_end_date COMMENT '用户资料拉链表' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS ORC LOCATION '/dws/user_his'; )

实现sql语句

然后初始化的sql就不写了，其实就相当于是拿一天的ods层用户表过来就行，我们写一下每日的更新语句。

现在我们假设我们已经已经初始化了2017-01-01的日期，然后需要更新2017-01-02那一天的数据，我们有了下面的Sql。

然后把两个日期设置为变量就可以了。

INSERT OVERWRITE TABLE dws.user_his SELECT * FROM ( SELECT A.user_num, A.mobile, A.reg_date, A.t_start_time, CASE WHEN A.t_end_time = '9999-12-31' AND B.user_num IS NOT NULL THEN '2017-01-01' ELSE A.t_end_time END AS t_end_time FROM dws.user_his AS A LEFT JOIN ods.user_update AS B ON A.user_num = B.user_num UNION SELECT C.user_num, C.mobile, C.reg_date, '2017-01-02' AS t_start_time, '9999-12-31' AS t_end_time FROM ods.user_update AS C ) AS T

0x03 补充

好了，我们分析了拉链表的原理、设计思路、并且在Hive环境下实现了一份拉链表，下面对拉链表做一些小的补充。

拉链表和流水表

流水表存放的是一个用户的变更记录，比如在一张流水表中，一天的数据中，会存放一个用户的每条修改记录，但是在拉链表中只有一条记录。

这是拉链表设计时需要注意的一个粒度问题。我们当然也可以设置的粒度更小一些，一般按天就足够。

查询性能

拉链表当然也会遇到查询性能的问题，比如说我们存放了5年的拉链数据，那么这张表势必会比较大，当查询的时候性能就比较低了，个人认为两个思路来解决：

在一些查询引擎中，我们对start_date和end_date做索引，这样能提高不少性能。

保留部分历史数据，比如说我们一张表里面存放全量的拉链表数据，然后再对外暴露一张只提供近3个月数据的拉链表。

0 notes

asimjalis-blog1 · 8 years ago

Text

Hive and Data Conversion

How can I convert HDFS files from text to SequenceFile and other formats?

While Hive is primarily a tool for running SQL queries against data in HDFS, it can also be used as a tool to convert data from one format to another. Hive is able to do this because it has a rich collection of SerDes (serializer-deserializers) for different formats.

Here is how to convert a text file to SequenceFile format using Hive.

Create a file called sales_data with the following contents:

010,CA,Alice,100 011,NY,Bob,200 013,NY,Bob,300 013,WA,Bob,200

Start the Hive shell and type the following commands into it.

-- Create table with source format. CREATE TABLE sales_as_text ( id STRING, state STRING, name STRING, amount DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'; -- Create table with target format. CREATE TABLE sales_as_seq ( id STRING, state STRING, name STRING, amount DOUBLE) STORED AS SEQUENCEFILE; -- Load sales data. LOAD DATA LOCAL INPATH 'sales_data' INTO TABLE sales_as_text; -- Optionally enable compression. SET hive.exec.compress.output=true; SET io.seqfile.compression.type=BLOCK; -- Use INSERT SELECT to convert formats. INSERT OVERWRITE TABLE sales_as_seq SELECT * FROM sales_as_text;

0 notes

siva3155 · 6 years ago

Text

300+ TOP BIG DATA Interview Questions and Answers

BIG Data Interview Questions for freshers experienced :-

0 notes

jobssss · 8 years ago

Link

Assuming, you are using Spark 2.1.0 or later and my_DF is your dataframe,

//get the schema split as string with comma-separated field-datatype pairs StructType my_schema = my_DF.schema(); StructField[] fields = my_schema.fields(); String fieldStr = ""; for (StructField f : fields) { fieldStr += f.name() + " " + f.dataType().typeName() + ","; } //drop the table if already created spark.sql("drop table if exists my_table"); //create the table using the dataframe schema spark.sql("create table my_table(" + fieldStr.subString(0,fieldStr.length()-1)+ ") row format delimited fields terminated by '|' location '/my/hdfs/location'"); //write the dataframe data to the hdfs location for the created Hive table my_DF.write() .format("com.databricks.spark.csv") .option("delimiter","|") .mode("overwrite") .save("/my/hdfs/location");

The other method using temp table

my_DF.createOrReplaceTempView("my_temp_table"); spark.sql("drop table if exists my_table"); spark.sql("create table my_table as select * from my_temp_table");

#spark #scala #dataframe

0 notes

milindjagre · 8 years ago

Text

Post 37 | HDPCD | Specifying delimiter of a Hive table

Specifying DELIMITER of a Hive table

Hello, everyone. Thanks for coming back for one more tutorial in this HDPCD certification series.

In the last tutorial, we saw how to specify the storage format of a Hive table. In this tutorial, we are going to see how to specify the delimiter of a Hive table.

We are going to follow the process mentioned in the following infographics.

Apache Hive: Specifying delimiter

This process is similar to…

View On WordPress

0 notes

tusharsarde-blog · 8 years ago

Text

Create table in Hive using octal code

Reference Example is take from Programming in Hive book

Let’s create a table in hive

CREATE TABLE employee ( name STRING, salary FLOAT,

subordinates ARRAY<STRING>

deductions MAP<STRING, FLOAT>

address STRUCT<street:STRING, city:STRING, state:STRING,zip:INT>)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘\001’

COLLECTION ITEMS TERMINATED BY ‘\002’

MAP KEYS…

View On WordPress

#Hadoop #Hive

0 notes

siva3155 · 6 years ago

Text

300+ TOP Apache SQOOP Interview Questions and Answers

SQOOP Interview Questions for freshers experienced :-

1. What is the process to perform an incremental data load in Sqoop? The process to perform incremental data load in Sqoop is to synchronize the modified or updated data (often referred as delta data) from RDBMS to Hadoop. The delta data can be facilitated through the incremental load command in Sqoop. Incremental load can be performed by using Sqoop import command or by loading the data into hive without overwriting it. The different attributes that need to be specified during incremental load in Sqoop are- Mode (incremental) –The mode defines how Sqoop will determine what the new rows are. The mode can have value as Append or Last Modified. Col (Check-column) –This attribute specifies the column that should be examined to find out the rows to be imported. Value (last-value) –This denotes the maximum value of the check column from the previous import operation. 2. How Sqoop can be used in a Java program? The Sqoop jar in classpath should be included in the java code. After this the method Sqoop.runTool () method must be invoked. The necessary parameters should be created to Sqoop programmatically just like for command line. 3. What is the significance of using –compress-codec parameter? To get the out file of a sqoop import in formats other than .gz like .bz2 we use the –compress -code parameter. 4. How are large objects handled in Sqoop? Sqoop provides the capability to store large sized data into a single field based on the type of data. Sqoop supports the ability to store- CLOB ‘s – Character Large Objects BLOB’s –Binary Large Objects Large objects in Sqoop are handled by importing the large objects into a file referred as “LobFile” i.e. Large Object File. The LobFile has the ability to store records of huge size, thus each record in the LobFile is a large object. 5. What is a disadvantage of using –direct parameter for faster data load by sqoop? The native utilities used by databases to support faster load do not work for binary data formats like SequenceFile 6. Is it possible to do an incremental import using Sqoop? Yes, Sqoop supports two types of incremental imports- Append Last Modified To insert only rows Append should be used in import command and for inserting the rows and also updating Last-Modified should be used in the import command. 7. How can you check all the tables present in a single database using Sqoop? The command to check the list of all tables present in a single database using Sqoop is as follows- Sqoop list-tables –connect jdbc: mysql: //localhost/user; 8. How can you control the number of mappers used by the sqoop command? The Parameter –num-mappers is used to control the number of mappers executed by a sqoop command. We should start with choosing a small number of map tasks and then gradually scale up as choosing high number of mappers initially may slow down the performance on the database side. 9. What is the standard location or path for Hadoop Sqoop scripts? /usr/bin/Hadoop Sqoop 10. How can we import a subset of rows from a table without using the where clause? We can run a filtering query on the database and save the result to a temporary table in database. Then use the sqoop import command without using the –where clause

Apache SQOOP Interview Questions 11. When the source data keeps getting updated frequently, what is the approach to keep it in sync with the data in HDFS imported by sqoop? qoop can have 2 approaches. To use the –incremental parameter with append option where value of some columns are checked and only in case of modified values the row is imported as a new row. To use the –incremental parameter with lastmodified option where a date column in the source is checked for records which have been updated after the last import. 12. What is a sqoop metastore? It is a tool using which Sqoop hosts a shared metadata repository. Multiple users and/or remote users can define and execute saved jobs (created with sqoop job) defined in this metastore. Clients must be configured to connect to the metastore in sqoop-site.xml or with the –meta-connect argument. 13. Can free form SQL queries be used with Sqoop import command? If yes, then how can they be used? Sqoop allows us to use free form SQL queries with the import command. The import command should be used with the –e and – query options to execute free form SQL queries. When using the –e and –query options with the import command the –target dir value must be specified. 14. Tell few import control commands: Append Columns Where These command are most frequently used to import RDBMS Data. 15. Can free form SQL queries be used with Sqoop import command? If yes, then how can they be used? Sqoop allows us to use free form SQL queries with the import command. The import command should be used with the –e and – query options to execute free form SQL queries. When using the –e and –query options with the import command the –target dir value must be specified. 16. How can you see the list of stored jobs in sqoop metastore? sqoop job –list 17. What type of databases Sqoop can support? MySQL, Oracle, PostgreSQL, IBM, Netezza and Teradata. Every database connects through jdbc driver. 18. What is the purpose of sqoop-merge? The merge tool combines two datasets where entries in one dataset should overwrite entries of an older dataset preserving only the newest version of the records between both the data sets. 19. How sqoop can handle large objects? Blog and Clob columns are common large objects. If the object is less than 16MB, it stored inline with the rest of the data. If large objects, temporary stored in_lob subdirectory. Those lobs processes in a streaming fashion. Those data materialized in memory for processing. IT you set LOB to 0, those lobs objects placed in external storage. 20. What is the importance of eval tool? It allows user to run sample SQL queries against Database and preview the results on the console. It can help to know what data can import? The desired data imported or not? 21. What is the default extension of the files produced from a sqoop import using the –compress parameter? .gz 22. Can we import the data with “Where” condition? Yes, Sqoop has a special option to export/import a particular data. 23. What are the limitations of importing RDBMS tables into Hcatalog directly? There is an option to import RDBMS tables into Hcatalog directly by making use of –hcatalog –database option with the –hcatalog –table but the limitation to it is that there are several arguments like –as-avro file , -direct, -as-sequencefile, -target-dir , -export-dir are not supported. 24. what are the majorly used commands in sqoop? In Sqoop Majorly Import and export command are used. But below commands are also useful sometimes. codegen, eval, import-all-tables, job, list-database, list-tables, merge, metastore. 25. What is the usefulness of the options file in sqoop. The options file is used in sqoop to specify the command line values in a file and use it in the sqoop commands. For example the –connect parameter’s value and –user name value scan be stored in a file and used again and again with different sqoop commands. 26. what are the common delimiters and escape character in sqoop? The default delimiters are a comma(,) for fields, a newline(\n) for records Escape characters are \b,\n,\r,\t,\”, \\’,\o etc 27. What are the two file formats supported by sqoop for import? Delimited text and Sequence Files. 28. while loading table from MySQL into HDFS, if we need to copy tables with maximum possible speed, what can you do? We need to use -direct argument in import command to use direct import fast path and this -direct can be used only with MySQL and PostGreSQL as of now. 29. How can you sync a exported table with HDFS data in which some rows are deleted? Truncate the target table and load it again. 30. Differentiate between Sqoop and distCP. DistCP utility can be used to transfer data between clusters whereas Sqoop can be used to transfer data only between Hadoop and RDBMS. 31. How can you import only a subset of rows form a table? By using the WHERE clause in the sqoop import statement we can import only a subset of rows. 32. How do you clear the data in a staging table before loading it by Sqoop? By specifying the –clear-staging-table option we can clear the staging table before it is loaded. This can be done again and again till we get proper data in staging. 33. What is Sqoop? Sqoop is an open source project that enables data transfer from non-hadoop source to hadoop source. It can be remembered as SQL to Hadoop -> SQOOP. It allows user to specify the source and target location inside the Hadoop. 35. How can you export only a subset of columns to a relational table using sqoop? By using the –column parameter in which we mention the required column names as a comma separated list of values. 36. Which database the sqoop metastore runs on? Running sqoop-metastore launches a shared HSQLDB database instance on the current machine. 37. How will you update the rows that are already exported? The parameter –update-key can be used to update existing rows. In it a comma-separated list of columns is used which uniquely identifies a row. All of these columns is used in the WHERE clause of the generated UPDATE query. All other table columns will be used in the SET part of the query. 38. You have a data in HDFS system, if you want to put some more data to into the same table, will it append the data or overwrite? No it can’t overwrite, one way to do is copy the new file in HDFS. 39. Where can the metastore database be hosted? The metastore database can be hosted anywhere within or outside of the Hadoop cluster. 40. Which is used to import data in Sqoop ? In SQOOP import command is used to import RDBMS data into HDFS. Using import command we can import a particular table into HDFS. – 41. What is the role of JDBC driver in a Sqoop set up? To connect to different relational databases sqoop needs a connector. Almost every DB vendor makes this connecter available as a JDBC driver which is specific to that DB. So Sqoop needs the JDBC driver of each of the database it needs to interact with. 42. How to import only the updated rows form a table into HDFS using sqoop assuming the source has last update timestamp details for each row? By using the lastmodified mode. Rows where the check column holds a timestamp more recent than the timestamp specified with –last-value are imported. 43. What is InputSplit in Hadoop? When a hadoop job is run, it splits input files into chunks and assign each split to a mapper to process. This is called Input Split 44. Hadoop sqoop word came from ? Sql + Hadoop = sqoop 45. What is the work of Export In Hadoop sqoop ? Export the data from HDFS to RDBMS 46. Use of Codegen command in Hadoop sqoop ? Generate code to interact with database records 47. Use of Help command in Hadoop sqoop ? List available commands 48. How can you schedule a sqoop job using Oozie? Oozie has in-built sqoop actions inside which we can mention the sqoop commands to be executed. 49. What are the two file formats supported by sqoop for import? Delimited text and Sequence Files. 50. What is a sqoop metastore? It is a tool using which Sqoop hosts a shared metadata repository. Multiple users and/or remote users can define and execute saved jobs (created with sqoop job) defined in this metastore. Clients must be configured to connect to the metastore in sqoop-site.xml or with the –meta-connect argument. SQOOP Questions and Answers pdf Download Read the full article

0 notes

siva3155 · 6 years ago

Text

300+ TOP Apache SQOOP Interview Questions and Answers

SQOOP Interview Questions for freshers experienced :-

0 notes