#stringtype | Explore Tumblr posts and blogs

nitorinfotech-blog · 4 years ago

Text

Data Processing with Apache Spark

Spark has emerged as a favorite for analytics, especially those that can handle massive volumes of data as well as provide high performance compared to any other conventional database engines. Spark SQL allows users to formulate their complex business requirements to Spark by using the familiar language of SQL.

So, in this blog, we will see how you can process data with Apache Spark and what better way to establish the capabilities of Spark than to put it through its paces and use the Hadoop-DS benchmark to compare performance, throughput, and SQL compatibility against SQL Server.

Before we begin, ensure that the following test environment is available:

SQL Server: 32 GB RAM with Windows server 2012 R2

Hadoop Cluster: 2 machines with 8GB RAM Ubuntu flavor

Sample Data:

For the purpose of this demo, we will use AdventureWorks2016DW data.

Following table is used in query with no of records:

We will compare performance of three data processing engines, which are SQL Server, Spark with CSV files as datafiles and Spark with Parquet files as datafiles.

Query:

We will use the following query to process data:

select pc.EnglishProductCategoryName, ps.EnglishProductSubcategoryName, sum(SalesAmount)

from FactInternetSales f

inner join dimProduct p on f.productkey = p.productkey

inner join DimProductSubcategory ps on p.ProductSubcategoryKey = ps.ProductSubcategoryKey

inner join DimProductCategory pc on pc.ProductCategoryKey = ps.ProductCategoryKey

inner join dimcustomer c on c.customerkey = f.customerkey

group by pc.EnglishProductCategoryName, ps.EnglishProductSubcategoryName

Let’s measure the performance of each processing engine:

1) SQL Server:

While running query in SQL Server with the 32GB RAM Microsoft 2012 Server, it takes around 2.33 mins to execute and return the data.

Following is the screenshot for the same:

2) Spark with CSV data files:

Now let’s export the same dataset to CSV and move it to HDFS.

Following is the screenshot of HDFS with the CSV file as an input source.

Now that we have the files for the specific input tables moved to HDFS as CSV files, we can start with Spark Shell and create DataFrames for each source file.

Run Following commands for creating SQL Context:

import org.apache.spark.sql.types._

import org.apache.spark.sql.{Row, SQLContext}

val sqlContext = new SQLContext(sc)

Run following command to create Fact Schema :

val factSchema = StructType(Array(

StructField("ProductKey", IntegerType, true),

StructField("OrderDateKey", IntegerType, true),

StructField("DueDateKey", IntegerType, true),

StructField("ShipDateKey", IntegerType, true),

StructField("CustomerKey", IntegerType, true),

StructField("PromotionKey", IntegerType, true),

StructField("CurrencyKey", IntegerType, true),

StructField("SalesTerritoryKey", IntegerType, true),

StructField("SalesOrderNumber", StringType, true),

StructField("SalesOrderLineNumber", IntegerType, true),

StructField("RevisionNumber", IntegerType, true),

StructField("OrderQuantity", IntegerType, true),

StructField("UnitPrice", DoubleType, true),

StructField("ExtendedAmount", DoubleType, true),

StructField("UnitPriceDiscountPct", DoubleType, true),

StructField("DiscountAmount", DoubleType, true),

StructField("ProductStandardCost", DoubleType, true),

StructField("TotalProductCost", DoubleType, true),

StructField("SalesAmount", DoubleType, true),

StructField("TaxAmt", DoubleType, true),

StructField("Freight", DoubleType, true),

StructField("CarrierTrackingNumber", StringType, true),

StructField("CustomerPONumber", StringType, true),

StructField("OrderDate", TimestampType, true),

StructField("DueDate", TimestampType, true),

StructField("ShipDate", TimestampType, true)

));

Run following command to create a DataFrame for Sales with Fact Schema:

val salesCSV = sqlContext.read.format("csv")

.option("header", "false")

.schema(factSchema)

.load("/data/FactSalesNew/part-m-00000")

Run following command to create Customer schema:

val customerSchema = StructType(Array(

StructField("CustomerKey", IntegerType, true),

StructField("GeographyKey", IntegerType, true),

StructField("CustomerAlternateKey", StringType, true),

StructField("Title", StringType, true),

StructField("FirstName", StringType, true),

StructField("MiddleName", StringType, true),

StructField("LastName", StringType, true),

StructField("NameStyle", BooleanType, true),

StructField("BirthDate", TimestampType, true),

StructField("MaritalStatus", StringType, true),

StructField("Suffix", StringType, true),

StructField("Gender", StringType, true),

StructField("EmailAddress", StringType, true),

StructField("YearlyIncome", DoubleType, true),

StructField("TotalChildren", IntegerType, true),

StructField("NumberChildrenAtHome", IntegerType, true),

StructField("EnglishEducation", StringType, true),

StructField("SpanishEducation", StringType, true),

StructField("FrenchEducation", StringType, true),

StructField("EnglishOccupation", StringType, true),

StructField("SpanishOccupation", StringType, true),

StructField("FrenchOccupation", StringType, true),

StructField("HouseOwnerFlag", StringType, true),

StructField("NumberCarsOwned", IntegerType, true),

StructField("AddressLine1", StringType, true),

StructField("AddressLine2", StringType, true),

StructField("Phone", StringType, true),

StructField("DateFirstPurchase", TimestampType, true),

StructField("CommuteDistance", StringType, true)

));

Run following command to create Customer a dataframe with Customer Schema.

val customer = sqlContext.read.format("csv")

.option("header", "false")

.schema(customerSchema)

.load("/data/dimCustomer/part-m-00000")

Now create product schema with the following command:

val productSchema = StructType(Array(

StructField("ProductKey", IntegerType, true),

StructField("ProductAlternateKey", StringType, true),

StructField("ProductSubcategoryKey", IntegerType, true),

StructField("WeightUnitMeasureCode", StringType, true),

StructField("SizeUnitMeasureCode", StringType, true),

StructField("EnglishProductName", StringType, true),

StructField("SpanishProductName", StringType, true),

StructField("FrenchProductName", StringType, true),

StructField("StandardCost", DoubleType, true),

StructField("FinishedGoodsFlag", BooleanType, true),

StructField("Color", StringType, true),

StructField("SafetyStockLevel", IntegerType, true),

StructField("ReorderPoint", IntegerType, true),

StructField("ListPrice", DoubleType, true),

StructField("Size", StringType, true),

StructField("SizeRange", StringType, true),

StructField("Weight", DoubleType, true),

StructField("DaysToManufacture", IntegerType, true),

StructField("ProductLine", StringType, true),

StructField("DealerPrice", DoubleType, true),

StructField("Class", StringType, true),

StructField("Style", StringType, true),

StructField("ModelName", StringType, true),

StructField("LargePhoto", StringType, true),

StructField("EnglishDescription", StringType, true),

StructField("FrenchDescription", StringType, true),

StructField("ChineseDescription", StringType, true),

StructField("ArabicDescription", StringType, true),

StructField("HebrewDescription", StringType, true),

StructField("ThaiDescription", StringType, true),

StructField("GermanDescription", StringType, true),

StructField("JapaneseDescription", StringType, true),

StructField("TurkishDescription", StringType, true),

StructField("StartDate", TimestampType, true),

StructField("EndDate", TimestampType, true),

StructField("Status", StringType, true)

))

Create product data frame with Product schema.

val product = sqlContext.read.format("csv")

.option("header", "false")

.schema(productSchema)

.load("/data/dimProduct/part-m-00000")

Now create Product Category schema using following command:

val productCategotySchema = StructType(Array(

StructField("ProductCategoryKey", IntegerType, true),

StructField("ProductCategoryAlternateKey", IntegerType, true),

StructField("EnglishProductCategoryName", StringType, true),

StructField("SpanishProductCategoryName", StringType, true),

StructField("FrenchProductCategoryName", StringType, true)

))

Now create Product Category Data frame with ProductCategory Schema:

val productCategory = sqlContext.read.format("csv")

.option("header", "false")

.schema(productCategotySchema)

.load("/data/dimProductCategory/part-m-00000")

Now create Product Sub Category schema using following command:

val productSubCategotySchema = StructType(Array(

StructField("ProductSubcategoryKey", IntegerType, true),

StructField("ProductSubcategoryAlternateKey", IntegerType, true),

StructField("EnglishProductSubcategoryName", StringType, true),

StructField("SpanishProductSubcategoryName", StringType, true),

StructField("FrenchProductSubcategoryName", StringType, true),

StructField("ProductCategoryKey", IntegerType, true)

))

And create productsubcategory data frame using below command:

val productSubCategory = sqlContext.read.format("csv")

.option("header", "false")

.schema(productSubCategotySchema)

.load("/data/dimProductSubCategory/part-m-00000")

Now create temporary views of each data frame that we have created so far:

sales.createOrReplaceTempView("salesV")

customer.createOrReplaceTempView("customerV")

product.createOrReplaceTempView("productV")

productCategory.createOrReplaceTempView("productCategoryV")

productSubCategory.createOrReplaceTempView("productSubCategoryV")

And Run the same query which we ran in SQL Server:

Val df_1=spark.sql("""select pc.EnglishProductCategoryName, ps.EnglishProductSubcategoryName, sum(SalesAmount)

from salesV f

inner join productV p on f.productkey = p.productkey

inner join productSubCategoryV ps on p.ProductSubcategoryKey = ps.ProductSubcategoryKey

inner join productCategoryV pc on pc.ProductCategoryKey = ps.ProductCategoryKey

inner join customerV c on c.customerkey = f.customerkey

group by pc.EnglishProductCategoryName, ps.EnglishProductSubcategoryName """)

df_1.show()

It took around 3 mins to execute the result set.

3)Spark with Parquet file for Fact Table:

Now, let’s convert FactInternetSaleNew file to parquet file and save to hdfs using the following command:

salesCSV.write.format("parquet").save("sales_parquet")

Create dataframe on top of Parquet file using below command:

val sales = sqlContext.read.parquet("/user/nituser/sales.parquet")

And create temp view using sales data frame:

sales.createOrReplaceTempView("salesV")

Now, we will run the same query which we used in step 2:

val df_1=spark.sql("""select pc.EnglishProductCategoryName, ps.EnglishProductSubcategoryName, sum(SalesAmount)

from salesV f

inner join productV p on f.productkey = p.productkey

inner join productSubCategoryV ps on p.ProductSubcategoryKey = ps.ProductSubcategoryKey

inner join productCategoryV pc on pc.ProductCategoryKey = ps.ProductCategoryKey

inner join customerV c on c.customerkey = f.customerkey

group by pc.EnglishProductCategoryName, ps.EnglishProductSubcategoryName """)

It will return the same result set in less than 20 secs.

We can conclude by stating that Spark with commodity hardware performs very similar to the high-end server of SQL Server. However, Spark outshines other engines when it deals with column-oriented efficient and compressed storage format.

So, we need to decide the specifications for the processing engine and storage based on business requirements, while also understanding how we can boost the power of such a highly efficient processing engine and get the required performance.

Reach out to us at Nitor Infotech know more about Apache Spark and how you can utilize it to accelerate your business and make advanced analytics more innovative.

#sqlserver #apache spark #data processing services #data engineering services

0 notes

tak4hir0 · 5 years ago

Link

React TypeScript: Basics and Best PracticesAn updated handbook/cheat sheet for working with React.js with TypeScript.There is no single “right” way of writing React code using TypeScript. As with other technologies, if your code compiles and works, you probably did something right. That being said, there are “best practices” that you’d want to consider following, especially when writing code others will have to either read or re-use for their own purposes. So, here I’m going to list some useful code-snippets that follow said “best practices”. There are a lot of them, some that you might’ve used already in the past and some that might be new. Just go through the list and make mental notes. Bookmarking this article for future reference might be a good idea as well. Making your components ready for sharing, with TypeScriptExample: browsing through shared React components in bit.devBit.dev has become a very popular alternative to traditional component libraries as it offers a way to “harvest” and share individual components from any codebase (to a single component hub). By building projects using React with TS, you make sure your components are easily comprehensible to other developers (as well as to your future self). That is absolutely crucial for making them ready for sharing. It’s a great way to write maintainable code and optimize your team collaboration. Learn more about sharing and reusing React TS components across repos here: Getting startedcreate-react-app with TypeScript$ npx create-react-app your-app-name --template typescriptIf you’re more of a fan of Yarn, you can use the following command: $ yarn create react-app your-app-name --template typescriptIn either case, notice how we’re not directly using the app, rather, we’re using other tools that will download the latest version of the app whenever it’s required. This helps ensure you’re not using an outdated version. BasicsSome of the very interesting tidbits added by TS to the language are: InterfacesOne of the many benefits TypeScript brings to the table, is access to constructs such as this, which allows you to define the interface of your components or even any other complex objects you might want to use with them, such as the shape your Props object will have (i.e how many properties and their types). The above code ensures that whoever uses your components needs to add exactly 3 properties: text: which needs to be a Stringtype: which needs to be a ButtonType option (I’ll cover Enums in a second)action: which is a simple functionNote that we “extended” the FC (Functional Component) type with our own custom interface. That gives our function all the generic functional component definitions such as the ‘children’ prop and a return type that must be assignable to JSX.Element. If you ignore one of them or send something that’s not compatible, both the TypeScript compiler and your IDE (assuming you’re using a JavaScript specific IDE, such as Code) will notify you and won’t allow you to continue until you fix it. A better way to define our ExtendedButton element would be to extend a native HTML button element type like so: But more on that topic later in this post… Also, note that when working with Bit.dev or react-docgen, the following syntax is required to auto-generate docs: (The props are defined directly and explicitly using :IButtonProps in addition to defining the component with ) EnumsJust like with Interfaces, Enums allow you to define a set of related constants as part of a single entity. Importing and using Enums: Please note that unlike Interfaces or Types, Enums will get translated into plain JavaScript. So, for example, this: enum SelectableButtonTypes {Important = "important",Optional = "optional",Irrelevant = "irrelevant"}will transform into this: "use strict";var SelectableButtonTypes;(function (SelectableButtonTypes) {SelectableButtonTypes["Important"] = "important";SelectableButtonTypes["Optional"] = "optional";SelectableButtonTypes["Irrelevant"] = "irrelevant";})(SelectableButtonTypes || (SelectableButtonTypes = {}));Interfaces vs Types aliasA common question that newcomers to TypeScript have is whether they should be using Interfaces or Type Aliases for different parts of their code — after all, the official documentation is a bit unclear regarding that topic. Truth is, although these entities are conceptually different, in practice, they are quite similar: They can both be extended.2. They can both be used to define the shape of objects. 3. They both can be implemented in the same way. The only extra feature Interfaces bring to the table (that Type aliases don’t), is “declaration merging” which means you can define the same interface several times and with each definition, the properties get merged: Optional types for your propsPart of the benefits of using Interfaces is that you’re able to enforce the properties of your props for your components. However, thanks to the optional syntax available through TypeScript, you can also define optional props, like this: HooksHooks are the new mechanics React provides to interact with several of its features (such as the state) without the need to define a class. Adding type check to hooksHooks such as useState receive a parameter and correctly return the state (again, that’s for this case) and a function to set it. Thanks to TypeScript’s type validation, you can enforce the type (or interface) of the initial value of the state, like this: Nullable values to hooksHowever, if the initial value for your hook can potentially be a null, then the above example will fail. For these cases, TypeScript allows you to set an optional type as well, making sure you’re covered from all sides. That way you’re ensuring you keep type checks, but allow for those scenarios where the initial value can come as null. Generic ComponentsMuch like the way you define generic functions and interfaces in TypeScript, you can define generic components, allowing you to re-use them for different data types. You can do this for props and states as well. You can then use the component either by taking advantage of type inference or directly specifying the data types, likes so: Type inference exampleDirectly declared typesFor the latter, note that if your list contains strings instead of numbers, TypeScript will throw an error during the transpilation process. Extending HTML ElementsSometimes, your components function and behave like native HTML elements (on steroids). For example, a “borederd box” (which is simply a component that always renders a div with a default border) or a “big submit” (which again, is nothing but your good old submit button with a default size and maybe some custom behavior). For these scenarios, it’s best to define your component type as a native HTML element or an extension of it. As you can see, I’ve extended HTML’s default props and added a new one: “title” for my specific needs. Event TypesAs you probably know, React provides its own set of events, which is why you can’t directly use the good old HTML Events. That being said, you do have access to all the useful UI events you need, so much so in fact, that they have the same names as well, so make sure you reference them directly like React.MouseEvent or just remember to import them from React like so: import React, { Component, MouseEvent } from 'react';The benefits of using TypeScript here, is that we can also use Generics (like in the previous example) to restrict the elements a particular event handler can be used on. For example, the following code will not work: And you’ll see an error message similar to the following: You can, however, use unions to allow a single handler to be re-used by multiple components: Integrated type definitionFinally, for the last tip, I wanted to mention the index.d.ts and the global.d.ts files. They’re both installed when you add React to your project (if you used npm, you’ll find them inside the npm_modules/@types folder. These files contain type and interface definitions used by React, so if you need to understand the props of one particular type, you can simply open these files and review their content. For example: There you can see a small section of the index.d.ts file, showing the different signatures for the createElement function. ConclusionThere is a lot more you can achieve by using TypeScript as part of your React toolchain, so if you saw at least one snippet that you liked here, consider reading up on how to gradually migrate your React projects to TypeScript or even learn how to design your own React TypeScript libraries here. Either way, I hope you got something out of this article, and feel free to leave any other tips or tricks you’ve picked up over the years of using TypeScript for your React projects! See you on the next one! Learn More

#Medium

0 notes