xenonstack-blog - Tumblr blog

xenonstack-blog · 7 years ago

Text

Data Ingestion and Processing of Data For Big Data and IoT Solutions

Overview

In the era of the Internet of Things and Mobility, with a huge volume of data becoming available at a fast velocity, there must be the need for an efficient Analytics System.

Also, the variety of data is coming from various sources in different formats, such as sensors, logs, structured data from an RDBMS, etc. In the past few years, the generation of new data has drastically increased. More applications are being built, and they are generating more data at a faster rate.

Earlier, Data Storage was costly, and there was an absence of technology which could process the data in an efficient manner. Now the storage costs have become cheaper, and the availability of technology to transform Big Data is a reality.

What is Big Data Concept?

According to the Author Dr. Kirk Borne, Principal Data Scientist, Big Data Definition is Everything, Quantified, and Tracked. Let’s pick that apart -

Everything – Means every aspect of life, work, consumerism, entertainment, and play is now recognized as a source of digital information about you, your world, and anything else we may encounter.

Quantified – Means we are storing those "everything” somewhere, mostly in digital form, often as numbers, but not always in such formats. The quantification of features, characteristics, patterns, and trends in all things is enabling Data Mining, Machine Learning, statistics, and discovery at an unprecedented scale on an unprecedented number of things. The Internet of Things is just one example, but the Internet of Everything is even more impressive.

Tracked – Means we don’t directly quantify and measure everything just once, but we do so continuously. It includes - tracking your sentiment, your web clicks, your purchase logs, your geolocation, your social media history, etc. or tracking every car on the road, or every motor in a manufacturing plant or every moving part on an airplane, etc. Consequently, we see the emergence of smart cities, smart highways, personalized medicine, personalized education, precision farming, and so much more.

Advantages of Streaming Data

Smarter Decisions

Better Products

Deeper Insights

Greater Knowledge

Optimal Solutions

Customer-Centric Products

Increased Customer Loyalty

More Automated Processes, more accurate Predictive and Prescriptive Analytics

Better models of future behaviors and outcomes in Business, Government, Security, Science, Healthcare, Education, and more.

D2D Communication Meets Big Data

Data-to-Decisions

Data-to-Discovery

Data-to-Dollars

10 Vs of Big Data

Big Data Architecture & Patterns

The Best Way to a solution is to "Split The Problem." Big Data Solution can be well understood using Layered Architecture. The Layered Architecture is divided into different Layers where each layer performs a particular function.

This Architecture helps in designing the Data Pipeline with the various requirements of either Batch Processing System or Stream Processing System. This architecture consists of 6 layers which ensure a secure flow of data.

Data Ingestion Layer

This layer is the first step for the data coming from variable sources to start its journey. Data here is prioritized and categorized which makes data flow smoothly in further layers.

Data Collector Layer

In this Layer, more focus is on the transportation of data from ingestion layer to rest of data pipeline. It is the Layer, where components are decoupled so that analytic capabilities may begin.

Data Processing Layer

In this primary layer, the focus is to specialize the data pipeline processing system or we can say the data we have collected in the previous layer is to be processed in this layer. Here we do some magic with the data to route them to a different destination, classify the data flow and it’s the first point where the analytic may take place.

Data Storage Layer

Storage becomes a challenge when the size of the data you are dealing with, becomes large. Several possible solutions can rescue from such problems. Finding a storage solution is very much important when the size of your data becomes large. This layer focuses on "where to store such a large data efficiently."

Data Query Layer

This is the layer where active analytic processing takes place. Here, the primary focus is to gather the data value so that they are made to be more helpful for the next layer.

Data Visualization Layer

The visualization, or presentation tier, probably the most prestigious tier, where the data pipeline users may feel the VALUE of DATA. We need something that will grab people’s attention, pull them into, make your findings well-understood.

You May also Love to Read Arising Need of Modern Big Data Integration Platform

1. Data Ingestion Architecture

Data ingestion is the first step for building Data Pipeline and also the toughest task in the System of Big Data. In this layer we plan the way to ingest data flows from hundreds or thousands of sources into Data Center. As the Data is coming from Multiple sources at variable speed, in different formats.

That's why we should properly ingest the data for the successful business decisions making. It's rightly said that "If starting goes well, then, half of the work is already done."

What is Ingestion in Big Data?

Big Data Ingestion involves connecting to various data sources, extracting the data, and detecting the changed data. It's about moving data - and especially the unstructured data - from where it is originated, into a system where it can be stored and analyzed.

We can also say that Data Ingestion means taking data coming from multiple sources and putting it somewhere it can be accessed. It is the beginning of Data Pipeline where it obtains or import data for immediate use.

Data can be streamed in real time or ingested in batches, When data is ingested in real time then, as soon as data arrives it is ingested immediately. When data is ingested in batches, data items are ingested in some chunks at a periodic interval of time. Ingestion is the process of bringing data into Data Processing system.

Effective Data Ingestion process begins by prioritizing data sources, validating individual files and routing data items to the correct destination.

Challenges in Data Ingestion

As the number of IoT devices increases, both the volume and variance of Data Sources are expanding rapidly. So, extracting the data such that it can be used by the destination system is a significant challenge regarding time and resources. Some of the other problems faced by Data Ingestion are -

When numerous Big Data sources exist in the different format, it's the biggest challenge for the business to ingest data at the reasonable speed and further process it efficiently so that data can be prioritized and improves business decisions.

Modern Data Sources and consuming application evolve rapidly.

Data produced changes without notice independent of consuming application.

Data Semantic Change over time as same Data Powers new cases.

Detection and capture of changed data - This task is difficult, not only because of the semi-structured or unstructured nature of data but also due to the low latency needed by individual business scenarios that require this determination.

That's why it should be well designed assuring following things -

Able to handle and upgrade the new data sources, technology and applications

Assure that consuming application is working with correct, consistent and trustworthy data.

Allows rapid consumption of data

Capacity and reliability - The system needs to scale according to input coming and also it should be fault tolerant.

Data volume - Though storing all incoming data is preferable; there are some cases in which aggregate data is stored.

Data Ingestion Parameters

Data Velocity - Data Velocity deals with the speed at which data flows in from different sources like machines, networks, human interaction, media sites, social media. The movement of data can be massive or continuous.

Data Size - Data size implies enormous volume of data. Data is generated by different sources that may increase timely.

Data Frequency (Batch, Real-Time) - Data can be processed in real time or batch, in real time processing as data received on same time, it further proceeds but in batch time data is stored in batches, fixed at some time interval and then further moved.

Data Format (Structured, Semi-Structured, Unstructured) - Data can be in different formats, mostly it can be the structured format, i.e., tabular one or unstructured format, i.e., images, audios, videos or semi-structured, i.e., JSON files, CSS files, etc.

Big Data Ingestion Key Principles

To complete the process of Data Ingestion, we should use right tools for that and most important that tools should be capable of supporting some of the fundamental principles written below -

Network Bandwidth - Data Pipeline must be able to compete with business traffic. Sometimes traffic increases or sometimes decreases, so Network bandwidth scalability is biggest Data Pipeline challenge. Tools are required for bandwidth throttling and compression capabilities.

Unreliable Network - Data Ingestion Pipeline takes data with multiple structures, i.e., images, audios, videos, text files, tabular files data, XML files, log files, etc. and due to the variable speed of data coming, it might travel through the unreliable network. Data Pipeline should be capable of supporting this also.

Heterogeneous Technologies and System - Tools for Data Ingestion Pipeline must be able to use different data sources technologies and different operating system.

Choose Right Data Format - Tools must provide data serialization format, that means as data comes in the variable format so converting them into single format will provide an easier view to understand or relate the data.

Streaming Data - It depends upon business necessity whether to process the data in batch or streams or real time. Sometimes we may require both processing. So, tools must be capable of supporting both.

You May also Love to Read Data Ingestion Using Apache Nifi For Building Data Lake Using Twitter Data

Data Serialization in Big Data

Different types of users have various types of data consumer needs. Here we want to share variable data, so we must plan how the user can access data in a meaningful way. That's why a single image of variable data optimize the data for human readability.

Approaches used for this are -

Apache Thrift

It's an RPC Framework containing Data Serialization Libraries.

Google Protocol Buffers

It can use the specially generated source code to easily write and read structured data to and from a variety of data streams and using a variety of languages.

Apache Avro

The more recent Data Serialization format that combines some of the best features which previously listed. Avro Data is self-describing and uses a JSON-schema description. This schema is included with the data itself and natively support compression. Probably it may become a de facto standard for Data Serialization.

Big Data Ingestion Tools

Apache Flume

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.

It has a straightforward and flexible architecture based on streaming data flows. It is robust and faults tolerant with tunable reliability mechanisms and many failovers and recovery mechanisms.

It uses a simple, extensible data model that allows for an online analytic application.

Functions of Apache Flume

Stream Data - Ingest streaming data from multiple sources into Hadoop for storage and analysis.

Insulate System - Buffer storage platform from transient spikes, when the rate of incoming data exceeds the rate at which data can be written to the destination

Scale Horizontally - To ingest new data streams and additional volume as needed.

Apache Nifi

Apache Nifi provides an easy to use, the powerful, and reliable system to process and distribute data. Apache NiFi supports robust and scalable directed graphs of data routing, transformation, and system mediation logic. Its functions are -

Track data flow from beginning to end

Seamless experience in design, control, feedback, and monitoring

Secure because of SSL, SSH, HTTPS, encrypted content.

Elastic Logstash

Elastic Logstash is an open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously transforms it, and then sends it to your “stash, " i.e., Elasticsearch.

It easily ingests from your logs, metrics, web applications, data stores, and various AWS services and done in continuous, streaming fashion. It can Ingest Data of all Shapes, Sizes, and Sources.

2. Data Collector Architecture

In this Layer, more focus is on transportation data from ingestion layer to rest of Data Pipeline. Here we use a messaging system that will act as a mediator between all the programs that can send and receive messages.

Here the tool used is Apache Kafka. It's a new approach in message-oriented middleware.

Apache Kafka Overview

It is used for building real-time data pipelines and streaming apps. It can process streams of data in real-time and store streams of data safely in a distributed replicated cluster.

Kafka works in combination with Apache Storm, Apache HBase and Apache Spark for real-time analysis and rendering of streaming data.

Data Pipeline Architecture

Data Pipeline the main component of Data Integration. All transformation of data happens in Data Pipeline.

It is a Python-based tool that streams and transforms real-time data to service that need it.

Data Pipeline Automate the movement and transformation of data. Data Pipeline is a Data Processing engine that runs inside your application.

It is used to transform all the incoming data in a standard format so that we can prepare it for analysis and visualization. Data Pipeline is built on Java Virtual Machine (JVM).

So, a Data Pipeline is a series of steps that your data moves through. The output of one step in the process becomes the input of the next. Data, typically raw data, goes on one side, passes through a series of steps.

The steps of a Data Pipeline can include cleaning, transforming, merging, modeling and more, in any combination.

Data Pipeline Functions

Data Ingestion

Data Pipeline Helps in bringing data into your system. It means taking unstructured data from where it is originated into a system where it can be stored and analyzed for making business decisions

Data Integration

Data Pipeline also helps in bringing different types of data together.

Data Organization

Organizing data means an arrangement of data; this arrangement is also made in Data Pipeline.

Data Refining

It's also one of the processes where we can enhance, clean, improve the raw data.

Data Analytics

After improving the useful data, Data Pipeline provides us the processed data on which we can apply the operations on raw data and can make business decisions accurately.

Need Of Data Pipeline

A Data Pipeline is software that takes data from multiple sources and makes it available to be used strategically for making business decisions.

Primarily reasons for the need of data pipeline is because it's tough to monitor Data Migration and manage data errors. Other reasons for this are below -

Business Decisions - Critical Analysis is only possible when combining data from multiple sources. For making business decisions, we should have a single image of all the data coming.

Connections - All the time data keeps on increasing, new data came and old data modified, so, each new integration can take anywhere from a few days to a few months to complete.

Accuracy - The only way to build trust with data consumers is to make sure that your data is auditable. One best practice that’s easy to implement is never to discard inputs or intermediate forms when altering data.

Latency - The fresher your data, the agiler your company’s decision-making can be. Extracting data from APIs and databases in real-time can be difficult, and many target data sources, including large object stores like Amazon S3 and analytics databases like Amazon Redshift, are optimized for receiving data in chunks rather than a stream.

Scalability - Data can be increased or decreased with time we can't say for on Monday data will come less and rest of days comes a lot for processing. So, usage of data is not uniform. What we can do is making our pipeline so scalable that able to handle any amount of data coming at variable speed.

Data Pipeline Use Cases

Data Pipeline is useful to some roles, including CTOs, CIOs, Data Scientists, Data Engineers, BI Analysts, SQL Analysts, and anyone else who derives value from a unified real-time stream of user, web, and mobile engagement data. So, use cases for data pipeline are given below -

For Business Intelligence Teams

For SQL Experts

For Data Scientists

For Data Engineers

For Product Teams

What is Apache Kafka for?

Building Real-Time streaming Data Pipelines that reliably get data between systems or applications

Building Real-Time streaming applications that transform or react to the streams of data.

Apache Kafka Use Cases

Stream Processing

Website Activity Tracking

Metrics Collection and Monitoring

Log Aggregation

Apache Kafka Features

One of the features of Apache Kafka is durable Messaging.

Apache Kafka relies heavily on the file system for storing and caching messages: rather than maintain as much as possible in memory and flush it all out to the filesystem, all data is immediately written to a persistent log on the filesystem without necessarily flushing to disk.

Apache Kafka solves the situation where the producer is generating messages faster than the consumer can consume them in a reliable way.

Apache Kafka Architecture

Apache Kafka System design act as Distributed commit log, where incoming data is written sequentially on disk. There are four main components involved in moving data in and out of Apache Kafka -

Topics - Topic is a user-defined category to which messages are published.

Producers - Producers post messages to one or more topics

Consumers - Consumers subscribe to topics and process the posted messages.

Brokers - Brokers that manage the persistence and replication of message data.

3. Data Processing Layer

In the previous layer, we gathered the data from different sources and made it available to go through rest of pipeline.

In this layer, our task is to do magic with data, as now data is ready we only have to route the data to different destinations.

In this main layer, focus is to specialize Data Pipeline processing system or we can say the data we have collected by the last layer in this next layer we have to do processing on that data.

Batch Processing System

A simple batch processing system for offline analytics. For doing this tool used is Apache Sqoop.

What is Apache Sqoop?

It efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Apache Sqoop can also be used to extract data from Hadoop and export it into external structured data stores.

Apache Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB.

Functions of Apache Sqoop

Import sequential data sets from mainframe

Data imports

Parallel data Transfer

Fast data copies

Efficient data analysis

Load balancing

Near Real Time Processing System

A pure online processing system for online analytics. For this type of processing Apache Storm is used.The Apache Storm cluster makes decisions about the criticality of the event and sends the alerts to the warning system (dashboard, e-mail, other monitoring systems).

What is Apache Storm?

It is a system for processing streaming data in real time. It adds reliable real-time data processing capabilities to Enterprise Hadoop. Storm on YARN is powerful for scenarios requiring real-time analytics, machine learning and continuous monitoring of operations.

6 Key Features of Apache Storm

Fast – It can process one million 100 byte messages per second per node.

Scalable – It can do parallel calculations that run across a cluster of machines.

Fault-tolerant – When workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node.

Reliable – Storm guarantees that each unit of data (tuple) will be processed at least once or exactly once. Messages are only replayed when there are failures.

Easy to operate – It consists of Standard configurations that are suitable for production on day one. Once deployed, Storm is easy to work.

Hybrid Processing system - This consist of Batch and Real-time processing System capabilities. For this type of processing tool used is Apache Spark and Apache Flink.

What is Apache Spark?

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to data sets.

With Spark running on Apache Hadoop YARN, developers everywhere can now create applications to exploit Spark’s power, derive insights, and enrich their data science workloads within a single, shared data set in Hadoop.

What is Apache Flink?

Apache Flink is an open-source framework for distributed stream processing that Provides results that are accurate, even in the case of out-of-order or late-arriving data. Some of its features are -

It is stateful and fault-tolerant and can seamlessly recover from failures while maintaining exactly-once application state.

Performs at large scale, running on thousands of nodes with excellent throughput and latency characteristics.

It's streaming data flow execution engine, APIs and domain-specific libraries for Batch, Streaming, Machine Learning, and Graph Processing.

Apache Flink Use Cases

Optimization of e-commerce search results in real-time

Stream processing-as-a-service for data science teams

Network/Sensor monitoring and error detection

ETL for Business Intelligence Infrastructure

4. Data Storage Layer

Next, the major issue is to keep data in the right place based on usage. We have relational Databases that were a successful place to store our data over the years.

But with the new big data strategic enterprise applications, you should no longer be assuming that your persistence should be relational.

We need different databases to handle the different variety of data, but using different databases creates overhead. That's why there is an introduction to the new concept in the database world, i.e., the Polyglot Persistence.

What is Polyglot Persistence?

Polyglot persistence is the idea of using multiple databases to power a single application. Polyglot persistence is the way to share or divide your data into multiple databases and leverage their power together.

It takes advantage of the strength of different database. Here various types of data are arranged in a variety of ways. In short, it means picking the right tool for the right use case.

It’s the same idea behind Polyglot Programming, which is the idea that applications should be written in a mix of languages to take advantage of the fact that different languages are suitable for tackling different problems.

Continue Reading: XenonStack/Blog

#data ingestion #big data #iot #data #xenonstack

0 notes

xenonstack-blog · 7 years ago

Text

What is BlockChain Technology

bitcoin

Overview

There are different kinds of people in the world, who have different languages, different cultures, different eating styles but one of the things that brings them together is money, which is everybody's necessity.

In today’s fast moving and developing era, we need a safe and secure platform to be able to make online transactions, for whatever purpose they might be. In whatever transactions we do, banks are the third party involved in it, and they have all records of our transaction and of one’s account holdings.

Blockchain Technology aims at removing this third-party involvement and keeping the transactions only limited to the sender and the receiver. A broad idea behind this technology was to build a secure platform where safe transactions could happen in a very transparent manner.

With the help of Blockchain technology, we can save ourselves from any kind of overheads and third party involvements and send or receiving money could be as hassle-free as sending emails. BlockChain Technology is a peer to peer software technology that protects a digital piece of information.

What is BlockChain Technology?

The origin of Blockchain technology is a little uncertain and still for experts to find out for sure. It is said to be invented by a person or a group of people known by the name of Satoshi Nakamoto in the year 2009.

Initially, it was developed to enable digital transactions between two parties in an anonymous fashion, without having the need of a third party to perform verification of the transaction.

The main inspiration behind the development of such a great technology at the time was to facilitate the transfer of Bitcoins, but it later caught on and is today being used for many other important things.

The blockchain technology is an open system for transaction processing that follows distributed ledger approach whose goal is to automate the processes and reduce data storage costs and provide data security and eliminate duplicates.

We can also say that BlockChain is a method of recording data, i.e., transactions, contracts, agreements that need to be recorded independently and verified as they are happening.

A very good way to understand the concept of Blockchain technology is the google docs analogy. Just the way Google Docs shared between two people enables both of them to make changes to the document at the same time and visible to the other party, similarly the transparency of blockchain works.

BlockChain sounds like a revolution and is the underlying Technology behind Bitcoin. Truly, BlockChain is a mechanism that brings everyone to the highest degree of accountability, i.e., no more missed transaction, no more third party involvement in the transaction.

Blockchain guarantees the validity of a transaction by recording it not only on the main register but also by connecting distributed system of registers, and all of which are connected through a secure validation mechanism. It may have been invented to create the alternative currency Bitcoin but can be used for other purposes like online signature services, voting systems, and many other applications.

It's a good way instead of sending our payment information through servers, in BlockChain Technology all transactions are Copied and cross-checked between every computer in a system, which becomes very safe at scale.

BlockChain Technology is a type of distributed ledger that means a database that is consensually shared and synchronized across network spread across multiple sites, institutions. Blockchain provides an unalterable, public record of digital transactions in packages called Blocks. These digitally recorded “Blocks” of data are stored in a linear path, and each block contains cryptographically hashed data. Main Features -

Capable of transforming

Making transaction faster

Reducing cost

More security

Transparency

Seamless and simultaneous integration of transaction

Settlements and ledger updates directly between multiple parties

Creates a secure way to share information

Conduct transactions without the need for a single, central party to approve them.

Only authorized network members can see details of their transactions, providing confidentiality and privacy.

All updates to the shared ledger are validated and recorded on all participants shared ledgers, which drives security and accuracy.

All updates to the ledger are unchangeable and auditable. Network members can accurately trace their past activity.

How Does BlockChain Works?

The Blockchain technology enables direct transactions between two parties without any intermediary such as a bank or a governing body. It is essentially a database of all transactions happening in the network.

The database is public and therefore, not owned by any one party, it is distributed that is, it is not stored on a single computer. Instead, it is stored on many computers across the world. The database is constantly synchronized to keep the transactions up to date and is secured by the art of cryptography making it hacker proof.

The basic framework on which the whole blockchain technology works is actually two-fold first is gathering data (transaction records) and the second is putting these blocks together securely in a chain with the help of cryptography.

Say a transaction happens, this transaction information is shared with everybody on the blockchain network. These transactions are individually timestamped and once these transactions are put together in a block, they have timestamped again as a whole block.

Now, this complete block is appended to the chain in the blockchain network. Other participants might also be adding to the network at the same time, but the timestamps added to each block takes care of the order in which the blocks are appended to the network.

The timestamps also take care of any duplicity issues hence, everybody on the network has the recent version of the chain available to them. The main cryptographic element that makes this whole system tamper-free is the hash function.

Each block's information is taken and a hash function is applied to it. The value computed from this is then stored in the next block and so on. So in this way, each block’s hash function value is being carried by the next block in the chain which makes tampering with the contents of the block very difficult.

Even if some changes are made to the block, one could easily find out because that block’s hash value would not be the same as the already calculated value of the hash function that was stored in the next block of the chain.

Blockchain works as a network of computers. Bitcoin photography is used to keep transactions secure and also shared among those in the network after the transaction is validated, the details of the transfer are recorded on a public ledger that anyone on the network and sees in the existing financial system essentially ledger maintained by the institution access the custodian of the information.

But on a blockchain the information is transparently held in a shared database and no one party access the movement, thus increasing the trust among parties.

How Bitcoin Transaction Works?

For everyone, it's easy to download a simple piece of software and install it on the computer. But to use Bitcoin which is a decentralized be out of your system, we do not need to register an account with any particular company or handle or any of your personal details, once you have a wallet you can create addresses which effectively become your identity within the network.

Suppose party A wants to send money to party B in the form of Bitcoins. For both party A and Party B, the transaction is collected in a block. A block record some or all of the most recent Bitcoin transactions that have not yet entered any cry of blocks, the new block is then broadcasted to all the parties or so-called nodes in the network the parties in the network approve that the transaction is valid through a process called mining.

Building Blocks With BlockChain

A very simple definition of blockchain is that “the blockchain is distributed, digital ledger.” One of the key features of the blockchain is that it is a ‘Distributed Database’ that is to say, the database exists in multiple copies across multiple computers.

Concept of BlockChain Technology

Shared View - One of the most powerful features of BlockChain Technology is it’s shared a view of data for all participants in a peer-to-peer network. Transaction records can be shared but cannot be altered. Shared views have two approaches i.e. Traditional Approach where each party maintains their own independent ledger and Blockchain Approach where all parties share and maintain the same ledger.

Cryptography - Cryptography is used to establish identity and protect the integrity of the underlying data. One of its concepts is hashing and this concept is used in the blockchain technology. Hashing is an effective means of determining if any piece of data has been changed or not. It generates a fingerprint for a piece of data by applying a cryptographic function to it. Changing one character in the original string results in a completely different hash value. Also, the original string cannot be reverse engineered from the hash function.

CryptoCurrency - Everything You Need to Know

Cryptocurrency is a virtual currency which uses cryptography for security. The main defining features of this currency which make what it is are its decentralized nature, transparent ledgers and it's security feature which makes it resistant to any kind of malicious manipulation. Bitcoin is the is the first and the most famous cryptocurrency. It was invented in 2009 and has seen an enormous rise in value ever since.

The cryptographic technique that it follows is SHA-256, it is a hash function that is used to encrypt every transaction before being added to the blockchain. The underlying and main aspect on which cryptocurrency works is that all the transactions happen in a transparent and decentralized ledger for everyone to see and verify if need be. One can go back to the very first transaction of a user to verify their credibility by tracking back in the ledger.

Benefits of CryptoCurrencies

The advantages of cryptocurrency are much fold. To name a few would be:

No Fraudulent activities - Transactions happening in this process cannot be reversed or in any other way be tampered with, without at least letting anybody know. The whole architecture is such that any discrepancies cannot go unnoticed and can be detected.

Faster transactions - The traditional way of settling payments involves a trusted third party, namely banks mainly, this makes the process take more time. In cryptocurrency, since the payments are peer-to-peer without any involvement of a third party, the payment process happens immediately without any delay. No Overheads- With the elimination of third party involvements, the overheads in the form of service fees that these institutions charge are also eliminated. This makes cryptocurrencies more cost-effective.

The integrity of identity - Unlike credit cards, which always have a risk of getting misused since all the information regarding it is given to the vendor, this mode of payment does not divulge any more information than required and only pushes the amount of money that is required to be paid.

How is Blockchain Changing Money and Business?

Blockchain technology is likely to have a great impact on next few decades. Currently, Blockchain is not the most thundering concept in the world, but it is believed that it will be the next generation of the internet. For the past few decades, we've had the internet for information.

The crucial difference between Internet and BlockChain is that the Internet enables the exchange of data, blockchain could enable the exchange of value, i.e., it could allow users to carry out trade and commerce across the globe without the need for payment processors, curator, and settlement and adjusting entities. Trust is a very crucial thing, and blockchain is one of the biggest technologies that peer.

Trust enables people everywhere to trust each other and transact peer established, not by some big institution, but by collaboration, by cryptography and by some smart code. And because trust is native to the technology, so we can say that it's “The Trusted Protocol.”

9 Key Features Of BlockChain:

Continue Reading: XenonStack/Blog

#xenonstack #blockchain technology #blockchain #cryptocurrency #bitcoin

0 notes

xenonstack-blog · 7 years ago

Text

How To Adopt DevOps in your Organization

While Scaling up the Business and working with remote teams with a different skill set and culture, I realize the need for processes and automation to improve the productivity and collaboration.

At growth Stage, with 3+ Years experience of delivering more than 55 projects in various domains for Startups and Enterprises including -

Mobility

Big Data

Internet of Things

Private Cloud and Hybrid Cloud

Problems Faced By Developers & Operations Team

Ownership Issues during Deployment

Fewer and Slow Releases

Flat Access Control and Security

Revision Control

Scaling Up Resources For Application Stack

Manual Processes Involved in Delivery Pipeline

Isolated Declaration of Dependencies

Single Configuration for Multiple Deployments

Manual Testing Results Into Slower Release

Shared Backing Services

Lean Start To DevOps Adoption

We Started transformation towards DevOps Strategy by adopting processes like Integration of DevOps Tools, Processes and Data into our work Culture. Parallelly, We Started adopting different Infrastructure architectures, Building Private Cloud, Docker, Apache Mesos, and Kubernetes.

Initial Steps To Implement DevOps

Enforcing Rules with the help of right tools - Agile board integration with SCM, Build Tool and Deployment Tool

Collaboration Tools - Rocket Chat Integration with Taiga, GitLab, Jenkins

Continuous Integration and Delivery

Explicit Dependency Management

Automated Testing

Hands-On Training

We started by creating two separate teams from existing pool of developers to adopt DevOps culture for new Projects in Big Data and Mobile Applications. After Initial hurdles in adaptation to Collaboration Tools and new delivery pipeline, results came out were extraordinary.

Results After Initial Phase

Improved Performance & Productivity

Less Manual Work

Better Collaboration and Communication

Developers Getting more Empowered and Involved in Delivery

Proper Dependency and Configuration Management

Challenges To DevOps Adoption

Cultural Shift in the way Things were being developed

Changing Mindset for Adaptation.

Support for Legacy Environments

Integrating Security and Compliance with new Setup

No support for Overlay Networks

Continue Reading: XenonStack/Blog

#DevOps #organization #xenonstack

0 notes

xenonstack-blog · 8 years ago

Text

Data Preprocessing and Data Wrangling in Machine Learning and Deep Learning

Introduction

Deep learning and Machine learning are becoming more and more important in today's ERP (Enterprise Resource Planning). During the process of building the analytical model using Deep Learning or Machine Learning the data set is collected from various sources such as a file, database, sensors and much more.

But, the collected data cannot be used directly for performing analysis process. Therefore, to solve this problem Data Preparation is done. This includes two techniques that are listed below:

Data Preprocessing

Data Wrangling

Data Preparation is an important part of Data Science. It includes two concepts such as Data Cleaning and Feature Engineering. These two are compulsory for achieving better accuracy and performance in the Machine Learning and Deep Learning projects.

Data Preprocessing

Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.

Therefore, certain steps are executed to convert the data into a tiny clean dataset. This technique is performed before the execution of Iterative Analysis. The set of steps is known as Data Preprocessing. This includes Data Cleaning, Data Integration, Data Transformation and Data Reduction.

Data Wrangling

Data Wrangling is a technique that is executed at the time of making an interactive model. In other words, it is used to convert the raw data into the format that is convenient for the consumption of data.

This technique is also known as Data Munging. This method also follows certain steps such as after extracting the data from different data sources, sorting of data using certain algorithm is performed, decompose the data into a different structured format and finally store the data into another database.

Need of Data Preprocessing

For achieving better results from the applied model in Machine Learning and Deep Learning projects the format of the data has to be in a proper manner. Some specified Machine Learning and Deep Learning model need data in a specified format, for example, Random Forest algorithm does not support null values, therefore to execute random forest algorithm null values has to be managed from the original raw data set.

Another aspect is that dataset should be formatted in such a way that more than one Machine Learning and Deep Learning algorithms are executed in one dataset and best out of them is chosen.

Need of Data Wrangling

Data Wrangling is an important aspect for implementing the model. Therefore, data is converted to the proper feasible format before applying any model intro it. By performing filtering, grouping and selecting appropriate data accuracy and performance of the model could be increased.

Another concept is that when time series data has to be handled every algorithm is executed with different aspects. Therefore Data Wrangling is used to convert the time series data into the required format of the applied model. In simple words, the complex data is converted into a usable format for performing analysis into it.

Why is Data Preprocessing used?

Data Preprocessing is necessary because of the presence of unformatted real world data. Mostly real world data is composed of -

Inaccurate data (missing data) - There are many reasons for missing data such as data is not continuously collected, a mistake in data entry, technical problems with biometrics and much more.

The presence of noisy data (erroneous data and outliers) - The reasons for the presence of noisy data could be a technological problem of gadget that gathers data, a human mistake during data entry and much more.

Inconsistent data - The presence of inconsistencies is due to the reasons such that existence of duplication within data, human data entry, containing mistakes in codes or names i.e. violation of data constraints and much more.

Therefore, to handle raw data, Data Preprocessing is performed.

Why is Data Wrangling used?

Data Wrangling is used to handle the issue of Data Leakage while implementing Machine Learning and Deep Learning. First of all, we have to understand what is Data Leakage?

What is Data Leakage in Machine Learning/Deep Learning?

Data Leakage is responsible for the cause of invalid Machine Learning/Deep Learning model due to the over optimization of the applied model.

Data Leakage is the term used when the data from outside i.e. not part of training dataset is used for the learning process of the model. This additional learning of information by the applied model will disapprove the computed estimated performance of the model.

For example when we want to use the specific feature for performing Predictive Analysis but that specific feature is not present at the time of training of dataset then data leakage will be introduced within the model.

Data Leakage can be demonstrated in many ways that are given below:

Leakage of data from test dataset to training dataset.

Leakage of computed correct prediction to the training dataset.

Leakage of future data into the past data.

Usage of data outside the scope of applied algorithm

In general, the leakage of data is observed from two main sources of Machine Learning/Deep Learning algorithms such as feature attributes (variables) and training dataset.

How to check the presence of Data Leakage within the applied model?

Data Leakage is observed at the time of usage of complex datasets. They are described below:

At the time of dividing time series dataset into training and test, the dataset is a complex problem.

Implementation of sampling in a graphical problem is a complex task.

Storage of analog observations in the form of audios and images in separate files having a defined size and timestamp.

How is Data Preprocessing performed?

Data Preprocessing is performed to remove the cause of unformatted real world data which are discussed above.

First of all, let's discuss how missing data can be handled. There are three different steps that can be executed which are given below:

Ignoring the missing record - It is the simplest and effective method for handling the missing data. But, this method should not be performed at the time when the number of missing values are huge or when the pattern of data is related to the unrecognized basic root of the cause of statement problem.

Filling the missing values manually - This is one of the best-chosen methods. But there is one limitation that when there is large dataset and missing values are large then, this method is not efficient as it becomes a time-consuming task.

Filling using computed values - The missing values can also be filled by computing mean, mode or median of the observed given values. Another method could be the predictive values that are computed by using any Machine Learning or Deep Learning algorithm. But one drawback of this method is that it can generate bias within the data as the computed values are not accurate with respect to the observed values.

Let's move further and discuss how we can deal with the noisy data. The methods that can be followed are given below:

Binning method - In this method sorting of data is performed with respect to the values of the neighborhood. This method is also known as local smoothing.

Clustering method - In the approach, the outliers may be detected by grouping the similar data in the same group i.e. in the same cluster.

Machine Learning - A Machine Learning algorithm can be executed for smoothing of data. For example, regression algorithm can be used for smoothing of data using a specified linear function.

Removing manually - The noisy data can be removed manually by the human being but it is a time-consuming process so mostly this method is not given priority.

To deal with the inconsistent data manually the data is managed using external references and knowledge engineering tools like knowledge engineering process.

How is Data Wrangling performed?

Data Wrangling is performed to minimize the effect of Data Leakage while executing the model. In other words if one consider the complete dataset for normalization and standardization, then the cross-validation is performed for the estimation of the performance of the model leads to the beginning of data leakage.

Another problem is also observed that the test model is also included for feature selection while executing each fold of cross-validation which further generates bias during performance analysis.

The effect of Data Leakage could be minimized by recalculating for the required Data Preparation during the cross-validation process that includes feature selection, outliers detection, and removal, projection methods, scaling of selected features and much more.

Another solution is that dividing the complete dataset into training dataset that is used to train the model and validation dataset which is used to evaluate the performance and accuracy of the applied model.

But, the selection of the model is made by looking at the results of test dataset in the cross validation process. This conclusion will not always be true as the sample of test dataset could vary and the performance of different models are evaluated for the particular type of test dataset. Therefore, while selecting best model test error is overfitting.

To solve this problem, the variance of the test error is determined by using different samples of test dataset. In this way, the best suitable model is chosen.

Difference Between Data Preprocessing and Data Wrangling

Data Preprocessing is performed before Data Wrangling. In this case, Data Preprocessing data is prepared exactly after receiving the data from the data source. In this initial transformations, data cleaning or any aggregation of data is performed. It is executed once.

For example, we have a data where one attribute have three variables and we have to convert them into three attributes and delete the special characters from them. This is the concept that is performed before applying any iterative model and will be executed once in the project.

On the other hand, Data Wrangling is performed during the iterative analysis and model building. This concept at the time of feature engineering. The conceptual view of the dataset changes as different models is applied to achieve good analytic model.

For example, we have data containing 30 attributes where two attributes are used to compute another attribute and that computed feature is used for further analysis. In this way, the data could be changed according to the requirement of the applied model.

Tasks of Data Preprocessing

Different steps are involved for Data Preprocessing. These steps are described below:

Data Cleaning - This is the first step which is implemented in Data Preprocessing. In this step, the main focus is on handling missing data, noisy data, detection, and removal of outliers, minimizing duplication and computed biases within the data.

Data Integration - This process is used when data is gathered from various data sources and data is combined together to form consistent data. This consistent data after performing data cleaning is used for analysis.

Data Transformation - This step is used to convert the raw data into a specified format according to the need of the model. The options used for transformation of data are given below:

Normalization - In this method, numerical data is converted into specified range i.e. between 0 and 1 so that scaling of data can be performed.

Aggregation - The concept can be derived from the word itself, this method is used to combine the features into one. For example combining two categories can be used to form a new category.

Generalization - In this case, lower level attributes are converted into a higher level.

Data Reduction - After the transformation and scaling of data duplication i.e. redundancy within the data is removed and organize the data in an efficient manner.

Tasks of Data Wrangling

The tasks of Data wrangling are described below -

Discovering - Firstly, data should be understood thoroughly and examine which approach will best suit. For example: if have a weather data when we examine the data it is observed that data is from one area and so main focus is on determining patterns.

Structuring - As the data is gathered from different sources, the data will be present in different shapes and sizes. Therefore, there is a need of structuring the data in proper format.

Cleaning - Cleaning or removing of data should be performed that can degrade the performance of analysis.

Enrichment - Extract new features or data from the given dataset in order to optimize the performance of the applied model.

Validating - This approach is used for improving the quality of data and consistency rules so that transformations that are applied to the data could be verified.

Publishing - After completing the steps of Data Wrangling, the steps can be documented so that similar steps can be performed for similar kind of data to save time.

Continue Reading - How Data Wrangling Improves Data Analytics?

#Deep learning #machine learning #data integration

0 notes

xenonstack-blog · 8 years ago

Text

Deploying .NET Application on Docker & Kubernetes

Overview

In this Post , We’ll share the Process how you can Develop and Deploy .NET Application using Docker and Kubernetes and Adopt DevOps in existing .NET Applications

Prerequisites

To follow this guide you need

Kubernetes - Kubernetes is an open source platform that automates container operations and Minikube is best for testing Kubernetes.

Kubectl - Kubectl is command line interface to manage Kubernetes cluster either remotely or locally. To configure kubectl in your machine follow this link.

Shared Persistent Storage - Shared Persistent Storage is permanent storage that we can attach to the Kubernetes container so that we don`t lose our data even container died. We will be using GlusterFS as a persistent data store for Kubernetes container applications.

.NET Application Source Code - Application Source Code is source code that we want to run inside a kubernetes container.

Dockerfile - Dockerfile contains a bunch of commands to build .NET application.

Container-Registry - The Container Registry is an online image store for container images.

Below mentioned options are few most popular registries.

Private Docker Hub

AWS ECR

Docker Store

Google Container Registry

Create a Dockerfile

The below-mentioned code is sample dockerfile for .NET applications. In which we are using Microsoft .NET 1.1 SDK for .NET Application.

FROM microsoft/dotnet:1.1-sdk # Setting Home Directory for application WORKDIR /app # copy csproj and restore as distinct layers COPY dotnetapp.csproj . RUN dotnet restore # copy and build everything else COPY . . RUN dotnet publish -c Release -o out EXPOSE 2223 ENTRYPOINT ["dotnet", "out/main.dll"]

Building .NET Application Image

The below-mentioned command will build your application container image.

$ docker build -t <name of your application>:<version of application>

Publishing Container Image

Now we publish our .NET application container images to any container registry like Docker Hub, AWS ECR, Google Container Registry, Private Docker Registry.

We are using Azure Container Registry for publishing Container Images.

You also need to Sign Up on Azure Cloud Platform and then create Container Registry using this link.

Now Click The Link to Pull and Push to Azure Container Registry.

Similarly, we can Push or Pull any container image to any of the below-mentioned Container Registry like Docker Hub, AWS ECR, Private Docker Registry, Google Container Registry etc.

Creating Deployment Files for Kubernetes

Deploying application on kubernetes with ease using deployment and service files either in JSON or YAML format.

Deployment File

Following Content is for “<name of application>.deployment.yml” file of Python container application.

apiVersion: extensions/v1beta1 kind: Deployment metadata: name: <name of application> namespace: <namespace of Kubernetes> spec: replicas: <number of application pods> template: metadata: labels: k8s-app: <name of application> spec: containers: - name: <name of application> image: <image name >:<version tag> imagePullPolicy: "IfNotPresent" ports: - containerPort: 2223

Service File

Following Content is for “<name of application>.service.yml” file of Python container application.

apiVersion: v1 kind: Service metadata: labels: k8s-app: <name of application> name: <name of application> namespace: <namespace of Kubernetes> spec: type: NodePort ports: - port: 2223 selector: k8s-app: <name of application>

Running .NET Application on Kubernetes

.NET Application Container can be deployed either by kubernetes Dashboard or Kubectl (Command line).

I`m explaining command line that you can use in production Kubernetes cluster.

$ kubectl create -f <name of application>.deployment.yml $ kubectl create -f <name of application>.service.yml

Now we have successfully deployed .NET Application on Kubernetes.

Verification

We can verify application deployment either by using Kubectl or Kubernetes Dashboard.

The below-mentioned command will show you running pods of your application with status running/terminated/stop/created.

$ kubectl get po --namespace=<namespace of kubernetes> | grep <application name>

Result of above command

Continue Reading About Deploying .NET Application On Docker And Kubernetes

#docker #dockerfile #kubernetes #DevOps #container #.net

0 notes

xenonstack-blog · 8 years ago

Text

Overview of Kotlin & Comparison With Java

What is Kotlin Language?

Kotlin is a new programming language from JetBrains. It first appeared in 2011 when JetBrains unveiled their project named “Kotlin”. Kotlin is an Open-Source Language.

Basically like Java, C and C++ - Kotlin is also “statically typed programming language”. Statically typed programming languages are those languages in which variables need not be defined before they are used. This means that static typing has to do with the explicit declaration or initialization of variables before they are employed.

As Earlier said that Java is an example of a statically typed language, similarly C and C++ are also statically typed languages.

Basically, Static typing does not mean that we have to declare all the variables first before we use them. Variables may be initialized anywhere in the program and we (developers) have to do so, to use those variables anywhere in the program when there is a need. Consider the following example -

/* Java Code */ static int num1, num2; //explicit declaration num1 = 20; //use the variables anywhere num2 = 30;

/* Kotlin Code*/ val a: Int val b: Int a=5 b=10

In addition to the classes and methods of object-oriented programming, Kotlin also supports procedural programming with the use of functions.

Like in Java, C and C++, the entry point to a Kotlin program is a function named “main”. Basically, it passed an array containing any command line arguments. Consider the following example -

/* Kotlin Code*/ /* Simple Hello Word Example*/ //optional package header package hello //package level function, which return Unit and takes an array of string as parameter fun main(args: Array < String > ) { val scope = “world” println(“Hello, $scope!”) //semicolons are optional, have you noticed that? :) }

Filename extensions of the Java are .java, .class, .jar but on the other hand filename extensions of the Kotlin are .kt and .kts.

Today (May 17, 2017), at the Google I/O keynote, the Android team announced that Kotlin will be the Official Language of Android. For More Info, Visit the Link.

Benefits of Kotlin Language

Kotlin compiles to JVM bytecode or JavaScript - Like Java, Bytecode is the compiled format for Kotlin programs also. Bytecode means Programming code that, once compiled, is run through a virtual machine instead of the computer’s processor. By using this approach, source code can be run on any platform once it has been compiled and run through the virtual machine. Once a kotlin program has been converted to bytecode, it can be transferred across a network and executed by JVM(Java Virtual Machine).

Kotlin programs can use all existing Java Frameworks and Libraries - Yes, it's true that Kotlin programs can use all existing java frameworks and libraries, even advanced frameworks that rely on annotation processing. The main important thing about kotlin language is that it can easily integrate with Maven, Gradle and other build systems.

Kotlin can be learned easily and it is approachable. It can be learned easily by simply reading the language reference.The syntax is clean and intuitive(easy to use and understand). Kotlin looks a lot like Scala but is simpler.

Kotlin is Open Source and it costs nothing to adopt.

Automatic conversion of Java to Kotlin - JetBrains integrated a new feature into IntelliJ which converts Java to Kotlin and saves a huge amount of time. And it also saves us to retype mundane code.

Kotlin’s null-safety is great - Now get rid of NullPointerExceptions. This type of system helps us to avoid null pointer exceptions. In Kotlin the system simply refuses to compile code that tries to assign or return null. Consider the following example -

val name: String = null // tries to assign null, won’t compile. fun getName() : String = null // tries to return null, won’t compile.

Code reviews are not a problem - Kotlin is much more focuses on readable syntax so code reviews are not a problem, they can still be done by those team members who are not familiar with the language.

Features of Kotlin Language

The billion dollar mistake made right. As already mentioned above that Kotlin avoids the null pointer exception. If we try to assign or return null to a variable or function respectively, then it won’t compile.

But in some special cases if we need nullability in our program then we have to ask Kotlin very nicely. Every Nullable type requires some special care and treatment. We can’t treat them the same way as non-nullable types and this is a very good thing.

We have to add “?” after the variable type. Consider the following example - Kotlin also fails at compile-time whenever a NullPointerException may be thrown at run-time. Consider the following example -

val name: String? = null //assigned null and it will compile also. fun getName() : String? = null //returned null and it will compile too.

/* won’t compile */ val name: String? = null val len = name.length /* correct way */ val name: String? = null val len = name?.length

Versatile

Lean Syntax and Concise - One liner functions take one line, simple structs/JavaBeans can also be declared in one line. Real properties generate getters and setters behind the scenes for Java interop. And Adding the data annotation to a class triggers autogeneration of boilerplate like equals, hashCode, toString and much more.

Consider the following example -

/* Java program */ public class Address { private String street; private int streetNumber; private String postCode; private String city; private Country country; public Address(String street, int streetNumber, String postCode, String city, Country country) { this.street = street; this.streetNumber = streetNumber; this.postCode = postCode; this.city = city; this.country = country; } @Override public boolean equals(Object o) { if (this == o) return true; if (o == null || getClass() != o.getClass()) return false; Address address = (Address) o; if (streetNumber != address.streetNumber) return false; if (!street.equals(address.street)) return false; if (!postCode.equals(address.postCode)) return false; if (!city.equals(address.city)) return false; return country == address.country; } @Override public int hashCode() { int result = street.hashCode(); result = 31 * result + streetNumber; result = 31 * result + postCode.hashCode(); result = 31 * result + city.hashCode(); result = 31 * result + (country != null ? country.hashCode() : 0); return result; } @Override public String toString() { return "Address{" + "street='" + street + '\'' + ", streetNumber=" + streetNumber + ", postCode='" + postCode + '\'' + ", city='" + city + '\'' + ", country=" + country + '}'; } public String getStreet() { return street; } public void setStreet(String street) { this.street = street; } public int getStreetNumber() { return streetNumber; } public void setStreetNumber(int streetNumber) { this.streetNumber = streetNumber; } public String getPostCode() { return postCode; } public void setPostCode(String postCode) { this.postCode = postCode; } public String getCity() { return city; } public void setCity(String city) { this.city = city; } public Country getCountry() { return country; } public void setCountry(Country country) { this.country = country; } }

/* Kotlin Program */ data class Address(var street:String, var streetNumber:Int, var postCode:String, var city:String, var country:Country)

You May Also Love To Read Deploying Kotlin Application on Docker & Kubernetes

Difference Between Kotlin And Java

Null Safety - As already mentioned in above section that Kotlin avoids NullPointerException. Kotlin fails at compile-time whenever a NullPointerException may be thrown.

Data Classes - In Kotlin there are Data Classes which leads to autogeneration of boilerplate like equals, hashCode, toString, getters/setters and much more.

Continue Reading About Kotlin V/s Java At: XenonStack.com/Blog

#kotlin #java #functional programming #enterprise application development

0 notes

xenonstack-blog · 8 years ago

Text

Top 10 Things To Know in DevOps

Introduction To DevOps

DevOps is a Modern software engineering Culture and Practices to develop a software where the development and operation teams work hand in hand as one unit, unlike the traditional ways i.e. Agile Methodology where they worked individually to develop a software or provide required services.

The traditional methods before DevOps were time-consuming and lacked understanding between different departments of software development, which lead to more time for updates and to fix bugs, therefore ultimately leading to customer dissatisfaction. Even to make a small change, the developer has to change the software from the beginning.

That’s why we are adopting such a culture, that allows fast, efficient, reliable software delivery through production.

DevOps Features

Maximize speed of delivery of the product.

Enhanced customer experience.

Increased time to value.

Enables fast flow of planned work into production.

Use Automated tools at each level.

More stable operating environments.

Improved communication and collaboration.

More time to innovate.

DevOps Consists of 5 C’s

DevOps practices lead to high productivity, lesser bugs, improved communication, enhanced quality, faster resolution of problems, more reliability, better and timely delivery of software.

Continuous Integration

Continuous Testing

Continuous Delivery

Continuous Deployment

Continuous Monitoring

1. Continuous Integration

Continuous integration means isolated changes are tested and reported when they are added to a larger code base. The goal of continuous integration is to give rapid feedback so that any defect can be identified and corrected as soon as possible.

Jenkins is used for continuous integration which follows 3 step rule i.e. build, test and deploy. Here developer does frequent changes to the source code in shared repository several times a day.

Along with Jenkins, we have more tools too i.e. BuildBot, Travis etc. Jenkins widely used because it provides plugins for testing, reporting, notification, deployment etc.

2. Continuous Testing

Continuous Testing is done to obtain immediate feedback on the business risk associated with Software Release. It's basically difficult and essential part of the software. Software rating depends upon Testing. Test function helps the developer to balance the quality and speed. Automated tools are used for testing as it is easier to do testing continuously instead of testing a whole software. Tool used for testing the software is Selenium

3. Continuous Delivery

Continuous Delivery is the ability to do changes like including new features, configuration management, fixes bugs and experiments into production. Our motive for doing continuous delivery is the continuous daily improvement. If there is any kind of error in the production code, we can quickly fix it that time. So, here we are developing and deploying our application rapidly, reliably and repeatedly with minimum overhead.

4. Continuous Deployment

The code is automatically deployed to the production environment as it passes through all the test cases. Continuous versioning ensures that multiple versions of the code are available at proper places. Here every changed code is put into production that automatically resulting in many deployments in production environment every day.

5. Continuous Monitoring

Continuous Monitoring is a reporting tool because of which developers and testers understand the performance and availability of their application, even before it is deployed to operations. Feedback provided by continuous monitoring is essential for lowering cost of errors and change. Nagios tool is used for continuous monitoring.

Learn How XenonStack DevOps Solutions can help you Enable Continuous Delivery Pipeline Across Cloud Platforms for Increased Efficiency and Reduced Cost Or Talk With Our Experts

Key Technologies and Terminologies In DevOps

6. Microservices

Microservices is an architectural style of developing a complex application by dividing it into smaller modules/microservices. These microservices are loosely coupled, deployed independently and are focused properly by small teams.

With Microservices developers can decide how to use, design, language to choose, platform to run, deploy, scale etc.

Advantages Of Microservices

Microservices can be developed in variable programming languages.

Errors in any module or microservices can easily be found out, thus saves time.

Smaller modules or microservices are easier to manage.

Whenever any update required, it can be immediately pushed on that particular microservices, otherwise, the whole application needs to be updated.

According to client need, we can scale up and down particular microservice without affecting the other microservices.

It also leads to increase in productivity.

If any one module goes down, the application remains largely unaffected.

Disadvantages Of Microservices

If any application involves the number of microservices, then managing them becomes a little bit difficult.

Microservices leads to more memory consumption.

In some cases, testing microservices becomes difficult.

In production, it also leads to complexity of deploying and managing a system comprised of different types of services.

7. Containers

Containers create a virtualization environment that allows us to run multiple applications or operating system without interrupting each other.

With the container, we can quickly, reliably and consistently deploy our application because containers have their own CPU, memory, network resources and block I/O that shares with the kernel of host operating system.

Containers are lightweight because they don’t need the extra load of a hypervisor, they can be directly run within host machine.

Before we were facing a problem that code can easily run on developer environment but while executing it in the production environment, dependency issue occurs.

Then virtual machines came, but they were heavyweight that leads to wastage of RAM, the processor is also not utilized completely. If we need more than 50 microservices to run then, VM is not the best option.

Docker is light weighted Container that has inbuilt images and occupies very less space comparatively. But for running a docker we need a Linux or Ubuntu as a host machine.

Terms used in docker that are:-

Docker Hub - It's cloud hosted service provided by Docker. Here we can upload our own image or also can pull the images in public repository.

Docker Registry - Storage component for docker images Either we can store in public repository or in private repository. We are using this to integrate image storage with our in-house development workflow and also to control where images are to be stored.

Docker images - Read only template that is used to create the container. Built by docker user and stored on docker hub or local registry.

Docker Containers - It's runtime instance of Docker image. It's built from 1 or more images.

Hence Docker helps in achieving application issues, Application Isolation, and faster development.

Advantages Of Containers

Wastage of resources like RAM, Processor, Disc space are controlled as now there is no need to pre-locate these resources and are met according to application requirements.

It’s easy to share a container.

Docker provides a platform to manage the lifecycle of containers.

Containers provide consistent computation environment.

Containers can run separate applications within a single shared operating system.

8. Container Orchestration

Container Orchestration is Automated, Arrangement, Coordination, and Management of containers and the resources they consume during deployment of a multi-container packed application.

Various features of Container Orchestration includes

Cluster Management - Developer’s task is limited to launch a cluster of container instances and specify the tasks which are needed to run. Management of all containers is done by Orchestration.

Task Definitions - It allows the developer to define task where they have to specify the number of containers required for the task and their dependencies. Many tasks can be launched through single task definition.

Programmatic Control - With simple API calls one can register and deregister tasks, and launch and stop Docker containers.

Scheduling - Container scheduling deals with placing the containers from the cluster according to the resources they need and the availability of requirements.

Load Balancing - Helps in distributing traffic across the containers/deployment.

Monitoring - One can monitor CPU and memory utilization of running tasks and also gets alerted if scaling is needed by containers.

Tools used for Container Orchestration

For Container orchestration different tools are used, few are open source tools like Kubernetes, and Docker Swarn which can be used privately, also some paid tools are there like AWS ECS from Amazon, Google Containers, and Microsoft Containers.

Some of these tools are briefly explained below:

Amazon ECS - Amazon ECS is yet another product from Amazon Web Services that provides the runtime environment for Docker Containers and provide orchestration. It allows running Dockerized applications on top of Amazon’s Infrastructure.

Azure Containers Service - Azure Container Service product is by Microsoft allowing similar functionalities. It has very good support for .NET ecosystem.

Docker Swarm - It’s an open source tool, part of Docker’s landscape. With this tool, we can run multiple docker engines as a single virtual Docker. This is Dockers own containers orchestration Tool. It consists of the manager and worker nodes that run different services for orchestration. Managers that distributes tasks across the cluster and worker node run containers assigned by managers.

Google Container Engine - Google Container Engine allow us to run Docker containers on Google Cloud Platform. It schedules the containers into the cluster and manages them as per the requirements were given. It is built on the top of Kubernetes i.e. an open source Containers Orchestration tool.

Kubernetes - Kubernetes is one of the most mature orchestration systems for Docker containers. It's an open source system used for automating the deployment and management of containerised application also according to user's need it scales the application.

Continue Reading About Latest DevOps Trends At : XenonStack.com/Blog

#Docker #DevOps #kubernetes #container #microservices #orchestration

0 notes

xenonstack-blog · 8 years ago

Text

Understanding Log Analytics, Log Mining & Anomaly Detection

What is Log Analytics

With technologies such as Machine Learning and Deep Neural Networks (DNN), these technologies employ next generation server infrastructure that spans immense Windows and Linux cluster environments.

Additionally, for DNNs, these application stacks don’t only involve traditional system resources (CPUs, Memory), but also graphic processing units (GPUs).

With a non-traditional infrastructure environment, the Microsoft Research Operations team needed a highly flexible, scalable, and Windows and Linux compatible service to troubleshoot and determine root causes across the full stack.

Log Analytics supports log search through billions of records, Real-Time Analytics Stack metric collection, and rich custom visualizations across numerous sources. These out of the box features paired with the flexibility of available data sources made Log Analytics a great option to produce visibility & insights by correlating across DNN clusters & components.

The relevance of log file can differ from one person to another. It may be possible that the specific log data can be beneficial for one user but irrelevant for the another user. Therefore, the useful log data can be lost inside the large cluster. Therefore, the analysis of the log file is an important aspect these days.

With the management of real-time data, the user can use the log file for making decisions.

But, as the volume of data increases let's say to gigabytes then, it becomes impossible for the traditional methods to analyze such a huge log file and determine the valid data. By ignoring the log data a huge gap of relevant information will be created.

So, the solution for this problem is to use Deep Learning Neural Network as a training classifier for the log data. With this, it’s not required to read the whole log file data by the human being. By combining the useful log data with the Deep Learning it becomes possible to gain the relevant optimum performance and comprehensive operational visibility.

Along with the analysis of log data, there is also need to classify the log file into relevant and irrelevant data. With this approach, time and performance effort could be saved and close to accurate results could be obtained.

Understanding Log Data

Before discussing the analysis of log file first we should understand about the log file. The log is a data that produces automatically by the system and stores the information about the events that are taking place inside the operating system. It stores the data at every period of time.

The log data can be presented in the form of pivot table or file. In log file or table, the records are arranged according to the time. Every software applications and systems produce log files. Some of the examples of log files are transaction log file, event log file, audit log file, server logs, etc.

Logs are usually application specific, therefore, log analysis is a much-needed task to extract the valuable information from the log file.

Log Analysis Process

The steps for the processing of Log Analysis are described below:

Collection and Cleaning of data

Structuring of Data

Analysis of Data

Collection and Cleaning of data

Firstly, Log data is collected from various sources. The collected information should be precise and informative as the type of collected data can affect the performance. Therefore, information should be collected from real users. Each type of Log contains distinguish the type of information.

After the collection of data, the data is represented in the form of Relational Database Management System (RDMS). Each record is assigned a unique primary key and Entity-Relationship model is developed to interpret the conceptual schema of the data.

Once the log data is arranged in proper manner then, the process of cleaning of data has to be performed. This is because there can be the possibility of the presence of corrupted log data.

The reasons of corruption of log data are given below:

Crashing of disk where log data is stored

Applications are terminated abnormally

Disturbance in the configuration of input/output

Presence of virus in the system and much more

Structuring of Data

Log data is large as well as complex. Therefore, the presentation of log data directly affects their ability to correlate with the other data.

An important aspect is that the log data has the ability to directly correlate with the other log data so that deep understanding of the log data can be interpreted by the team members.

The steps implemented for the structuring of log data are given below:

Clarity about the usage of collected log data

Same assets involve across the data so that values of log data are consistent. This means that naming conventions can be used

Correlation between the objects is created automatically due to the presence of nested files in the log data. It’s better to avoid nested files from the log data.

Analysis of Data: Now, the next step is to analyze the structured form of log data. This can be performed by various methods such as Pattern Recognition, Normalization, Classification using Machine Learning, Correlation Analysis and much more.

Importance of Log Analysis

Indexing and crawling are two important aspects. If the content does not include indexing and crawling, then update of data will not occur properly within time and the chance of duplicates values will be increased.

But, with the use of log analytics, it will be possible to examine the issues of crawling and indexing of data. This can be performed by examining the time taken by Google to crawl the data and at what location Google is spending large time.

In the case of large websites, it becomes difficult for the team to maintain the record of changes that are made on the website. With the use of log analysis, updated changes can be maintained in the regular period of time thus helps to determine the quality of the website.

In Business point of view, frequent crawling of the website by the Google is an important aspect as it point towards the value of the product or services. Log analytics make it possible to examine how often Google views the page site.

The changes that are made in the page site should be updated quickly at that time in order to maintain the freshness of the content. This can also be determined by the log analysis.

Acquiring the real informative data automatically and measuring the level of security within the system.

Knowledge Discovery and Data Mining

In today's generation, the volume of data is increasing day by day. Because of these circumstances, there is a great need to extract useful information from large data that are further use for making decisions. Knowledge discovery and Data Mining are used to solve this problem.

Knowledge discovery and Data mining ate two distinct terms. Knowledge Discovery is a kind of process used for extracting the useful information from the database and Data Mining is one of the steps involved in this process. Data Mining is the algorithm used for extracting the patterns from the data.

Knowledge Discovery involves various steps such as Data Cleaning, Data Integration, Data Selection, Data Transformation, Data Mining, Pattern Evaluation, Knowledge Presentation. Knowledge Discovery is a process that has total focus on deriving the useful information from the database, interpretation of storage mechanism of data, implementation of optimum algorithms and visualization of results. This process gives more importance on finding the understandable patterns of data that further used for grasping useful information.

Data Mining involves the extraction of patterns and fitting of the model. The concept behind the fitting of the model is to ensure what type of information is inferred from the processing of model. It works on three aspects such as model representation, model estimation, and search. Some of the common Data Mining techniques are Classification, Regression, and Clustering.

Log Mining

After performing analysis of logs, now next step is to perform log mining. Log Mining is a technique that uses Data Mining for the analysis of logs.

With the introduction of Data Mining technique for log analysis the quality of analysis of log data increases. In this way analytics approach moves towards software and automated analytic systems.

But, there are few challenges to perform log analysis using data mining. These are:

Day by day volume of log data is increasing from megabytes to gigabytes or even petabytes. Therefore, there is a need of advanced tools for log analysis.

The essential information is missing from the log data. So, more efforts are needed to extract useful data.

The different number of logs are analyzed from different sources to move deep into the knowledge. So, logs in different formats have to be analyzed.

The presence of different logs creates the problem of redundancy of data without any identification. This leads to the problem of synchronization between the sources of log data.

As shown in fig the process of log mining consist of three phases. Firstly, the log data is collected from various sources like Syslog, Message log, etc. After collecting the log data, it is aggregated together using Log Collector. After aggregation second phase is started.

In this, data cleaning is performed by removing the irrelevant data or corrupted data that can affect the accuracy of the process. After cleaning, log data is represented in the structured form of data (Integrated form) so that queries could be executed on them.

After that, the transformation process is performed to convert into the required format for performing normalization and pattern analysis. Useful patterns are obtained by performing Pattern Analysis. Various data mining techniques are used such as Association rules, Clustering etc to grasp the useful information from the patterns. This information is used for decision-making and for alerting the unusual behavior of the pattern by the organization.

Define Anomaly

An anomaly is defined as the unusual behavior or pattern of the data. This unusual indicates the presence of the error in the system. It describes that the actual result is different from the obtained result, thus the applied model does not fit into the given assumptions.

The anomaly is further divided into three categories described below:

Point Anomalies

A single instance of a point is considered as an anomaly when it is farthest from the rest of the data.

Contextual Anomalies

This type of anomaly related to the abnormal behavior of the particular type of context within data. It is commonly observed in time series problems.

Collective anomalies

When the collected instance of data is help for detecting anomalies is considered as collective anomalies.

The system produces logs which contain the information about the state of the system. By analyzing the log data anomalies can be detected so that security of the system could be protected. This can be performed by using Data Mining Techniques. This is because there is a need of usage of dynamic rules along with the data mining approach.

Network Intrusion Detection using Data Mining

In today's generation, the use of computers has been increased. Due to this, the probability of cyber crime has also increased. Therefore, a system is developed known as Network Intrusion Detection which enables the security in the computer system.

Continue Reading The Full Article At - XenonStack.com/Blog

#machine learning #deep learning #anomaly #data science #fraud

1 note · View note

xenonstack-blog · 8 years ago

Text

Enabling Real-Time Analytics For IoT

What is Fast Data?

A few years ago, we remember the time when it was just impossible to analyze petabytes of data. Then emergence of Hadoop made it possible to run analytical queries on our huge amount of historical data.

As we know Big Data is a buzz from last few years, but Modern Data Pipelines are constantly receiving data at a high ingestion rate. So this constant flow of data at high velocity is termed as Fast Data.

So Fast data is not about just volume of data like Data Warehouses in which data is measured in GigaBytes, TeraBytes or PetaBytes. Instead, we measure volume but with respect to its incoming rate like MB per second, GB per hour, TB per day. So Volume and Velocity both are considered while talking about Fast Data.

What is Streaming and Real-Time Data

Nowadays, there are a lot of Data Processing platforms available to process data from our ingestion platforms. Some support streaming of data and other supports true streaming of data which is also called Real-Time data.

Streaming means when we are able to process the data at the instant as it arrives and then processing and analyzing it at ingestion time. But in streaming, we can consider some amount of delay in streaming data from ingestion layer.

But Real-time data needs to have tight deadlines in the terms of time. So we normally consider that if our platform is able to capture any event within 1 ms, then we call it as real-time data or true streaming.

But When we talk about taking business decisions, detecting frauds and analyzing real-time logs and predicting errors in real-time, all these scenarios comes to streaming. So Data received instantly as it arrives is termed as Real-time data.

Stream & Real Time Processing Frameworks

So in the market, there are a lot of open sources technologies available like Apache Kafka in which we can ingest data at millions of messages per sec. Also Analyzing Constant Streams of data is also made possible by Apache Spark Streaming, Apache Flink, Apache Storm.

Apache Spark Streaming is the tool in which we specify the time-based window to stream data from our message queue. So it does not process every message individually. We can call it as the processing of real streams in micro batches.

Whereas Apache Storm and Flink have the ability to stream data in real-time.

Why Real-Time Streaming

As we know that Hadoop, S3 and other distributed file systems are supporting data processing in huge volumes and also we are able to query them using their different frameworks like Hive which uses MapReduce as their execution engine.

Why we Need Real - Time Streaming?

A lot of organizations are trying to collect as much data as they can regarding their products, services or even their organizational activities like tracking employees activities through various methods used like log tracking, taking screenshots at regular intervals.

So Data Engineering allows us to convert this data into structural formats and Data Analysts then turn this data into useful results which can help the organization to improve their customer experiences and also boost their employee's productivity.

But when we talk about log analytics, fraud detection or real-time analytics, this is not the way we want our data to be processed.The actual value data is in processing or acting upon it at the instant it receives.

Imagine we have a data warehouse like hive having petabytes of data in it. But it allows us to just analyze our historical data and predict future.

So processing of huge volumes of data is not enough. We need to process them in real-time so that any organization can take business decisions immediately whenever any important event occurs. This is required in Intelligence and surveillance systems, fraud detection etc.

Earlier handling of these constant streams of data at high ingestion rate is managed by firstly storing the data and then running analytics on it.

But organizations are looking for the platforms where they can look into business insights in real-time and act upon them in real-time.

Alerting platforms are also built on the top of these real-time streams. But Effectiveness of these platform lies in the fact that how truly we are processing the data in real-time.

Use Of Reactive Programming & Functional Programming

Now when we are thinking of building our alerting platforms, anomaly detection engines etc on the top of our real-time data, it is very important to consider the style of programming you are following.

Nowadays, Reactive Programming and Functional Programming are at their boom.

So, we can consider Reactive Programming as subscriber and publisher pattern. Often, we see the column on almost every website where we can subscribe to their newsletter and whenever the newsletter is posted by the publisher, whosoever have got subscription will get the newsletter via email or some other way.

So the difference between Reactive and Traditional Programming is that the data is available to the subscriber as soon as it receives. And it is made possible by using Reactive Programming model. In Reactive Programming, whenever any events occur, there are certain components (classes) that had registered to that event. So instead of invoking target components by event generator, all targets automatically get triggered whenever any event occurs.

Now when we are processing data at high rate, concurrency is the point of concern. So the performance of our analytics job highly depends upon memory allocation/deallocation. So in Functional Programming, we don’t need to initialize loops/iterators on our own.

We will be using Functional Programming styles to iterate over the data in which CPU itself takes care of allocation and deallocation of data and also makes the best use of memory which results in better concurrency or parallelism.

Streaming Architecture Matters

While Streaming and Analyzing the real-time data, there are chances that some messages can be missed or in short, the problem is how we can handle data errors.

So, there are two types of architectures which are used while building real-time pipelines.

Lambda Architecture:

This architecture was introduced by Nathan Marz in which we have three layers to provide real-time streaming and compensate any data error occurs if any. The three layers are Batch Layer, Speed layer, and Serving Layer.

Continue Reading The Full Article At - XenonStack.com/Blog

#big data #Big data analytics #hadoop #real time analytics #streaming analytics #reactive programming #functional programming

0 notes

xenonstack-blog · 8 years ago

Text

Deploying PostgreSQL on Kubernetes

What is PostgreSQL?

PostgreSQL is a powerful, open source Relational Database Management System. PostgreSQL is not controlled by any organization or any individual. Its source code is available free of charge. It is pronounced as "post-gress-Q-L".

PostgreSQL has earned a strong reputation for its reliability, data integrity, and correctness.

It runs on all major operating systems, including Linux, UNIX (AIX, BSD, HP-UX, SGI IRIX, MacOS, Solaris, Tru64), and Windows.

It is fully ACID compliant, has full support for foreign keys, joins, views, triggers, and stored procedures (in multiple languages).

It includes most SQL:2008 data types, including INTEGER, NUMERIC, BOOLEAN, CHAR, VARCHAR, DATE, INTERVAL, and TIMESTAMP.

It also supports storage of binary large objects, including pictures, sounds, or video.

It has native programming interfaces for C/C++, Java, .Net, Perl, Python, Ruby, Tcl, ODBC, among others, and exceptional documentation.

Prerequisites

To follow this guide you need -

Kubernetes Cluster

GlusterFS Cluster

Step 1 - Create a PostgreSQL Container Image

Create a file name “Dockerfile” for PostgreSQL. This image contains our custom config dockerfile which will look like -

FROM ubuntu:latest MAINTAINER XenonStack RUN apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys B97B0AFCAA1A47F044F244A07FCC7D46ACCC4CF8 RUN echo "deb http://apt.postgresql.org/pub/repos/apt/ xenial-pgdg main" > /etc/apt/sources.list.d/pgdg.list RUN apt-get update && apt-get install -y python-software-properties software-properties-common postgresql-9.6 postgresql-client-9.6 postgresql-contrib-9.6 RUN /etc/init.d/postgresql start &&\ psql --command "CREATE USER root WITH SUPERUSER PASSWORD 'xenonstack';" &&\ createdb -O root xenonstack RUN echo "host all all 0.0.0.0/0 md5" >> /etc/postgresql/9.6/main/pg_hba.conf RUN echo "listen_addresses='*'" >> /etc/postgresql/9.6/main/postgresql.conf # Expose the PostgreSQL port EXPOSE 5432 # Add VOLUMEs to allow backup of databases VOLUME ["/var/lib/postgresql"] # Set the default command to run when starting the container CMD ["/usr/lib/postgresql/9.6/bin/postgres", "-D", "/var/lib/postgresql", "-c", "config_file=/etc/postgresql/9.6/main/postgresql.conf"]

This Postgres image has a base image of ubuntu xenial. After that, we create Super User and default databases. Exposing 5432 port will help external system to connect the PostgreSQL server.

Step 2 - Build PostgreSQL Docker Image

$ docker build -t dr.xenonstack.com:5050/postgres:v9.6

Step 3 - Create a Storage Volume (Using GlusterFS)

Using below-mentioned command create a volume in GlusterFS for PostgreSQL and start it. As we don’t want to lose our PostgreSQL Database data just because a Gluster server dies in the cluster, so we put replica 2 or more for higher availability of data.

$ gluster volume create postgres-disk replica 2 transport tcp k8-master:/mnt/brick1/postgres-disk k8-1:/mnt/brick1/postgres-disk $ gluster volume start postgres-disk $ gluster volume info postgres-disk

Step 4 - Deploy PostgreSQL on Kubernetes

Deploying PostgreSQL on Kubernetes have following prerequisites -

Docker Image: We have created a Docker Image for Postgres in Step 2

Persistent Shared Storage Volume: We have created a Persistent Shared Storage Volume in Step 3

Deployment & Service Files: Next, we will create Deployment & Service Files

Create a file name “deployment.yml” for PostgreSQL. This deployment file will look like -

apiVersion: extensions/v1beta1 kind: Deployment metadata: name: postgres namespace: production spec: replicas: 1 template: metadata: labels: k8s-app: postgres spec: containers: - name: postgres image: dr.xenonstack.com:5050/postgres:v9.6 imagePullPolicy: "IfNotPresent" ports: - containerPort: 5432 env: - name: POSTGRES_USER value: postgres - name: POSTGRES_PASSWORD value: superpostgres - name: PGDATA value: /var/lib/postgresql/data/pgdata volumeMounts: - mountPath: /var/lib/postgresql/data name: postgredb volumes: - name: postgredb glusterfs: endpoints: glusterfs-cluster path: postgres-disk readOnly: false

Continue Reading The Full Article At - XenonStack.com/Blog

#DevOps #Docker #kubernetes #postgresql #dockerfile

1 note · View note

xenonstack-blog · 8 years ago

Text

Arising Need Of Modern Big Data Integration Platform

Data is everywhere and we are generating data from different Sources like Social Media, Sensors, API’s, Databases.

Healthcare, Insurance, Finance, Banking, Energy, Telecom, Manufacturing, Retail, IoT, M2M are the leading domains/areas for Data Generation. The Government is using BigData to improve their efficiency and distribution of the services to the people.

The Biggest Challenge for the Enterprises is to create the Business Value from the data coming from the existing system and from new sources. Enterprises are looking for a Modern Data Integration platform for Aggregation, Migration, Broadcast, Correlation, Data Management, and Security.

Traditional ETL is having a paradigm shift for Business Agility and need of Modern Data Integration Platform is arising. Enterprises need Modern Data Integration for agility and for an end to end operations and decision-making which involves Data Integration from different sources, Processing Batch Streaming Real Time with BigData Management, BigData Governance, and Security.

BigData Type Includes:

What type of data it is

Format of content of data required

Whether data is transactional data, historical data or master data

The Speed or Frequency at which data made to be available

How to process the data i.e. whether in real time or in batch mode

5 V’s to Define BigData

Additional 5V’s to Define BigData

Data Ingestion and Data Transformation

Data Ingestion comprises of integrating Structured/unstructured data from where it is originated into a system, where it can be stored and analyzed for making business decisions. Data Ingestion may be continuous or asynchronous, real-time or batched or both.

Defining the BigData Characteristics: Using Different BigData types, helps us to define the BigData Characteristics i.e how the BigData is Collected, Processed, Analyzed and how we deploy that data On-Premises or Public or Hybrid Cloud.

Data type: Type of data

Transactional

Historical

Master Data and others

Data Content Format: Format of Data

Structured (RDBMS)

Unstructured (audio, video, and images)

Semi-Structured

Data Sizes: Data size like Small, Medium, Large and Extra Large which means we can receive data having sizes in Bytes, KBs, MBs or even in GBs.

Data Throughput and Latency: How much data is expected and at what frequency does it arrive. Data throughput and latency depend on data sources:

On demand, as with Social Media Data

Continuous feed, Real-Time (Weather Data, Transactional Data)

Time series (Time-Based Data)

Processing Methodology: The type of technique to be applied for processing data (e.g. Predictive Analytics, Ad-Hoc Query and Reporting).

Data Sources: Data generated Sources

The Web and Social Media

Machine-Generated

Human-Generated etc

Data Consumers: A list of all possible consumers of the processed data:

Business processes

Business users

Enterprise applications

Individual people in various business roles

Part of the process flows

Other data repositories or enterprise applications

Major Industries Impacted with BigData

What is Data Integration?

Data Integration is the process of Data Ingestion - integrating data from different sources i.e. RDBMS, Social Media, Sensors, M2M etc, then using Data Mapping, Schema Definition, Data transformation to build a Data platform for analytics and further Reporting. You need to deliver the right data in the right format at the right timeframe.

BigData integration provides a unified view of data for Business Agility and Decision Making and it involves:

Discovering the Data

Profiling the Data

Understanding the Data

Improving the Data

Transforming the Data

A Data Integration project usually involves the following steps:

Ingest Data from different sources where data resides in multiple formats.

Transform Data means converting data into a single format so that one can easily be able to manage his problem with that unified data records. Data Pipeline is the main component used for Integration or Transformation.

MetaData Management: Centralized Data Collection.

Store Transform Data so that analyst can exactly get when the business needs it, whether it is in batch or real time.

Why Data Integration is required

Make Data Records Centralized: As data is stored in different formats like in Tabular, Graphical, Hierarchical, Structured, Unstructured form. For making the business decision, a user has to go through all these formats before reaching a conclusion. That’s why a single image is the combination of different format helpful in better decision making.

Format Selecting Freedom: Every user has different way or style to solve a problem. User are flexible to use data in whatever system and in whatever format they feel better.

Reduce Data Complexity: When data resides in different formats, so by increasing data size, complexity also increases that degrade decision making capability and one will consume much more time in understanding how one should proceed with data.

Prioritize the Data: When one have a single image of all the data records, then prioritizing the data what's very much useful and what's not required for business can easily find out.

Better Understanding of Information: A single image of data helps non-technical user also to understand how effectively one can utilize data records. While solving any problem one can win the game only if a non-technical person is able to understand what he is saying.

Keeping Information Up to Date: As data keeps on increasing on daily basis. So many new things come that become necessary to add on with existing data, so Data Integration makes easy to keep the information up to date.

Continue Reading The Full Article At - XenonStack.com/Blog

#big data #big data analytics #data integration #data #iot

0 notes