mahiworld-blog1 - Tumblr blog

mahiworld-blog1 · 6 years ago

Text

Important libraries for data science and Machine learning.

Python has more than 137,000 libraries which is help in various ways.In the data age where data is looks like the oil or electricity .In coming days companies are requires more skilled full data scientist , Machine Learning engineer, deep learning engineer, to avail insights by processing massive data sets.

Python libraries for different data science task:

Python Libraries for Data Collection

Beautiful Soup

Scrapy

Selenium

Python Libraries for Data Cleaning and Manipulation

Pandas

PyOD

NumPy

Spacy

Python Libraries for Data Visualization

Matplotlib

Seaborn

Bokeh

Python Libraries for Modeling

Scikit-learn

TensorFlow

PyTorch

Python Libraries for Model Interpretability

Lime

H2O

Python Libraries for Audio Processing

Librosa

Madmom

pyAudioAnalysis

Python Libraries for Image Processing

OpenCV-Python

Scikit-image

Pillow

Python Libraries for Database

Psycopg

SQLAlchemy

Python Libraries for Deployment

Flask

Django

Best Framework for Machine Learning:

1. Tensorflow :

If you are working or interested about Machine Learning, then you might have heard about this famous Open Source library known as Tensorflow. It was developed at Google by Brain Team. Almost all Google’s Applications use Tensorflow for Machine Learning. If you are using Google photos or Google voice search then indirectly you are using the models built using Tensorflow.

Tensorflow is just a computational framework for expressing algorithms involving large number of Tensor operations, since Neural networks can be expressed as computational graphs they can be implemented using Tensorflow as a series of operations on Tensors. Tensors are N-dimensional matrices which represents our Data.

2. Keras :

Keras is one of the coolest Machine learning library. If you are a beginner in Machine Learning then I suggest you to use Keras. It provides a easier way to express Neural networks. It also provides some of the utilities for processing datasets, compiling models, evaluating results, visualization of graphs and many more.

Keras internally uses either Tensorflow or Theano as backend. Some other pouplar neural network frameworks like CNTK can also be used. If you are using Tensorflow as backend then you can refer to the Tensorflow architecture diagram shown in Tensorflow section of this article. Keras is slow when compared to other libraries because it constructs a computational graph using the backend infrastructure and then uses it to perform operations. Keras models are portable (HDF5 models) and Keras provides many preprocessed datasets and pretrained models like Inception, SqueezeNet, Mnist, VGG, ResNet etc

3.Theano :

Theano is a computational framework for computing multidimensional arrays. Theano is similar to Tensorflow , but Theano is not as efficient as Tensorflow because of it’s inability to suit into production environments. Theano can be used on a prallel or distributed environments just like Tensorflow.

4.APACHE SPARK:

Spark is an open source cluster-computing framework originally developed at Berkeley’s lab and was initially released on 26th of May 2014, It is majorly written in Scala, Java, Python and R. though produced in Berkery’s lab at University of California it was later donated to Apache Software Foundation.

Spark core is basically the foundation for this project, This is complicated too, but instead of worrying about Numpy arrays it lets you work with its own Spark RDD data structures, which anyone in knowledge with big data would understand its uses. As a user, we could also work with Spark SQL data frames. With all these features it creates dense and sparks feature label vectors for you thus carrying away much complexity to feed to ML algorithms.

5. CAFFE:

Caffe is an open source framework under a BSD license. CAFFE(Convolutional Architecture for Fast Feature Embedding) is a deep learning tool which was developed by UC Berkeley, this framework is mainly written in CPP. It supports many different types of architectures for deep learning focusing mainly on image classification and segmentation. It supports almost all major schemes and is fully connected neural network designs, it offers GPU as well as CPU based acceleration as well like TensorFlow.

CAFFE is mainly used in the academic research projects and to design startups Prototypes. Even Yahoo has integrated caffe with Apache Spark to create CaffeOnSpark, another great deep learning framework.

6.PyTorch.

Torch is also a machine learning open source library, a proper scientific computing framework. Its makers brag it as easiest ML framework, though its complexity is relatively simple which comes from its scripting language interface from Lua programming language interface. There are just numbers(no int, short or double) in it which are not categorized further like in any other language. So its ease many operations and functions. Torch is used by Facebook AI Research Group, IBM, Yandex and the Idiap Research Institute, it has recently extended its use for Android and iOS.

7.Scikit-learn

Scikit-Learn is a very powerful free to use Python library for ML that is widely used in Building models. It is founded and built on foundations of many other libraries namely SciPy, Numpy and matplotlib, it is also one of the most efficient tool for statistical modeling techniques namely classification, regression, clustering.

Scikit-Learn comes with features like supervised & unsupervised learning algorithms and even cross-validation. Scikit-learn is largely written in Python, with some core algorithms written in Cython to achieve performance. Support vector machines are implemented by a Cython wrapper around LIBSVM.

Below is a list of frameworks for machine learning engineers:

Apache Singa is a general distributed deep learning platform for training big deep learning models over large datasets. It is designed with an intuitive programming model based on the layer abstraction. A variety of popular deep learning models are supported, namely feed-forward models including convolutional neural networks (CNN), energy models like restricted Boltzmann machine (RBM), and recurrent neural networks (RNN). Many built-in layers are provided for users.

Amazon Machine Learning is a service that makes it easy for developers of all skill levels to use machine learning technology. Amazon Machine Learning provides visualization tools and wizards that guide you through the process of creating machine learning (ML) models without having to learn complex ML algorithms and technology. It connects to data stored in Amazon S3, Redshift, or RDS, and can run binary classification, multiclass categorization, or regression on said data to create a model.

Azure ML Studio allows Microsoft Azure users to create and train models, then turn them into APIs that can be consumed by other services. Users get up to 10GB of storage per account for model data, although you can also connect your own Azure storage to the service for larger models. A wide range of algorithms are available, courtesy of both Microsoft and third parties. You don’t even need an account to try out the service; you can log in anonymously and use Azure ML Studio for up to eight hours.

Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors. Yangqing Jia created the project during his PhD at UC Berkeley. Caffe is released under the BSD 2-Clause license. Models and optimization are defined by configuration without hard-coding & user can switch between CPU and GPU. Speed makes Caffe perfect for research experiments and industry deployment. Caffe can process over 60M images per day with a single NVIDIA K40 GPU.

H2O makes it possible for anyone to easily apply math and predictive analytics to solve today’s most challenging business problems. It intelligently combines unique features not currently found in other machine learning platforms including: Best of Breed Open Source Technology, Easy-to-use WebUI and Familiar Interfaces, Data Agnostic Support for all Common Database and File Types. With H2O, you can work with your existing languages and tools. Further, you can extend the platform seamlessly into your Hadoop environments.

Massive Online Analysis (MOA) is the most popular open source framework for data stream mining, with a very active growing community. It includes a collection of machine learning algorithms (classification, regression, clustering, outlier detection, concept drift detection and recommender systems) and tools for evaluation. Related to the WEKA project, MOA is also written in Java, while scaling to more demanding problems.

MLlib (Spark) is Apache Spark’s machine learning library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.

mlpack, a C++-based machine learning library originally rolled out in 2011 and designed for “scalability, speed, and ease-of-use,” according to the library’s creators. Implementing mlpack can be done through a cache of command-line executables for quick-and-dirty, “black box” operations, or with a C++ API for more sophisticated work. Mlpack provides these algorithms as simple command-line programs and C++ classes which can then be integrated into larger-scale machine learning solutions.

Pattern is a web mining module for the Python programming language. It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, SVM), network analysis and visualization.

Scikit-Learn leverages Python’s breadth by building on top of several existing Python packages — NumPy, SciPy, and matplotlib — for math and science work. The resulting libraries can be used either for interactive “workbench” applications or be embedded into other software and reused. The kit is available under a BSD license, so it’s fully open and reusable. Scikit-learn includes tools for many of the standard machine-learning tasks (such as clustering, classification, regression, etc.). And since scikit-learn is developed by a large community of developers and machine-learning experts, promising new techniques tend to be included in fairly short order.

Shogun is among the oldest, most venerable of machine learning libraries, Shogun was created in 1999 and written in C++, but isn’t limited to working in C++. Thanks to the SWIG library, Shogun can be used transparently in such languages and environments: as Java, Python, C#, Ruby, R, Lua, Octave, and Matlab. Shogun is designed for unified large-scale learning for a broad range of feature types and learning settings, like classification, regression, or explorative data analysis.

TensorFlow is an open source software library for numerical computation using data flow graphs. TensorFlow implements what are called data flow graphs, where batches of data (“tensors”) can be processed by a series of algorithms described by a graph. The movements of the data through the system are called “flows” — hence, the name. Graphs can be assembled with C++ or Python and can be processed on CPUs or GPUs.

Theano is a Python library that lets you to define, optimize, and evaluate mathematical expressions, especially ones with multi-dimensional arrays (numpy.ndarray). Using Theano it is possible to attain speeds rivaling hand-crafted C implementations for problems involving large amounts of data. It was written at the LISA lab to support rapid development of efficient machine learning algorithms. Theano is named after the Greek mathematician, who may have been Pythagoras’ wife. Theano is released under a BSD license.

Torch is a scientific computing framework with wide support for machine learning algorithms that puts GPUs first. It is easy to use and efficient, thanks to an easy and fast scripting language, LuaJIT, and an underlying C/CUDA implementation. The goal of Torch is to have maximum flexibility and speed in building your scientific algorithms while making the process extremely simple. Torch comes with a large ecosystem of community-driven packages in machine learning, computer vision, signal processing, parallel processing, image, video, audio and networking among others, and builds on top of the Lua community.

Veles is a distributed platform for deep-learning applications, and it’s written in C++, although it uses Python to perform automation and coordination between nodes. Datasets can be analyzed and automatically normalized before being fed to the cluster, and a REST API allows the trained model to be used in production immediately. It focuses on performance and flexibility. It has little hard-coded entities and enables training of all the widely recognized topologies, such as fully connected nets, convolutional nets, recurent nets etc.

#datascience machineLearning.

1 note · View note

mahiworld-blog1 · 6 years ago

Link

0 notes

mahiworld-blog1 · 8 years ago

Text

Similar parameters in rapidminer

#Parameters measure_types This parameter is used for selecting the type of measure to be used for calculating similarity. following options are available: mixed measures, nominal measures, numerical measures and Bregman divergences. @@@@@ mixed_measure This parameter is available if the measure type parameter is set to 'mixed measures'. The only available option is the 'Mixed Euclidean Distance' @@@@ nominal_measure This parameter is available if the measure type parameter is set to 'nominal measures'. This option cannot be applied if the input ExampleSet has numerical attributes. In this case the 'numerical measure' option should be selected. @@@@@ numerical_measure This parameter is available if the measure type parameter is set to 'numerical measures'. This option cannot be applied if the input ExampleSet has nominal attributes. In this case the 'nominal measure' option should be selected. @@@@ divergence This parameter is available if the measure type parameter is set to 'bregman divergences'. @@@@@ kernel_type This parameter is only available if the numerical measure parameter is set to 'Kernel Euclidean Distance'. The type of the kernel function is selected through this parameter. Following kernel types are supported: dot: The dot kernel is defined by k(x,y)=x*y i.e.it is the inner product of x and y. radial: The radial kernel is defined by exp(-g ||x-y||^2) where g is the gamma that is specified by the kernel gamma parameter. The adjustable parameter gamma plays a major role in the performance of the kernel, and should be carefully tuned to the problem at hand. polynomial: The polynomial kernel is defined by k(x,y)=(x*y+1)^d where d is the degree of the polynomial and it is specified by the kernel degree parameter. The Polynomial kernels are well suited for problems where all the training data is normalized. neural: The neural kernel is defined by a two layered neural net tanh(a x*y+b) where a is alpha and b is the intercept constant. These parameters can be adjusted using the kernel a and kernel b parameters. A common value for alpha is 1/N, where N is the data dimension. Note that not all choices of a and b lead to a valid kernel function. sigmoid: This is the sigmoid kernel. Please note that the sigmoid kernel is not valid under some parameters. anova: This is the anova kernel. It has the adjustable parameters gamma and degree. epachnenikov: The Epanechnikov kernel is this function (3/4)(1-u2) for u between -1 and 1 and zero for u outside that range. It has the two adjustable parameters kernel sigma1 and kernel degree. gaussian_combination: This is the gaussian combination kernel. It has the adjustable parameters kernel sigma1, kernel sigma2 and kernel sigma3. multiquadric: The multiquadric kernel is defined by the square root of ||x-y||^2 + c^2. It has the adjustable parameters kernel sigma1 and kernel sigma shift. @@@@@ kernel_gamma This is the SVM kernel parameter gamma. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to radial or anova. @@@@@ kernel_sigma1 This is the SVM kernel parameter sigma1. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to epachnenikov, gaussian combination or multiquadric. @@@@@ kernel_sigma2 This is the SVM kernel parameter sigma2. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to gaussian combination. @@@@@ kernel_sigma3 This is the SVM kernel parameter sigma3. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to gaussian combination. @@@@@ kernel_shift This is the SVM kernel parameter shift. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to multiquadric. @@@@@@ kernel_degree This is the SVM kernel parameter degree. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to polynomial, anova or epachnenikov. @@@@@ kernel_a This is the SVM kernel parameter a. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to neural. @@@@@@ kernel_b This is the SVM kernel parameter b. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to neural. #keep_reading

0 notes

mahiworld-blog1 · 8 years ago

Text

data science tools

1.R 2.Python 3.SQL 4.Excel 5.RapidMiner 6.Hadoop 7.Spark 8.Tableau 9.KNIME 10.scikit-learn

Deep Learning ToolsFor the second year KDnuggets poll include Deep Learning Tools. This year, 18% of voters used Deep Learning tools, doubling the 9% in 2015.

Google Tensorflow jumped to first place, displacing last year leader Theano/Pylearn2 ecosystem.

Top tools:

Tensorflow, 6.8%

Theano ecosystem (including Pylearn2), 5.1%

Caffe, 2.3%

MATLAB Deep Learning Toolbox, 2.0%

Deeplearning4j, 1.7%

Torch, 1.0%

Microsoft CNTK, 0.9%

Cuda-convnet, 0.8%

mxnet, 0.6%

Other Deep Learning Tools, 3.7%

The Deep Learning field is still in the beginning of its journey, as we see by the large number of options.

Programming LanguagesPython, Java, Unix tools, Scala grew in popularity, while C/C++, Perl, Julia, F#, Clojure, and Lisp declined.

Here are the programming languages sorted by popularity.

Python, 45.8% share (was 30.3%), 51% increase

Java, 16.8% share (was 14.1%), 19% increase

Unix shell/awk/gawk 10.4% share (was 8.0%), 30% increase

C/C++, 7.3% share (was 9.4%), 23% decrease

Other programming/data languages, 6.8% share (was 5.1%), 34.1% increase

Scala, 6.2% share (was 3.5%), 79% increase

Perl, 2.3% share (was 2.9%), 19% decrease

Julia, 1.1% share (was 1.1%), 1.6% decrease

F#, 0.4% share (was 0.7%), 41.8% decrease

Clojure, 0.4% share (was 0.5%), 19.4% decrease

Lisp, 0.2% share (was 0.4%), 33.3% decrease

ToolDescription

1 Apache HadoopFrameworkfor processing big dat

2 Apache MahoutScalable machine-learning algorithms for Hadoop

3 SparkCluster-computing framework for data analytics

4 The R Project for Statistical ComputingAccessible data manipulation and graphin

5 Python, Ruby, PerlPrototyping and production scripting languages

6 SciPyPython package for scientific computing

7 scikit-learnPython package for machine learning

8 AxiisInteractive data visualization

#amit

1 note · View note

mahiworld-blog1 · 8 years ago

Text

data science tools

1.R 2.Python 3.SQL 4.Excel 5.RapidMiner 6.Hadoop 7.Spark 8.Tableau 9.KNIME 10.scikit-learn

Deep Learning ToolsFor the second year KDnuggets poll include Deep Learning Tools. This year, 18% of voters used Deep Learning tools, doubling the 9% in 2015.

Google Tensorflow jumped to first place, displacing last year leader Theano/Pylearn2 ecosystem.

Top tools:

Tensorflow, 6.8%

Theano ecosystem (including Pylearn2), 5.1%

Caffe, 2.3%

MATLAB Deep Learning Toolbox, 2.0%

Deeplearning4j, 1.7%

Torch, 1.0%

Microsoft CNTK, 0.9%

Cuda-convnet, 0.8%

mxnet, 0.6%

Other Deep Learning Tools, 3.7%

The Deep Learning field is still in the beginning of its journey, as we see by the large number of options.

Programming LanguagesPython, Java, Unix tools, Scala grew in popularity, while C/C++, Perl, Julia, F#, Clojure, and Lisp declined.

Here are the programming languages sorted by popularity.

Python, 45.8% share (was 30.3%), 51% increase

Java, 16.8% share (was 14.1%), 19% increase

Unix shell/awk/gawk 10.4% share (was 8.0%), 30% increase

C/C++, 7.3% share (was 9.4%), 23% decrease

Other programming/data languages, 6.8% share (was 5.1%), 34.1% increase

Scala, 6.2% share (was 3.5%), 79% increase

Perl, 2.3% share (was 2.9%), 19% decrease

Julia, 1.1% share (was 1.1%), 1.6% decrease

F#, 0.4% share (was 0.7%), 41.8% decrease

Clojure, 0.4% share (was 0.5%), 19.4% decrease

Lisp, 0.2% share (was 0.4%), 33.3% decrease

ToolDescription

1 Apache HadoopFrameworkfor processing big dat

2 Apache MahoutScalable machine-learning algorithms for Hadoop

3 SparkCluster-computing framework for data analytics

4 The R Project for Statistical ComputingAccessible data manipulation and graphin

5 Python, Ruby, PerlPrototyping and production scripting languages

6 SciPyPython package for scientific computing

7 scikit-learnPython package for machine learning

8 AxiisInteractive data visualization

#amit

1 note · View note

mahiworld-blog1 · 8 years ago

Text

Data visualization

Data visualization is the presentation of data in a pictorial or graphical format. It enables decision makers to see analytics presented visually, so they can grasp difficult concepts or identify new patterns. With interactive visualization, you can take the concept a step further by using technology to drill down into charts and graphs for more detail, interactively changing what data you see and how it’s processed.

Data visualization will promote that creative data exploration.

1 note · View note

mahiworld-blog1 · 8 years ago

Text

Data visualization

Data visualization will promote that creative data exploration.

1 note · View note

mahiworld-blog1 · 8 years ago

Text

BIG DATA ANALYST #SKILLS:

skill #1: Programming

Learning how to code is an essential skill in the Big Data analyst’s arsenal. You need to code to conduct numerical and statistical analysis with massive data sets. Some of the languages you should invest time and money in learning are Python, R, Java, and C++ among others. The more you know, the better–just remember that you do not have to learn every single language out there.

As every IT professional can tell you, if you know one language well, you can easily pick up the rest. Hands on experience with these languages and programming will help in your learning effort. Finally, being able to think like a programmer will help you become a good big data analyst.

skill #2: Quantitative Skills

As a big data analyst, programming helps you do what you need to do. But, what are you supposed to do?

The quantitative skills you need to be a good big data analyst answers this question. For starters, you need to know multivariable calculus and linear and matrix algebra. You will also need to know probability and statistics

skill #3: Multiple Technologies

Programming is an essential big data analysis skill. What makes it extra special, though, is the versatility. You can, and must, learn multiple technologies that will help you grow as a Big Data analyst.

But, technologies are not limited to programming alone. The range of technologies that a good big data analyst must be familiar with is huge. It spans myriad tools, platforms, hardware and software. For example, Microsoft Excel, SQL and R are basic tools. At the enterprise level, SPSS, Cognos, SAS, MATLAB are important to learn as are Python, Scala, Linux, Hadoop and HIVE.

skill #4: Understanding of Business & Outcomes

Analysis of data and insights would be useless if it cannot be applied to a business setting. All big data analysts need to have a strong understanding of the business and domain they operate in.

Domain expertise enables big data analysts to communicate effectively with different stakeholders. Consider recommending that new employees be added to a factory floor. When pitching it to the CFO it could be positioned as a net increase in top line margins. It may need to be repositioned as a reduction in quality test failures to the operations head. Domain expertise makes these conversations easier and more effective.

skill #5: Interpretation of Data

Of all the skills we have outlined, interpretation of data is the outlier. It is the one skill that combines both art and science. It requires the precision and sterility of hard science and mathematics but also call for creativity, ingenuity, and curiosity.

In most companies, a large majority of employees don’t understand their own company’s data. In fact, most employees do not even have a clear idea of where all the data is. These employees often rely on preconfigured reports and dashboards to derive their insights. Unfortunately, this approach is dangerous. It does not provide a holistic view of the data procurement and analysis process.This problem is often compounded by the fragmentation of data systems. As companies grow inorganically, different data silos merge, resulting in a confusing mess.

Keep reading

1 note · View note

mahiworld-blog1 · 8 years ago

Photo

https://www.edureka.co/blog/10-reasons-why-big-data-analytics-is-the-best-career-move

0 notes

mahiworld-blog1 · 8 years ago

Text

BIG DATA ANALYST #SKILLS:

skill #1: Programming

skill #2: Quantitative Skills

As a big data analyst, programming helps you do what you need to do. But, what are you supposed to do?

skill #3: Multiple Technologies

skill #4: Understanding of Business & Outcomes

Analysis of data and insights would be useless if it cannot be applied to a business setting. All big data analysts need to have a strong understanding of the business and domain they operate in.

skill #5: Interpretation of Data

#SOURCE* @INTERNET

1 note · View note