RAZSOFT CANADA | Big Data Service

Big Data Architecture

Big Data architecture is an evolving practice for developing an automated data pipe line on a reliable and scalable platform at a lower cost, to gather and maintain large collections of data, and to process and apply advance statistical models for extracting useful data insights and information.

What is Big Data

Big Data is a popular term used to describe Tools, Technologies and Practices to process and analyze massive datasets that traditional data warehouse applications were unable to handle. Big Data Landscape including Technologies and Practices are rapidly evolving. Big data is often defined by five Vs, however, some of this data and related sciences exists long before. Hence, big data is a collection of data from traditional and digital sources, either inside or outside the enterprise that represents a source for ongoing discovery and analysis, that can be structured or un-structured.

Five V's

Big Data is often defined by five Vs:

Volume: the amount of data, e.g. large data generated by machine sensors.
Velocity: the speed of data generated and flowing into the enterprise, e.g. stream of data through social media, or telecom CDRs.
Variety: the kind of data available, e.g. structured/unstructured, documents, multimedia etc.
Veracity: the accuracy and quality of data
Value: the economic value of the data

Hadoop Framework

Hadoop is a framework and set of tools for processing very large data sets. It was designed to work on cluster of servers using commodity hardware, providing powerful parallel processing on compute and data nodes at a very low price. Techonology is rapidly advancing both for software and engineered hardware, making this framework even more powerful. Hadoop implements a computational paradigm known as MapReduce, which was inspired by an architecture developed by Google to implement its search technology, and its based on “map” and “reduce” function from LISP Programming.

Today large number of Tools & Technologies are available that staples to Hadoop and commonly termed as Hadoop or Big Data Eco-System.

Classifications of Big Data

Categorizing business problems, and classifying nature and type of Big Data that needs to be sourced will helps understand the type of Big Data solution required to solve the equation. General characteristics of data deals with how to acquire the data, what is the structure and format of the data, how frequently data becomes available, What is the size of this data. What is the nature of processing required to transform this data, what algorithms or statistical model is required to mine the data etc.

Defining Business Problems

Classifying data based on variety of business problems in Industry, e.g.:

Market Sentiment Analysis: Fusing social media feeds, customer issues, customer feedback and profiles to generate customer sentiments about products and services.
Telecom Churn Analysis: Using Customer Profiles, Demographics, Segmentation, Transactions, CDRs, and fusing them with Social behavior, and applying predictive analysis and trending to forecast.
Fraud Detection: Using Customer Profiles, Account Information, Transactions, Credit Data and Scoring, and fusing them real-time with Social behavior, and other data from partners to detect and stop fraud.

Data Acquisition

Today data is pumped through variety of sources including Data Sensors, mobile network, internet, traditional commercial and non-commerical data to discover new economic value in data. An enterprise is discovering how more information can be derived from an existing data, at the same time they are sourcing data from external players. Some of sources are:

Enterprise data available in existing OTLP or OLAP Systems.
Archieved data, not tapped. e.g. Log Data, machine data etc.
Build partnerships with Key Industry Players to fuse data with industry like Telecom, Travel, Financial, and Entertainment etc.
Integrate data sources with aligned and external businesses to derive 360-degree insights.
Acquire Social Media Data or Internet Data

Big Data Management Life Cycle

Data Management in an Enterprise is a critical process for manage its data. Process involves defining, governing, cleansing, data quality, and securing all data in enterprise. it ensure the quality of data and that there is single source of truth. While Big Data is part of overall Data Management process in an Enterprise, its demands its own unique cycle due to its nature.

Big Data Management Life Cycle can be described in six key steps: i) Acquire ii) Classify & Organize iii) Store iv) Analyze v) Share & Act vi) Retire

CRISP-DM Methodology

Stands for Cross Industry Process for Data Mining. CRISP-DM is a comprehensive data mining methodology and process model that provides a complete blueprint for conducting a data mining project. It is a robust and well-proven methodology. It breaks down Data Mining Projects into six steps:

CRISP-DM - Six Steps

Business Understanding: Objectives, requirements, and problem definition
Data Understanding: Initial data collection, familiarization, data quality issues; Initial, obvious results
Data Preparation: Record and attribute selection; Data cleansing
Modeling: Run the data mining tools
Evaluation: Determine if results meet business objectives; Identify business issues that should have been addressed earlier
Deployment: Put the resulting models into practice; Set up for continuous mining of the data

Big Data Infrastructure

While enterprises are rushing to process their unstructured data into actionable business intelligence, they are first required to created an Infrastructure with Compute and Storage architecture that can deal with petabytes of high velocity data.

Virtualization

Virtualization is fundamental to both Cloud Computing and Big Data. It provides high efficiency and scalability for Big Data Platform, and give MapReduce the desired distributed environment with endless scalability. Virtualization is desired across all IT Infra layers including Servers, Storage, Application, Data, Networks, Processors, and Memory etc to provide optimum scalability and performance in distributed environment.

Scale up / Scale out

Scaling is expanding an infrastructure (compute, storage, networking) to meet the growing needs of the applications that runs on it. Scaling up is replacing current technology with something more powerful. e.g. 1GbE switch with 10GbE switch.

Scale out means taking the current infrastructure, and replicates it to work in parallel in distributed environment. In Big Data environment, nodes are typically clustered to provide scalability. When more resources are required, additional nodes can be simply added into the cluster, adding more compute power or storage space. A virtualized or cloud environment can provide such an scalability (termed Elasticity) where additional nodes can be build on-the-fly when needed, and destroyed once the processing has finished.

Security

Security is of paramount concern in big data environment due to nature of data collected, privacy concerns, regulations and compliance. Careful Security policy is required to mitigate risks. Managing good Access-Control both internally and externally, and maintaining audit and logs for data access. In addition to access security, number of Data Safeguarding techniques can be applied to secure the data, e.g.

Tokenization (e.g. replace credit card # with random token)
Sanitization (either Encrypt or Remove data that can uniquely identify individual, e.g. name, national id)
Data isolation (Isolation sensitive data into separate zone, e.g. PCI Zones)

Cloud

Cloud provides three key support for Big data -- Scalability, Elasticity and Flexibility. Resources can be added and remove resources in real-time as needed, in extremely quick time at lower cost, in comparison to traditional procurement of private infrastructure. Cloud Service options are vast and available at each layer, i.e. Infra-IAAS, Software-SAAS, Platform-PAAS, and Data-DAAS, and it can be structured as Private Cloud (private data center) and Public Cloud (external vendor service on pay per use basis). Examples of cloud services:

Amazon Big Data - EC2, Elastic MapReduce, DynamoDB, S3 Storage, RedShift
Google Big Data - Compute, Big Query, Prediction API
Microsoft Azure - Based on Hortonworks HDP

Big Data Databases

Big Data consists of structured, unstructured and semi-structured data, and include relational databases (OLTP or OLAP), as well as other non-relational databases like key-value, document, columnar, graph or GeoSpatial data stores. A typical Big Data implementation will include multiple databases to serve different needs of Big Analytics -- a theme thats knows as Polyglot persistence.

Big Data Store is faced with many challenges. Firstly, it may receive fairly large data volumes at high velocity in real-time or near real-time. Secondly, its data sources may not be trustable, e.g. internet or social media. Third, it may receive dirty data which is inaccurate or inconsistent, hence cleansing and data quality may be astronomical task. Fourth, data may be noisy, and even in cleansed data only small portion of data may be of business value, Lastly, data may include personal or private information, subjected to regulatory and compliance laws, and privacy concerns. Hence, Data Governance, Data Privacy, and Data Life Cycle becomes very critical in Big Data environment.

noSQL

noSQL Stands for NOT ONLY SQL. No formal definition. Its an umbrella term for unstructured data store, though they may support SQL-like query languages. First used in 1998 for open source relational database. Became popular in 2009, Johan Oskarsson organized event discussed distributed databases used the term noSQL. It provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.

Big Data Databases

Big Data Databases can be categorized as

Relational: Traditional RDBMS. Oracle, SQL Server, PostgreSQL etc.
Key-Value: Simple noSQL Data Store, support key-value. Riak, Redis, Amazon (S3, DynamoDB), Memcached, infiniSpan etc.
Document: Semi-Structured, no pre-defined schema. MangoDB, CouchDB, Terrastore, OrientDB, RavenDB, Jackrabbit
Columnar: Unstructed. Store Data in tabular format (in columns). Cassandra, Hbase (like Google BigTable), Hypertable, Amazon (SimpleDB) etc.
Graph: Store entities and their relationship in Graph DS. Neo4j, FlockDB, InfiniteGraph, OrientDB etc.
Spatial: Is a geo-database that is optimized to store and query data that represents objects defined in a geometric space. Can store 2D/3D Objects. Spatial data is Standardized by Open Geospatial Consortium (OGC), which establishes OpenGIS (Geographic Information System) and other Spatial Data Standards. Oracle Spatial, PostGIS etc.

ACID vs BASE

Big difference between traditional RDBMS and noSQL Databases is BASE vs ACID.

Traditional Relational Database supports ACID ( Atomicity, Consistency, Isolation, Durability).
Most NoSQL stores lack true ACID Transactions, rather the support BASE (Basically Available, Soft state, Eventual consistency).

However, it is not always a requirement for Big Data, as most Big Databases are used for Decision Support System (DSS). In DSS, processing is performed in batch before the data is Read. DSS are used for non-transactional applications, and meant for organization's decision-support (e.g. OLAP apps). Although, Big Data is often used for real-time processing, but not always for online transactional processing (OLTP). Some Big Databases support ACID, like OrientDB, Neo4j.

Polyglot Persistence

Polyglot Persistence term refers to use of several core database technologies in a single application process. Though organizations are already using multiple databases for various applications like OLTP, OLAP, CMS, data marts etc. It become much more prominent with Big Data, as nature of Big data is to deal with "variety" of data types (structured or unstructured), as well as varying needs to process/compute it (realtime, near realtime etc.)

For example, eCommerce applications requires shopping cart, product catalog, stock inventory, orders, payments, user profile, user sessions, recommendations etc. Instead of trying to store all different data types into one database, it is more suitable to store them in best suited environment and desired data persistence. E.g.

User-session can be stored in Key-Value Data store.
Shopping Cart and Product Catalog can store Document Data store.
Recommendations can store in Graph Data Store.
Orders can be stored in RDBMS.
Analytics can be stored in Columar data store.

Big Data Analytics

Big Data Analytics is about gathering and maintaining large collections of data, and extracting useful data insights and information from these data collections.

Big Data Analytics not only provides tools and techniques to apply advance statistical models to data collections, but it is a completely different way of processing and extracting information, in comparison to traditional data analytics.

Big Data Analytics uses statistic modeling to provide deep insights into data using Data Mining techniques. Data Mining involves exploring and analyzing large datasets to find patterns in that data. These technique are used in Statistics and Artifical Intelligence (AI), and its all about Maths, like Linear Algebra, Calculus, Probability Theory, Graph Theory etc.

Data Mining is generally divided into four groups. i) Descriptive ii) Diagnostic iii) Predictive and iv) Prescriptive analysis. Descriptive Analytics classifies historical data to analyze performance, e.g. sales. Predictive Analytics attempts to predict future based on descriptive data, business rules, algorithms, and often human bias using Statistical Modeling techniques. Prescriptive analysis takes first two analysis and provide prescriptive advise based on what-if scenarios. It prescribe an action, so the business decision-maker can take this information and act upon it.

Analytics, Reporting & Dashboard

Building basic analytics, operational reporting and dashboards. This may involve developing descriptive statistics, breaking down data with various dimensions, building reports to monitor thresholds, KPIs. And reports to detect anomalies, fraud, and security breaches.

Data Mining & Data Science

Applying Data Science to mine data. It involves sophisticated complex algorithms, statistical models, machine learning, neural networks, text analytics. This includes descriptive, diagnostic, predictive or prescriptive statistics, applying supervised or unsupervised machine learning techniques. For instance, in Telecom company predicting customer behavior to manage Churn. some of the popular algorithms are classification, decision tree, regression, clustering, neural network etc.

Operationalizing Analytics

Key objective of data analytics is to put them into decision-maker's hand, so this information can be used and acted upon. For instance, a fraud analytics can be plugged into real-time Sales or Payment system to flag fraud transaction, or a recommendation engine can pull data from Big Data API to perform up-sell recommendations during sales process.

Monetizing Big Data

While enterprises are developing Data Science to gain a fully function 360 degree customer view to enable them to make informed decisions to improve Customer Experience & Satisfaction, Customer Engagements & Interactions, Customer Loyalty & Recommendations, and Channels/Employee Productivity etc. Or to predict Customer behavior, Customer Buying Patterns, Customer Movements and Travel Plans, Churn, Cross Sell / Up Sell Opportunities. Or to detect Fraud & Spin, Security alerts etc. They can use this data to find new revenue streams, develop new business models, or simply sell it to other companies who needs this analytics to perform their business.

However, enterprises needs to ensure that they exploit data by ensuring customer privacy and the security of data under different legal environments.

Internal

Exploit internally to generate Leads to support Campaigns, support Marketing & Sales Team to perform data inquest and analysis to find new revenue streams, reduce consumer churn, improve customer loyalty, or develop profitable data-driven business models.

External

Develop Data Insights for consumption by Industry Verticals via Data Stream and API.

Big Data Science

Data Science is the ongoing process of discovering information from data. It is a process that never stops, and often one question leads to another new questions.

Big Data Science

Data Science is the ongoing process of discovering information from data. It is a process that never stops, and often one question leads to another new questions. It focuses on real-world problems, and tries to explain it.

Data scientists use mathematical and statistical methods to build decision models to solve complex business and scientific problems. Statistical methods are useful to understand data, for validating hypotheses, for simulating scenarios, and for making predictive forecasts of future events. e.g. linear regression, Monte Carlo simulations, and time series analysis etc. Data scientists are required to have a strong subject-matter and domain-specific expertise in the area in which they’re analyzing, e.g. city planning, crime analysis, energy efficiency, customer churn etc.

Terms

Descriptive & Inferential Statistics

Descriptive Statistics help us describe the data distribution, i.e. show or summarize data in a meaningful way such that patterns might emerge from the data. However, they do not allow us to make conclusions, rather it simply help us visualize data. Typically, there are two general types of statistic that are used to describe data:

Measures of central tendency: Way of describing the central position of a frequency distribution for a group of data. Central Tendency can be described as "What usually happens". Statiscal measures for Central Tendency are Mean, Median, and Mode.
Measures of spread or Dispersion: Ways of summarizing a group of data by describing how spread out the scores are. Statiscal measures for Dispersion are Range, Variance and Standard Deviation.

Descriptive statistics are generally presented through summation of data either in Tabulated description (ie. Tables), or Graphical description (i.e. graphs or charts), or through a statistical commentary about result-sets.

A group of data that we are interested in is called Population. Properties of Population like mean or standard deviation are called parameters as they represents the entire population. However, when its it not feasible to measure the entire data, then small sample of population is taken through a process call Sampling.

Inferential Statistics are techniques that allow us to use these samples to make generalizations about the populations from which the samples were drawn. The methods used for inferential statistics are: i) the estimation of parameters and ii) testing of statistical hypothesis.

Inductive vs Deductive

Inductive models are built from generalization of an example, based on new knowledge. i.e. we look for a pattern that explains the common characteristics of an example, and apply it to other situation. In this case, new knowledge obtained invalidates old knowledge and is inducted.

Deductive models are built by deducting generalizations from domain theory, a previously solved examples and its explanation. i.e. we look for a solved examples, and deduct its characteristics to generalize. In this case new knowledge can not invalidate knowledge already known.

Analogical learnings are solutions to new problems by finding similarities with known problems and adapting their solution.

Reinforcement models are learned by trial and error playing with the environment and receiving feedback of our actions.

Types of Analytics (Prescriptive, Predictive, Diagnostic, Descriptive)

Typically there are four groups of analytics in data mining:

I. Descriptive: Descriptive Analytics are based on current and historical data. They general deals with "What happened". They would generally be presented in Tables, metrics, graphs, trends etc, like Quarterly Sales, or Year by year revenue comparisons.

II. Diagnostic: Diagnostic Analytics try to diagnose a situations or "What went wrong?" or why. They are useful for deducting and inferring the success or failure. For instance in market campaign for a product, descriptive analysis of page views, reviews, social media or blogs, actual sales, sales of previous product launches etc can help determine what went right, or what didn't work this time.

III. Predictive Predictive Analytics are based on current and historical data. However, unlike descriptive, this involves complex model-building and analysis to predict future and trends. Data Scientist use maths and complex statistical models using Supervised Learning or Unsupervised learning methods.

IV. Prescriptive Prescriptive Analytics aims to improve or optimize a process or system, or to avoid a failure through informed action. These actions are prescribed by building rules engines that are based on predictive analytics. The often deals with "What-if" scenarios, and can provide real-time prescriptive actions, e.g denying fraudulent credit card based on fraud control analytics.

Machine Learning

Machine Learning provides ideal technique for exploiting opportunities hidden in big data and finding data insights. It is useful where traditional analytic tools are not adequate to deal with large volume of data.

Machine Learning can help find correlations and relationship between desperate data, and provide techniques to test all hypothesis to investigate hidden value in the data. It Goes well beyond the traditional analytical tools that provides basic analytics to function sum, counts, simple means and medians, and provides comprehensive libraries for advance statistics modeling and algorithms.

Machine learning the process where machine (computer) is trained to run a learned method (or algorithm). An algorithm that’s repeated over and over until a certain set of predetermined conditions are met. The process is run until the point that the final analysis results will not change no matter how many additional times the algorithm is passed over the data.

Types of Machine Learning

There are three types of Machine learning techniques. i) Supervised, ii) Unsupervised and iii) Hybrid. In Supervised learning there is pre-determined classifications, and training instances are labeled with the correct result. Whist in Unsupervised learning, it is much harder to determine the conclusion because there are no pre-determined classifications.

Supervised Learning

The aim of supervised, machine learning is to build a model that makes predictions based on evidence in the presence of uncertainty. As adaptive algorithms identify patterns in data, a computer "learns" from the observations. When exposed to more observations, the computer improves its predictive performance.

A supervised learning algorithm takes a known set of input data and known responses to the data (output), and trains a model to generate reasonable predictions for the response to new data. Predictive algorithms use methods such as regression and classification. These methods use labeled data sets to build predictive models that accurately predict new observations.

Unsupervised Learning

Unsupervised Learning is about analyzing data and looking for patterns. The most common unsupervised learning method is Cluster Analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data. The clusters are modeled using a measure of similarity which is defined upon metrics such as Euclidean or probabilistic distance.

Unliked Supervised learning, Unsupervised refers to the fact that we’re trying to understand the structure of our underlying data, rather than trying to optimize for a specific, pre-labeled criterion. Hence, it provides methods to discover patterns in unlabeled data sets that can be used, for example, to construct clusters of similar data or reduce the dimensionality of a data set.

Hybrid or Deep Learning

Many Data Science application are using a hybrid approach, which utilizes Unsupervised learning to pre-process data that becomes input to Supervised learning. This approach is used in Deep learning.

One example is Unsupervised techniques, such as Principal Component Analysis (PCA). Which can be used for dimensionality reduction, which reduces the number of feature variables while still being able to explain the variance in the data. The reduced data set can then be used with a supervised learning algorithm. In this way, PCA can improve the learning process.

Algorithms

Classification

Purpose of Classification is to predict an outcome based on input data. Classification has numerous applications, like fraud detection, targeted marketing, performance prediction, manufacturing and medical diagnosis, credit application risk profiling. Classification is a two-step Process.

Learning Step - in which Model is built based on seen data.
Classification Step - where Model is used to predict for given data.

Generally, a Sample is taken from real data that reflects real population to train the model. Radomizer is used for data selection. Classification uses two data sets i) Train Data (to build model), ii) Test Data (to measure model's performance and error rates). Decision Tree is generated on train-data and test-data. Confusion Matrix and Error rates are used to evaluate the model and hypothesis. Tuning of model can be done through Boosting and adjusting the Error cost. Randon Forest technique is used to improving predictive accuracy which generate larger number of decision trees.

Clustering vs Classifications

Clustering is used to divide the data into subsets and classification methods is used to build predictive models that can forecast the classification of future data points.

CLUSTERING

Unsupervised Machine Learning
Data is unlabeled
Use inferential method to discover patterns, relations, or correlations

CLASSIFICATION

Supervised Machine Learning
Data is labeled
Labels makes it easier to model or make rules to decide

Probability, Regression & Hypothesis

Decision Tree

Decision trees are one of the most powerful data mining technique, which can be used to solve wide range of problems. They can be used to explore data, making predictions, classifying records, and estimating values.

A decision tree is a hierarchical collection of rules that describes how to divide a large collection of records into successively smaller groups of records. They are a flowchart-like tree structure, it starts with "root" node and splits into branches. Each "internal" node denotes a test on certain attribute, next level "branch" represents an outcome of the test, and each "leaf/terminal" node holds a class label and contains "decisions".

Decision Tree where the target variable can take a finite set of values are called classification trees, where the target variable can take continuous values (typically real numbers) are called regression trees.

Classification for Predicting Customer Churn

Unstructured Data / Text Mining

Text Data Mining is the process of finding and exploiting useful patterns in text or unstructured data, e.g. emails, machine logs, tweets and blogs, text documents, multi-media contents etc. This data can be used to discover patterns, identify trends, find associations, perform sentiment analysis, predict behaviors, spam/phishing detection, web search etc.

A common approach is to Vectorize an unstructured or semi-structured text data through a process called Tokeniztion. Text is split into Bag of words, and each word in text reflects a Dimension. Text Documents are converted into Corpus to analyze them for important features or words phrases.
A Preprocessing step cleansed the data by removing "numbers", "punctuations", or undesired text. A process for Stemming reduces the words to their "stem", e.g. playing to play. Another process for removing Stop words text of little meaning, e.g. articles (a, an, the) or pronouns (you, she) etc. Identifying word pairs and phrases, or a custom defined lexicon is often used to ensure important dimensions are not lost.
Term Matrix Document (TDM) is created for calculate Term Frequency that shows terms and frequency of all words in the Corpus.

Sentiment Analysis

Sentiment Analysis is another application for text mining, and it can determine how people fell about something, like a brand or product, or stock etc. A sentiment score is assigned (e.g. negative, positive, neutral) for scoring.

Clustering

Clustering is the process of partitioning a set of data items (or observations) into subsets, where each subset is a Cluster, such that items in a cluster are similar to one another, and dissimilar to items in other clusters. Unlike Classifications, Clustering techniques have no pre-defined criteria or set targets, hence, it is referred to as unsupervised learning. Since there are no class labels, clustering is a form of learning by observation, rather learning by example. Clustering also detects outliers, where outliers are values that are "far away" from any cluster, like exceptions.

Clustering Process Steps

Clustering involves following steps:

Data Preperation: Gather dataset, data reduction and choosing right number of dimensions, Scaling to right proportions, and identifying right number of clusters
Apply Clustering Method: Partioning Methods (e.g. k-means), Hierarchical methods (e.g. agglomerative or decisive), Density-based methods (e.g. DBSCAN, OPTICS, DENCLUE), or Grid-based methods (e.g. STING, CLIQUE) etc.
Measuring Similarities: Techniques like Distance Measures (e.g. Euclidean, Square Euclidean, Mahattan), or Correlational Measures, Proximity Matrices (e.g. Nearest neighbor Linkages)
Analyzing Clusters: Plotting the cluster, and analyze it including both intuition and quantitative techniques, like Silhoutte

Clustering can be used as input to a "Classification" analysis, e.g. clustering may define customer segments that can be used as target labels in classification problem to predict customer behavior.

Business application of Clustering can help marketing define Segmentation, targeted advertisements, or customer categorization. Government can use it for census data and social analysis etc. Customer Segmentation, for example, can help to identify groups of customers that are similar to each other. This can allow direct marketing campaigns to different group of customers for cross-selling offer, retention efforts, customized messaging. Or It can help categorize customers to focus on group of similar customers.

K-Means Clustering Algorithm

K-means is one of the most commonly used clustering algorithm that uses Partioning Method. The "k" in its name refers to the fact that it looks for fixed numbers of clusters. K-mean may use mean or medoid etc to represent cluster center. Its not suited for binary values or categoric attributes. i.e. it only deals with numbers. Steps are:

Requires analyst to provide "k" number of clusters
Algorithm will calculate recursively clusters until there is no more change in data points
Evaluate the Cluster, typically using visual plots and quantitative techniques (e.g. Silhoutte)

Hierarchical Clustering Algorithm

Hierarchical clustering works by grouping data objects into a hierarchy or "tree" of clusters. It can be either "agglomerative" or divisive, depending on whether the hierarchical decomposition is formed in a bottom-up (merging) or top-down (splitting) fashion.

H-Cluster can use various methods, e.g. single (nearest neighbor), complete (fullest neighbor), ward (min variance) etc
Algorithm will iteratively merge (or split) data points/clusters until there is single cluster
It produce a dendrogram, in which heights and bars shows the closeness of items
Dendrograms can be cut to appropriate level to review specific cluster(s).

Linear Regression

Linear Regression is the most commonly used statistical method for numeric predictions. Though they are available in Excel, they can be quite sophisticated. However, simple regression is used to show relationship between dependent variable (y-axix) to independent variables (x-axis). This relationship is than used to predict target variable's future value.

Some business uses are Trending Sales to forecast; Identifying patterns based on known data to forecast future, e.g. predicting election results, insurance claims; Analyzing effect of product pricing on customer behavior to optimize price.

Correlation

Correlation is used in regression modeling to identify relationships between dependent and independent variables. Correlation represents relationship between two variables, and it ranges between -1 and +1. A range (-ive/+ive) of:

0.1 - 0.3 --> is Weak Correlation,
0.3 - 0.5 --> is Moderate Correlation,
0.5 - 0.1 --> is Strong Correlation.

However, this range needs to be interpreted in-context, e.g. 0.5 is high correlation when human beings are involved, but considered low where machine processes are involved. There are several correlation coefficients to measuring the degree of correlation, e.g. Pearson, Rank, Polychoric etc.

Associations & Correlations

Frequent Patterns are patterns that appear frequently in dataset, e.g. frequent itemset is set of items like bread and milk in grocery transactions. A subsequence e.g. is like buy Computer first, then Printer, then Paper. If this occurs frequently in shopping, then its a (frequent) sequence pattern. Finding frequent patterns is essential in mining Associations & Correlations. A typical example of frequent itemset mining is market basket analysis, which can analyze customer buying habits by finding associations between different items that customer purchased, and provide insights to retailers into which items were frequently purchased together.

Association Rules

Association rules mining consists of finding frequent itemsets, e.g. sets of items like milk and bread. From this association rules are constructed. Thousands of rules can be generated from transactions. They are grouped as:

Trivial: Rules represent common knowledge, e.g. people who buy hammer also buy nails.
Inexplicable:Rules doesn't suggest course of action. e.g. people who bought pickles also bought ice cream.
Actionable:Rules, that represent a explainable cause, and easily understood by analyst with domain-knowledge.

There are many measures for Rule interpretations like: i) Support, ii) Confidence and iii) Lift. They respectively reflect the usefulness and certainty of discovered rules. e.g. A rule:

computer ⇒ printer [support=2%, confidence=60%, lift=1.2]

2% support means that 2% of all transactions shows that computer and printer was purchased together. A confidence of 60% means that 60% of customers who purchased computer, also bought printer. A Lift is a correlation rule, that measures correlation between computer (A) and printer (B). Lift > 1 means rule is good at predicting, Lift <= 1 means rules is as good or worse than guessing. The rules are considered useful if they satisfy minimum confidence threshold, specified by analyst.

Association Algorithms

Many algorithms are available for frequent itemset mining, from which association and correlation rules can be derived. These include:

Apriori-like Algorithms
Frequent pattern growth-based Algorithms, like FP-growth
Vertical data format Algorithms (Eclat)

Apriori is a seminal algorithm for mining frequent itemsets for boolean association rules. It employs an iterative approach called level-wise mining by using apriori property. This technique use trials and errors approach to generate Rules. Frequent pattern growth avoid T&E, rather construct highly compact data structure (an FP-Tree) to focus on frequent patterns. Vertical data format transforms a transaction dataset into Vertical itemsets.

Big Data Technologies

Our consultants can help you determine which Big Data platforms and tools to implement for your organization's Big Data Projects to achieve your goals.

Key Technologies

Big Data turf is rapidly evolving. There has been a massive amount of innovation in Big Data Tools & Technologies over last few years. Technologies that are coming out in Big Data are based on following key trends:

Internet Technologies: Technologies developed by Internet, Web and Social media companies like Google, Facebook, Amazon, like text processing, log processing, social networks etc.
Machine Learning: Apply Statistical modeling, computational scinece and machine learning to invent Artificial Intelligence to solve real work problems, and provide analysis, intelligence, predictions, and deep learnings.
Commodity Hardware: Use Commodity hardware that is much more cheaper than traditional hardware, and support distributed processing at much lower price tag.
Distributed Processing: Leverage Distributed processing to power up and Use Scale out option to build infrastructure on demand, and destroy once the outcome is achived
Cloud: Leverage Cloud based approach to reduce time to market, reduce risk and gain better SLA out of the box.

Open Source

Apache Hadoop

Apache Hadoop is an open-source data-processing platform first used by Internet giants including Yahoo and Facebook, leads the big-data revolution.

Horton Works

Horton is one of the few which offer 100% open source Hadoop technology without any proprietary (non-open) modifications. Thave have intergrated and support Apache HCatalog, which creates “metadata” to simplify data sharing with other layers of service like Apache Pig or Hive.

Commercial - Distribution

Cloudera

Cloudera is first to offer Commericial Enterprise support. They also offer Impala, which offers real-time massively parallel processing of Big Data to Hadoop

Oracle

View your marketing performance, monitor overall cost per lead, number of leads, average deal time, total acquisition cost over time, and drill down to regions.

MapR

Provide native support for Unix File system rather HDFS. MapR is also leading R&D of the Apache Drill project, which provides advanced tools for interactive real-time querying of Big Datasets.

Pentaho

Provides Hadoop distribution with Big Data eco system tools and analytics, both with community open source and Commericial Enterprise editions.

IBM

IBM BigInsights is a proprietary analytics and visualization algorithms to the core Hadoop infrastructure.

Commercial - Cloud

Amazon Web Services (AWS)

AWS's entire business is based on the cloud model. They have broadest selection of products for analytics and options to scale up / scale out, and it's typical choice for those running big workloads and storing lots of data in the cloud.

1010 Data

Cloud-based Big Data Platform, originated in Wall Street and used by New York Stock Exchange NYSE. They have highly scalable database service and supporting information-management, BI, and analytics capabilities that are served up private-cloud style

Microsoft HD Insight

Primarily a cloud-based Platform to run on Azure. Their distribution is based on Hortonworks.

Commercial - Others

Datastax Enterprise Analytics

Offers its own distribution of Apache Cassandra Data store with Hadoop. It includes proprietary modules to handle security, search, dashboard and visualization. Key customer include Netflix, where it powers the recommendation engine by analyzing over 10M datapoints per second.

Pivtol HD

Open source model, provide scale-out databases, including Pivotal Greenplum (Data warehouse) Pivotal HDB (Hadoop), Pivotal GemFire (in-memory data grid). Its a venture between EMC & vmWare. Key customer is China Railways with 3.5 Billion passenger's data.

Large organization creates huge volume of architectural outputs. Managing and organizing them requires effective management, tools, storage, and a formal taxonomy for various different artifacts.

Governance

Falcon

Apache Falcon is a data governance engine that defines, schedules, and monitors data management policies. Falcon allows Hadoop administrators to centrally define their data pipelines, and then Falcon uses those definitions to auto-generate workflows in Apache Oozie. Read More...

Atlas

Apache Atlas is a scalable and extensible set of core foundational governance services that enables enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the complete enterprise data ecosystem..

Integration

Sqoop

Sqoop provide command-line interface to transfer bulk data between (to/from) non-hadoop or relational databases and Hadoop. Sqoop can import a table or entire database into hadoop, and it can also export to relational database, and supports data transformation. Behind the scene Sqoop creates MapReduce jobs to import. It can do incremental loads through sql like queries or saved jobs.

Flume

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It runs agents on the source machines that pass the data updates to collectors, which then aggregate them into large chunks that can be efficiently written as HDFS files.

Chukwa

Apache Chukwa is a data collection system for monitoring large distributed systems. It can collect logs and perform analysis. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a ﬂexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data..

Kafka

Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. It sends sending large numbers of events from producers to consumers. Kafka was originally built to connect LinkedIn’s website with its backend systems, it’s somewhere between S4 and Flume in its functionality. Kafka relies on ZooKeeper to keep track of its distributed processing.

Access

Pig & Pig Latin

Pig is an interactive execution environment to access data in Hadoop, like SQL in RDBMS. It support scripting language called Pig Latin. Pig Latin support loading data, processing, transformation and output. Pig can run both in Local mode or Hadoop mode (Parallel processing with MapReduce and HDFS). Pig programs can run through Pig Scripts, Grunt Command line, or Embedded java programming

Hive

Hive provides batch-like programming interface to Hadoop. It provides its mechanism for data store through its own table, partitions, buckets (files) and supporting meta data. It also provide SQL like queries for data access. It support both SQL query and MapReduce capability. Hive queries may take several minutes or even hours, depending on complexity. Hence, it is best used for deep data mining, rather than real-time inquiries.

Impala

Impala is native analytic MPP database for Apache Hadoop and is supported by Cloudera Enterprise. Provide ANSI SQL compatibility.

Solr/Lucene

Lucene is a Java library that handles indexing and searching large collections of documents, and Solr is an application that uses the Lucene library to build a search engine server. Solr is a high performance search server built using Lucene Core, with XML/HTTP and JSON/Python/Ruby APIs, hit highlighting, faceted search, caching, replication, and a web admin interface.

Processing

MapReduce

MapReduce (first developed by Google) is a software framework, and its the core of Hadoop Eco System, that allows to write applications that can process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

MapReduce is a design pattern that includes 3 steps, and its based on traditional functional programming model with multi-threading:

Map: Read input & transforms data into series of key-value pairs
Shuffle: Each of these key-value elements are then sorted by their key and reach to the same node.
Reduce: Reduce function then merge the values (of the same key) into a single result.

MapReduce Framework has two nodes :

Single Master "JobTracker" - Responsible for Job Scheduling. Receives Job from Client as assign to task tracker.
Multiple Salves "TaskTracker" - Manages individual tasks, and write to, perform Map or Reduce Function, and writes to DataNode. There is one slave TaskTracker per cluster-node.

Mahout

Mahout is a scalable data mining library, that provides scalable machine learning algorithms for Clustering, Classification, Recommendations, filtering etc. Its used by Amazon, AOL, Drupal for their recommendation engine. It provides two methods to program, one through command-line and other through API-based java libraries.

R

R is a programming language and environment for statistical computing and graphics. R is often claimed to be most used tool by Data Scientists. It provides wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering etc.) and graphical techniques, and is highly extensible. R-Studio has a free IDE. R includes standard Statistical Libraries, while other libraries are available for download.

Sparks

Apache Spark allows fast in-memory data processing, which can tune performance up to 100 times faster. It provides APIs in Scala, Java, R, and Python for development of machine learning and data analysis applications. It comes with a built-in set of over 80 high-level operators, and allow to use it interactively to query data within the shell. In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing. Spark Libraries includes Spark Streaming, Spark SQL, Spark MLlib (Machine Learning Library), Spark GraphX.

H2O

Rapid Miner

scikit-learn

Key-Value Data Store

Key-Value are Simplest noSQL Data Stores that can be persistent or run totally in memory. Some of the uses of Key-Value Data Store are: Web Sessions, User Profiles / Preferences, Shopping Cart Data.

Riak

Riak is a distributed NoSQL key-value data store that offers high availability, fault tolerance, operational simplicity, and scalability. Riak was inspired by Amazon’s Dynamo database, and it offers a key/value interface and is designed to run on large distributed clusters. It automatically distributes data across the cluster to ensure fast performance and fault-tolerance.

Redis

Redis is an open source, in-memory data structure store. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs and geospatial indexes with radius queries. Redis has built-in replication and different levels of on-disk persistence, and provides high availability via Redis Sentinel and automatic partitioning with Redis Cluster.

DynamoDB

Amazon DynamoDB is a fully managed proprietary NoSQL database service offered by Amazon AWS. It provides fast and predictable performance with seamless scalability. Amazon DynamoDB automatically spreads the data and traffic for the table over a sufficient number of servers to handle the request capacity specified by the customer and the amount of data stored, while maintaining consistent and fast performance.

Document Data Store

They are Semi-structured Data Stores. There is no pre-defined schema or structure. Some of the uses are Documents stored in flat files like XML can do structured search. Or High volume “matching” of records: Trade clearing, Fraud detection, Transaction reconciliation.

MongoDB

MongoDB is Document Data store. Its records look similar to JSON objects with the ability to store and query on nested attributes. It supports automatic sharding and MapReduce operations. Queries can be written in JavaScript and Interactive Shell.

CouchDB

CouchDB is similar to MangoDB, but differs in how it supports querying, scaling, and versioning. It support JavaScript interface for Queries.

JackRabbit

RavenDB

OrientDB

TerraStore

Columnare Data Store

Its unstructured Data Store. Stores data in tabular format (in columns). Each row is called Column-Family. It supports Keyspace (Collection) >> Columns Familes (Rows) >> Columns >> Key-Values. Some of the uses are very large datasets, rows can have millions of columns, and variable/flexible to store different type of data. E.g. Event Logging, Blogging & User generated contents, fast search on tags..

Cassandra

Cassandra was originally an internal Facebook project, however it is now an open source. It’s very close to the data model used by Google’s BigTable. It can support both key-value and columnar data structures. The data is sharded and balanced automatically using consistent hashing on key ranges, though other schemes can be configured.

Hbase (Like Google BigTable)

Hbase is a columnar database and its modeled after Google's BigTable. It is capable of holding very large table (billions of columns/rows). Each row/column is called "cell", and it keep tracks of version for each cell with timestamp. Any previous version can be retrieved.

HyperTable

Hypertable is another open source clone of BigTable. It’s written in C++, rather than Java like HBase.

Amazon (SimpleDB)

Graph Data Store

Store entities, and their relationship in Graph DS. Entity is called Node, Relationships are called Edges. Graph DS can be queried in many ways, e.g. All Nodes who live in “Toronto”, and like “Soccor”. Some of the uses are Connected Data e.g. People and their associations. Recommendation Engines. E.g. Customer stayed in hotels.

Neo4j

Most popular Graph Datastore. Neo4j is ACID compliant. It's Java based but has bindings for other languages, including Ruby and Python.

FlockDB

FlockDB was created by Twitter for relationship related analytics. There is no stable release of FlockDB as yet. The biggest difference between FlockDB and other graph databases like Neo4j and OrientDB is graph traversal. It doesn't support traversal, as Twitter's model has no need for traversing the social graph.

InfiniteGraph

It is a proprietary graph database from Objectivity. Its goal is to create a graph database with "virtually unlimited scalability". Its reportedly used by CIA and Department of Defense.

OrientDB

OrientDB is ACID compliant. OrientDB supports HTTP and JSON out-of-the-box. OrientDB supports SQL with some extensions to manipulate trees and graphs. OrientDB supports state of the art Multi-Master Replication on distributed systems.

Security

Knox Gateway

Apache Knox Gateway is a REST API Gateway for interacting with Hadoop clusters. It provide support for functinos like Authentication (LDAP and Active Directory Authentication Provider), Federation/SSO (HTTP Header Based Identity Federation), Authorization (Service Level Authorization), Auditing.

Ranger

Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform. It provides central security policy administration across the core enterprise security requirements of authorization, authentication, audit and data protection.

Sentry

Authorization Component. Sentry is add-on for enforcing fine grained role based authorization to data and metadata stored on a Hadoop cluster.

Kerberos

Authentication Component. Kerberos is a network authentication protocol. It is designed to provide strong authentication for client/server applications by using secret-key cryptography. A free implementation of this protocol is available from the Massachusetts Institute of Technology. Kerberos is available in many commercial products as well.

Operations

Ambari

Apache Ambari provides software for Provisioning, Managing, and Monitoring a Hadoop Cluster. It also provide Ambari REST APIs to integrate these operations in external applications. Ambari provides an intuitive, easy-to-use Hadoop management web UI.

ZooKeeper

ZooKeeper is a centralized server in Hadoop Eco System to perform distributed configuration, synchronization service, and naming registry. It is a coordination service that coordinate all elements of distributed applications. It gives resilience and fault tolerance to Hadoop applications. Its capability include Process Synchronization (between nodes), Configuration Management, Self-Election (assign node a leader role), Messaging (between nodes). ZooKeeper best implemented when it manages group of nodes over multiple racks into a single distributed application.

CloudBreak

Cloudbreak is a cloud agnostic tool for provisioning, managing and monitoring of on-demand clusters in the cloud. It automates the launching of elastic Hadoop clusters with policy-based autoscaling on the major cloud infrastructure platforms including Microsoft Azure, Amazon Web Services, Google Cloud Platform, OpenStack, as well as platforms that support Docker containers for greater application mobility.

Scheduling

Oozie

Apache Oozie is a workflow scheduler system to manage Hadoop jobs. Oozie is a Java Web application, its Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. There are two basic types of Oozie jobs: i) Oozie Workflow jobs (workflows), ii) Oozie Coordinator jobs (triggered by time and data availability)

Big Data Service

Build your Big Data Strategy

Big Data Architecture

Big Data Technologies

Big Data Science

Big Data Analytics

Our Methodology

DISCOVER

POC

PROVISION

INGEST

PROCESS

PUBLISH

Big Data Architecture

What is Big Data

Five V's

Hadoop Framework

Classifications of Big Data

Defining Business Problems

Data Acquisition

Big Data Management Life Cycle

CRISP-DM Methodology

Big Data Infrastructure

Virtualization

Scale up / Scale out

Security

Cloud

Big Data Databases

noSQL

Big Data Analytics

Analytics, Reporting & Dashboard

Data Mining & Data Science

Operationalizing Analytics

Monetizing Big Data

Internal

External

Big Data Science

Big Data Science

Terms

Machine Learning

Types of Machine Learning

Algorithms

Classification

CLUSTERING

CLASSIFICATION

Unstructured Data / Text Mining

Clustering

Linear Regression

Associations & Correlations

Big Data Technologies

Key Technologies

Categories

Big Data Industry Trends

Telecomunication

Healthcare

Retail

Supply Chain & Logistics

Big Data Artifacts

Architecture