Cluster Analysis using Tableau and R – Part-1

This article introduces you to similar clustering analysis on your data using Tableau and R. Data files and source used in this post can be downloaded using the link below.

Download source files used in this article

Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. We will perform the analysis in both Tableau and R using the same data.

Clustering Analysis using Tableau

To start with, we connect Tableau to Iris data set.  You can download this from UCI Machine Learning Repository

First connect the Tableau work book to this csv data source and launch a new sheet. Drag the measures petal length, petal width to columns and sepal length, width to rows.

clustering_tableau_R

Next, disable aggregation of measures using Analysis->Aggregate Measures

clustering_tableau_R1

Alternately, to keep it simple, you can choose to analyze only 2 measures as shown below. But in this article, we go with all 4 measures as above

clustering_tableau_R2.PNG

If you observe, these scatter-plots does not identify or differentiate any groups. However in our case, the data set already has a column specifying flower species of these measures. So let us view it by dragging ‘Species’ to color which shows the distinct species groups as below:

clustering_tableau_R3

Well, imagine what if we didn’t had the ‘species’ data handy and we wanted to identify the clusters based on the measures. Lets see how it can be accomplished using Tableau Cluster Analysis.

Start with our initial plot, i.e.

clustering_tableau_R1

Go to Analytics tab, and drag ‘Clusters’ as shown in the screen capture below. Tableau automatically identifies the number of clusters.

clustering_tableau_R4

Leave the defaults

clustering_tableau_R5.PNG

clustering_tableau_R6

Note that we have got exactly same cluster grouping as we got using ‘Species’ dimensi.on data.

Tableau uses k-means algorithm for cluster analysis which partitions the data into k clusters with a center or mean value of all the points in each. Clustering is based on the distance each measure lies from the center.

Cluster Analysis using R

To start with, let us revisit Tableau plot for iris data between petal.length and petal.width with cluster analysis.

clustering_tableau_R7

Let’s go to RStudio, and plot this using ggplot (note that iris is available as part of the datasets installed with R)

library(ggplot2)
ggplot(iris, aes(Petal.Width, Petal.Length, color = Species)) + 
geom_point()

clustering_tableau_R8

Note that we get identical grouping in R plot, but we used species column data to group (color) the data.

Let us look at how to perform the cluster analysis to identify clusters in R.

First take a copy of iris dataset to another variable

#cluster analysis - Biju Paulose
#-------------------------------
#copying iris to myiris variable
myiris<-iris
#printing data 
myiris

For our analysis, we do not want to use species column/data. So lets remove that from the new dataset.

#Remove Species column
myiris$Species <- NULL
#printing data to verify
myiris

Lets use k-means function for generating 3 clusters and plot the data

kmeans.result <- kmeans(myiris,3)
# plot the clusters
plot(myiris[c("Petal.Width", "Petal.Length")],col=kmeans.result$cluster)

The result is given below.  As you can compare with the analysis performed in Tableau above, we could generate the same clustering of data from R. We will examine these more closely in my next article.

clustering_tableau_R9

Download source files used in this article
Advertisements

On-Premises Data Gateway for Power BI

We have seen different Power BI implementation options in my article here. If you do not want their application or database infrastructure to cloud will be looking for options for generating visualizations from Power BI service(cloud) connecting to on-premises data source (database such as SQL Server, Oracle or even structured files such as spreadsheets or csv files.

Note : Complete on-premises solution is available with the release of Power BI Report Server which is part of Power BI Premium. This article assumes the Power BI service implementation as below and gateway component is highlighted.

Power BI Gateway 1

On-premises data gateway allows Power BI service to connect to your database to refresh dataset. Gateway allows both scheduled refresh and live/direct query.  Time for looking at the steps on how to install and configure the gateway.

Download the standalone data gateway and start the service in on-premises. To download, use the option below after logon to your accountPower BI Gateway Installation_1

After installation, you have to sign in using your Azure account and specify a name to your gateway so that Power BI service tie it to your Power BI account.

Power BI Gateway Installation_3

Power BI Gateway Installation_4

Make sure that On-Premises Data Gateway is running as a windows service in your PC. Ideally this will be one of your on-premises server with connectivity to your database.

Go to Power BI web login, select ‘Manage Gateway’ option as shown below

Power BI Gateway Installation_5

Note that your Data Gateway is already listed there. You need to create a data source and provide authentication details for database connectivity (Windows User details in case of a file data source such as excel spreadsheet)

Power BI Gateway Installation_6

Next, go to your workspace, select the data set and go to ‘settings’ as shown below

Power BI Gateway Installation_7

Under ‘Gateway connection’ section, select your data gateway and click ‘apply’

Power BI Gateway Installation_8

Under ‘Scheduled refresh’ section, turn on ‘Keep your data up to date’ and specify refresh frequency as below

Power BI Gateway Installation_9

That completes it. Your dataset is ready for scheduled refresh connecting to your on-premises data source!

For verifying, you can update data in SQL Server database (local server) and wait for the refresh to happen.

Power BI Gateway Installation_11

Once complete, launch your visualization which uses the dataset and see your updated data reflecting there!

Power BI Gateway Installation_10

Data Analytics – The Power BI way

In my previous articles related to BI, we have explored extensively on Tableau and SAP Business Objects. While I continue the series, it would be unfair if I don’t cover another ’emerging’ solution – Microsoft Power BI (in fact it is a market leading BI solution already, refer to my post for 2017 BI trends here Gartner Magic Quadrant for BI and Analytics)

Those who are familiar with Tableau will find themselves home with Power BI in many ways. However Power BI is unique in terms of features, the publish options, pricing and visualization options such as custom visualizations.

Power BI was completely cloud based solution until recently – i.e, until the release of Power BI Report Server in this month (June 2017).  In the cloud based solution, you have to publish your visualizations to Power BI Service on cloud. You have alternative to Power BI desktop for development in the form of Power BI Service Web Interface which provides similar options for creating and customizing visualizations.

I have sketched a high level cloud based (Power BI Service) implementation architecture below  connecting to on-premises database to understand the different components better.

Power BI Architecture

Implementation architecture using Power BI Service on cloud connecting to on-premises database

The implementation architecture below shows the on-premises solution using Power BI Report Server. In this implementation, development and publishing do not require cloud. Note that the development require installation of custom release version of Power BI Desktop (Power BI Desktop optimized for Power BI Report Server)

 

PowerBI Architecture_OnPremise

On-Premises implementation architecture using Power BI Report Server

In terms of connectivity, Power BI can connect to almost every data source. However the type and location of data should decide your design and type of connectivity.

On-Premises Database connectivity from Power BI Service : If you have an on-premises database, Power BI service allows the connectivity using gateways. The options are (a) data imported to Power BI service or (b) use direct query or live connections. We will explore these implementation models in detail in the upcoming articles.

PowerBI_DataSourcePowerBI_DataSource1

Custom Visuals : This is really great feature that Microsoft lets Power BI developers to create and publish their custom visualizations which can be added to your visualization tray and integrate with your reports/dashboards. Thanks for the support to creativity from developer community.

R Integration to Power BI :  Similar to Tableau R integration, we can connect to R for performing complex predictive analysis.

PowerBI_R_Integration

I will cover R integration details and many more in upcoming posts.

Data Analytics using Tableau – Trends & Trend Lines

In my previous blog, I had posted a bar chart showing declining trend in sales of a company over the years. If you notice, it had a trend line to aid visual analytics.

graph-1

Another example given below shows profit trends by increase in sales for each category.

Tableau_Trend_Lines

Trend lines are great visual tools for quick analysis. In the above diagram, its very easy to judge that the increase in sales for Supplies and Tables does not help increasing profit. The same judgement could not have been derived that easy from the clustered circles if the trend lines were not present. The steps to add Trend Lines are explained below.

I have created a scatter plot for Sales vs Profit using the Global Superstore database. Notice the “analytics” tab highlighted below.

Tableau_Trend_Lines1

Drag ‘Trend Line’ from the options to the visualization area and choose the line type. In this example, I have selected Linear model.

Tableau_Trend_Lines2

Tableau_Trend_Lines3

Additional options are available at (Right click -> Trend Lines-> Edit Trend Lines). For example, you can choose to show only one line (uncheck show trend line per color) and view for each category interactively for focused view/analysis as shown below

Tableau_Trend_Lines4

 

R Programming using RTVS – Part 4

This article assumes that you have RTVS extension added to your VS IDE.  You can refer my previous blogs on installation & configuration of RTVS.

  1. R Support in Visual Studio (RTVS)
  2. R Programming using RTVS – Part 2
  3. R Programming using RTVS – Part 3
Prerequisites:
1. Basic knowledge of R programming
2. RTVS Extension

Creating R projects using RTVS

Create a project in visual studio using R template.

RTVS_2

Open the Script.R file from solution explorer and type in your R program. In the example below, we create a simple line graph that compares sales and profit performance of a company over a period of 5 years.

#Define Data
sales <- c(2000, 3500, 3000, 4700, 6000)
profit <- c(1450, 2600, 750, 2000, 3750)
#range for axis
rnge <- range(0, sales, profit)
#Plot sales values
plot(sales, type = "o", col = "green", ylim = rnge, axes = FALSE, ann = FALSE)
#draw profit values as line
lines(profit, type = "o", pch = 22, lty = 2, col = "red")
box()
#draw X axis and y axis
axis(1, at = 1:5, lab = c("2013", "2014", "2015", "2016", "2017"))
axis(2, las = 1, at = 1000 * 0:rnge[2])
#Graph Title
title(main = "Sales by Year", col.main = "blue", font.main = 4)
#add axis labels
title(xlab = "Year", col.lab = rgb(0, 0, 1))
title(ylab = "Amount", col.lab = rgb(0, 0, 1))

Place your cursor in the first line of code and use Ctrl+Enter to step through each line to R Interactive and execute. You can see each line execution results to the R Plot window as shown below

RTVS_5_R_prioject

Here is the final plot view

RTVS_5_R_prioject_plot

Given below is an updated source which generate the output as PDF file

#Define Data
sales <- c(2000, 3500, 3000, 4700, 6000)
profit <- c(1450, 2600, 750, 2000, 3750)

#File name to save output
pdf(file = "sales_and_profit_byYr.pdf", height = 4, width = 5)

#range for axis
rnge <- range(0, sales, profit)

#Plot sales values
plot(sales, type = "o", col = "green", ylim = rnge, axes = FALSE, ann = FALSE)
#draw profit values as line
lines(profit, type = "o", pch = 22, lty = 2, col = "red")
box()

#draw X axis and y axis
axis(1, at = 1:5, lab = c("2013", "2014", "2015", "2016", "2017"))
axis(2, las = 1, at = 1000 * 0:rnge[2])

#Graph Title
title(main = "Sales by Year", col.main = "blue", font.main = 4)

#add axis labels
title(xlab = "Year", col.lab = rgb(0, 0, 1))
title(ylab = "Amount", col.lab = rgb(0, 0, 1))

#Flush the output to PDF file
dev.off()

R Programming using RTVS – Part 3

This article assumes that you have RTVS extension added to your VS IDE.  You can refer my previous blogs on installation & configuration of RTVS.

  1. R Support in Visual Studio (RTVS)
  2. R Programming using RTVS – Part 2
Prerequisites:
1. Basic knowledge of R programming
2. RTVS Extension

Key RTVS features that a developer should be aware of :

The following is just a list for you to explore rather than extensive walk through.

  1. Step through :  use Ctrl+Enter
  2. Execute a block of code in R interactive  : Select a block of code and Ctrl+ Enter (or right click and choose Execute in Interactive)
  3. Debug : Use Debug > Source startup file or F5
  4. Set Debug points : Native VS behavior (refer image below)rtvs_5_r_debug.png
  5. Reset current work space : To clear everything by using R Tools > Session > Reset
  6. Examine variables : Use ‘variable explorer’ (Ctrl + 8)
  7. IntelliSense : RTVS_5_R_IntelliSense
  8. Help : Use ? and ?? (Use ?? to search term in quotes if it include spaces)
  9. Code snippet manager :  View them at Tools > Code Snippets Manager.  RTVS_5_R_CodeSnipptType the abbreviated name and use tab to insert a snippet to your codeRTVS_5_R_CodeSnippt1
  10.  R interactive window or REPL (Read-Evaluate-Print-Loop) window : Use already illustrated in previous steps/posts
  11. Plot window : Ctrl + 6
  12. Plot history : Preserves all your plots in the session.  R Tools > Plots > Plot History Window (refer image below)RTVS_5_R_Plot_history

There are many more which an R programmer or VS developer can explore themselves. We will go through some samples using RTVS in the upcoming posts.

R Programming using RTVS – Part 2

This article assumes that you have RTVS extension added to your VS IDE.  You can refer my previous blog on installation & configuration of RTVS.

Prerequisites:
1. Basic knowledge of R programming
2. RTVS Extension

Once you install ‘Data Science and Analytical Applications’ workload using VS2017 installer (May 2017 update), new R Project template will be available for creating new project.

RTVS_2

Those who are familiar with R Studio will certainly find themselves home with this IDE. Additionally we have all VS native features. Given below is a view that shows Solution Explorer, Script file (Script.R) and Interactive Window (Microsoft R Client which is equivalent to console in R Studio)

RTVS_3

Notice that there is an ‘R Plot’ tab which is where the plots/visualizations will be generated.

To start with lets draw a basic plot

ds <- c(1,500,300,100,350)
plot(ds)

RTVS_4.png

Alternately use the normal VS debug (starting new instance) and watch the interactive window as the plot is generated.

ds <- c(1,500,300,100,350) > plot(ds)
> rtvs::debug_source("~/visual studio 2017/Projects/rproject1/rproject1/script.R")
Sourcing: c:\users\geek2\documents\visual studio 2017\Projects\rproject1\rproject1\script.R
>

That’s basic start. I recommend R developers to explore the VS menu “R Tools” for the complete set of R integration options available.