Cluster Analysis using Tableau and R – Part-2

Related article : Cluster Analysis using Tableau and R – Part-1

We have performed clustering analysis from both Tableau and R in my previous post. Tableau uses k-means algorithm for cluster analysis which partitions the data into k clusters with a center or mean value of all the points in each. Clustering is based on the distance each measure lies from the center.

Let’s look into that in detail.

First generate a cluster scatter plot in Tableau as we did in part -1 using Iris data set.

clustering_tableau_R10

Right click on the cluster that we added and choose Describe Clusters option

clustering_tableau_R11

This provides you the clustering details given below

clustering_tableau_R12

Now lets perform the k-means clustering from R and print the cluster

#copying iris to myiris variable
myiris<-iris
#Remove Species column
myiris$Species <- NULL
#clustering
kmeans.result <- kmeans(myiris,3)

#print the cluster data
kmeans.result

clustering_tableau_R13

Check the cluster means against that of Tableau cluster centers. Aren’t these comparable? However Tableau clustering analysis is limited and the default one is k-means as compared to the number of packages and functions available in R to perform various types of clustering. I will dedicate a future article to cover cluster and fpc packages in R.

In this case, what options that a Tableau user has for extended/advanced clustering? The answer is R integration by calling R packages from Tableau using similar steps that I explained in this article.

Cluster Analysis using Tableau and R – Part-1

This article introduces you to similar clustering analysis on your data using Tableau and R. Data files and source used in this post can be downloaded using the link below.

Download source files used in this article

Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. We will perform the analysis in both Tableau and R using the same data.

Clustering Analysis using Tableau

To start with, we connect Tableau to Iris data set.  You can download this from UCI Machine Learning Repository

First connect the Tableau work book to this csv data source and launch a new sheet. Drag the measures petal length, petal width to columns and sepal length, width to rows.

clustering_tableau_R

Next, disable aggregation of measures using Analysis->Aggregate Measures

clustering_tableau_R1

Alternately, to keep it simple, you can choose to analyze only 2 measures as shown below. But in this article, we go with all 4 measures as above

clustering_tableau_R2.PNG

If you observe, these scatter-plots does not identify or differentiate any groups. However in our case, the data set already has a column specifying flower species of these measures. So let us view it by dragging ‘Species’ to color which shows the distinct species groups as below:

clustering_tableau_R3

Well, imagine what if we didn’t had the ‘species’ data handy and we wanted to identify the clusters based on the measures. Lets see how it can be accomplished using Tableau Cluster Analysis.

Start with our initial plot, i.e.

clustering_tableau_R1

Go to Analytics tab, and drag ‘Clusters’ as shown in the screen capture below. Tableau automatically identifies the number of clusters.

clustering_tableau_R4

Leave the defaults

clustering_tableau_R5.PNG

clustering_tableau_R6

Note that we have got exactly same cluster grouping as we got using ‘Species’ dimensi.on data.

Tableau uses k-means algorithm for cluster analysis which partitions the data into k clusters with a center or mean value of all the points in each. Clustering is based on the distance each measure lies from the center.

Cluster Analysis using R

To start with, let us revisit Tableau plot for iris data between petal.length and petal.width with cluster analysis.

clustering_tableau_R7

Let’s go to RStudio, and plot this using ggplot (note that iris is available as part of the datasets installed with R)

library(ggplot2)
ggplot(iris, aes(Petal.Width, Petal.Length, color = Species)) + 
geom_point()

clustering_tableau_R8

Note that we get identical grouping in R plot, but we used species column data to group (color) the data.

Let us look at how to perform the cluster analysis to identify clusters in R.

First take a copy of iris dataset to another variable

#cluster analysis - Biju Paulose
#-------------------------------
#copying iris to myiris variable
myiris<-iris
#printing data 
myiris

For our analysis, we do not want to use species column/data. So lets remove that from the new dataset.

#Remove Species column
myiris$Species <- NULL
#printing data to verify
myiris

Lets use k-means function for generating 3 clusters and plot the data

kmeans.result <- kmeans(myiris,3)
# plot the clusters
plot(myiris[c("Petal.Width", "Petal.Length")],col=kmeans.result$cluster)

The result is given below.  As you can compare with the analysis performed in Tableau above, we could generate the same clustering of data from R. We will examine these more closely in my next article.

clustering_tableau_R9

Download source files used in this article

Using R forecast from Tableau

Prerequisites:

  • RStudio (with forecast package installed)
  • Tableau Desktop (with connectivity established to RServe service. For details of R integration with Tableau, please refer my previous post here)
  • R Programming knowledge
Download source files used for this article

Forecasting allows business to arrive at more realistic estimates and targets for future. Tableau analytics provide the option of generating forecasts which many of you must be familiar with. In this article, we will look at how to make use of the R forecast package from Tableau. This will be a reference for the capabilities that you can bring in from R.

For a simple demonstration of R forecast, we can use the R air passenger time series data set for 2 year forecasting as below

myts1 <- ets(AirPassengers)
library(datasets)
library(forecast)
plot(forecast.ets(myts1, h=24))

tableau_r_forecast2

Tableau’s native forecasting has similar capability – an example shown below using the superstore dataset

tableau_r_forecast3.PNG

For the R integration, start a new sheet connecting to globalstore dataset and generate timeseries graph for sales (orderdate by months and Sum[Sales])

tableau_r_forecast4

To generate a forecast using R package, create a calculated field with the script as shown below

tableau_r_forecast5

SCRIPT_REAL("library(forecast);
myts <- ts(.arg1,start=c(2011,1), frequency=12);
myforecast <- forecast(myts, h=.arg2[1]);
monthsts<-length(.arg1); 
append(.arg1[(.arg2[1]+1):monthsts],myforecast$mean,after= monthsts
)",
SUM([Sales]),[Forecast Months])

The scripts creates timeseries for sales starting Jan 2011, generates forecast and appends starting x months (specified by parameter ‘Forecast months’) before the last month in the series.

You can view the forecast series by adding calculated field (SalesForecast) to the row. To make it intuitive, create the formula isForecast as below and drag to color.

tableau_r_forecast6

tableau_r_forecast7.PNG

Forecast vs Actual

To view forecast vs actual side by side, you can add sales to row

tableau_r_forecast8.PNG

But this does not give you a clear understanding or limits your ability to compare. The solution is to bring them together (dual-axis) and then synchronize both axis. The result is shown below

tableau_r_forecast9

Download source files used for this article

 

How to integrate R with Tableau

We have seen how R can be integrated to your data science project using Power BI or Visual Studio(RTVS). Now its time to look at R integration with Tableau.

Before we get started with the steps, let us discuss how this is beneficial.

Tableau is a great visualization tool which helps you to understand your data, provide interactivity and assist in making business decisions. R integration is going to bring the capabilities of to your Tableau visualizations – such as statistical functions predictive analysis. The advantage of interactive visualizations in Tableau powered by the complex statistical analysis behind the scenes using R presents a strong case for data scientists to go for this integration.

I have added a high level representation below of this implementation. You can call R functions from Tableau and it passes the result back to Tableau which can be used to generate visualizations. You can utilize all packages (difficult to accomplish using Tableau scripting alone) that are running in R Server and generate visualizations using the resultant data (complex to accomplish using R alone).

 

R_Tableau

As first step, make sure that RServe is running as a service that you can connect to. The screenshots below shows how to install RServe from RStudio.

R_Tableau_Integration

Start the service

R_Tableau_Integration_1

Now RServe is ready for connections. Go to Tableau and choose the option Help–> Settings and Performance–>Manage External Service Connection

R_Tableau_Integration_2

In this case, my RServe is running on the same PC. So I selected localhost as server. Default RServe port is 6311. Leave that as is and test your connection as below

R_Tableau_Integration_3

Above message confirms that you have established the connectivity with R service.

Next, we will look at an example calling R scripts from Tableau.

R rattle launch Error – libatk-1.0-0.dll is missing from your computer

If you encounter the below error while trying to launch rattle package in R Studio, that indicate that GTK+ is missing.

The program can’t start because libatk-1.0-0.dll is missing from your computer. Try reinstalling the program to fix this problem.

library(rattle)
Error in inDL(x, as.logical(local), as.logical(now), …) :
unable to load shared object ‘C:/Users/GEEK2/Documents/R/win-library/3.3/RGtk2/libs/x64/RGtk2.dll’:
LoadLibrary failure: The specified module could not be found.

Solution :  Re-Install package > install.packages(“rattle”)

If the popup asks for installing GTK+, select OK

R_GTKplus_istall

R_GTKplus_istall2.PNG

Get Docker on Windows – Docker Toolbox

How to get Docker installed in your windows? First major consideration is the version of Windows that you are running.

Docker for Windows: Docker for Windows requires 64 bit Windows 10 Pro or Windows Server 2016 and Microsoft Hyper-V. You can find complete requirements here

Docker Toolbox : If your windows does not meet the above requirements, you can install Docker Toolbox, which uses Oracle Virtual Box

Let us look at Docker toolbox installation in detail for those who are on Windows 8 or Windows 7. Make sure that virtualization is supported/enabled in your OS. For Windows 8, verify that from Task Manager -> Performance Tab.

docker_windows_installation2

For windows 7, download and run the Hardware-Assisted Virtualization Detection Tool

docker_windows_installation1

Next install Docker Toolbox (with default options selected) and launch Docker Terminal

docker_windows_installation3

Verify installation by running hello-world command

$ docker run hello-world

docker_windows_installation4

Launch Kitematic to search for various docker images available from Docker Hub

docker_windows_installation5

 

On-Premises Data Gateway for Power BI

We have seen different Power BI implementation options in my article here. If you do not want their application or database infrastructure to cloud will be looking for options for generating visualizations from Power BI service(cloud) connecting to on-premises data source (database such as SQL Server, Oracle or even structured files such as spreadsheets or csv files.

Note : Complete on-premises solution is available with the release of Power BI Report Server which is part of Power BI Premium. This article assumes the Power BI service implementation as below and gateway component is highlighted.

Power BI Gateway 1

On-premises data gateway allows Power BI service to connect to your database to refresh dataset. Gateway allows both scheduled refresh and live/direct query.  Time for looking at the steps on how to install and configure the gateway.

Download the standalone data gateway and start the service in on-premises. To download, use the option below after logon to your accountPower BI Gateway Installation_1

After installation, you have to sign in using your Azure account and specify a name to your gateway so that Power BI service tie it to your Power BI account.

Power BI Gateway Installation_3

Power BI Gateway Installation_4

Make sure that On-Premises Data Gateway is running as a windows service in your PC. Ideally this will be one of your on-premises server with connectivity to your database.

Go to Power BI web login, select ‘Manage Gateway’ option as shown below

Power BI Gateway Installation_5

Note that your Data Gateway is already listed there. You need to create a data source and provide authentication details for database connectivity (Windows User details in case of a file data source such as excel spreadsheet)

Power BI Gateway Installation_6

Next, go to your workspace, select the data set and go to ‘settings’ as shown below

Power BI Gateway Installation_7

Under ‘Gateway connection’ section, select your data gateway and click ‘apply’

Power BI Gateway Installation_8

Under ‘Scheduled refresh’ section, turn on ‘Keep your data up to date’ and specify refresh frequency as below

Power BI Gateway Installation_9

That completes it. Your dataset is ready for scheduled refresh connecting to your on-premises data source!

For verifying, you can update data in SQL Server database (local server) and wait for the refresh to happen.

Power BI Gateway Installation_11

Once complete, launch your visualization which uses the dataset and see your updated data reflecting there!

Power BI Gateway Installation_10