SingleStore DB Loves R – DZone Database

Abstract

The R programming language is very popular with many Data Scientists. At the time of writing this article, R ranks as the 11th most popular programming language according to TIOBE.

R provides many compelling data manipulation capabilities enabling the slicing and dicing of data with ease. Often data are read into and written out of R programs using files. However, R can also work with database systems. In this article, we’ll see two quick examples of how Data Scientists can use R from Spark with SingleStore DB.

The notebook files used in this article are available on GitHub. The notebook files are available in DBC, HTML, and RMarkdown formats.

Introduction

In this article, we’ll look at two methods that we can use to connect R programs from Spark to SingleStore DB:

  1. SparkR
  2. sparklyr

We’ll also plot data using an example from Databriks.

To begin with, we need to create a free Cloud account on the SingleStore website, and a free Community Edition (CE) account on the Databriks website. At the time of writing, the Cloud account from SingleStore comes with $500 of Credits. This is more than adequate for the case study described in this article. For Databriks CE, we need to sign-up for the free account rather than the trial version.

Configure Databriks CE

We’ll create a new Spark cluster and use DBR 9.1 LTS ML.

Next, we’ll create a new Databriks CE R notebook called Setup. We’ll add a new code cell with the following:

SERVER <- "<TO DO>"
PASSWORD <- "<TO DO>"

PORT <- 3306
DATABASE <- "iris_db"

URL <- paste("jdbc:mysql://", SERVER, ":", PORT, "https://dzone.com/", DATABASE, sep = "")

The <TO DO> for SERVER and PASSWORD should be replaced with the values ​​obtained from SingleStore Cloud when creating a cluster.

Create the Database

In our SingleStore Cloud account, let’s use the SQL Editor to create a new database. Call this iris_dbas follows:

CREATE DATABASE IF NOT EXISTS iris_db;

SparkR

Let’s now create a new Databriks CE R notebook. We’ll call it SparkR Write-Read Test. We’ll attach our new notebook to our Spark cluster.

Let’s first load SparkR and dplyr as follows:

library(SparkR)
library(dplyr)

Next, let’s load the built-in iris dataset:

spark_write_df <- SparkR::createDataFrame(iris)

We can use head to show some of the data in the Spark Dataframe:

The output should be similar to the following:

  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Next, let’s run our Setup notebook:

Now we’ll write the Spark Dataframe to SingleStore DB:

SparkR::write.jdbc(
  spark_write_df,
  URL,
  "iris",
  mode = "overwrite",
  user = "admin",
  password = PASSWORD
)

We’ll now read the data back from SingleStore DB into a different Spark Dataframe:

spark_read_df <- SparkR::read.jdbc(
  URL,
  "iris",
  user = "admin",
  password = PASSWORD
)

We can use head to show some of the data in the Spark Dataframe:

The output should be similar to the following:

  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.6         3.1          1.5         0.2  setosa
3          5.0         3.4          1.5         0.2  setosa
4          4.4         2.9          1.4         0.2  setosa
5          4.9         3.1          1.5         0.1  setosa
6          5.4         3.7          1.5         0.2  setosa

We’ll now convert the two Spark Dataframes into R Dataframes:

r_write_df <- SparkR::as.data.frame(spark_write_df)
r_read_df <- SparkR::as.data.frame(spark_read_df)

and compare the two to see if they are the same:

all_equal(r_write_df, r_read_df)

The result should be:

We’ll now set a plot option:

options(repr.plot.height = 600)

and group the data:

iris_summary <- r_read_df %>%
  mutate(Sepal_Width = round(Sepal_Width * 2) / 2) %>%
  group_by(Species, Sepal_Width) %>%
  summarize(
    count = n(),
    Sepal_Length_Mean = mean(Sepal_Length),
    stdev = sd(Sepal_Length)
  ) %>%
  collect

Finally, let’s plot the data:

library(ggplot2)

ggplot(
  iris_summary,
  aes(
    Sepal_Width,
    Sepal_Length_Mean,
    color = Species
  )
) +
geom_line(size = 1.2) +
geom_errorbar(
  aes(
    ymin = Sepal_Length_Mean - stdev,
    ymax = Sepal_Length_Mean + stdev),
  width = 0.05
) +
geom_text(
  aes(label = count),
  vjust = -0.2,
  hjust = 1.2,
  color = "black") +
theme(legend.position = "top")

This should produce the output shown in Figure 1.

Figure 1. Sepal Width vs Mean Sepal Length.

sparklyr

Let’s now create a new Databriks CE R notebook. We’ll call it sparklyr Write-Read Test. We’ll attach our new notebook to our Spark cluster.

Let’s first load sparklyr and dplyr as follows:

library(sparklyr)
library(dplyr)

and connect sparklyr to Databriks:

sc <- spark_connect(method = "databricks")

Next, let’s load the built-in iris dataset:

spark_write_df <- sdf_copy_to(sc = sc, x = iris, overwrite = TRUE)

We can use head to show some of the data in the Spark Dataframe:

The output should be similar to the following:

# Source: spark<?> [?? x 5]
  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
         <dbl>       <dbl>        <dbl>       <dbl> <chr>  
1          5.1         3.5          1.4         0.2 setosa 
2          4.9         3            1.4         0.2 setosa 
3          4.7         3.2          1.3         0.2 setosa 
4          4.6         3.1          1.5         0.2 setosa 
5          5           3.6          1.4         0.2 setosa 
6          5.4         3.9          1.7         0.4 setosa

Next, let’s run our Setup notebook:

Now we’ll write the Spark Dataframe to SingleStore DB:

spark_write_jdbc(
  spark_write_df,
  name = "iris",
  mode = "overwrite",
  options = list(
    url = URL,
    user = "admin",
    password = PASSWORD,
    dbtable = "iris"
  ),
  partition_by = NULL
)

We’ll now read the data back from SingleStore DB into a different Spark Dataframe:

spark_read_df <- spark_read_jdbc(
  sc,
  name = "iris",
  options = list(
    url = URL,
    user = "admin",
    password = PASSWORD,
    dbtable = "iris"
  )
)

We can use head to show some of the data in the Spark Dataframe:

The output should be similar to the following:

# Source: spark<?> [?? x 5]
  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
         <dbl>       <dbl>        <dbl>       <dbl> <chr>  
1          4.9         3            1.4         0.2 setosa 
2          5.4         3.9          1.7         0.4 setosa 
3          4.4         2.9          1.4         0.2 setosa 
4          4.9         3.1          1.5         0.1 setosa 
5          5.4         3.7          1.5         0.2 setosa 
6          4.8         3            1.4         0.1 setosa

We’ll now convert the two Spark Dataframes into R Dataframes:

r_write_df <- collect(spark_write_df)
r_read_df <- collect(spark_read_df)

and compare the two to see if they are the same:

all_equal(r_write_df, r_read_df)

The result should be:

We’ll now set a plot option:

options(repr.plot.height = 600)

and group the data:

iris_summary <- r_read_df %>%
  mutate(Sepal_Width = round(Sepal_Width * 2) / 2) %>%
  group_by(Species, Sepal_Width) %>%
  summarize(
    count = n(),
    Sepal_Length_Mean = mean(Sepal_Length),
    stdev = sd(Sepal_Length)
  ) %>%
  collect

Finally, let’s plot the data:

library(ggplot2)

ggplot(
  iris_summary,
  aes(
    Sepal_Width,
    Sepal_Length_Mean,
    color = Species
  )
) +
geom_line(size = 1.2) +
geom_errorbar(
  aes(
    ymin = Sepal_Length_Mean - stdev,
    ymax = Sepal_Length_Mean + stdev),
  width = 0.05
) +
geom_text(
  aes(label = count),
  vjust = -0.2,
  hjust = 1.2,
  color = "black") +
theme(legend.position = "top")

This should produce the output shown in Figure 2.

Figure 2. Sepal Width vs Mean Sepal Length.

Summary

In this article, we have seen two methods that we can use to connect to SingleStore DB using R from Spark. Connecting to SingleStore DB gives Data Scientists the ability to work with long-lived data and save experimental results, all under DBMS control.

.

Leave a Comment