- Apache Spark Machine Learning Blueprints
- Alex Liu
- 899字
- 2025-04-04 20:00:33
Data cleaning
In this section, we will review some methods for data cleaning on Spark with a focus on data incompleteness. Then, we will discuss some of Spark's special features for data cleaning and also some data cleaning solutions made easy with Spark.
After this section, we will be able to clean data and make datasets ready for machine learning.
Dealing with data incompleteness
For machine learning, the more the data the better. However, as is often the case, the more the data, the dirtier it could be—that is, the more the work to clean the data.
There are many issues to deal with data quality control, which can be as simple as data entry errors or data duplications. In principal, the methods of treating them are similar—for example, utilizing data logic for discovery and subject matter knowledge and analytical logic to correct them. For this reason, in this section, we will focus on missing value treatment so as to illustrate our usage of Spark for this topic. Data cleaning covers data accuracy, completeness, uniqueness, timeliness, and consistency.
Treating missing values and dealing with incompleteness is not an easy task, though it may sound simple. It involves many issues and often requires the following steps:
- Counting the missing percentage.
If the percentage is lower than 5% or 10% then, depending on the studies, we may not need to spend time on it.
- Studying the missing patterns.
There are two patterns of missing data: completely at random or not at random. If they are missing completely at random, we can ignore this issue.
- Deciding the methods to deal with missing patterns.
There are several commonly used methods to deal with missing cases. Filling with mean, deleting the missing cases, and imputation are among the main ones.
- Performing filling for missing patterns.
To work with missing cases and incompleteness, data scientists and machine learning professionals often utilize their familiar SQL tools or R programming. Fortunately, within the Spark environment, there are Spark SQL and R notebooks for users to continue their familiar paths, for which we will have detailed reviews in the following two sections.
There are also other issues with data cleaning, such as treating data entry errors and outliers.
Data cleaning in Spark
In the preceding section, we discussed working with data incompleteness.
With Spark installed, we can easily use the Spark SQL and R notebook on DataBricks Workspace for the data cleaning work described in the previous section.
Especially, the sql
function on sqlContext
enables applications to run SQL queries programmatically and return the result as a DataFrame.
For example, with R notebook, we can use the following to perform SQL commands and turn the results into a data.frame
:
sqlContext <- sparkRSQL.init(sc) df <- sql(sqlContext, "SELECT * FROM table")
Data cleaning is a very tedious and time-consuming work and, in this section, we would like to bring your attention to SampleClean, which can make data cleaning, and especially distributed data cleaning, easy for machine learning professionals.
SampleClean is a scalable data cleaning library built on AMPLab Berkeley Data Analytics Stack (BDAS). The library uses Apache Spark SQL 1.2.0 and above as well as Apache Hive to support distributed data cleaning operations and related query processing on dirty data. SampleClean implements a set of interchangeable and composable physical and logical data cleaning operators, which makes quick construction and adaptation of data cleaning pipelines possible.
To get our work started, let's first import Spark and SampleClean with the following commands:
import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import sampleclean.api.SampleCleanContext
To begin using SampleClean
, we need to create an object called SampleCleanContext
, and then use this context to manage all of the information for working sessions and provide the API primitives to interact with the data. SampleCleanContext
is constructed with a SparkContext
object, as follows:
new SampleCleanContext(sparkContext)
Data cleaning made easy
With SampleClean and Spark together, we can make data cleaning easy, which is to write less code and utilize less data.
Overall, SampleClean employs a good strategy; it uses asynchrony to hide latency and sampling to hide scale. Also, SampleClean combines all the three (Algorithms, Machines, and People) in one system to become more efficient than others.
Note
For more information on using SampleClean
, go to: http://sampleclean.org/guide/ and http://sampleclean.org/release.html.
For the purposes of illustration, let's imagine a machine learning project with four data tables:
Users(userId INT, name String, email STRING,age INT, latitude: DOUBLE, longitude: DOUBLE,subscribed: BOOLEAN)
Events(userId INT, action INT, Default)
WebLog(userId, webAction)
Demographic(memberId, age, edu, income)
To clean this dataset, we need to:
- Count how many are missing for each variable, either with the SQL or R commands
- Fill in the missing cases with the mean value if this is the strategy we agree to
Even though the preceding are very easy to implement, it could be very time consuming if our data is huge. Therefore, for efficiency, we may need to divide the data into many subsets and complete the previous steps in parallel, for which Spark becomes the best computing platform to use.
In the Databricks R notebook environment, we can first create notebooks with the R command sum(is.na(x))
to count the missing cases.
To replace the missing cases with the mean, we can use the following code:
for(i in 1:ncol(data)){ data[is.na(data[,i]), i] <- mean(data[,i], na.rm = TRUE) }
In Spark, we can easily schedule to implement R notebooks in all the data clusters.