PROVE IT !!If you know it then PROVE IT !! Skill Proficiency Test

Spark: DataFrame To RDD For Data Cleansing

Gilad Moscovitch walks us through a common data cleansing problem with Spark data frames:

A problem can arise when one of the inner fields of the json, has undesired non-json values in some of the records.

For instance, an inner field might contains HTTP errors, that would be interpreted as a string, rather than as a struct.

As a result, our schema would look like:

root

 |– headers: struct (nullable = true)

 |    |– items: array (nullable = true)

 |    |    |– element: struct (containsNull = true)

 |– requestBody: string (nullable = true)

Instead of

root

 |– headers: struct (nullable = true)

 |    |– items: array (nullable = true)

 |    |    |– element: struct (containsNull = true)

 |– requestBody: struct (nullable = true)

 |    |– items: array (nullable = true)

 |    |    |– element: struct (containsNull = true)

When trying to explode a “string” type, we will get a miss type error:

org.apache.spark.sql.AnalysisException: Can’t extract value from requestBody#10

Click through to see how to handle this scenario cleanly.

Let’s block ads! (Why?)