Few days back I was trying to work with Multiline JSONs (aka. JSON ) on Spark 2.1 and I faced a very peculiar issue while working on Single Line JSON(aka. JSONL or JSON Lines ) vs Multiline JSON files.

JSON Lines vs JSON

Consider an example, our JSON looks like below
here we can see we have 3 rows and all rows are enclosed inside an JSON array.

JSON

Now if we compare same data to be represented as JSON Lines, it would look something like

So, the difference clearly includes absence of following in JSONL file

  • Square brackets representing array, so in JSONL files every lines is represented by a Record and delimited by New line.
  • Commas, JSONL files do not require comma after every record.

Reading JSON & JSONL files

  • By default Spark considers JSON files to be having JSON lines (JSONL format) and not Multiline JSON.
  • There is a difference when it comes to working with JSON files for Spark versions prior to 2.2 and after Spark 2.2 release.

Working with Single Line Records prior Apache Spark 2.2

  • when working with JSON files ( both JSONL and JSON), if whole record is present in single line, then we can simply read it using

spark.read.json(<path>)

Json file having everything in single line ( minified JSON)
  • when we read this data we will get results as expected in form of dataframe
Reading JSON or JSONL files using Spark < 2.2

Working with Multi Line Records prior Apache Spark 2.2

  • if we have data in multiple lines as below,
Multiline Json Data
  • if we try to simply read it we will get currupt_records
reading Multiline Json via read.json with Spark < 2.2

Solution

  • to be able to read multiline JSON records prior to Spark 2.2, we will have to use sc.wholeTextFiles() , which will give us RDD which we can convert to DataFrame

Working with Multi Line Records post Apache Spark 2.2

  • with Apache Spark 2.2+, it becomes very easy to work with Multiline JSON files, we just need to add option

multiline=’true

  • suppose we have following data
  • now we can simply read it using spark.read.json() with option multiline=’true’
reading multiline JSON files post Apache Spark 2.2

Thanks for reading, Please comment any queries or corrections.

#happy_coding #codebrace

--

--

Ashish Patel
Codebrace

Big Data Engineer at Skyscanner , loves Competitive programming, Big Data.