Two Scala Libraries Every Data Engineer Should Know

As data engineers, we deal with a lot of JSON, which is ubiquitous since JSON is easy for developers to add to applications. However, JSON is not an efficient storage format for applications that frequently query or use the data at scale. Unlike Parquet, JSON is not a splittable file format making it less parallelizable in systems like Spark. Often JSON is not performant enough and requires further ETL to be converted to formats like Parquet which is a splittable file format and therefore parallelizable....

July 9, 2022