
Kryo requires that you register the classes in your program, and it doesn't yet support all Serializable types.īucketing is similar to data partitioning, but each bucket can hold a set of column values rather than just one. Kryo serialization is a newer format and can result in faster and more compact serialization than Java.There are two serialization options for Spark: Spark jobs are distributed, so appropriate data serialization is important for the best performance. Create ComplexTypes that encapsulate actions, such as "Top N", various aggregations, or windowing operations.

Leverage DataFrames rather than the lower-level RDD objects.Prefer TreeReduce, which does more work on the executors or partitions, to Reduce, which does all work on the driver.Prefer ReduceByKey with its fixed memory limit to GroupByKey, which provides aggregations, windowing, and other functions but it has an unbounded memory limit.Reduce by map-side reducing, pre-partition (or bucketize) source data, maximize single shuffles, and reduce the amount of data sent. To address 'out of memory' messages, try: The following diagram shows the key objects and their relationships. Spark memory considerationsĪpache Spark in Azure Synapse uses YARN Apache Hadoop YARN, YARN controls the maximum sum of memory used by all containers on each Spark node. Monitor and tune Spark configuration settings.įor your reference, the Spark memory structure and some key executor memory parameters are shown in the next image.Consider the newer, more efficient Kryo data serialization, rather than the default Java serialization.Prefer smaller data partitions and account for data size, types, and distribution in your partitioning strategy.There are several techniques you can apply to use your cluster's memory efficiently. Spark operates by placing data in memory, so managing memory resources is a key aspect of optimizing the execution of Spark jobs. However, Spark native caching currently doesn't work well with partitioning, since a cached table doesn't keep the partitioning data. This native caching is effective with small data sets as well as in ETL pipelines where you need to cache intermediate results. Spark provides its own native caching mechanisms, which can be used through different methods such as. Due to the splittable nature of those files, they will decompress faster. In addition, while snappy compression may result in larger files than say gzip compression. Parquet stores data in columnar format, and is highly optimized in Spark. The best format for performance is parquet with snappy compression, which is the default in Spark 2.x.

Spark can be extended to support many more formats with external data sources - for more information, see Apache Spark packages. Spark supports many formats, such as csv, json, xml, parquet, orc, and avro.
#ORC TABLE CREATION FROM SPARK SQL WITH SNAPPY COMPRESSION HOW TO#
Learn how to optimize an Apache Spark cluster configuration for your particular workload.
