Spark – Create Multiple Output Files per Task using spark.sql.files.maxRecordsPerFile

It is highly recommended that you try to evenly distribute the work among multiple tasks so every task produces a single output file and job is completed in parallel.

But sometimes it still may be useful when a task generates multiple output files with the limited number of records in each file by using spark.sql.files.maxRecordsPerFile option:

pyspark --conf spark.sql.files.maxRecordsPerFile=300000 ...

Now when a task writes its final output (org.apache.spark.sql.execution.datasources.FileFormatDataWriter) it will close the current file and create a new one after reaching spark.sql.files.maxRecordsPerFile output records.

Note that a single task generates files sequentially one by one, and this is a disadvantage compared with an approach when you distribute the job among multiple parallel workers.