It is highly recommended that you try to evenly distribute the work among multiple tasks so every task produces a single output file and job is completed in parallel.
But sometimes it still may be useful when a task generates multiple output files with the limited number of records in each file by using
pyspark --conf spark.sql.files.maxRecordsPerFile=300000 ...
Now when a task writes its final output (
org.apache.spark.sql.execution.datasources.FileFormatDataWriter) it will close the current file and create a new one after reaching
spark.sql.files.maxRecordsPerFile output records.
Note that a single task generates files sequentially one by one, and this is a disadvantage compared with an approach when you distribute the job among multiple parallel workers.