Skip to content

Commit

Permalink
[SPARK-35045][SQL][FOLLOW-UP] Add a configuration for CSV input buffe…
Browse files Browse the repository at this point in the history
…r size

### What changes were proposed in this pull request?

This PR makes the input buffer configurable (as an internal configuration). This is mainly to work around the regression in uniVocity/univocity-parsers#449.

This is particularly useful for SQL workloads that requires to rewrite the `CREATE TABLE` with options.

### Why are the changes needed?

To work around uniVocity/univocity-parsers#449.

### Does this PR introduce _any_ user-facing change?

No, it"s only internal option.

### How was this patch tested?

Manually tested by modifying the unittest added in #31858 as below:

```diff
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
index fd25a79619d..705f38dbfbd 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 -2456,6 +2456,7  abstract class CSVSuite
   test("SPARK-34768: counting a long record with ignoreTrailingWhiteSpace set to true") {
     val bufSize = 128
     val line = "X" * (bufSize - 1) + "| |"
+    spark.conf.set("spark.sql.csv.parser.inputBufferSize", 128)
     withTempPath { path =>
       Seq(line).toDF.write.text(path.getAbsolutePath)
       assert(spark.read.format("csv")
```

Closes #32231 from HyukjinKwon/SPARK-35045-followup.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 70b606f)
Signed-off-by: HyukjinKwon <[email protected]>
  • Loading branch information
HyukjinKwon committed Apr 19, 2021
1 parent 34d4da5 commit 8fd6d18
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -212,6 +212,7 @@ class CSVOptions(
val lineSeparatorInWrite: Option[String] = lineSeparator

val inputBufferSize: Option[Int] = parameters.get("inputBufferSize").map(_.toInt)
.orElse(SQLConf.get.getConf(SQLConf.CSV_INPUT_BUFFER_SIZE))

def asWriterSettings: CsvWriterSettings = {
val writerSettings = new CsvWriterSettings()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2159,6 +2159,16 @@ object SQLConf {
.booleanConf
.createWithDefault(true)

val CSV_INPUT_BUFFER_SIZE = buildConf("spark.sql.csv.parser.inputBufferSize")
.internal()
.doc("If it is set, it configures the buffer size of CSV input during parsing. " +
"It is the same as inputBufferSize option in CSV which has a higher priority. " +
"Note that this is a workaround for the parsing library's regression, and this " +
"configuration is internal and supposed to be removed in the near future.")
.version("3.0.3")
.intConf
.createOptional

val REPL_EAGER_EVAL_ENABLED = buildConf("spark.sql.repl.eagerEval.enabled")
.doc("Enables eager evaluation or not. When true, the top K rows of Dataset will be " +
"displayed if and only if the REPL supports the eager evaluation. Currently, the " +
Expand Down

0 comments on commit 8fd6d18

Please sign in to comment.