Skip to content

Commit

Permalink
[SPARK-34768][SQL] Respect the default input buffer size in Univocity
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?

This PR proposes to follow Univocity's input buffer.

### Why are the changes needed?

- Firstly, it's best to trust their judgement on the default values. Also 128 is too low.
- Default values arguably have more test coverage in Univocity.
- It will also fix uniVocity/univocity-parsers#449
- ^ is a regression compared to Spark 2.4

### Does this PR introduce _any_ user-facing change?

No. In addition, It fixes a regression.

### How was this patch tested?

Manually tested, and added a unit test.

Closes #31858 from HyukjinKwon/SPARK-34768.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 385f1e8)
Signed-off-by: HyukjinKwon <[email protected]>
  • Loading branch information
HyukjinKwon committed Mar 17, 2021
1 parent b30e0a1 commit 0922380
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 3 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -166,8 166,6 @@ class CSVOptions(

val quoteAll = getBool("quoteAll", false)

val inputBufferSize = 128

/**
* The max error content length in CSV parser/writer exception message.
*/
Expand Down Expand Up @@ -259,7 257,6 @@ class CSVOptions(
settings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceInRead)
settings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceInRead)
settings.setReadInputOnSeparateThread(false)
settings.setInputBufferSize(inputBufferSize)
settings.setMaxColumns(maxColumns)
settings.setNullValue(nullValue)
settings.setEmptyValue(emptyValueInRead)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2452,6 2452,17 @@ abstract class CSVSuite
assert(result.sameElements(exceptResults))
}
}

test("SPARK-34768: counting a long record with ignoreTrailingWhiteSpace set to true") {
val bufSize = 128
val line = "X" * (bufSize - 1) "| |"
withTempPath { path =>
Seq(line).toDF.write.text(path.getAbsolutePath)
assert(spark.read.format("csv")
.option("delimiter", "|")
.option("ignoreTrailingWhiteSpace", "true").load(path.getAbsolutePath).count() == 1)
}
}
}

class CSVv1Suite extends CSVSuite {
Expand Down

0 comments on commit 0922380

Please sign in to comment.