Handle also when header and data rows have different number of columns #189

HenrikBengtsson · 2015-06-10T00:21:08Z

Case 1: More column names than data columns

read.table() has fill=TRUE to handle the case for when there are more column names than columns in the data rows, e.g.

> read_tsv("a\tb\tc\n1\t2\n")
Error: You have 3 column names, but 2 columns

> read.table(text="a\tb\tc\n1\t2\n", fill=FALSE)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
  line 2 did not have 3 elements
> read.table(text="a\tb\tc\n1\t2\n", fill=TRUE)
  V1 V2 V3
1  a  b  c
2  1  2

Looking at the help, I don't think there is way to use read_tsv() to deal with this case.

WISH: Make it possible "fill" data rows with empty values/NAs, when data rows lack trailing cells. This would assume the missing ones are at the end, cf. argument fill of read.table().

Case 2: Fewer column names than data columns

read.table() does not handle this. I don't think read_tsv() does either.

> read.table(text="a\tb\n1\t2\t3\t4\n", header=TRUE, fill=FALSE)
Error in read.table(text = "a\tb\n1\t2\t3\t4\n", header = TRUE, fill = FALSE) :
  more columns than column names
> read.table(text="a\tb\n1\t2\t3\t4\n", header=TRUE, fill=TRUE)
Error in read.table(text = "a\tb\n1\t2\t3\t4\n", header = TRUE, fill = FALSE) :
  more columns than column names
> read_tsv("a\tb\n1\t2\t3\t4\n")
Error: You have 2 column names, but 4 columns

WISH: Make it possible "fill" column names with empty values/NAs, when header lack trailing column names. This would assume the missing ones are at the end, cf. argument fill of read.table().

Background

For a real-world example, please see https://gist.github.com/HenrikBengtsson/dabc383aaa958c0ed49a. The above examples are never ending stories in my life.

The text was updated successfully, but these errors were encountered:

jennybc · 2015-06-14T05:39:00Z

👍 currently fiddling with an instance of Case 1: More column names than data columns.

hadley · 2015-09-03T18:51:09Z

I think both cases should be problems, not errors.

hadley · 2015-09-03T21:40:33Z

How does this look?

read_csv(col_types = "ii", "a,b\n1")
#>   a  b
#> 1 1 NA
read_csv(col_types = "ii", "a,b\n1,2,3")
#> Warning: 1 problems parsing literal data. See problems(...) for more
#> details.
#>   a b
#> 1 1 2

read_csv("a,b\n1")
#>   a
#> 1 1
read_csv("a,b\n1,2,3")
#>   a b X3
#> 1 1 2  3

I guess they probably should all generate warnings :/

hadley · 2015-09-03T22:11:49Z

A bit more progress

read_csv(col_types = "ii", "a,b\n1")
#>   a  b
#> 1 1 NA
read_csv(col_types = "ii", "a,b\n1,2,3")
#> Warning: 1 parsing failure (literal data)
#> row col  expected actual
#>   1   3 2 columns
#>   a b
#> 1 1 2

read_csv("a,b\n1")
#> Warning: 1 parsing failure (literal data)
#> row col       expected actual
#>  NA  NA 1 column names      2
#>   a
#> 1 1
read_csv("a,b\n1,2,3")
#> Warning: 1 parsing failure (literal data)
#> row col            expected actual
#>  NA   3 Missing column name
#>   a b X3
#> 1 1 2  3

jennybc · 2015-09-03T22:17:47Z

Looks good to me. Yes, it is helpful to be warned when col_types, the header row, and the data rows provide contradictory information about the number of variables.

hadley · 2015-09-03T22:46:44Z

Final version:

read_csv(col_types = "ii", "a,b\n1")
#> Warning: 1 parsing failure.
#> row col  expected    actual
#>   1  -- 2 columns 1 columns
#>   a  b
#> 1 1 NA
read_csv(col_types = "ii", "a,b\n1,2,3")
#> Warning: 1 parsing failure.
#> row col  expected    actual
#>   1  -- 2 columns 3 columns
#>   a b
#> 1 1 2
read_csv(col_types = "ii", "a,b\n1,2,3,4")
#> Warning: 1 parsing failure.
#> row col  expected    actual
#>   1  -- 2 columns 4 columns
#>   a b
#> 1 1 2

read_csv("a,b\n1")
#> Warning: 1 parsing failure.
#> row col    expected      actual
#>  --  -- 1 col names 2 col names
#>   a
#> 1 1
read_csv("a,b\n1,2,3")
#> Warning: 1 parsing failure.
#> row col    expected      actual
#>  --  -- 3 col names 2 col names
#>   a b X3
#> 1 1 2  3
read_csv("a,b\n1,2,3,4")
#> Warning: 1 parsing failure.
#> row col    expected      actual
#>  --  -- 4 col names 2 col names
#>   a b X3 X4
#> 1 1 2  3  4

This is looking pretty good to me :)

(BTW I've been using reprex to make these code snippets and it's awesome!)

hadley · 2015-09-03T22:54:34Z

Not quite right, but I'll finish it off tomorrow:

read_csv("a,b\n\n2,3")
#>    a  b
#> 1 NA NA
#> 2  2  3
read_csv("a,b\n\n\n2,3")
#> Warning: 1 parsing failure.
#> row col  expected    actual
#>   2  -- 2 columns 1 columns
#>    a  b
#> 1 NA NA
#> 2 NA NA
#> 3  2  3

jennybc · 2015-09-07T21:33:43Z

@HenrikBengtsson is the main sufferer but I agree this looks great. (Thanks for kind words re: reprex ... yeah, it certainly feels useful and ppl have given neat ideas and PRs already.)

hadley · 2015-09-23T18:59:15Z

I'm pretty sure I got everything - please open a new issue if you discover a case I missed.

HenrikBengtsson · 2015-09-23T19:34:12Z

Awesome - thanks for this. I've confirmed that it works with my real-world data that originally triggered this issue. You just made life a bit less hard for quite a few people.

This was referenced Sep 3, 2015

Behavior of read_table with rows of unequal length #169

Closed

Dealing with horrible files #94

Closed

hadley added a commit that referenced this issue Sep 3, 2015

Warnings instead of errors for bad cols. #189

fc901ad

hadley closed this as completed in a165309 Sep 23, 2015

brainfood mentioned this issue Dec 12, 2017

readr does not read columns with missing headers #762

Closed

lock bot locked and limited conversation to collaborators Sep 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle also when header and data rows have different number of columns #189

Handle also when header and data rows have different number of columns #189

HenrikBengtsson commented Jun 10, 2015

jennybc commented Jun 14, 2015

hadley commented Sep 3, 2015

hadley commented Sep 3, 2015

hadley commented Sep 3, 2015

jennybc commented Sep 3, 2015

hadley commented Sep 3, 2015

hadley commented Sep 3, 2015

jennybc commented Sep 7, 2015

hadley commented Sep 23, 2015

HenrikBengtsson commented Sep 23, 2015

Handle also when header and data rows have different number of columns #189

Handle also when header and data rows have different number of columns #189

Comments

HenrikBengtsson commented Jun 10, 2015

Case 1: More column names than data columns

Case 2: Fewer column names than data columns

Background

jennybc commented Jun 14, 2015

hadley commented Sep 3, 2015

hadley commented Sep 3, 2015

hadley commented Sep 3, 2015

jennybc commented Sep 3, 2015

hadley commented Sep 3, 2015

hadley commented Sep 3, 2015

jennybc commented Sep 7, 2015

hadley commented Sep 23, 2015

HenrikBengtsson commented Sep 23, 2015