Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle also when header and data rows have different number of columns #189

Closed
HenrikBengtsson opened this issue Jun 10, 2015 · 10 comments
Closed

Comments

@HenrikBengtsson
Copy link

Case 1: More column names than data columns

read.table() has fill=TRUE to handle the case for when there are more column names than columns in the data rows, e.g.

> read_tsv("a\tb\tc\n1\t2\n")
Error: You have 3 column names, but 2 columns

> read.table(text="a\tb\tc\n1\t2\n", fill=FALSE)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
  line 2 did not have 3 elements
> read.table(text="a\tb\tc\n1\t2\n", fill=TRUE)
  V1 V2 V3
1  a  b  c
2  1  2

Looking at the help, I don't think there is way to use read_tsv() to deal with this case.

WISH: Make it possible "fill" data rows with empty values/NAs, when data rows lack trailing cells. This would assume the missing ones are at the end, cf. argument fill of read.table().

Case 2: Fewer column names than data columns

read.table() does not handle this. I don't think read_tsv() does either.

> read.table(text="a\tb\n1\t2\t3\t4\n", header=TRUE, fill=FALSE)
Error in read.table(text = "a\tb\n1\t2\t3\t4\n", header = TRUE, fill = FALSE) :
  more columns than column names
> read.table(text="a\tb\n1\t2\t3\t4\n", header=TRUE, fill=TRUE)
Error in read.table(text = "a\tb\n1\t2\t3\t4\n", header = TRUE, fill = FALSE) :
  more columns than column names
> read_tsv("a\tb\n1\t2\t3\t4\n")
Error: You have 2 column names, but 4 columns

WISH: Make it possible "fill" column names with empty values/NAs, when header lack trailing column names. This would assume the missing ones are at the end, cf. argument fill of read.table().

Background

For a real-world example, please see https://gist.github.com/HenrikBengtsson/dabc383aaa958c0ed49a. The above examples are never ending stories in my life.

@jennybc
Copy link
Member

jennybc commented Jun 14, 2015

👍 currently fiddling with an instance of Case 1: More column names than data columns.

@hadley
Copy link
Member

hadley commented Sep 3, 2015

I think both cases should be problems, not errors.

@hadley
Copy link
Member

hadley commented Sep 3, 2015

How does this look?

read_csv(col_types = "ii", "a,b\n1")
#>   a  b
#> 1 1 NA
read_csv(col_types = "ii", "a,b\n1,2,3")
#> Warning: 1 problems parsing literal data. See problems(...) for more
#> details.
#>   a b
#> 1 1 2

read_csv("a,b\n1")
#>   a
#> 1 1
read_csv("a,b\n1,2,3")
#>   a b X3
#> 1 1 2  3

I guess they probably should all generate warnings :/

@hadley
Copy link
Member

hadley commented Sep 3, 2015

A bit more progress

read_csv(col_types = "ii", "a,b\n1")
#>   a  b
#> 1 1 NA
read_csv(col_types = "ii", "a,b\n1,2,3")
#> Warning: 1 parsing failure (literal data)
#> row col  expected actual
#>   1   3 2 columns
#>   a b
#> 1 1 2

read_csv("a,b\n1")
#> Warning: 1 parsing failure (literal data)
#> row col       expected actual
#>  NA  NA 1 column names      2
#>   a
#> 1 1
read_csv("a,b\n1,2,3")
#> Warning: 1 parsing failure (literal data)
#> row col            expected actual
#>  NA   3 Missing column name
#>   a b X3
#> 1 1 2  3

@jennybc
Copy link
Member

jennybc commented Sep 3, 2015

Looks good to me. Yes, it is helpful to be warned when col_types, the header row, and the data rows provide contradictory information about the number of variables.

@hadley
Copy link
Member

hadley commented Sep 3, 2015

Final version:

read_csv(col_types = "ii", "a,b\n1")
#> Warning: 1 parsing failure.
#> row col  expected    actual
#>   1  -- 2 columns 1 columns
#>   a  b
#> 1 1 NA
read_csv(col_types = "ii", "a,b\n1,2,3")
#> Warning: 1 parsing failure.
#> row col  expected    actual
#>   1  -- 2 columns 3 columns
#>   a b
#> 1 1 2
read_csv(col_types = "ii", "a,b\n1,2,3,4")
#> Warning: 1 parsing failure.
#> row col  expected    actual
#>   1  -- 2 columns 4 columns
#>   a b
#> 1 1 2

read_csv("a,b\n1")
#> Warning: 1 parsing failure.
#> row col    expected      actual
#>  --  -- 1 col names 2 col names
#>   a
#> 1 1
read_csv("a,b\n1,2,3")
#> Warning: 1 parsing failure.
#> row col    expected      actual
#>  --  -- 3 col names 2 col names
#>   a b X3
#> 1 1 2  3
read_csv("a,b\n1,2,3,4")
#> Warning: 1 parsing failure.
#> row col    expected      actual
#>  --  -- 4 col names 2 col names
#>   a b X3 X4
#> 1 1 2  3  4

This is looking pretty good to me :)

(BTW I've been using reprex to make these code snippets and it's awesome!)

@hadley
Copy link
Member

hadley commented Sep 3, 2015

Not quite right, but I'll finish it off tomorrow:

read_csv("a,b\n\n2,3")
#>    a  b
#> 1 NA NA
#> 2  2  3
read_csv("a,b\n\n\n2,3")
#> Warning: 1 parsing failure.
#> row col  expected    actual
#>   2  -- 2 columns 1 columns
#>    a  b
#> 1 NA NA
#> 2 NA NA
#> 3  2  3

@jennybc
Copy link
Member

jennybc commented Sep 7, 2015

@HenrikBengtsson is the main sufferer but I agree this looks great. (Thanks for kind words re: reprex ... yeah, it certainly feels useful and ppl have given neat ideas and PRs already.)

@hadley hadley closed this as completed in a165309 Sep 23, 2015
@hadley
Copy link
Member

hadley commented Sep 23, 2015

I'm pretty sure I got everything - please open a new issue if you discover a case I missed.

@HenrikBengtsson
Copy link
Author

Awesome - thanks for this. I've confirmed that it works with my real-world data that originally triggered this issue. You just made life a bit less hard for quite a few people.

@lock lock bot locked and limited conversation to collaborators Sep 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants