Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

layerCor problem with pearson correlation when NA values are present #1034

Closed
frousseu opened this issue Feb 24, 2023 · 5 comments
Closed

layerCor problem with pearson correlation when NA values are present #1034

frousseu opened this issue Feb 24, 2023 · 5 comments

Comments

@frousseu
Copy link
Contributor

Hi!

While searching for a faster way to correlate two rasters, I found a bug in layerCor for pearson correlations when NA values are present. Here is an example:

library(terra)
set.seed(1234)
n <- 10
r1 <- rast(matrix(1:n ^ 2, ncol = n))
r2 <- rast(matrix(1:n ^ 2, ncol = n) + runif(n ^ 2, 10, 100))
samp1 <- sample(1:ncell(r1), 50)
samp2 <- sample(1:ncell(r2), 50)
r1[samp1] <- NA
r2[samp2] <- NA
cor(values(r1)[, 1], values(r2)[, 1], use = "complete.obs", method = "pearson")
## [1] 0.7631978
layerCor(c(r1, r2), "pearson", na.rm = TRUE)$pearson
##          lyr.1    lyr.1
## lyr.1 1.000000 0.862859
## lyr.1 0.862859 1.000000

Or is it because there is an underlying filling/interpolation that is done which changes the values? The same correlation is found when there are no missing values or when the missing values are the same in both rasters. Sometimes, the correlation even goes over 1.

Thanks!
François

@rhijmans
Copy link
Member

Thanks. I removed the NA's for each layer, instead of removing the cells that are NA in any of the two layers in a pair. I believe this is fixed now, and I get:

layerCor(c(r1, r2), "pearson", na.rm = TRUE)$pearson
#          lyr.1     lyr.1
#lyr.1 1.0000000 0.7631978
#lyr.1 0.7631978 1.0000000

This could be make much more efficient by running this in one step in C++

@frousseu
Copy link
Contributor Author

Thanks for the fix. You mean using a cpp function to compute the correlation? I will look into that.

@rhijmans
Copy link
Member

I have now implemented pearson in C++; making it much faster, I think.

@frousseu
Copy link
Contributor Author

Before I think it was a bit slower than directly using cor and now it is a bit faster, but not by much. More importantly, there still seems to be something off with the value obtained. If I use n = 1000 or n = 5000, the cor is the same, but with n = 10000 the values are different. The break seems to appear somewhere over n = 6000 (on my machine at least). This is with version 1.7-20 from R-Universe.

library(terra)
set.seed(1234)
n <- 10000
r1 <- rast(matrix(runif(n ^ 2), ncol = n))
r2 <- r1+rast(matrix(runif(n ^ 2,0,0.5), ncol = n))
system.time({c1<-cor(values(r1)[, 1], values(r2)[, 1], use = "complete.obs", method = "pearson")});c1
## utilisateur     système      écoulé 
##       10.23        3.51       13.94 
## [1] 0.8944202
system.time({c2<-layerCor(c(r1, r2), "pearson", na.rm = TRUE)$pearson});c2
## utilisateur     système      écoulé 
##        6.84        3.70       10.68 
##           [,1]      [,2]
## [1,] 1.0000000 0.2250657
## [2,] 0.2250657 1.0000000

rhijmans added a commit that referenced this issue Mar 13, 2023
@rhijmans
Copy link
Member

Thanks so much for checking that. The mistake was for cases where the rasters are large and get processed in chunks. I blieve this is fixed now

library(terra)
set.seed(1234)
n <- 100
r1 <- rast(matrix(runif(n ^ 2), ncol = n))
r2 <- r1+rast(matrix(runif(n ^ 2,0,0.5), ncol = n))
cor(values(r1)[, 1], values(r2)[, 1], use = "complete.obs", method = "pearson")
#[1] 0.8956437
layerCor(c(r1, r2), "pearson")$pearson[2]
#[1] 0.8956437

## simulating a large dataset
terraOptions(steps=4)
layerCor(c(r1, r2), "pearson")$pearson[2]
#[1] 0.8956437
terraOptions(steps=0)

It also works for your example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants