-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for flexible queries in terra::query
?
#1157
Comments
Apologies to divert a little ... I came here to make a case for also including the SQL dialect, so that (say) SQLITE functions can be used no matter what the format ... but then remembered can do this by VRT construction (within limits of terra passing the VRT directly to GDAL, and in what it will accept as vect output): url <- "/vsicurl/https://minio.carlboettiger.info/biodiversity/World-Protected-Areas-May2023.gdb"
terra::vect(url, proxy = TRUE)
# class : SpatVectorProxy
# geometry : polygons
# dimensions : 273868, 30 (geometries, attributes)
# extent : 175.7782, 175.7809, -37.42824, -37.4256 (xmin, xmax, ymin, ymax)
# source : World-Protected-Areas-May2023.gdb (WDPA_WDOECM_poly_May2023_all)
# coord. ref. : lon/lat WGS 84 (EPSG:4326)
# names : WDPAID WDPA_PID PA_DEF NAME ORIG_NAME DESIG DESIG_ENG DESIG_TYPE IUCN_CAT INT_CRIT (and 20 more)
# type : <num> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# Warning message:
# [vect] Reading layer: WDPA_WDOECM_poly_May2023_all
# Other layers: WDPA_WDOECM_point_May2023_all, WDPA_WDOECM_source_May2023_all
## (SHAPE is what the geometry presents as the geometry column to GDAL from GDB)
vrt <- '<OGRVRTDataSource>
<OGRVRTLayer name="limit10">
<SrcDataSource>/vsicurl/https://minio.carlboettiger.info/biodiversity/World-Protected-Areas-May2023.gdb</SrcDataSource>
<SrcSQL dialect="sqlite">SELECT WDPAID, SHAPE from WDPA_WDOECM_poly_May2023_all LIMIT 10</SrcSQL>
</OGRVRTLayer>
</OGRVRTDataSource>'
terra::vect(vrt)
# class : SpatVector
# geometry : polygons
# dimensions : 10, 1 (geometries, attributes)
# extent : 170.4084, 176.8883, -45.79126, -37.4256 (xmin, xmax, ymin, ymax)
# source : OGRVRTDataSource> (limit10)
# coord. ref. : lon/lat WGS 84 (EPSG:4326)
# names : WDPAID
... I think DISTINCT won't work well because of the format, and 'ConvexHull(SHAPE)' as an example is not accepted by terra - but there's a lot of scope to explore here. Users can expose what can and can't be done with GDAL via terra, which seems a positive seam for contributions here. |
I have added a new argument "sql" to
|
Wow that was fast! Trying it right now |
Indeed, works like a charm. Thank you @rhijmans |
wonderful, thanks @rhijmans . I feel this pretty much addresses the top issue so feel free to close unless you think it's worth further considerations of syntax for these queries. |
It is now also possible, by specifying
Thank you much for the ideas for lazy evaluation, but I will leave pondering about that for another day. |
@rhijmans thanks once again for this amazing package. Very impressed at the power of proxy objects when working with large remote vector layers, e.g. I find syntax like this compelling:
letting us instantly see the file metadata without downloading the rather large object or bringing it all in to RAM. If I want to do some more exploration on the remote object, e.g. look at unique values of the "NAMES" variable, it seems the only option to do so is to start again with the query:
In other contexts, we could instead have piped
v
toterra::query()
, but it seems to have no ability to support arbitrary queries likeSELECT DISTINCT
. It would be great ifquery()
could gain an additional argument to take an arbitrary OGR-SQL query, like you already allow us to do insidevect
.I also think a more complicated but more elegant solution would be to permit the use of common dplyr verbs that can be translated to SQL, the way
dbplyr
andarrow
already do such meta-programming translations. Then the above query could be:etc. Of course there would be edge cases that don't work (i.e. where some dplyr code that would work on a tibble cannot be translated into an OGR-SQL query), but that is already the case for users using dplyr to interact with, say, SQL databases or arrow/Parquet files) .
Relatedly, it looks like
query()
is always non-lazy interra
? I find the pattern used indplyr::collect()
to trigger evaluation (also used by most other paradigms with lazy-evaluation) to be really convenient. It's already nice thatvect()
supports lazy evaluation this way, but it might be nice to flush out the support for lazy operations more thoroughly. (e.g. for a more explicitly spatial data example, I find the lazy-evaluation ingdalcubes
really thoughtful and clever).And of course apologies if there are already ways to do this kind of lazy eval in terra that I'm overlooking!
The text was updated successfully, but these errors were encountered: