-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
99 lines (71 loc) · 3.16 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
output: rmarkdown::github_document
---
```{r include=FALSE}
knitr::opts_chunk$set(
fig.width=10, fig.retina=2, message=FALSE, warning=FALSE, collapse=TRUE
)
```
# psl
Extract Internet Domain Components Using the Public Suffix List
## Description
The 'Public Suffix List' (<https://publicsuffix.org/>) is a collection of top-level domains ('TLDs') which include global top-level domainsa ('gTLDs') such as '.com' and '.net'; country top-level domains ('ccTLDs') such as '.de' and '.cn'; and, brand top-level domains such as '.apple' and '.google'. Tools are provided to extract internet domain components using the public suffix list base data.
- `libpsl`: <https://github.com/rockdaboot/libpsl>
- Public Suffix List: <https://publicsuffix.org/>
## What's Inside The Tin
The following functions are implemented:
- `apex_domain`: Return the apex/top-private domain from a vector of domains
- `is_public_suffix`: Test whether a domain is a public suffix
- `public_suffix`: Return the public suffix from a vector of domains
- `suffix_extract`: Separate a domain into component parts
- `suffix_extract2`: Separate a domain into component parts (urltools compatible output)
## PRE-Installation
You need a recent `libpsl`.
- macOS: `brew install libpsl libicu4c && brew link icu4c --force`
- Debian/Ubuntu-ish: Many repos have old versions so it's _highly_ suggested that you build from source and ensure the library & header files are accessible
- Windows: Just use `urltools::suffix_extract()` until winlibs are available for psl
## Installation
```{r eval=FALSE}
remotes::install_github("hrbrmstr/psl")
```
```{r message=FALSE, warning=FALSE, error=FALSE, include=FALSE}
options(width=120)
```
## Usage
```{r message=FALSE, warning=FALSE, error=FALSE}
library(psl)
library(tidyverse)
# current verison
packageVersion("psl")
```
```{r message=FALSE, warning=FALSE, error=FALSE}
doms <- c(
"", "com", "example.com", "www.example.com",
".com", ".example", ".example.com", ".example.example", "example",
"example.example", "b.example.example", "a.b.example.example",
"biz", "domain.biz", "b.domain.biz", "a.b.domain.biz", "com",
"example.com", "b.example.com", "a.b.example.com", "uk.com",
"example.uk.com", "b.example.uk.com", "a.b.example.uk.com", "test.ac",
"cy", "c.cy", "b.c.cy", "a.b.c.cy", "jp", "test.jp", "www.test.jp",
"ac.jp", "test.ac.jp", "www.test.ac.jp", "kyoto.jp", "test.kyoto.jp",
"ide.kyoto.jp", "b.ide.kyoto.jp", "a.b.ide.kyoto.jp", "c.kobe.jp",
"b.c.kobe.jp", "a.b.c.kobe.jp", "city.kobe.jp", "www.city.kobe.jp",
"ck", "test.ck", "b.test.ck", "a.b.test.ck", "www.ck", "www.www.ck",
"us", "test.us", "www.test.us", "ak.us", "test.ak.us", "www.test.ak.us",
"k12.ak.us", "test.k12.ak.us", "www.test.k12.ak.us"
)
apex_domain(doms)
public_suffix(doms)
is_public_suffix(doms)
suffix_extract(doms)
str(suffix_extract2(doms)) # urltools compatible output
```
```{r bench, message=FALSE, warning=FALSE, error=FALSE, fig.width=10, fig.retina=2}
library(microbenchmark)
microbenchmark(
urltools = urltools::suffix_extract(doms),
psl = psl::suffix_extract(doms), # returns more data
psl2 = psl::suffix_extract2(doms) # returns what urltools does
) -> mb
autoplot(mb)
```