Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic replacement of urls specified in sitemap #248

Closed
Firq-ow opened this issue Oct 24, 2024 · 2 comments
Closed

Dynamic replacement of urls specified in sitemap #248

Firq-ow opened this issue Oct 24, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@Firq-ow
Copy link

Firq-ow commented Oct 24, 2024

Clear and concise description of the problem

I am running a setup where my site is deployed to http://localhost:8081/ during my CI runs. Unlighthouse is configured to run against this url.
I also provide a sitemap which is automatically generated during build and points to the actual url https://example.com/.

The issue now is that unlighthouse will refuse to parse and use the sitemap because the origin is different (as stated by the logs). This results in all routes that nested deeper to be missed due to unlighthouse not knowing these exist.

Suggested solution

I would love to have the option to reconfigure unlighthouse so that any routes in the sitemap get re-written to match a given override. It may look similar to this:

export default {
  site: "http://localhost:8081",
  sitemap_origin: "https://example.com"
}

The parameter name could be anything in this regard (sitemap_origin, sitemap_override, ...).

Alternatively, unlighthouse could automatically try to use the sitemap, replacing the origin with the given site. However, I would prefer an explicit solution over some implicit replacement.

Alternative

A current workaround is to add all possible routes to the config in advance. However, this is very tedious and doesn't scale well with sites that generate routes on the fly, for example when content collections are being used.

Additional context

Dynamically reading the links from the pages is not an option in this context, as the links are already pointing to https://example.com/ when the site finishes building.

Feel free to reach out for further details, I can provide the project infos if necessary.

@Firq-ow Firq-ow added the enhancement New feature or request label Oct 24, 2024
@harlan-zw
Copy link
Owner

harlan-zw commented Jan 1, 2025

I've pushed up a fix to support async functions for the config definition so you can just fetch the sitemap, parse the URLs and transform them however you like. I think this is the best solution to reduce the maintenance of the project.

For example:

// unlighthouse.config.ts
export default async () => {
  const sitemap = await (await fetch('https://harlanzw.com/sitemap.xml')).text()
  const urls = sitemap.match(/<loc>(.*?)<\/loc>/g).map((loc) => loc.replace(/<\/?loc>/g, ''))
  return {
    site: 'harlanzw.com',
    urls,
  }
})

@mwskwong
Copy link

mwskwong commented Feb 8, 2025

@harlan-zw While the above workaround works, IMO, it is still beneficial to expose an option to disable the sitemap origin checking.

One of the common use cases of Unlighthouse is to in corporate it into the CI/CD pipeline and auto trigger Unlighthouse after a successful deployment had been made. Most deployment platforms will generate a unique URL for each deployment and naturally that will be the URL targeted by Unlighthouse. However, that URL is not necessary the main entry point of the site, and the sitemap URL specified in robots.txt generally uses the main entry URL.

E.g. the main entry point can be example.com, while the auto generated one is example-.vercel.app.

So the sitemap origin different from the site origin can be fairly common

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants