HOW TO FIND ALL EXISTING AND ARCHIVED URLS ON A WEB SITE

How to Find All Existing and Archived URLs on a web site

How to Find All Existing and Archived URLs on a web site

Blog Article

There are plenty of reasons you might require to find the many URLs on a web site, but your exact target will determine Everything you’re seeking. For illustration, you might want to:

Recognize every single indexed URL to analyze challenges like cannibalization or index bloat
Obtain recent and historic URLs Google has observed, specifically for website migrations
Obtain all 404 URLs to recover from article-migration faults
In Every situation, just one Device gained’t Supply you with all the things you would like. Sad to say, Google Search Console isn’t exhaustive, in addition to a “internet site:case in point.com” lookup is restricted and challenging to extract data from.

In this particular article, I’ll stroll you thru some tools to construct your URL list and prior to deduplicating the data employing a spreadsheet or Jupyter Notebook, determined by your internet site’s dimensions.

Old sitemaps and crawl exports
In case you’re trying to find URLs that disappeared from the Stay site a short while ago, there’s a chance an individual on the crew can have saved a sitemap file or simply a crawl export prior to the variations were being designed. Should you haven’t presently, look for these information; they will often offer what you may need. But, when you’re examining this, you almost certainly didn't get so lucky.

Archive.org
Archive.org
Archive.org is an invaluable Instrument for Website positioning responsibilities, funded by donations. If you hunt for a site and choose the “URLs” selection, you may accessibility approximately ten,000 outlined URLs.

Nonetheless, There are some constraints:

URL limit: You could only retrieve up to web designer kuala lumpur 10,000 URLs, that's insufficient for more substantial web pages.
High quality: Several URLs could be malformed or reference resource data files (e.g., illustrations or photos or scripts).
No export possibility: There isn’t a designed-in strategy to export the checklist.
To bypass The dearth of the export button, utilize a browser scraping plugin like Dataminer.io. Nonetheless, these restrictions signify Archive.org might not present a whole Alternative for larger sized sites. Also, Archive.org doesn’t indicate no matter whether Google indexed a URL—but when Archive.org discovered it, there’s a good prospect Google did, far too.

Moz Professional
When you may perhaps normally utilize a link index to discover exterior websites linking for you, these instruments also find URLs on your web site in the method.


The way to use it:
Export your inbound links in Moz Professional to get a speedy and easy list of target URLs from the website. When you’re coping with an enormous Site, consider using the Moz API to export information past what’s manageable in Excel or Google Sheets.

It’s crucial that you Observe that Moz Pro doesn’t confirm if URLs are indexed or found out by Google. On the other hand, due to the fact most web sites utilize the identical robots.txt rules to Moz’s bots since they do to Google’s, this process generally operates effectively like a proxy for Googlebot’s discoverability.

Google Lookup Console
Google Search Console offers quite a few important sources for developing your listing of URLs.

Backlinks experiences:


Much like Moz Pro, the One-way links segment delivers exportable lists of target URLs. Regretably, these exports are capped at 1,000 URLs Each and every. You can utilize filters for specific pages, but given that filters don’t use towards the export, you may have to rely on browser scraping applications—limited to 500 filtered URLs at any given time. Not perfect.

General performance → Search Results:


This export offers you a summary of webpages obtaining lookup impressions. Even though the export is proscribed, You should use Google Search Console API for larger datasets. In addition there are free of charge Google Sheets plugins that simplify pulling extra intensive info.

Indexing → Web pages report:


This section supplies exports filtered by difficulty kind, however these are also restricted in scope.

Google Analytics
Google Analytics
The Engagement → Pages and Screens default report in GA4 is a wonderful supply for gathering URLs, having a generous Restrict of a hundred,000 URLs.


Better still, you'll be able to apply filters to build diverse URL lists, efficiently surpassing the 100k limit. For example, if you need to export only blog URLs, observe these actions:

Phase 1: Incorporate a section on the report

Move two: Click on “Create a new phase.”


Phase three: Outline the phase that has a narrower URL pattern, for instance URLs containing /blog site/


Be aware: URLs present in Google Analytics may not be discoverable by Googlebot or indexed by Google, but they supply beneficial insights.

Server log documents
Server or CDN log data files are Maybe the last word Software at your disposal. These logs seize an exhaustive record of each URL route queried by people, Googlebot, or other bots in the course of the recorded time period.

Issues:

Knowledge dimension: Log data files is often large, countless websites only keep the final two weeks of data.
Complexity: Analyzing log information is often tough, but several applications are offered to simplify the method.
Merge, and very good luck
When you finally’ve gathered URLs from every one of these sources, it’s time to mix them. If your website is sufficiently small, use Excel or, for more substantial datasets, resources like Google Sheets or Jupyter Notebook. Ensure all URLs are continuously formatted, then deduplicate the list.

And voilà—you now have a comprehensive list of latest, aged, and archived URLs. Good luck!

Report this page