There are several reasons you could possibly need to have to discover all the URLs on a website, but your precise purpose will determine That which you’re searching for. As an illustration, you might want to:
Detect each individual indexed URL to research problems like cannibalization or index bloat
Obtain present-day and historic URLs Google has found, especially for website migrations
Discover all 404 URLs to recover from write-up-migration glitches
In Every single scenario, only one Device received’t Supply you with every little thing you need. Sad to say, Google Search Console isn’t exhaustive, in addition to a “website:case in point.com” lookup is proscribed and hard to extract knowledge from.
In this put up, I’ll stroll you thru some instruments to make your URL listing and in advance of deduplicating the information using a spreadsheet or Jupyter Notebook, depending on your internet site’s measurement.
Aged sitemaps and crawl exports
When you’re in search of URLs that disappeared in the Stay site not too long ago, there’s a chance an individual with your crew could have saved a sitemap file or maybe a crawl export before the modifications were being designed. In the event you haven’t now, check for these documents; they are able to typically provide what you may need. But, for those who’re studying this, you probably didn't get so Fortunate.
Archive.org
Archive.org
Archive.org is an invaluable Software for Search engine optimisation duties, funded by donations. For those who try to find a domain and select the “URLs” option, you could access as many as ten,000 mentioned URLs.
Even so, Here are a few limits:
URL Restrict: You could only retrieve nearly web designer kuala lumpur ten,000 URLs, which can be insufficient for more substantial web sites.
High quality: Lots of URLs could possibly be malformed or reference useful resource files (e.g., photographs or scripts).
No export possibility: There isn’t a constructed-in strategy to export the listing.
To bypass The shortage of an export button, utilize a browser scraping plugin like Dataminer.io. Having said that, these constraints indicate Archive.org may not supply a complete Option for more substantial websites. Also, Archive.org doesn’t point out no matter whether Google indexed a URL—however, if Archive.org found it, there’s an excellent prospect Google did, much too.
Moz Professional
When you might typically make use of a url index to seek out exterior web pages linking to you personally, these resources also explore URLs on your website in the method.
Ways to utilize it:
Export your inbound backlinks in Moz Pro to get a rapid and easy listing of concentrate on URLs from the web-site. When you’re addressing an enormous Web site, think about using the Moz API to export info outside of what’s manageable in Excel or Google Sheets.
It’s imperative that you Notice that Moz Pro doesn’t affirm if URLs are indexed or found by Google. On the other hand, because most web sites use the exact same robots.txt procedures to Moz’s bots because they do to Google’s, this method generally works well for a proxy for Googlebot’s discoverability.
Google Search Console
Google Lookup Console features numerous important resources for building your list of URLs.
Inbound links experiences:
Comparable to Moz Professional, the Inbound links part delivers exportable lists of goal URLs. However, these exports are capped at 1,000 URLs Each individual. You could use filters for unique webpages, but since filters don’t apply for the export, you may perhaps need to count on browser scraping equipment—limited to five hundred filtered URLs at a time. Not suitable.
Overall performance → Search engine results:
This export will give you a summary of internet pages receiving research impressions. While the export is restricted, You should use Google Lookup Console API for larger sized datasets. You can also find free Google Sheets plugins that simplify pulling a lot more comprehensive info.
Indexing → Webpages report:
This portion delivers exports filtered by challenge sort, although these are also minimal in scope.
Google Analytics
Google Analytics
The Engagement → Pages and Screens default report in GA4 is an excellent source for gathering URLs, that has a generous Restrict of one hundred,000 URLs.
Even better, you'll be able to use filters to make distinct URL lists, efficiently surpassing the 100k Restrict. One example is, if you need to export only website URLs, follow these techniques:
Step 1: Include a section to your report
Move two: Click on “Develop a new section.”
Phase 3: Define the phase which has a narrower URL sample, for instance URLs that contains /weblog/
Note: URLs present in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they provide beneficial insights.
Server log documents
Server or CDN log files are Probably the final word Instrument at your disposal. These logs capture an exhaustive list of each URL route queried by users, Googlebot, or other bots over the recorded period of time.
Criteria:
Data sizing: Log documents might be substantial, numerous websites only keep the final two months of data.
Complexity: Analyzing log data files could be complicated, but different applications are available to simplify the process.
Mix, and very good luck
As you’ve collected URLs from these sources, it’s time to combine them. If your web site is small enough, use Excel or, for greater datasets, resources like Google Sheets or Jupyter Notebook. Ensure all URLs are consistently formatted, then deduplicate the listing.
And voilà—you now have an extensive listing of existing, aged, and archived URLs. Great luck!