I recently cleaned up a huge amount of spam comments and posts on someone's Wordpress website. Not just a few, but like over 500K worth of pages.
Now there is a huge spike in GSC/GA4 for the site's 404 page because these pages are gone. How do i properly validate that this was fixed and that these pages should be removed from the index and the site should work as normal? I can't redirect 500,000 pages
There is no inherent superiority between 410s and 404s, in my opinion. We've done tests on this before, and from what I've read, others have also done the same, and nobody has been able to find any difference.
That being said, if the pages still follow a specific pattern and there is no longer a crawl path to them, perhaps you might attempt the URL removal tool.
Although sitemaps are recommended, I've heard of people asking Google to continuously crawl these URLs, which will completely destroy your crawl budget.
https://wpclerks.com/wordpress-maintenance/
I don't think 410s are fundamentally better than 404s. We have tested this at some point and I have read about others doing the same and no one seems to be able to detect a difference. I would definitely check that there is no more crawl path to the pages but assuming there isn't and they follow a particular pattern maybe you could try the URL removal tool?
I have heard people using sitemaps (suggested already) but you still ask Google to crawl these URLs (repeatedly) so this will ruin your crawl budget
If you have deleted a large number of pages from the site, you have two options. The first is to wait for Google to sort it out on its own. Personally, I don’t recommend this approach because it comes with several issues:
1. A 404 code informs the search engine that the requested page is not found on the server. However, search engines will continue to periodically check this page in case it becomes available again. Only after several attempts will they start to consider it deleted. This brings us to the second issue, which is the consumption of crawl budget.
2. All repeated crawls of these pages will consume your crawl budget. With such a large number of deleted pages, there’s a chance that you won’t have enough crawl budget left to crawl new pages.
The best solution in your case would be to proceed as follows:
1. All deleted pages should return a 410 response code instead of a 404. The fundamental difference is that a 410 code indicates that the page has been permanently deleted and the server knows it is no longer available and will not be available in the future. Therefore, search engines will not attempt to search for it again.
2. Create a separate sitemap with all these URLs and submit it to Google Search Console (GSC).
This way, Google will find all these pages as quickly as possible, receive the 410 code, and forget about them faster.
404 is the right thing to do. It indicates the page does not exist and should not be in the search index.