Since this is a beginner's guide, let's start with the basics.
Technical SEO is optimizing your website to help search engines like Google find, crawl, understand, and index your web pages. The goal is to get search engines to find and improve rankings.
It depends. The basics are not difficult to grasp, but technical SEO can be complex and hard to understand. I will make things as simple as possible through this guide.
In this chapter, we will introduce how to ensure that search engines can effectively crawl your content.
Crawlers scrape content from pages and use the links on these pages to find more pages. This allows them to discover more content on the internet. There are some mechanisms in this process that need to be discussed.
Crawlers must start from somewhere. Usually, they create a list of all the URLs they find through the pages. Another mechanism is to find more URLs through a sitemap created by users or various systems with a list of pages.
All URLs that need to be crawled or recrawled will be prioritized and added to the crawl queue. This is essentially an ordered list of URLs that Google wants to crawl.
The mechanism for capturing page content.
These are standardized processing mechanisms that render pages, just like a browser loads a page, and process the page to obtain more URLs to crawl, which we will discuss later.
Rendering is like a browser loading a page, loading JavaScript and CSS files. This is done so that Google can see the content that most users will see.
Used to store the pages displayed to users by Google.
There are several ways to control the content that can be crawled on your website.
The Robots.txt file tells search engines which pages they can and cannot access.
It should be noted that if the links point to these pages, even if Google cannot access the page, it may still index them. This can be confusing, but if you want to prevent the page from being indexed, please refer to this guide and flowchart.
You can use a crawl-delay directive in robots.txt, which many crawling tools support, allowing you to set the frequency at which they crawl pages. Unfortunately, Google does not support this. For Google, you need to change the crawl rate in Google Search Console as described here.
If you want certain users to access this page, but search engines cannot access this page, then you may want one of the following three situations:
- Some login pages;
- HTTP Authentication (places that require a password to access);
- IP whitelist (only allows specific IP addresses to access the page)
This type of setup is best suited for internal networks, member-only content, testing, or sites in development. It allows a group of users to access the page, but search engines will not be able to access them and will not index these pages.
特别是对于 Google,查看他们正在抓取的内容的最简单方法是使用 Google Search Console Crawl Statistics Report,该报告为你提供有关抓取你网站的更多信息。
If you want to view all crawling activities on the website, you need to access the server logs and use tools to better analyze the data. If your host has a control panel like cPanel, you should be able to access the raw logs through some tools such as Awstats and Webalizer.
Every website has a different crawl budget, which is a combination of how often Google crawls the site and the number of pages your site allows to be crawled. More popular pages and pages that change frequently will be crawled more often, while pages that appear less popular or have fewer links will have a lower crawl frequency.
If the crawling tools are under pressure while crawling the website, they usually slow down or even stop crawling until conditions improve.
After the pages are crawled, they are rendered and then sent to the index. The index is a list that stores search results.
Let's talk about the index.
In this chapter, we will discuss how to ensure your pages are indexed and check how they are indexed.
The crawler tag is an HTML snippet that tells search engines how to crawl or index a page. It is placed in the <head> section of the webpage, as follows:
<meta name="robots" content="noindex" />
When there are multiple versions of the same page, Google will choose one stored in its index. This process is called normalization, and the selected canonical URL will be the one displayed by Google in search results. They use many different signals to choose the canonical URL, including:
The easiest way to see how Google indexes a page is to use the URL Inspection Tool in Google Search Console. It will show what the canonical URL is that Google has chosen.
One of the most difficult things for SEO is determining priorities. There are many best practices, but some changes will have a greater impact on your rankings and traffic than others. Here are some factors I recommend prioritizing.
Make sure the pages you want people to see are indexed by Google. The first two chapters discuss crawling and indexing, and that is the purpose.
You can view the visibility report in Site Audit to find pages that cannot be indexed and their reasons. This report is free in Ahrefs Webmaster Tools.
During the operation of the website, its URL is often changed. In many cases, these old URLs contain links from other websites. If they are not redirected to the current page, then these links will be lost and will no longer count towards your page. Redirecting can quickly recover lost links. This is also a quick trick to gain links.
Site Explorer -> yourdomain.com -> Pages -> Best by Links -> add a “404 not found” HTTP response filter. I usually sort this by “Referring Domains”.
Site Explorer (Website Analysis) -> Your Domain -> Page -> Best by Links (Sorted by Backlink Count) -> Add '404 not found' HTTP response filter. I usually sort by Referring Domains.
This is the result of testing the 1800flowers.com website:
In archive.org, when viewing the first URL, I saw that this was previously about the Mother's Day page. By redirecting that page to the current version, you can recover 225 links from 59 different websites, and there are many similar cases on other pages.
你需要用 301 Redirect,将旧 URL 重定向到当前页面以收回丢失的权重。
Internal links are links from one page on your website to another page on your website. They help search engines find your pages and help pages rank better. We have a report called Link Opportunities in Site Audit that helps you quickly find these opportunities.
Schema markup is a code that helps search engines better understand your content and provides many features to help your site stand out in search results. Google's Search Gallery shows various search features and schemas required for a site to be eligible.
The elements we will discuss in this chapter are all worth paying attention to, but compared to the quick-win elements in the previous chapter, they may require more work and yield less. This does not mean you should not do them, but rather to help you understand how to prioritize your work.
These are secondary ranking factors, but for your users, you still want to look at this content. They cover aspects of the website that affect user experience (UX).
Core Web Vitals are speed metrics and are part of the page experience signals that Google uses to measure user experience. These metrics measure: Largest Contentful Paint (LCP), Cumulative Layout Shift (CLS), and First Input Delay (FID).
HTTPS protects the communication between your browser and server from being intercepted and tampered with by attackers. This provides confidentiality, integrity, and authentication for the vast majority of internet traffic today. You would prefer your pages to load via HTTPS rather than HTTP.
Any website that displays a lock icon in the address bar is using HTTPS.