two, noise removal

, FLASH use

we know, the first step is to search engine noise removal in HTML format, so the first step to improve the signal-to-noise ratio of the web page is to optimize the HTML code. Why do we often say, "the code must conform to the W3C standard, the code should be simple, by DIV+CSS, is based on this principle. In fact, a lot of friends only to see the online article to write code to this, but they do not know what to do so, that is why I suggest that you first learn Shanghai dragon principle (I know, practice more than theory, but if the theory is not how to practice, not a starting point). Remove the noise code includes the following aspects:

, a noise removal

reduce the JS use, package must use the JS code.

, which is also what we must understand the basic knowledge of Shanghai dragon. From the principle of search engine, the first is the whole web crawling system download, then the text inside the extracted, through the analysis of the removal of HTML format, and then remove the noise, segmentation, and finally into the index library. In this process, the process of search engine will after denoising, we can clearly know that "the SNR is high, search engine spiders crawl efficiency is high, the spider every day to deal with the document very much, how can the subject information extraction" is an important task of fast.


not only refers to the ratio of all the code, including the ratio of the current page text content of useful information and useless information. What is useful information, such as the subject of this article is "the signal-to-noise ratio, the whole article is 1000 words, while the current page all the text is 2000 words, while other words are independent and SNR, the irrelevant information is noise. So, to improve the SNR of "into two aspects: including the optimization code and content optimization.

we are very common is some electronic commerce website product detail page, may be part of the staff to do e-commerce site of Shanghai dragon had not noticed, in some of the following product content delivery mode or >

search engine, Web text extraction, but also for the analysis of the two denoising, which is to determine the current web page theme. So in this process, how do we make the search engine more accurate judgment of our "theme (which is also the problem of correlation, how to improve the relevance of web pages) that is to reduce the noise page content?.

In fact,

The same

reduced DIV nested layers (many of my friends do not understand the principle of words, the blind pursuit of DIV+CSS, but also produce a large number of redundant code.

to reduce the image

encapsulate CSS code.

Refers to the ratio of the text content of a web page with all the HTML code page signal-to-noise ratio is