使用Python和Google NLP API进行网站分类¶

在今天的帖子中，我将向您展示如何使用Python，Google Cloud NLP API和Google Translate API对一组网站进行分类。我认为，此过程对于SEO增强其页外策略很有用，因为它使他们能够探索链接建立的机会并基于主题性分析反向链接配置文件。但是，这不仅对SEO有所帮助，而且对促进其他数字营销活动也有帮助，因为有了它，您可以找到联盟机会或分析展示广告的网站群，从而找到您想要提高出价的网站的最佳性能类型。

我第一次想到创建一个可以自动进行网站分类的Python脚本时，就想到使用某种 WhoIs API。我偶然发现WhoIs API似乎运行良好，但不幸的是，对于非英语网站，该API返回的结果非常不可靠，因此我需要更有创造力。不过，请注意，如果您打算对英语网站进行分类，则此API应该可以很好地工作，并且将使此过程变得更加容易。

对于非英语网站，为了成功地对这些网站进行分类，我们遵循的逻辑是：

使用名为Cloud Scraper的Python模块来抓取其某些内容标签。我们使用Cloud Scraper模块来尝试避免被Cloud Flare技术禁止。
由于Google NLP API 仅能处理英文内容，因此请翻译使用Google Translate Python模块抓取的内容。注意：如果您需要翻译许多网站并确保您的工作流程稳定，则最好使用官方的Google Cloud Translation模块。
我们将每个网站的字符数减少到最多1.000个字符，这足以让Google NLP API确定该网站类别。
我们使用Website Categorization Google NLP API 模块查找分配的类别及其置信度得分。置信度分数越接近100％，网站分类就越准确。
最后，我们需要阅读Google网站分类Google NLP API返回的输出。

现在，我们现在要如何进行网站分类，让我们付诸实践！

1.- 抓取内容¶

如工作流说明中所述，首先我们要做的是使用Cloud Scraper模块抓取可在那些网站上找到的内容。然后，我们将使用Beautiful Soup查找我们感兴趣的标签：字幕，元描述，段落和标题。最后，我们将所有这些内容放在一起，并根据文本内的相对重要性按以下顺序对标签进行优先级排序：

Metatitle
Metadescription
Headers
Paragraphs

并且我们将字符数限制为1.000，因为这足以确定网站类型，并且避免每次调用Google Cloud NLP API都浪费一个令牌。

    import cloudscraper
    from bs4 import BeautifulSoup

    scraper = cloudscraper.create_scraper() 
    headers = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}

    try:
        r = scraper.get(<your website>, headers = headers)

        soup = BeautifulSoup(r.text, 'html.parser')
        title = soup.find('title').text
        description = soup.find('meta', attrs={'name': 'description'})

        if "content" in str(description):
            description = description.get("content")
        else:
            description = ""


        h1 = soup.find_all('h1')
        h1_all = ""
        for x in range (len(h1)):
            if x ==  len(h1) -1:
                h1_all = h1_all + h1[x].text
            else:
                h1_all = h1_all + h1[x].text + ". "


        paragraphs_all = ""
        paragraphs = soup.find_all('p')
        for x in range (len(paragraphs)):
            if x ==  len(paragraphs) -1:
                paragraphs_all = paragraphs_all + paragraphs[x].text
            else:
                paragraphs_all = paragraphs_all + paragraphs[x].text + ". "
        h2 = soup.find_all('h2')
        h2_all = ""
        for x in range (len(h2)):
            if x ==  len(h2) -1:
                h2_all = h2_all + h2[x].text
            else:
                h2_all = h2_all + h2[x].text + ". "
        h3 = soup.find_all('h3')
        h3_all = ""
        for x in range (len(h3)):
            if x ==  len(h3) -1:
                h3_all = h3_all + h3[x].text
            else:
                h3_all = h3_all + h3[x].text + ". "

        allthecontent = ""
        allthecontent = str(title) + " " + str(description) + " " + str(h1_all) + " " + str(h2_all) + " " + str(h3_all) + " " + str(paragraphs_all)
        allthecontent = str(allthecontent)[0:999]

    except Exception as e:
            print(e)

2.- 翻译内容¶

在第二阶段，我们翻译内容，因为我们将使用Google Trans模块，尽管如前几段所述，我建议您使用Translation Cloud官方模块。经过更稳定的解决方案。如果在很短的时间内发出如此多的呼叫，则Google Trans模块可能会被阻止，因此，如果迭代运行以批量模式对网站进行分类，建议您使用时间模块使脚本休眠对于相当多的网站而言，两次通话之间的间隔秒数就足够了）。

    from googletrans import Translator
    translator = Translator()

    try:
            translation = translator.translate(allthecontent).text
            translation = str(translation)[0:999]
            time.sleep(10)

        except Exception as e:
            print(e)

 ```

## 3.- 使用Google NLP API对网站进行分类

最后，我们将使用Google NLP API对网站中的翻译文本进行分类。感谢[Greg Bernhardt](https://importsem.com/getting-started-with-google-nlp-api-using-python/)所发表的这篇文章，我对此API有所了解，因此，如果您想进一步了解该API可以提供的所有功能，我建议您阅读本文。在开始网站分类之前，请务必在[Google Cloud](https://cloud.google.com/)平台上创建API密钥。

为了对网站进行分类，我们将使用以下代码：

```python   
    import os
    from google.cloud import language_v1
    from google.cloud.language_v1 import enums
    from google.cloud import language
    from google.cloud.language import types

    os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = <path to your credentials file>

    try:
            text_content = str(translation)
            text_content = text_content[0:1000]

            client = language_v1.LanguageServiceClient()

            type_ = enums.Document.Type.PLAIN_TEXT
            language = "en"
            document = {"content": text_content, "type": type_, "language": language}

            encoding_type = enums.EncodingType.UTF8

            response = client.classify_text(document)
            print(response.categories[0].name)
            print(str(int(round(response.categories[0].confidence,3)*100))+"%")


    except Exception as e:
        print(e)

仅此而已，如果我们将其付诸实践，并在一些示例网站上运行脚本

Marca.com：西班牙体育报纸。使用NLP API进行分类：新闻/体育新闻具有79％的置信度。
Happypancake.se：瑞典约会网站。使用NLP API进行分类：拥有99％的信任度的在线社区/约会和交友。
Mavcsoport.hu：致力于出售火车票的匈牙利网站。使用NLP API进行分类：60％的置信度旅行/公共汽车和铁路。
Ilcasalingodivoghera.it：意大利食品网站。归类为NLP API：对食物和饮料的信任度为76％。
Autoscout24.de：德国车辆网站。使用NLP API进行分类：拥有96％的置信度的汽车和车辆/车辆购物/二手车。

Any text/graphics/videos and other articles on this website that indicate "Source: xxx" are reprinted on this website for the purpose of transmitting more information, which does not mean that we agree with their views or confirm the authenticity of their content. If you are involved in the content of the work, copyright and other issues, please contact this website, we will delete the content in the first time!
Author: Daniel Heredia
Source: https://www.danielherediamejias.com/website-categorization-python/