Instagram 爬虫¶

Instagram 是目前世界上最重要的社交网络之一，特别是在西方国家。Instagram 每月拥有超过 10 亿活跃用户和 500 个活跃用户，成为品牌与潜在客户建立联系、提高品牌知名度和知名度以及建立客户忠诚度的一大机会。另一方面，Instagram 也可以是一个很好的机会，以品牌大使、与影响者和商业伙伴合作或创造销售机会的形式创建有趣的盟友。

在这篇文章中，我要向您展示如何最大限度地利用 Instagram 来提升您的业务绩效，通过自动化大多数最耗时和繁琐的任务，如分析竞争对手最成功的出版物，跟踪竞争对手的故事或抓取 Instagram 配置文件，以提取一些数据，如关注者数量，帖子或电子邮件地址，如果在传记中找到合适的人，让您的品牌可见，合作伙伴或创造销售机会。

为此，我将引导您浏览一个名为 Instagram爬虫的命令行应用程序以及一些 Python 脚本，这些脚本是为了操作从 JSON 格式的 Instagram 爬虫获取的一些数据，或者只是为了更改输出的传递方式。

1.- 安装 Instagram 爬虫并安装第一个命令¶

安装 Instagram 爬虫非常简单，您只需要在终端中运行以下代码：

1	`pip install instagram-scraper`

容易吧？安装后，您已经可以使用终端上的下一个命令进行第一个帐户抓取：

1
2

instagram-scraper <username> #Scraping is done anonymously. 
instagram-scraper <username> -u <your username> -p <your password> #You log into your account and you make the scraping from there. Can be useful for private accounts.

通过上面的命令，您将能够从 Instagram 帐户源中抓取帖子。例如，我运行了命令 instagram-scraper psg ""，并下载了 PSG Instagram 帐户上的所有图像和视频。

如果要抓取已使用特定井号标签上传的出版物，还可以使用此命令：

1	`instagram-scraper <hashtag> --tag #Important! Don't include # when you enter the hashtag.`

2.- 使用参数将 Instagram 爬虫到一个新的水平¶

如果你已经认为 Instagram 爬虫很酷，你还没有看到任何东西！上述命令主要是我们需要与参数一起使用的基础，以将此工具更上一层楼。在这篇文章中，我将从我的角度出发，介绍最丰富的内容，你也可以在instagram-scraper GitHub页面上查看全部。

使用 txt文件批量输入 Instagram 帐户进行批量抓取：使用参数 -f，您可以输入 txt 文件与一堆要抓取的 Instagram 帐户的位置。若要使此文件可读，可以使用新行、逗号、分号或空格分隔 Instagram 帐户。命令示例： instagram-scraper -f instagram_users.txt
故事下载：如果你想从帐户下载故事（亮点和每日故事），那么你需要使用以下命令 instagram-scraper -u <your username> -p <your password> -t story。如您在这种情况下看到，不允许匿名抓取，并且只有在登录时才能抓取故事。
下载赞数和评论数：使用基本 instagram-scraper <user_name> 查询，您只能下载图像和视频，但没有得到任何信息。如果使用参数 -媒体元数据，那么除了媒体资源之外，您还可以获得一个 JSON 文件，您可以在该文件中访问某些信息，如赞数、注释数、所有者 ID...命令是： instagram-scraper <user_name> --media-metadata 或者如果您只对 JSON 文件感兴趣，您可以运行： instagram-scraper <user_name> --media-metadata --media-types none 。
从帖子中下载评论：如果使用参数 +评论，也可以从帖子中下载评论。例如： instagram-scraper <user_name> --comments .
其他：Instagram 爬虫还有其他辅助参数，这些参数有助于使用代理、限制要抓取的帖子数、按位置筛选或为最终输出选择目标文件夹。

3.- 某些应用程序和 Python 脚本¶

3.1.- 跟踪竞争对手¶

您可能有兴趣跟踪您的竞争对手每天在 Instagram 上发布的内容，以确保您不会错过任何与关注者联系、响应他们上传的促销或帖子的机会，或者只是密切跟踪他们关注的策略以吸引其关注者。

为此，您可以安排每天运行 Instagram 爬虫命令，并将新发布或它们上载到文件夹中的故事存储。如果要将所有图像和/或视频存储在同一文件夹中，实际上可以运行一个简单的命令，即：

1 2	`instagram-scraper <username> -u <your username> -p <your password> -t story --latest #For stories instagram-scraper <username> --latest #Publication feeds`

+latest 参数将在目标文件夹上搜索更新的图片，并且仅存储在此日期之后上载的资源。但是，如果您希望每天有一个特定的文件夹，您可能需要运行一个 Python 脚本，该脚本将检查当天是否上传了任何媒体资源，并将其正确存储在运行标准 Instagram 爬虫命令后专用的当天的文件夹中，您可以在下面 instagram-scraper -u -p -t story. 找到一个示例：

import os.path, time

directory = "psg"
file_list = []
for i in os.listdir(directory): #Here we list files in a folder and get the creation date
    a = os.stat(os.path.join(directory,i))
    file_list.append([i,str(time.ctime(os.path.getmtime(directory + "/" + i)))[0:10] + " " + str(time.ctime(os.path.getmtime(directory + "/" + i)))[-4::]]) #[file,creationtime]

today = str(time.ctime())[0:10] + " " + str(time.ctime())[-4::]

for x in file_list: #We compare today´s date and the creation date and if they do not match we delete the file
    if x[1] != today:
        os.remove(<path to PSG directory>"/psg/" + str(x[0]))
        print("removed")

os.rename("psg", "psg" + " " + today) #We change the name of the folder including the date

3.2. 寻找合作伙伴、影响者或销售机会¶

找到合作伙伴、影响者或销售机会非常有用的东西是运行一个命令，以查找使用与您的行业相关的特定井号标签的帖子，抓取帖子信息，抓取帖子数量、关注者、关注者以及传记以及其中是否有电子邮件地址的个人资料。

首先，我们需要通过JSON文件，这是从刮标签，返回使用该井号标签的所有帖子。作为提醒，命令是： instagram-scraper <tag> --media-metadata --tag --media-types none 。这些帖子返回的变量之一是所有者 ID，我们可以用它来将该帖子与 Instagram 帐户关联。下面显示的代码将工作，以获得所有者ID从帖子（如果你想得到数量的赞和评论，那么你需要找到特定的键，通过JSON文件的不同元素迭代和存储您感兴趣的变量）

import json
with open(<your_JSON_file>) as json_file:     
    data = json.load(json_file) #JSON file is loaded.

listowners = []
for x in range (len(data["GraphImages"])):
    listowners.append(data["GraphImages"][x]["owner"]["id"]) #We iterate through the JSON file and we store the owner IDs.

listowners = list( dict.fromkeys(listowners)) #We remove duplicate owner IDs just in case.

现在，我们已经有了一个列表，其中来自特定井号标签下的帖子的所有者 ID，那么我们需要将此所有者 ID 与 Instagram 帐户关联。为此，我们可以利用https://i.instagram.com/api/v1/users/<Owner_ID>/info/。将其转换为 Python 代码，它看起来像：

from bs4 import BeautifulSoup
import cloudscraper

listaccounts = []
for x in listowners: 
    headers = {'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_3 like Mac OS X) AppleWebKit/603.3.8 (KHTML, like Gecko) Mobile/14G60 Instagram 12.0.0.16.90 (iPhone9,4; iOS 10_3_3; en_US; en-US; scale=2.61; gamut=wide; 1080x1920)'} #We need to use Instagram UA to be able to access to this URL.
    parser = 'html.parser' 
    scraper = cloudscraper.create_scraper() 
    html = scraper.get("https://i.instagram.com/api/v1/users/"+str(x)+"/info/", headers = headers) #We request the URL
    soup = BeautifulSoup(html.text, parser).get_text() #We parse the HTML code which is returned from the request.
    jsondata = json.loads(soup) #We transform the HTML into Json.
    listaccounts.append(jsondata["user"]["username"]) #We append the usernames.

最后，既然我们有用户名，我们可以使用 Instagram URL 上的路径参数 ?__a=1 获取些 Instagram 用户信息。在下面的代码中，我们从 Instagram 帐户中抓取传记、关注者、关注者、以下类别（如果是商业帐户）以及使用 Regex 抓取电子邮件地址（如果该帐户存在于传记中）。

import re

html = scraper.get("https://www.instagram.com/<username>/?__a=1", headers = headers) #We request this URL which returns the Instagram profile with structured data
soup = BeautifulSoup(html.text, parser).get_text()
jsondata = json.loads(soup)

#Keys for each of the variables mentioned above
biography = jsondata["graphql"]["user"]["biography"]
externalurl = jsondata["graphql"]["user"]["external_url"]
followers = jsondata["graphql"]["user"]["edge_followed_by"]["count"]
following = jsondata["graphql"]["user"]["edge_follow"]["count"]
businessacount = jsondata["graphql"]["user"]["is_business_account"]
category = jsondata["graphql"]["user"]["overall_category_name"]
emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.com", soup, re.I)

免责声明：如果您打算使用这些电子邮件地址联系用户或作为 Google 广告或付费社交广告的受众，您需要小心，因为欧盟有一项名为 GDPR 的特别立法，负责处理个人信息的收集和处理方式。

3.3.- 分析竞争对手的最佳表现职位¶

向竞争对手和/或成功案例学习可以是一个很好的练习，因为你会发现哪些是最吸引人的创意或形式，您可以相应地调整您的战略。作为提醒，首先需要使用 Instagram 爬虫运行的命令，该命令从您感兴趣的 Instagram 帐户中发布的帖子中返回 JSON 数据，是： instagram-scraper <user_name> --media-metadata --media-types none 。

在下面的示例中，我们将迭代 JSON 文件，获取 URL 名称（包括文件名）、赞数和注释数，并将所有内容存储在 Excel 文件中（如下图所示），其中可以更轻松地进行分析。

import json
import pandas as pd

with open('psg.json') as json_file:     
    data = json.load(json_file) #We load the Json file

listposts = []
for x in range (len(data["GraphImages"])): #We iterate through the Json file and we get the variables through the keys
    try:
        media = data["GraphImages"][x]["display_url"]
        likes = data["GraphImages"][x]["edge_media_preview_like"]["count"]
        comments = data["GraphImages"][x]["edge_media_to_comment"]["count"]
        listposts.append([media,likes,comments])
    except:
        continue

df = pd.DataFrame(listposts, columns=["Media","Likes","Comments"])
df.to_csv('TestJason.csv', index=False) #We store the list in an Excel file by using Pandas

原文翻译自网文，作者:
原文: https://www.danielherediamejias.com/scraping-on-instagram-with-instagram-scraper-and-python/