Skip to content

使用方式

  • 参考配置文件编写好配置文件
  • 使用python main.py启动主程序

命令

query

list 列举spider清单

python main.py query list [OPTIONS]

run

spider 运行爬虫

python main.py run spider [OPTIONS] SPIDER_NAME

该命令将运行对应的spider
spider_name基于正则全文匹配,例如stock/basic/*将运行所有stock/basic下的spider

job 运行任务

python main.py run job [OPTIONS] JOB_NAME

该命令将从jobs.yaml文件中找到对应的job,然后运行该job下的所有spider

使用K8S

项目提供了基于Helm的部署模板,编写好values.yaml后直接部署到集群即可,会自动创建CronJob用于定时更新

使用Docker

安装Docker

yum install -y yum-utils
yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
yum install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
systemctl start docker

部署Clickhouse

拉取镜像

docker pull clickhouse/clickhouse-server:23.6.2.18-alpine

创建数据目录

mkdir -p /data0/clickhouse/data

创建配置文件目录(可选)

mkdir -p /data0/clickhouse/config

启动数据库

如果不需要挂载配置文件删除下列命令中的-v /data0/clickhouse/config:/etc/clickhouse-server/config.d

docker run -d --net=host -v /data0/clickhouse/data:/var/lib/clickhouse -v /data0/clickhouse/config:/etc/clickhouse-server/config.d --name clickhouse-server --ulimit nofile=262144:262144 clickhouse/clickhouse-server:23.6.2.18-alpine

tushare-integration

创建配置文件目录

mkdir -p /data0/tushare-integration/config

编写jobs.yaml和config.yaml

# jobs.yaml
cronjob:
  - cron_expr: Unsupported
    name: stock/basic
    spiders:
      - name: stock/basic/stock_basic
      - name: stock/basic/namechange
      - name: stock/basic/hs_const
      - name: stock/basic/trade_cal
      - name: stock/basic/stock_company
      - name: stock/basic/stk_managers
      - name: stock/basic/stk_rewards
      - name: stock/basic/new_share
      - name: stock/basic/bak_basic
  - cron_expr: Unsupported
    name: stock/financial
    spiders:
      - name: stock/financial/balancesheet
      - name: stock/financial/cashflow
      - name: stock/financial/income
      - name: stock/financial/express
      - name: stock/financial/forecast
      - name: stock/financial/dividend
      - name: stock/financial/fina_indicator
      - name: stock/financial/fina_audit
      - name: stock/financial/fina_mainbz
      - name: stock/financial/disclosure_date
  - cron_expr: Unsupported
    name: stock/market
    spiders:
      - name: stock/market/margin
      - name: stock/market/margin_detail
      - name: stock/market/margin_target
      - name: stock/market/top10_holders
      - name: stock/market/top10_floatholders
      - name: stock/market/top_list
      - name: stock/market/top_inst
      - name: stock/market/pledge_stat
      - name: stock/market/pledge_detail
      - name: stock/market/repurchase
      - name: stock/market/concept
      - name: stock/market/concept_detail
      - name: stock/market/block_trade
      - name: stock/market/stk_holdernumber
      - name: stock/market/stk_holdertrade
  - cron_expr: Unsupported
    name: stock/quotes
    spiders:
      - name: stock/quotes/daily
      - name: stock/quotes/weekly
      - name: stock/quotes/monthly
      - name: stock/quotes/adj_factor
      - name: stock/quotes/suspend_d
      - name: stock/quotes/hsgt_top10
      - name: stock/quotes/moneyflow
      - name: stock/quotes/moneyflow_hsgt
      - name: stock/quotes/stk_limit
      - name: stock/quotes/daily_basic
      - name: stock/quotes/ggt_top10
      - name: stock/quotes/ggt_daily
      - name: stock/quotes/bak_daily
  - cron_expr: Unsupported
    name: stock/special
    spiders:
      - name: stock/special/report_rc
      - name: stock/special/cyq_perf
      - name: stock/special/stk_factor
      - name: stock/special/ccass_hold
      - name: stock/special/ccass_hold_detail
      - name: stock/special/hk_hold
      - name: stock/special/limit_list_d
      - name: stock/special/stk_surv
      - name: stock/special/broker_recommend

# config.yaml
# TUSHARE相关配置
tushare_url: https://api.tushare.pro
tushare_point: 2000
tushare_token: ''

database:
  # 数据库配置
  db_type: 'clickhouse'
  host: '127.0.0.1'
  port: '8123'
  user: 'default'
  password: ''
  db_name: 'default'
  template_params: { }

reporters: [ ]
#  - "tushare_integration.reporters.FeishuWebHookReporter"

# 飞书Webhook配置
feishu_webhook: ""

# Scrapy配置
bot_name: tushare_integration
concurrent_requests: 1
concurrent_items: 100

downloader_middlewares:
  scrapy.downloadermiddlewares.retry.RetryMiddleware: null
  tushare_integration.middlewares.TushareRetryDownloaderMiddleware: 543

item_pipelines:
  tushare_integration.pipelines.TushareIntegrationFillNAPipeline: 298
  tushare_integration.pipelines.TransformDTypePipeline: 299
  tushare_integration.pipelines.TushareIntegrationDataPipeline: 300
  tushare_integration.pipelines.RecordLogPipeline: 301


# 重试配置
# Tushare的频次限制是一分钟为周期,重试只要大于这个周期理论上不会有任何问题
retry_enabled: true
retry_delay: 10
retry_times: 6

spider_modules:
  - "tushare_integration.spiders"

closespider_errorcount: 1

手动运行任务

docker run -d --net=host -v /data0/tushare-integration/config/jobs.yaml:/code/app/jobs.yaml -v /data0/tushare-integration/config/config.yaml:/code/app/config.yaml zhangbc/tushare-integration:0.0.3 python main.py run job stock/basic 

配置定时任务

配置定时任务时请避免定时任务同时启动,采集服务并发情况下可能会出现采集的数据异常

crontab -e

使用Crontab定时运行Job

将jobs.yaml和config.yaml文件放置在同一目录下,然后在crontab中添加上述命令