current position:Home>The crawler grabs the dynamic long text and uses it to directly generate HTML reports

The crawler grabs the dynamic long text and uses it to directly generate HTML reports

2021-08-27 03:00:37 Lei Xuewei

This is my participation 8 The fourth of the yuegengwen challenge 23 God , Check out the activity details :8 Yuegengwen challenge

The school committee wrote an article before Hot list long list crawler screenshot article and Fast and elegant HTML Report development

Play a little bigger this time , We climb down the hot list and save it as a report to check .

First look at the effect :

 Insert picture description here

To make up !

First step Generate a report

No mistake , Put the reptile first , Out of thin air , Just get some data and generate the report first .

Save the following code as report.py, This name will be used later to introduce .

from dominate.tags import *

"""  The special generation of Lei Xue Committee html Functions reported  """
def generate_html(tuples):
    _html = html()
    _head = head()
    _head.add(title(" Compiled by Lei Xuewei CSDN Hot list report "))
    _head.add(meta(charset="utf-8"))
    _html.add(_head)
    _body = _html.add(body())
    _table = table(border=1)
    with _table.add(tbody()):
        index = 0
        for tp in tuples:
            index += 1
            leiXW = tr()
            leiXW += td(str(index))
            leiXW += td(a(tp[1],href=tp[0]))
    with _body.add(div(cls="leixuewei")):
        h3(" Compiled by Lei Xuewei CSDN Hot list ")
    _body.add(_table)
    return _html.render()

"""  A function specially designed by the Commission of mine science to directly generate and save reports  """
def lei_report(leixuewei_tuples, path):
    data = generate_html(leixuewei_tuples)
    with open(path, "w") as f:
        f.write(data)
       

if __name__ == "__main__":
    lxw_tuples = []
    lxw_tuples.append(("https://blog.csdn.net/geeklevin/article/details/119594295"," Lei Xuewei Python Generate Html report form "))
    lxw_tuples.append(("https://blog.csdn.net/geeklevin/article/details/116771659","Docker Tired of playing , Try it Vagrant"))
    path = "./csdn_rank.html"
    lei_report(lxw_tuples, path)
 Copy code 

Code parsing

The code on generates a html Webpage , And save to path Variable specifies the path .

  1. Prepare a binary array
  2. Pass in generate_html function , This function builds a string with head and body. among body Iterate over the input array , Generate a table .
  3. Write the table content output to a file

The effect is as follows :

 Insert picture description here

The second step is to transform the previous crawler code

That's this one Hot list long list crawler screenshot article The core code inside , Let's directly transform .

'''  Tips for solving crawlers on streaming pages   The core code of the screenshot : '''
def resolve_height(driver, pageh_factor=5):
    js = "return action=document.body.scrollHeight"
    height = 0
    page_height = driver.execute_script(js)
    ref_pageh = int(page_height * pageh_factor)
    step = 150 
    max_count = 15 
    count = 0 
    while count < max_count and height < page_height:
        #scroll down to page bottom
        for i in range(height, ref_pageh, step):
            count+=1
            vh = i
            slowjs='window.scrollTo(0, {})'.format(vh)
            print('[ Lei Xuewei  Demo]exec js: %s' % slowjs)
            driver.execute_script(slowjs)
            sleep(0.3)
        if i >= ref_pageh- step:
            print('[ Lei Xuewei  Demo]not fully read')
            break
        height = page_height
        sleep(2)
        page_height = driver.execute_script(js)
    print("finish scroll")
    return page_height

# Get the actual height of the window 
page_height = resolve_height(driver)
print("[ Lei Xuewei  Demo]page height : %s"%page_height)
sleep(5)
driver.execute_script('document.documentElement.scrollTop=0')
sleep(1)
driver.save_screenshot(img_path)
page_height = driver.execute_script('return document.documentElement.scrollHeight') #  Page height 
print("get accurate height : %s" % page_height)

# The above code is from the previous article 

# Reference report function 
from report import lei_report

# Pull to the bottom of the page 
driver.execute_script(f'document.documentElement.scrollTop={page_height};')
sleep(1)
driver.save_screenshot(f'./leixuewei_rank_end.png')
blogs = driver.find_elements_by_xpath("//div[@class='hosetitem-title']/a")

# Generating arrays 
articles = []
for blog in blogs:
    link = blog.get_attribute("href")
    title = blog.text
    articles.append((link,title))

print('get %s articles' % len(articles))
print('articles : %s ' % str(articles))

# Given path , Generate html The report 
path = "./leixuewei_csdn_rank.html"
lei_report(articles, path)
print(" Save hot list to path :%s" %path)

"""LeiXueWei Demo Code , There are so many white whores , Pay attention to the third company and support it !"""
 Copy code 

Code parsing

The crawler code of streaming processing in the previous article has deleted the screenshot merging code segment .

then , The key is coming. . Following steps :

  1. The reptile pulls directly to the bottom , For a link , Generating arrays
  2. Then take a screenshot at the end of the page , You can keep it as a souvenir in the future
  3. Import calls lei_report function , Generate page

It's simpler , No, read it line by line .

The effect is as follows :

The report is too long. The screenshot cuts the beginning and end , have a look .
 Insert picture description here
 Insert picture description here

summary : Look at this article more

This article is for demonstration purposes only , Any objections to the demo site , Please inform us of the modification .

Finally, the use of reptiles must be cautious , Don't use it as a child's play to climb institutional websites . You can't brush with serious network when you study , This behavior will make you eat LAO rice !

by the way , The school committee can also focus on long-term reading => Lei Xuewei interesting programming story compilation
perhaps => Lei Xuewei NodeJS series

Continuous learning and continuous development , I'm Lei Xuewei !
Programming is fun , The key is to understand the technology thoroughly .
It's not easy to create , Please support , Like the collection and support the School Committee !

copyright notice
author[Lei Xuewei],Please bring the original link to reprint, thank you.
https://en.qdmana.com/2021/08/20210827030035077s.html

Random recommended