current position:Home>Python handles super anti crawling (TSEC firewall + CSS image background offset positioning)

Python handles super anti crawling (TSEC firewall + CSS image background offset positioning)

2021-08-25 13:10:20 Xiao Ming (code entity)

Hello everyone , I'm Xiaoming , I saw a website today :

image-20210701223232072

It's fantastic , Use for each number css Cut the background picture to get a small picture for display . What is certain is that the picture size of each number is 8*17.

Let's play together today .

Start testing

Try it first request Reading data , As a result, a lot of extremely confused JS Code for . Then try to use selenium visit , result :

image-20210701224310707

I feel that this firewall is a bit of a catch .

Forget it , Use the big killer to hide the features of the simulation browser :

from selenium.webdriver import ChromeOptions
from selenium import webdriver
browser = webdriver.Chrome()

option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
option.add_experimental_option('useAutomationExtension', False)
option.add_argument(
    'user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36')
option.add_argument("--disable-blink-features=AutomationControlled")
browser = webdriver.Chrome(options=option)

with open('stealth.min.js') as f:
    js = f.read()
browser.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {
    
    'source': js
})
url = 'http://hotels.huazhu.com/inthotel/detail/9005308'
browser.get(url)

This time the page finally came out :

image-20210701224518804

However, prices sometimes do not show , You can only refresh the page a few more times :

image-20210701224813522

After many visits , The data can finally be seen .

Now let the simulator simulate and click to view all prices :

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(browser, 10)

table = wait.until(EC.element_to_be_clickable(
    (By.CSS_SELECTOR, '#Pdetail_part2 table')))
table.location_once_scrolled_into_view
{'x': 0, 'y': 0}
more_click = wait.until(EC.element_to_be_clickable(
    (By.CSS_SELECTOR, '#Pdetail_part2 a[class="viewallprice"]')))
more_click.click()

such 7 Price data , We can all see .

Now let's start capturing the data we need :

Capture the required data

from io import BytesIO
import base64
from PIL import Image

for tr in table.find_elements_by_css_selector("table tr[class^='room first']"):
    print(tr.find_element_by_tag_name("h3").text)
    price = tr.find_element_by_css_selector(
        "div>a[class^='totalprice']")
    img_data = base64.b64decode(price.screenshot_as_base64)
    img = Image.open(BytesIO(img_data))
    display(img)

image-20210701225404104

Adjust the position of the mouse slide and try again :

image-20210701225526233

It shows that screenshots are sometimes inaccurate , It's also very difficult to get accurate screenshots , Because you can't scroll to the correct position through the program .

Just a screenshot can do it , That's too simple .

The main purpose of this article is to demonstrate parsing CSS, Let's continue to use analytical methods to obtain data :

analysis CSS Get picture data

First, we parse out the data we need :

img_url = None
for tr in table.find_elements_by_css_selector("table tr[class^='room first']"):
    name = tr.find_element_by_tag_name("h3").text
    print(name)
    price = tr.find_element_by_css_selector("div>a[class^='totalprice']")
    for var in price.find_elements_by_tag_name("var"):
        if img_url is None:
            img_url = var.value_of_css_property("background-image")[5:-2]
            print(img_url)
        position = var.value_of_css_property("background-position")
        w, h = map(lambda x: int(x[1:-2]), position.split())
        print(w, h)
 Superior King Room 
http://hotels.huazhu.com/Blur/Pic?b=81efc0b8e3094942a81d01e311864270
170 2
188 2
126 2
 Deluxe big bed room 
170 2
33 2
56 2
 Deluxe Twin Room 
145 2
2 2
188 2
 View Deluxe King Room 
145 2
170 2
111 2
 Executive King Room 
33 2
56 2
56 2
 Executive Twin Room 
201 2
2 2
145 2
 Executive Suite 
2 2
2 2
33 2
188 2

Try downloading CSS Background image :

browser.get(img_url)

The result is Tencent T-Sec Web Application firewall (WAF) Blocked pages , Explain directly with selenium Downloading pictures doesn't work .

use request Download ? It's always intercepted after trying .

Final , Write the following code ( It can also obtain picture data more smoothly ):

import requests
from io import BytesIO
import base64
from PIL import Image

def download_img(img_url):
    cookies = {
    o['name']: o['value'] for o in browser.get_cookies()}
    headers = {
    
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "Accept-Encoding": "gzip, deflate",
        "Accept-Language": "zh-CN,zh;q=0.9",
        "Cache-Control": "max-age=0",
        "Connection": "keep-alive",
        "Host": "hotels.huazhu.com",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"
    }

    for _ in range(10):
        r = requests.get(img_url, headers=headers, cookies=cookies)
        if r.status_code == 200:
            break
    else:
        return None
    img = Image.open(BytesIO(r.content))
    return img


img = download_img(img_url)
img

image-20210701231350869

With pictures , We can cut out the corresponding digital pictures and splice them .

Test the last piece of data :

image-20210701231704825

You can see that the effect of parsing and splicing is very good .

Then test batch data extraction :

img_url = None
for tr in table.find_elements_by_css_selector("table tr[class^='room first']"):
    name = tr.find_element_by_tag_name("h3").text
    print(name)
    price = tr.find_element_by_css_selector("div>a[class^='totalprice']")
    var_el_s = price.find_elements_by_tag_name("var")
    n = len(var_el_s)
    target = Image.new('RGB', (10 * n, 17), color=(255, 255, 255))
    for i, var in enumerate(var_el_s):
        if img_url is None:
            img_url = var.value_of_css_property("background-image")[5:-2]
            img = download_img(img_url)
        position = var.value_of_css_property("background-position")
        w, h = map(lambda x: int(x[1:-2]), position.split())
        r = img.crop((w, h, w+8, h+17))
        target.paste(r, (10*i, 0), r)
    display(target)

image-20210701231958368

You can see that you have successfully obtained the desired results , Consistent with the data seen on the website :

image-20210701232045553

The rest of us only need to recognize the spliced images , Or save it directly in the form of the original picture .

Image recognition

About image recognition , There are online identification and offline identification . Online text can consider using Baidu cloud , Tencent, cloud, etc. , Operate according to the interface provided on the official website .

What about the next , We try to do offline character recognition , The accuracy of offline character recognition is often lower than that of online character recognition .

For better recognition , Let's binarize the picture first :

def image_binarization(im, threshold=250):
    Lim = im.convert("L")
    table = [0 if i < threshold else 1 for i in range(256)]
    return Lim.point(table, "1")

image_binarization(target)

image-20210701235301941

Now we need to install pytesseract and Tesseract-OCR.

pytesseract It's a Python library :

pip insatll pytesseract

Tesseract-OCR You need to in https://digi.bib.uni-mannheim.de/tesseract/ Download installation package .

Because of the Internet , I am here https://www.liangchan.net/liangchan/11545.html Downloaded a .

Project address :https://github.com/tesseract-ocr/tesseract

After installation , Add installation path to path environment variable , After the command line is executed, the following prompt appears, indicating that the installation is successful :

C:\Users\ASUS>tesseract -v
tesseract v5.0.0.20190623
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found SSE

C:\Users\ASUS>

Then we began to identify :

import pytesseract

text = pytesseract.image_to_string(image_binarization(target)).strip()
print(text)
1183

So you can start batch identification :

import pytesseract

for tr in table.find_elements_by_css_selector("table tr[class^='room first']"):
    name = tr.find_element_by_tag_name("h3").text
    price = tr.find_element_by_css_selector("div>a[class^='totalprice']")
    var_el_s = price.find_elements_by_tag_name("var")
    n = len(var_el_s)
    target = Image.new('RGB', (10 * n, 17), color=(255, 255, 255))
    for i, var in enumerate(var_el_s):
        if img_url is None:
            img_url = var.value_of_css_property("background-image")[5:-2]
            img = download_img(img_url)
        position = var.value_of_css_property("background-position")
        w, h = map(lambda x: int(x[1:-2]), position.split())
        r = img.crop((w, h, w+8, h+17))
        target.paste(r, (10*i, 0), r)
    display(target)
    text = pytesseract.image_to_string(image_binarization(target)).strip()
    print(name, text)

image-20210702003009521

As you can see from the results , The recognition accuracy is still very high , At least everything we've seen so far is right .

copyright notice
author[Xiao Ming (code entity)],Please bring the original link to reprint, thank you.
https://en.qdmana.com/2021/08/20210825131016102i.html

Random recommended