current position:Home>Python handles super anti crawling (TSEC firewall + CSS image background offset positioning)
Python handles super anti crawling (TSEC firewall + CSS image background offset positioning)
2021-08-25 13:10:20 【Xiao Ming (code entity)】
Hello everyone , I'm Xiaoming , I saw a website today :
It's fantastic , Use for each number css Cut the background picture to get a small picture for display . What is certain is that the picture size of each number is 8*17.
Let's play together today .
Start testing
Try it first request Reading data , As a result, a lot of extremely confused JS Code for . Then try to use selenium visit , result :
I feel that this firewall is a bit of a catch .
Forget it , Use the big killer to hide the features of the simulation browser :
from selenium.webdriver import ChromeOptions
from selenium import webdriver
browser = webdriver.Chrome()
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
option.add_experimental_option('useAutomationExtension', False)
option.add_argument(
'user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36')
option.add_argument("--disable-blink-features=AutomationControlled")
browser = webdriver.Chrome(options=option)
with open('stealth.min.js') as f:
js = f.read()
browser.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {
'source': js
})
url = 'http://hotels.huazhu.com/inthotel/detail/9005308'
browser.get(url)
This time the page finally came out :
However, prices sometimes do not show , You can only refresh the page a few more times :
After many visits , The data can finally be seen .
Now let the simulator simulate and click to view all prices :
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(browser, 10)
table = wait.until(EC.element_to_be_clickable(
(By.CSS_SELECTOR, '#Pdetail_part2 table')))
table.location_once_scrolled_into_view
{'x': 0, 'y': 0}
more_click = wait.until(EC.element_to_be_clickable(
(By.CSS_SELECTOR, '#Pdetail_part2 a[class="viewallprice"]')))
more_click.click()
such 7 Price data , We can all see .
Now let's start capturing the data we need :
Capture the required data
from io import BytesIO
import base64
from PIL import Image
for tr in table.find_elements_by_css_selector("table tr[class^='room first']"):
print(tr.find_element_by_tag_name("h3").text)
price = tr.find_element_by_css_selector(
"div>a[class^='totalprice']")
img_data = base64.b64decode(price.screenshot_as_base64)
img = Image.open(BytesIO(img_data))
display(img)
Adjust the position of the mouse slide and try again :
It shows that screenshots are sometimes inaccurate , It's also very difficult to get accurate screenshots , Because you can't scroll to the correct position through the program .
Just a screenshot can do it , That's too simple .
The main purpose of this article is to demonstrate parsing CSS, Let's continue to use analytical methods to obtain data :
analysis CSS Get picture data
First, we parse out the data we need :
img_url = None
for tr in table.find_elements_by_css_selector("table tr[class^='room first']"):
name = tr.find_element_by_tag_name("h3").text
print(name)
price = tr.find_element_by_css_selector("div>a[class^='totalprice']")
for var in price.find_elements_by_tag_name("var"):
if img_url is None:
img_url = var.value_of_css_property("background-image")[5:-2]
print(img_url)
position = var.value_of_css_property("background-position")
w, h = map(lambda x: int(x[1:-2]), position.split())
print(w, h)
Superior King Room
http://hotels.huazhu.com/Blur/Pic?b=81efc0b8e3094942a81d01e311864270
170 2
188 2
126 2
Deluxe big bed room
170 2
33 2
56 2
Deluxe Twin Room
145 2
2 2
188 2
View Deluxe King Room
145 2
170 2
111 2
Executive King Room
33 2
56 2
56 2
Executive Twin Room
201 2
2 2
145 2
Executive Suite
2 2
2 2
33 2
188 2
Try downloading CSS Background image :
browser.get(img_url)
The result is Tencent T-Sec Web Application firewall (WAF) Blocked pages , Explain directly with selenium Downloading pictures doesn't work .
use request Download ? It's always intercepted after trying .
Final , Write the following code ( It can also obtain picture data more smoothly ):
import requests
from io import BytesIO
import base64
from PIL import Image
def download_img(img_url):
cookies = {
o['name']: o['value'] for o in browser.get_cookies()}
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.9",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"Host": "hotels.huazhu.com",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"
}
for _ in range(10):
r = requests.get(img_url, headers=headers, cookies=cookies)
if r.status_code == 200:
break
else:
return None
img = Image.open(BytesIO(r.content))
return img
img = download_img(img_url)
img
With pictures , We can cut out the corresponding digital pictures and splice them .
Test the last piece of data :
You can see that the effect of parsing and splicing is very good .
Then test batch data extraction :
img_url = None
for tr in table.find_elements_by_css_selector("table tr[class^='room first']"):
name = tr.find_element_by_tag_name("h3").text
print(name)
price = tr.find_element_by_css_selector("div>a[class^='totalprice']")
var_el_s = price.find_elements_by_tag_name("var")
n = len(var_el_s)
target = Image.new('RGB', (10 * n, 17), color=(255, 255, 255))
for i, var in enumerate(var_el_s):
if img_url is None:
img_url = var.value_of_css_property("background-image")[5:-2]
img = download_img(img_url)
position = var.value_of_css_property("background-position")
w, h = map(lambda x: int(x[1:-2]), position.split())
r = img.crop((w, h, w+8, h+17))
target.paste(r, (10*i, 0), r)
display(target)
You can see that you have successfully obtained the desired results , Consistent with the data seen on the website :
The rest of us only need to recognize the spliced images , Or save it directly in the form of the original picture .
Image recognition
About image recognition , There are online identification and offline identification . Online text can consider using Baidu cloud , Tencent, cloud, etc. , Operate according to the interface provided on the official website .
What about the next , We try to do offline character recognition , The accuracy of offline character recognition is often lower than that of online character recognition .
For better recognition , Let's binarize the picture first :
def image_binarization(im, threshold=250):
Lim = im.convert("L")
table = [0 if i < threshold else 1 for i in range(256)]
return Lim.point(table, "1")
image_binarization(target)
Now we need to install pytesseract and Tesseract-OCR.
pytesseract It's a Python library :
pip insatll pytesseract
Tesseract-OCR You need to in https://digi.bib.uni-mannheim.de/tesseract/ Download installation package .
Because of the Internet , I am here https://www.liangchan.net/liangchan/11545.html Downloaded a .
Project address :https://github.com/tesseract-ocr/tesseract
After installation , Add installation path to path environment variable , After the command line is executed, the following prompt appears, indicating that the installation is successful :
C:\Users\ASUS>tesseract -v
tesseract v5.0.0.20190623
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found SSE
C:\Users\ASUS>
Then we began to identify :
import pytesseract
text = pytesseract.image_to_string(image_binarization(target)).strip()
print(text)
1183
So you can start batch identification :
import pytesseract
for tr in table.find_elements_by_css_selector("table tr[class^='room first']"):
name = tr.find_element_by_tag_name("h3").text
price = tr.find_element_by_css_selector("div>a[class^='totalprice']")
var_el_s = price.find_elements_by_tag_name("var")
n = len(var_el_s)
target = Image.new('RGB', (10 * n, 17), color=(255, 255, 255))
for i, var in enumerate(var_el_s):
if img_url is None:
img_url = var.value_of_css_property("background-image")[5:-2]
img = download_img(img_url)
position = var.value_of_css_property("background-position")
w, h = map(lambda x: int(x[1:-2]), position.split())
r = img.crop((w, h, w+8, h+17))
target.paste(r, (10*i, 0), r)
display(target)
text = pytesseract.image_to_string(image_binarization(target)).strip()
print(name, text)
As you can see from the results , The recognition accuracy is still very high , At least everything we've seen so far is right .
copyright notice
author[Xiao Ming (code entity)],Please bring the original link to reprint, thank you.
https://en.qdmana.com/2021/08/20210825131016102i.html
The sidebar is recommended
- Crazy blessing! Tencent boss's "million JVM learning notes", real topic of Huawei Java interview 2020-2021
- JS JavaScript how to get the subscript of a value in the array
- How to implement injection in vuex source code?
- JQuery operation select (value, setting, selected)
- One line of code teaches you how to advertise on Tanabata Valentine's Day - Animation 3D photo album (music + text) HTML + CSS + JavaScript
- An article disassembles the pyramid architecture behind the gamefi outbreak
- BEM - a front-end CSS naming methodology
- [vue3] encapsulate custom global plug-ins
- Error using swiper plug-in in Vue
- Another ruthless character fell by 40000, which was "more beautiful" than Passat and maiteng, and didn't lose BMW
guess what you like
-
Huang Lei basks in Zhang Yixing's album, and the relationship between teachers and apprentices is no less than that in the past. Netizens envy Huang Lei
-
He was cheated by Wang Xiaofei and Li Chengxuan successively. Is an Yixuan a blessed daughter and not a blessed home?
-
Zhou Shen sang the theme song of the film "summer friends and sunny days" in mainland China. Netizen: endless aftertaste
-
Pink is Wangyuan online! Back to the peak! The new hairstyle is creamy and sassy
-
Front end interview daily 3 + 1 - day 858
-
Spring Webflux tutorial: how to build reactive web applications
-
[golang] walk into go language lesson 24 TCP high-level operation
-
August 23, 2021 Daily: less than three years after its establishment, Google dissolved the health department
-
The female doctor of Southeast University is no less beautiful than the female star. She has been married four times, and her personal experience has been controversial
-
There are many potential safety hazards in Chinese restaurant. The top of the program recording shed collapses, and the artist will fall down if he is careless
Random recommended
- Anti Mafia storm: He Yun's helpless son, Sun Xing, is destined to be caught by his dry son
- Introduction to flex flexible layout in CSS -- learning notes
- CSS learning notes - Flex layout (Ruan Yifeng tutorial summary)
- Today, let's talk about the arrow function of ES6
- Some thoughts on small program development
- Talk about mobile terminal adaptation
- Unwilling to cooperate with Wang Yibo again, Zhao Liying's fans went on a collective strike and made a public apology in less than a day
- JS function scope, closure, let, const
- Zheng Shuang's 30th birthday is deserted. Chen Jia has been sending blessings for ten years. Is it really just forgetting to make friends?
- Unveil the mystery of ascension
- Asynchronous solution async await
- Analysis and expansion of Vue infinite scroll source code
- Compression webpack plugin first screen loading optimization
- Specific usage of vue3 video play plug-in
- "The story of huiyeji" -- people are always greedy, and fairies should be spotless!
- Installing Vue devtool for chrome and Firefox
- Basic usage of JS object
- 1. JavaScript variable promotion mechanism
- Two easy-to-use animation JS that make the page move
- Front end Engineering - scaffold
- Java SQL Server intelligent fixed asset management, back end + front end + mobile end
- Mediator pattern of JavaScript Design Pattern
- Array de duplication problem solution - Nan recognition problem
- New choice for app development: building mobile applications using Vue native
- New gs8 Chengdu auto show announces interior Toyota technology blessing
- Vieira officially terminated his contract and left the team. The national security club sent blessings to him
- Less than 200000 to buy a Ford RV? 2.0T gasoline / diesel power, horizontal bed / longitudinal bed layout can be selected
- How does "heart 4" come to an end? Pinhole was boycotted by the brand, Ma Dong deleted the bad comments, and no one blessed him
- We are fearless in epidemic prevention and control -- pay tribute to the front-line workers of epidemic prevention!
- Front end, netty framework tutorial
- Xiaomi 11 | miui12.5 | android11 solves the problem that the httpcanary certificate cannot be installed
- The wireless charging of SAIC Roewe rx5 plus is so easy to use!
- Upload and preview pictures with JavaScript, and summarize the most complete mybatis core configuration file
- [25] typescript
- CSS transform Complete Guide (Second Edition) flight.archives 007
- Ajax foundation - HTTP foundation of interview essential knowledge
- Cloud lesson | explain in detail how Huawei cloud exclusive load balancing charges
- Decorator pattern of JavaScript Design Pattern
- [JS] 10. Closure application (loop processing)
- Left hand IRR, right hand NPV, master the password of getting rich