filter >

Photo

Workshop: Master Anti-Ban & Web Scraping Techniques (2h)

Fabien Vauchelles

from Scrapoxy (France)

About speaker

Fabien Vauchelles is an Anti-Ban Expert. With over a decade of experience in Web Scraping, Fabien's passion for code and technology helps him to bypass protections. He is the creator of Scrapoxy, a mature free and open-source proxy waterfall tailored for the Web Scraping industry.

About speakers company

I work independently and focus solely on open-source projects. Scrapoxy is freely accessible and open to the entire community.

Abstracts

specific

This session is a workshop with progressively challenging exercises, lasting 90 to 180 minutes to fit your schedule.

You can preview the workshop here: https://github.com/fabienvauchelles/scraping-workshop

We’ll tackle protection measures step by step with proxies, headless browsers and deobfuscation. I developed the website https://trekky-reviews.com specifically for this workshop, featuring the latest techniques used by anti-bot systems.

The ideal attendance size is 30, but I can easily accommodate between 15 and 60 participants.

The best part? Everyone will walk away with actionable skills to legally gather data using these cutting-edge methods.

Alternatively, I offer a 45-minute live-coding session if that's preferred.

Here’s a sneak peek of the 2-hour workshop:

1. Introduction (4 mins)
To kick off the workshop, I engage the participants by asking about their experiences with bypassing website protection. This sets the stage for introducing myself and expressing my passion for web scraping and reverse-engineering anti-bot measures.

2. Legal (4 mins)
Let's take a proactive approach. Here's a straightforward decision pathway: If the data is public, non-personal, you don't need to agree to any terms (T&C), and you're not causing harm (DDoS), then you're good to go!

3. Website Target Structure (4 mins)
I created a dedicated website for this workshop: https://trekky-reviews.com/. This site features various iterations. Each fortified with progressively challenging protections. Throughout the workshop, we'll manoeuvre through these defences.

4. Framework Installation and 1st challenge (15 mins)
I will guide participants through the installation of the Scrapy framework and kickstart the first project.

5. Basic Challenge-Solving (15 mins)
Participants will engage in solving 2 challenges:
- Bypass Useragent filtering
- Add consistent HTTP headers

6. Proxies Overview (5 mins)
I explain the different types of proxy: Datacenter, ISP, Residential, and Mobile, outlining their respective advantages and drawbacks.

7. Proxies Challenges (20 mins)
We'll set up Scrapoxy and configure the first connector. Participants will tackle 2 challenges:
- Bypass Rate Limit with Datacenter proxies
- Avoid detection with ISP proxies

8. Headless Browser Challenge (20 mins)
Participants will install Playwright and tackle a series of challenges, including:
- Executing Javascript with a headless browser
- Tuning headless browser parameters (like timezone)

9. Code Deobfuscation (10 mins)
I'll introduce techniques for deobfuscating both strings and code-flow.

10. Deobfuscation Challenge (20 mins)
With the installation of Babel.js, participants will start reverse engineering a protection through deobfuscation. They will replicate the anti-bot behaviour, including payload encryption.

11. Conclusion (3 min)
As a wrap-up, I will present upcoming challenges and potential solutions, leaving us with food for thought into the future of protections.

The talk was declined