Photo

Workshop: Master Anti-Ban & Web Scraping Techniques (2h)

Fabien Vauchelles

from Scrapoxy (France)

About speaker

Fabien Vauchelles is an Anti-Ban Expert. With over a decade of experience in Web Scraping, Fabien's passion for code and technology helps him to bypass protections. He is the creator of Scrapoxy, a mature free and open-source proxy waterfall tailored for the Web Scraping industry.

About speakers company

I work independently and focus solely on open-source projects. Scrapoxy is freely accessible and open to the entire community.

Abstracts

specific

This session is a workshop with progressively challenging exercises, lasting 90 to 180 minutes to fit your schedule.

You can preview the workshop here: https://github.com/fabienvauchelles/scraping-workshop

We’ll tackle protection measures step by step with proxies, headless browsers and deobfuscation. I developed the website https://trekky-reviews.com specifically for this workshop, featuring the latest techniques used by anti-bot systems.

The ideal attendance size is 30, but I can easily accommodate between 15 and 60 participants.

The best part? Everyone will walk away with actionable skills to legally gather data using these cutting-edge methods.

Alternatively, I offer a 45-minute live-coding session if that's preferred.

Here’s a sneak peek of the 2-hour workshop:

1. Introduction (4 mins)
To kick off the workshop, I engage the participants by asking about their experiences with bypassing website protection. This sets the stage for introducing myself and expressing my passion for web scraping and reverse-engineering anti-bot measures.

2. Legal (4 mins)
Let's take a proactive approach. Here's a straightforward decision pathway: If the data is public, non-personal, you don't need to agree to any terms (T&C), and you're not causing harm (DDoS), then you're good to go!

3. Website Target Structure (4 mins)
I created a dedicated website for this workshop: https://trekky-reviews.com/. This site features various iterations. Each fortified with progressively challenging protections. Throughout the workshop, we'll manoeuvre through these defences.

4. Framework Installation and 1st challenge (15 mins)
I will guide participants through the installation of the Scrapy framework and kickstart the first project.

5. Basic Challenge-Solving (15 mins)
Participants will engage in solving 2 challenges:
- Bypass Useragent filtering
- Add consistent HTTP headers

6. Proxies Overview (5 mins)
I explain the different types of proxy: Datacenter, ISP, Residential, and Mobile, outlining their respective advantages and drawbacks.

7. Proxies Challenges (20 mins)
We'll set up Scrapoxy and configure the first connector. Participants will tackle 2 challenges:
- Bypass Rate Limit with Datacenter proxies
- Avoid detection with ISP proxies

8. Headless Browser Challenge (20 mins)
Participants will install Playwright and tackle a series of challenges, including:
- Executing Javascript with a headless browser
- Tuning headless browser parameters (like timezone)

9. Code Deobfuscation (10 mins)
I'll introduce techniques for deobfuscating both strings and code-flow.

10. Deobfuscation Challenge (20 mins)
With the installation of Babel.js, participants will start reverse engineering a protection through deobfuscation. They will replicate the anti-bot behaviour, including payload encryption.

11. Conclusion (3 min)
As a wrap-up, I will present upcoming challenges and potential solutions, leaving us with food for thought into the future of protections.

The Program Committee has not yet taken a decision on this talk

other talks of this topic

Photo
Crafting the Ultimate Docker Image for Spring Applications

Pasha Finkelshteyn

BellSoft

specific
Photo
Continuous Profiling on K8s - why, when and how

Ant(on) Weiss

PerfectScale

specific
Photo
Algorithm Of Massively Parallel Networking In C++

Vladislav Shpilevoy

VirtualMinds

specific
Photo
How we elevated tracking data accuracy from ~60% to ~80%

Alina Krasavina

Delivery hero

broad
Photo
What the @#!? is Auth

Warren Parad

Authress

specific
Photo
Why You Ignore Best Practices and How You Can Fix It

Annelore Egger

OpenValue Switzerland

broad
Photo
Taking Shortcuts Beyond Your IDE

Annelore Egger

OpenValue Switzerland

broad
Photo
Sculpting Data for Machine Learning: Generative AI edition

Rishabh Misra

Attentive Mobile Inc

broad
Photo
How Unit Testing Saved My Career

Annelore Egger

OpenValue Switzerland

broad
Photo
Deep dive into the postgres index types

Jesús Espino

Mattermost Inc.

specific
Photo
Putting the asm in Wasm: from bytecode to native

Edoardo Vacchi

Tetrate

specific
Photo
An Efficient Git Workflow For High-Stakes Projects

Vladislav Shpilevoy

VirtualMinds

specific
Photo
Become a Gen AI Bot Master in Just 50 Minutes – No Kidding!!

Ambesh Singh

Visionet Systems Deutschland

broad
Photo
Path to Golden Path

Daniel Drack

FullStackS GmbH

broad
Photo
Collaborative applications and how to make them fast

Bartosz Sypytkowski

appflowy.io

specific
Photo
UX at the centre of system development and design

Anesu Makwasha

Tose Technologies

specific
Photo
From null to applications on Kubernetes

Roberth Strand

Sopra Steria

specific
Photo
Go performance profiling in theory and practice

Alexey Palazhchenko

FerretDB Inc.

broad
Photo
Pros and Cons of Jetpack Compose Toolkit

Stevan Milovanovic

InterVenture

specific
Photo
C# 13 Unleashed: Live Demos of my Top 10 Cutting-Edge Features!

Ambesh Singh

Visionet Systems Deutschland

broad
Photo
JavaScript is weird. MythBusters special.

Małgorzata Janeczek

Sector Alarm Tech

broad
Photo
You don't need to implement GraphQL

Sefi Ninio

Tikal Knowledge

specific
Photo
Throw exceptions... out of your codebase

Guillaume Faas

Vonage

specific
Photo
From Server to Serverless - A story of saving Cost

Yoav Nordmann

Tikal Knowledge

specific
Photo
Three Flavors of Pokémon - Framework Agnostic UI Testing

Shelly Goldblit

Dell Technologies

broad
Photo
Simple and stable UI tests with Ultron

Aleksei Tiurin

Exness

specific