Photo

Live-Coding: Master Anti-Ban & Web Scraping Techniques with Scrapoxy

Fabien Vauchelles

from Scrapoxy (France)

About speaker

Fabien Vauchelles is an Anti-Ban Expert. With over a decade of experience in Web Scraping, Fabien's passion for code and technology helps him to bypass protections. He is the creator of Scrapoxy, a mature free and open-source proxy waterfall tailored for the Web Scraping industry.

About speakers company

I work independently and focus solely on open-source projects. Scrapoxy is freely accessible and open to the entire community.

Abstracts

specific

The session will be a 35-minute live-coding demonstration, followed by a 10-minute Q&A.

You can preview the slides here: https://bit.ly/masteringcfp45

In this presentation, I'll take attendees on an intriguing story:

Meet Isabella, a visionary AI engineer with a head full of dreams. She wants to revolutionise the tourism industry. But there is a catch - she's missing the crucial ingredient for her AI model: data.

We’ll join Isabella on her data quest, tackling protection measures step by step with proxies, headless browsers and deobfuscation. I developed the website https://trekky-reviews.com specifically for this talk, featuring the latest techniques used by anti-bot systems.

The best part? Everyone will walk away with actionable skills to legally gather data using these cutting-edge methods.

Alternatively, I offer a more in-depth 3-hour workshop if that's preferred.

Here’s a sneak peek of the live-coding:

1. Introduction (3 mins)
To kick off the presentation, I engage the audience by asking about their experiences with coding a web scraper. This sets the stage for introducing myself and expressing my enthusiasm for web scraping.

2. Narrative (2 mins)
I share a compelling narrative to this audience: Meet Isabella, a visionary AI engineer with a head full of dreams. To build her product, she needs to collect vital data and bypass protections.

3. Legal (2 mins)
Let's take a proactive approach. Here's a straightforward decision pathway: If the data is public, non-personal, you don't need to agree to any terms (T&C), and you're not causing harm (DDoS), then you're good to go!

4. Website Target Structure (3 mins)
I created a dedicated website for this presentation: https://trekky-reviews.com/. This site features various iterations. Each fortified with progressively challenging protections. Throughout the presentation, we'll help Isabella to manoeuvre through these defences.

5. Framework Presentation (2 mins)
I introduce a brief overview of the Scrapy framework and how to write a spider.

6. Live Challenge-Solving (21 mins)
Now, let’s dive into live-coding.

We will help Isabella to tackle a series of challenges:
- altering HTTP headers (3 mins)
- using Datacenter and Residential Proxies with Scrapoxy (7 mins)
- leveraging and tuning Headless Browser to overcome fingerprint (4 mins)
- code deobfuscation and crafting of anti-bot payload (7 mins)

7. Conclusion (2 min)
As a wrap-up, I will present upcoming challenges and potential solutions, leaving us with food for thought into the future of web scraping.

The Program Committee has not yet taken a decision on this talk

other talks of this topic

Photo
Go performance profiling in theory and practice

Alexey Palazhchenko

FerretDB Inc.

broad
Photo
Pros and Cons of Jetpack Compose Toolkit

Stevan Milovanovic

InterVenture

specific
Photo
Simple and stable UI tests with Ultron

Aleksei Tiurin

Exness

specific
Photo
Path to Golden Path

Daniel Drack

FullStackS GmbH

broad
Photo
Continuous Profiling on K8s - why, when and how

Ant(on) Weiss

PerfectScale

specific
Photo
Algorithm Of Massively Parallel Networking In C++

Vladislav Shpilevoy

VirtualMinds

specific
Photo
Deep dive into the postgres index types

Jesús Espino

Mattermost Inc.

specific
Photo
From null to applications on Kubernetes

Roberth Strand

Sopra Steria

specific
Photo
C# 13 Unleashed: Live Demos of my Top 10 Cutting-Edge Features!

Ambesh Singh

Visionet Systems Deutschland

broad
Photo
How we elevated tracking data accuracy from ~60% to ~80%

Alina Krasavina

Delivery hero

broad
Photo
Workshop: Master Anti-Ban & Web Scraping Techniques (2h)

Fabien Vauchelles

Scrapoxy

specific
Photo
Become a Gen AI Bot Master in Just 50 Minutes – No Kidding!!

Ambesh Singh

Visionet Systems Deutschland

broad
Photo
Taking Shortcuts Beyond Your IDE

Annelore Egger

OpenValue Switzerland

broad
Photo
Collaborative applications and how to make them fast

Bartosz Sypytkowski

appflowy.io

specific
Photo
An Efficient Git Workflow For High-Stakes Projects

Vladislav Shpilevoy

VirtualMinds

specific
Photo
JavaScript is weird. MythBusters special.

Małgorzata Janeczek

Sector Alarm Tech

broad
Photo
Why You Ignore Best Practices and How You Can Fix It

Annelore Egger

OpenValue Switzerland

broad
Photo
How Unit Testing Saved My Career

Annelore Egger

OpenValue Switzerland

broad
Photo
Three Flavors of Pokémon - Framework Agnostic UI Testing

Shelly Goldblit

Dell Technologies

broad
Photo
You don't need to implement GraphQL

Sefi Ninio

Tikal Knowledge

specific
Photo
UX at the centre of system development and design

Anesu Makwasha

Tose Technologies

specific
Photo
Sculpting Data for Machine Learning: Generative AI edition

Rishabh Misra

Attentive Mobile Inc

broad
Photo
Putting the asm in Wasm: from bytecode to native

Edoardo Vacchi

Tetrate

specific
Photo
Throw exceptions... out of your codebase

Guillaume Faas

Vonage

specific
Photo
Crafting the Ultimate Docker Image for Spring Applications

Pasha Finkelshteyn

BellSoft

specific
Photo
From Server to Serverless - A story of saving Cost

Yoav Nordmann

Tikal Knowledge

specific
Photo
What the @#!? is Auth

Warren Parad

Authress

specific