https://substack.thewebscraping.club/

The Web Scraping Club, Milan (2026)

25/10/2024

Hey everyone!

I'm super excited to share something I've been working on - a tool called Camoufox. For those of you diving into the world of web scraping, you know how tricky it can be, especially with all the anti-bot solutions out there. So, I developed Camoufox to tackle exactly that. It's packed with features to make your scraping jobs a breeze, and I'm thrilled to tell you more about it.

First off, Camoufox isn't just any scraping tool. It's designed to be a ninja in the world where websites are fortress-like with their anti-bot defenses. We're talking about dealing with heavyweights like Datadome and coming out on top. How, you ask? Well, for starters, it boasts of fingerprint spoofing and some really neat anti-bot detection tricks up its sleeve.

But what I'm most proud of is the human-like mouse movements and headless browsing capabilities. These features are particularly close to my heart because they mimic human interaction so closely, it's like having an invisible partner in crime on your scraping missions.

And for my fellow coders out there, yes, you can fully customize and build scrapers using Python. I've made sure that you have access to stuff like proxies, GeoIP matching, and of course, headless browsing to make your life easier.

One of my favorite aspects is utilizing a modified version of Juggler to automate Firefox in such a stealthy way, it's virtually undetectable. This is key in navigating through sites like Hermes, which we've successfully managed to scrape data from, proving Camoufox's effectiveness.

I developed Camoufox with the community in mind, knowing the challenges we face with web scraping. It's here to make your projects more feasible, bypassing those pesky anti-bot solutions with ease. Let's open up the web's treasure trove together, without letting bots and restrictions hold us back.

Would love to hear your thoughts or experiences with web scraping challenges. Let's geek out over solutions and keep pushing the boundaries!

If you want to read the full article, you can find it at this link.

Discovering the features of Camoufox, a custom and stealthy version of Firefox

20/10/2024

Hey everyone!

Just had an incredible time at the Zyte in-person conference right here in Austin, and I'm buzzing with all the insights and discussions that went down. We delved deep into the world of Large Language Models (LLMs) and their growing role in data extraction and engineering, which, let me tell you, is a fascinating arena that's rapidly evolving.

The conversations were rich and varied, covering the hurdles we face when using LLMs for web scraping, not to mention the cool techniques and applications being developed. It's inspiring to see how much potential there is and the smart solutions coming up to navigate these challenges.

We also got into the nitty-gritty of the legal side of web scraping. It’s a topic that can’t be overlooked, emphasizing how crucial it is to keep our practices ethical and polite. It’s all about respecting boundaries while innovating, and that’s a balance I believe we can strike.

And can we talk about Charity Engine for a moment? Their approach to using web scraping for charity is nothing short of remarkable. It’s a powerful reminder of how technology can be a force for good, making a real difference in the world.

Wrapping up, this event really underscored the dynamic nature of web scraping and LLMs, painting a picture of a future brimming with potential. Can't wait to see where we're headed!

If you want to read the full article, you can find it at this link.

From advanced scraping techniques to AI, here are the latest trends shown in the summit

18/10/2024

Ever dived into the world of web scraping? It’s fascinating, and for those of us looking to extract reliable data, stumbling upon web APIs hidden within websites or apps can feel like hitting the jackpot. Unlike the ever-changing landscape of HTML, APIs offer a more stable and information-rich avenue for our data extraction endeavours.

Now, it's pretty common to find unauthenticated APIs lying around on websites. Apps, though, they tend to play hard to get, safeguarding their data behind layers of security, including JWT tokens. For the uninitiated, JWT tokens are like the secret handshakes of the internet, facilitating secure info swapping between parties. These tokens, made up of a header, payload, and a signature, come with an expiry date – something absolutely critical for us in the scraping world to keep an eye on.

Let’s get a bit hands-on for a moment. Take the Tractor Supply Co.’s app, for instance. With some ingenuity, using a virtual Android device coupled with a Frida server, it’s possible to peel back the layers and see the app's inner workings. By intercepting the app traffic, we can get a glimpse of those coveted API calls, especially the ones dealing with authentication.

And here’s a little golden nugget – there’s code out there, sitting in a GitHub repository, ready to make these scraping tasks a breeze. It's all about knowing where to look and having the right tools at your disposal. Happy scraping!

If you want to read the full article, you can find it at this link.

How to create scrapers that use token authentication for API data retrieval

14/10/2024

Hey everyone, just wanted to share some reflections on how web scraping has evolved since 2014, throw in a bit of a spotlight on the hurdles we've faced, and the immense potential we're seeing unfold right in front of us. It's been quite the journey from the early days, watching the industry shift towards a marketplace model for web data, something we've embraced with our Data Boutique concept.

Digging into the various business models in web data collection has been fascinating. It's become clear that simply harvesting data isn't enough anymore. We've really got to focus on what sets us apart and how we can deliver added value to our customers.

And here's a thought to chew on - how about a shared dataset marketplace? Imagine the efficiencies we could drive in the industry with such an approach. It's not just about making life easier; it's about setting new standards and pushing boundaries. Let's chat about what this future looks like!

If you want to read the full article, you can find it at this link.

And also: is there a way to make it more efficiently

07/10/2024

Hey everyone,

I've been diving deep into customizing a GPT model specifically for web scraping tasks and thought it'd be interesting to share my journey and findings with you. Utilizing ChatGPT's web interface, I embarked on a mission to see how far I could push the boundaries by importing knowledge from both PDF and Markdown files directly into the model. The idea was to enhance its grasp on web scraping concepts and see if it could handle content extracted from these formats effectively.

During this experiment, I put the model through several tests, challenging it with content scraped from various sources to evaluate its capability in answering questions and providing summaries on web scraping topics. It wasn't all smooth sailing; I bumped into a few limitations along the way that made me pause and think about the complexities of training such a model.

Despite the hurdles encountered, I'm pretty stoked about the outcomes. The customized GPT model proved to be quite a useful tool in dealing with questions and creating summaries related to web scraping. This whole experiment has been quite an insightful adventure into the potential and versatility of GPT models when tailor-fitted for specific tasks.

Would love to hear if anyone else has been tinkering with similar projects or has insights to share on enhancing GPT models for specialized applications!

Catch you later!

If you want to read the full article, you can find it at this link.

Create The Web Scraping Club GPT by scraping my own newsletter

05/10/2024

Hey folks! Today, I'm diving into the fascinating world of web scraping and how we can smartly navigate through the increasingly sophisticated detection mechanisms websites have in place. Have you ever thought about how sites are getting so good at telling bots from humans? A big part of it has to do with tracking our mouse movements. Yes, that's right, those subtle movements you make with your mouse are being analyzed to figure out if you're a human or some automated script cruising through the site.

That's where a cool tool I've been working with comes into play – Oxymouse. It's this nifty open-source package developed by the folks at Oxylabs, and it's a game-changer for anyone in the scraping game. What it does is pretty slick. It takes advantage of browser automation giants like Playwright and Selenium and amps up their capabilities by simulating human-like mouse movements. We're not just talking any random movements here. Oxymouse uses sophisticated algorithms, including Gaussian and Perlin, to mimic the way a real person would move their mouse around a webpage.

Why does this matter? Well, it's all about staying under the radar and getting the data you need without tripping any anti-bot alarms. By integrating Oxymouse into your scraping projects, you can drastically improve your chances of success. It's like giving your bot a cloak of invisibility — or at least making it blend in with the crowd.

So, if you're knee-deep in web scraping or just starting out, considering how to make your bots mimic human behavior is crucial. Oxymouse has been a vital tool in my arsenal for just that reason. It's opened up a whole new level of possibilities and has made scraping projects that much more efficient and stealthy.

Curious to give it a whirl? Dive into the tech, explore those algorithms, and let's conquer those anti-bot measures with some smart, human-like ingenuity!

If you want to read the full article, you can find it at this link.

Testing the new Oxylabs open source package for human-like mouse movements

29/09/2024

Hey everyone!

Just wanted to share some exciting moments from Oxycon, the virtual event we just hosted all about web scraping. It was an incredible day filled with insights and I'm still buzzing from the energy and conversations.

Three talks really stood out for me. First, Žydrūnas Tamašauskas deep-dived into scaling data collection processes - something we're all wrestling with as our projects grow bigger and more complex. Then, Tadas Gedgaudas opened our eyes to some really innovative ways of using mouse movements to outsmart anti-bot measures. It's fascinating to see how creativity is leading the charge against these hurdles.

But the highlight for me was presenting our latest innovation - OxyCopilot. It's an AI-powered assistant designed to make web scraping a breeze. With a custom parser builder and a request builder, it's shaping up to redefine how we approach web scraping projects. It was great to see so much enthusiasm about how these tools can streamline our work.

The event was a fantastic showcase of the strides we're making in web scraping technology. It's clear that staying at the forefront of innovation is key in this ever-evolving field. Can't wait to see where we'll go from here!

If you want to read the full article, you can find it at this link.

Three key insights from the Oxycon 2024 conference

27/09/2024

Hey everyone!

I'm thrilled to share something I've been working on - Nodriver. It's my latest creation in the world of web scraping, designed specifically for those pesky JavaScript-heavy websites. What's cool about Nodriver is that it doesn't rely on a browser driver to do its job, making it not only easier to use but also super light on its feet. Plus, it runs headless, so it's all smooth sailing without any cumbersome GUI slowing you down.

Now, I won't shy away from the fact that it's not all roses. As of now, Nodriver doesn't have the capabilities for fingerprint forging or using authenticated proxies. I know, those are pretty nifty features to have, but hear me out on what it can do.

One of the shining points of Nodriver is its knack for sneaking past those anti-bot tests, like the CDP protocol detection, which can be a real headache. This is where Nodriver really stands out, especially when you stack it up against something like Playwright. It's got this stealth mode vibe that makes web scraping a smooth operation, keeping you under the radar.

I'm pretty proud of what Nodriver can do and its potential to shake things up for all of us in the web scraping scene. Whether you're looking to collect data without the hassle or just tired of getting blocked, I believe Nodriver could be your new go-to.

Would love to hear your thoughts or if you're keen on giving it a whirl. Let's push the boundaries of what's possible together!

If you want to read the full article, you can find it at this link.

Testing the undetected-chromedriver successor for scraping Cloudflare protected websites

15/09/2024

Hey everyone 👋!

Just dropped the latest edition of our Proxy Pricing Playbook over at The Web Scraping Club! 🚀 Every quarter, we dive deep to bring you the latest on proxy pricing trends. Our methodology? A neat comparison of pricing plans and pay-as-you-go options (leaving out APIs for purity), all based on monthly rates to keep things consistent.

This time around, we covered the whole spectrum - data center proxies, residential, ISP, mobile, and even unblocker proxies. Noticed some interesting price shifts that you definitely don't want to miss. 📉📈

Also, for those of you into web scraping, mark your calendars 📅 for Oxycon 2024 happening on September 25th. It's shaping up to be a can't-miss event.

Would love it if you could check out the article, and hey, if you find it helpful, why not share it with friends or colleagues in the field?

Catch you next quarter for another proxy pricing update!

If you want to read the full article, you can find it at this link.

Let's discover what's happening in the proxy market

15/09/2024

Hey folks! Have you checked out our latest Proxy Pricing Playbook yet? Every three months, we dive deep into the proxy market to see what's shaking. Our goal? To make sense of the proxy pricing jungle for you. We compare prices from a variety of providers, ensuring you get the clarity you need to make the best choices.

In our latest edition, we cover everything from data center and residential proxies to ISP, mobile, and even unblocker proxies. It's all about spotting the trends and price changes that could affect your decisions.

And hey, have you heard about Oxycon 2024? It's a must-attend event for folks in the web scraping scene. Trust me, you'll want to be there.

I'd love to hear your thoughts or even assist you further. Feel free to reach out for a chat or consultation. And if you enjoy staying up-to-date with the latest trends, subscribing to our newsletter might just be your next best move. Catch you in the next update!

If you want to read the full article, you can find it at this link.

Let's discover what's happening in the proxy market

13/09/2024

Hey folks 🚀!

Choosing the right proxy provider for your web scraping projects isn't just about snagging the best price. It's way more nuanced than that. You've got to think about the type of data you're after, how to ace IP rotation, sneak past bot protections, and even consider the geographic locations of those IPs.

I'm here to spill the tea 🍵 on not just snagging a good deal but finding the perfect fit for your data scraping needs. Because let's face it, not all proxies are created equal, and the wrong choice could mean hitting roadblocks instead of data goldmines.

If the thought of sifting through proxy providers has you breaking out in a cold sweat, don't worry! I've been down in the trenches and come back with some killer strategies. I even developed a nifty tool to compare pricing plans across providers so you can make informed decisions without the headache.

But wait, there's more! I've put together a rock-solid methodology for testing proxy providers, focusing on how well they handle IP rotation and geographical targeting. Because, in the end, it's all about getting those high-quality, relevant data extracts without drawing unnecessary attention.

Fancy a chat on optimizing your web data collection setup? Slide into my DMs. Whether you're just starting out or looking to fine-tune your operations, I'm all about helping companies navigate these choppy waters. Let's make your data collection as smooth as silk! 🚀

Catch you later,
[Your Name]

If you want to read the full article, you can find it at this link.

Measuring programmatically the quality of the IPs offered by proxy providers

12/09/2024

Hey folks!

Diving deep into the world of web scraping, I've realized there's a ton to consider when hunting for the perfect proxy provider. While it's tempting to just look at the price tag and make a call, there’s a whole lot more under the hood that needs our attention.

First off, what are you trying to scrape? And, oh, let’s not forget about the ever-present bot protections that are getting trickier by the day. These factors are critical and vary greatly depending on the project at hand, so they need to be front and center in your decision-making process.

It's fascinating to see the variety of pricing models out there. However, beyond the dollars and cents, we've got to peer into the specifics – like the size of the IP pool and whether the locations of these IPs make sense for what we're trying to accomplish. Trust me, these details can make or break your data collection.

And here’s a pro tip: don’t skimp on the testing phase. There are some neat tools and methodologies to really push these proxy providers to their limits before you commit. Evaluating their performance can save you a bunch of headaches down the road.

Ultimately, it's all about doing your homework and looking beyond the surface to ensure you're picking a proxy provider that aligns with your project goals. A little effort upfront can save a ton of time and resources later on.

Cheers to smarter scraping! 🚀📊

If you want to read the full article, you can find it at this link.

Measuring programmatically the quality of the IPs offered by proxy providers

The Web Scraping Club

25/10/2024

20/10/2024

18/10/2024

14/10/2024

07/10/2024

05/10/2024

29/09/2024

27/09/2024

15/09/2024

15/09/2024

13/09/2024

12/09/2024

Indirizzo

Sito Web

Notifiche

Contatta L'azienda

Scelte rapide

Condividi

Digitare