Introduction

In this blog, get ready for an exciting exploration into the dynamic intersection of Playwright and Docker, an innovative fusion. Running Playwright scripts using Docker has become a popular choice among developers for its simplicity and consistency. Docker allows you to bundle your Python Playwright scripts and all their necessary components into a single container, ensuring a smooth execution across different environments.

Playwright, a powerful automation library developed by Microsoft, enables developers to automate and test web applications across multiple browsers (Chromium, Firefox, and WebKit). When combined with Docker, it offers a robust solution for web scraping and automation tasks, eliminating the common pitfalls associated with environment inconsistencies.

In this blog, we will dive into the basics of using Docker to run your Playwright scripts, showcasing how this approach can make your web scraping and automation process more straightforward and reliable. We’ll cover everything from setting up Playwright, writing a simple scraping script, Dockerizing your application, and running it with a specified target URL.

Let’s get started and see how this powerful combination can streamline your web scraping tasks!

Prerequisites for Docker and Playwright

Before diving into the setup and execution of Playwright with Docker, there are a few prerequisites you’ll need to ensure are in place. These tools and installations will prepare your environment for a smooth and successful integration.

1. Docker Installation

Ensure Docker is installed on your machine. Docker allows you to create and run containers, providing a consistent environment for your applications.

How to Install and Use Docker on Ubuntu 20.04:

https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-on-ubuntu-20-04

2. Playwright Installation

You need to install Playwright and its dependencies. This can be done using pip, the Python package installer.

pip install playwright
playwright install

By ensuring these prerequisites are met, you’ll be ready to follow along with the steps to set up and run your Playwright scripts within Docker containers, making your web scraping and automation tasks more efficient and consistent.

Writing the Scraper

Now we will create a Python file named scraper.py and add the following code to it.

import os
from playwright.sync_api import sync_playwright


def run(playwright):
   browser = playwright.chromium.launch(headless=True)
   page = browser.new_page()
   url = os.getenv('TARGET_URL', 'https://example.com')  
   page.goto(url)
   content = page.content()
   print(content)
   browser.close()


with sync_playwright() as playwright:
   run(playwright)

The above Python script uses Playwright to perform web scraping in a streamlined manner. It begins by importing necessary modules, including ‘os’ for environment variable access and ‘sync_playwright’ from Playwright for synchronous browser automation. The core function, ‘run’, launches a headless Chromium browser, opens a new page, and navigates to a URL specified by the ‘TARGET_URL’ environment variable (defaulting to ‘https://example.com’ if not set). The script then retrieves and prints the HTML content of the page before closing the browser. The Playwright context is managed using a ‘with’ statement, ensuring proper setup and teardown of browser resources. This setup allows for flexible and automated web scraping tasks.

Create a Dockerfile
FROM python:3.8


WORKDIR /app


# Copy the Python script into the container
COPY . .


# Install Python dependencies
RUN pip install playwright


# Run playwright install to ensure all browsers are downloaded
RUN playwright install --with-deps


# Command to run the scraper script
CMD ["python", "scraper.py"]

1. FROM python:3.8

  • This line specifies the base image for our Docker image. We are using the official Python 3.8 image from Docker Hub. This image includes Python 3.8 and a minimal set of dependencies required to run Python applications.

2. WORKDIR /app

  • This command sets the working directory inside the container to ‘/app’. All subsequent commands (e.g., COPY, RUN) will be executed relative to this directory. If the directory does not exist, it will be created.

3. COPY . .

  • The ‘COPY’ command copies all files from the current directory on the host machine to the ‘/app’ directory in the container. This includes your scraper.py script and any other necessary files.

4. RUN pip install playwright

  • This command installs the Playwright package using ‘pip’, the Python package manager. It ensures that Playwright is available in the container for running our script.

5. RUN playwright install –with-deps

  • This command installs the necessary browser binaries (Chromium, Firefox, WebKit) that Playwright needs to operate. The ‘–with-deps’ flag ensures that any additional dependencies required by the browsers are also installed. This step is crucial for making sure that Playwright can run the browsers in the container.

6. CMD [“python”, “scraper.py”]

  • The ‘CMD’ instruction specifies the command that will be run when the container starts. In this case, it tells the container to execute ‘python scraper.py’, which runs our scraping script. This script will use the Playwright library to perform web scraping.
Building the Docker Image

To build a Docker image, you need to use the `docker build` command, which reads the instructions in your Dockerfile and creates an image based on those instructions. Here’s how to do it:

1. Open a Terminal:

   Navigate to the directory containing your Dockerfile and your Python script (scraper.py).

2. Run the Docker Build Command:  

docker build -t playwright-scraping .
  • docker build: This command tells Docker to create a new image from the Dockerfile.
  • -t playwright-scraping: The ‘-t’ flag allows you to tag the image with a name. Here, we’re tagging it as ‘playwright-scraping’. This name can be anything you choose.
  • . : The dot at the end specifies the build context, which is the current directory. Docker will look for the Dockerfile in this directory and use it to build the image.

During this process, Docker will execute the instructions in your Dockerfile step by step, creating layers for each instruction. Once the build is complete, you will have a Docker image named ‘playwright-scraping’.

Running the Container

Once you have built your Docker image, you can run a container based on that image. Using an environment variable allows you to pass dynamic data (like a target URL) into the container. Here’s how to do it:

1. Execute the Docker Run Command as given below: 

docker run -e TARGET_URL='https://example.com' playwright-scraping
  • docker run: This command creates and starts a new container from the specified image.
  • -e TARGET_URL=’https://example.com’: The ‘-e’ flag sets an environment variable inside the container. Here, we are setting ‘TARGET_URL’ to ‘https://example.com’. This allows our Python script to dynamically use this URL.
  • playwright-scraping: This is the name of the Docker image we built earlier.

When you run this command, Docker will:

  • Start a new container from the ‘playwright-scraping’ image.
  • Set the environment variable ‘TARGET_URL’ to ‘https://example.com’ inside the container.
  • Execute the ‘CMD’ instruction specified in the Dockerfile, which runs ‘python scraper.py’.

Your Python script (scraper.py) will then access the ‘TARGET_URL’ environment variable and use it to perform web scraping on the specified URL. The content or results will be printed to the terminal or handled as specified in your script.

Summary

By building the Docker image and running the container with an environment variable, you achieve a flexible and reproducible setup for web scraping with Playwright. This approach ensures that your script runs consistently across different environments, leveraging Docker’s powerful containerization capabilities.

Categories: Docker Playwright Python Scraping

Leave a Reply

Your email address will not be published.

You may use these <abbr title="HyperText Markup Language">HTML</abbr> tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*