Introduction

Extracting video, image URLs, and text from the webpage can be done easily with selenium and beautiful soup in python. If there are URLs like “https://…video.mp4” as the src then we can directly access those videos.

However, there are so many websites that use the blob format URLs like src=”blob:https://video_url”. We can extract them using selenium + bs4 but we can not access them directly because those are generated internally by the browser.

What are BLOB URLs?

Blob URLs can only be generated internally by the browser. URL.createObjectURL() will create a special reference to the Blob or File object which later can be released using URL.revokeObjectURL(). These URLs can only be used locally in a single instance of the browser and in the same session.

BLOB URLs are typically used to display or play multimedia content, such as videos, directly in a web browser or media player, without the need to download the content to the user’s local device. They are often used in conjunction with HTML5 video elements, which allow web developers to embed video content directly into a web page, using a simple <video> tag.

To overcome the above issue we’ve found two methods that can help to extract the video URL directly:

  1. YT-dlp
  2. Selenium + Network logs
YT-dlp

YT-dlp is a very handy module to download youtube videos and also extracts other attributes of youtube videos like titles, descriptions, tags, etc. We have found a way to extract videos from normal web pages (non-youtube) using some additional options with it. Below are the steps and sample code for using it.

Install YT-dlp module for ubuntu
sudo snap install yt-dlp

Below is the simple code for video URL extraction using yt-dlp with the python subprocess. We are using additional options like -f, -g, -q, etc. The description for these options can be found on the git hub of yt-dlp.

import subprocess
 
def get_video_urls(url):  
  
   videos_url = []
   youtube_subprocess = subprocess.Popen(["yt-dlp","-f","all","-g","-q","--ignore-error",
       "--no-warnings", url], stdout=subprocess.PIPE)
   try:
       video_url_list = youtube_subprocess.communicate(timeout=15)[0].decode("utf-8").split("\n")
       for video in video_url_list:
           if video.endswith(".mp4") or video.endswith(".mp3") or video.endswith(".mov") or video.endswith(".webm"):
               videos_url.append(video)
      
       if len(videos_url) == 0:
           for video in video_url_list:
               if video.endswith(".m3u8"):
                   videos_url.append(video)
   except subprocess.TimeoutExpired:
       youtube_subprocess.kill()
      
   return videos_url
 
print(get_video_urls(url="https://edition.cnn.com/videos/world/2022/12/06/china-beijing-covid-restrictions-wang-dnt-ebof-vpx.cnn"))
Selenium + Network logs

Whenever blob format URLs are used in the website and the video is being played, we can access the streaming URL (.m3u8) for that video in the browser’s network tab. We can use the network and performance logs to find the streaming URLs.

What is M3U8?

M3U8 is a text file that uses UTF-8-encoded characters to specify the locations of one or more media files. It is commonly used to specify a playlist of audio or video files for streaming over the internet, using a media player that supports the M3U8 format, such as VLC, Apple’s iTunes, and QuickTime. The file typically has the “.m3u8” file extension and begins with a list of one or more media files, followed by a series of attribute information lines. Each line in an M3U8 file typically specifies a single media file, along with its title and length, or a reference to another M3U8 file for streaming a playlist of media files.

We can extract the network and performance logs using selenium with some advanced options. Perform the following steps to install all the required packages:

pip install selenium
pip install webdriver_manager

Below is the sample code for getting streaming URL (.m3u8) using selenium and network logs:

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
import json
from selenium.webdriver.common.by import By
import json
 
 
desired_capabilities = DesiredCapabilities.CHROME
desired_capabilities["goog:loggingPrefs"] = {"performance": "ALL"}
 
options = webdriver.ChromeOptions()
 
options.add_argument("--no-sandbox")
options.add_argument("--headless")
options.add_argument('--disable-dev-shm-usage')
options.add_argument("start-maximized")
options.add_argument("--autoplay-policy=no-user-gesture-required")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument("--ignore-certificate-errors")
options.add_argument("--mute-audio")
options.add_argument("--disable-notifications")
options.add_argument("--disable-popup-blocking")
options.add_argument(f'user-agent={desired_capabilities}')
 
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),
                           options=options,
                           desired_capabilities=desired_capabilities)
 
 
def get_m3u8_urls(url): 
   driver.get(url)
   driver.execute_script("window.scrollTo(0, 10000)")
   time.sleep(20)
   logs = driver.get_log("performance")
   url_list = []
  
   for log in logs:
       network_log = json.loads(log["message"])["message"]
       if ("Network.response" in network_log["method"]
           or "Network.request" in network_log["method"]
           or "Network.webSocket" in network_log["method"]):
           if 'request' in network_log["params"]:
               if 'url' in network_log["params"]["request"]:
                   if 'm3u8' in network_log["params"]["request"]["url"] or '.mp4' in network_log["params"]["request"]["url"]:
                       if "blob" not in network_log["params"]["request"]["url"]:
                           if '.m3u8' in network_log["params"]["request"]["url"]:
                               url_list.append( network_log["params"]["request"]["url"] )
 
   driver.close()
   return url_list
 
 
if __name__ == "__main__":
  
   url = "https://fruitlab.com/video/aTUqTrJrMtj6FgO5?ntp=ggm"
   url_list = get_m3u8_urls(url)
   print(url_list)

Once you get the streaming URL it can be played in the VLC media player using the stream option. 

The m3u8 URL can also be downloaded as a .mp4 file using the FFmpeg module. It can be installed in ubuntu using:

sudo apt install ffmpeg

After installing FFmpeg we can easily download the video using the below command:

ffmpeg -i http://..m3u8 -c copy -bsf:a aac_adtstoasc output.mp4

Hope you like these two approaches of Advance video scraping. Do let us know if you have any queries.


Looking to simplify your video scraping tasks or need expert assistance with video analytics? Look no further! Contact us today or share your requirements at letstalk@pragnakalp.com to schedule a consultation with our computer vision expert. Let’s work together to find hassle-free solutions tailored to your needs.

Categories: Scrapping Selenium

Leave a Reply

Your email address will not be published.

You may use these <abbr title="HyperText Markup Language">HTML</abbr> tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*