Web Scraping (Almost every WordPress website) With Python – Code included!

Scraping a WordPress website and publishing the content to another WordPress website can be a valuable way to save time and effort when it comes to content creation. This process involves extracting data from one website and then importing it into another website, allowing you to use the content in a new and different way. There are a few different methods for scraping WordPress websites, depending on your specific needs and the size and complexity of the website you are scraping. Some common methods include using a plugin, using a scraping tool, or using a custom code solution.

In this tutorial, we will be exploring the option of using a custom code solution to scrape a WordPress website and import its content into another WordPress website. This method allows for greater flexibility and control over the scraping process and is ideal for more complex websites or those with specific requirements. We will be using the Python programming language to achieve this task, along with the popular libraries BeautifulSoup and Requests. These libraries will help us parse the HTML content of the website and make HTTP requests, respectively. By the end of this tutorial, you will have a better understanding of how to scrape a WordPress website using Python and import its content into another WordPress website.

Here is the code for collecting post IDs

import requests
import json

headers ={
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "accept-language": "ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7,de;q=0.6",
    "cache-control": "max-age=0",
    "sec-ch-ua": "\" Not A;Brand\";v=\"99\", \"Chromium\";v=\"101\", \"Google Chrome\";v=\"101\"",
    "sec-ch-ua-mobile": "?0",
    "sec-ch-ua-platform": "\"Windows\"",
    "sec-fetch-dest": "document",
    "sec-fetch-mode": "navigate",
    "sec-fetch-site": "none",
    "sec-fetch-user": "?1",
    "upgrade-insecure-requests": "1"
  }

with open("filename.txt", "w") as f:  #Create a .txt file in the folder where this script is stored, to get IDs saved automatically.
    for i in range(1,10000):
        try:
            g = requests.get("https://URL_OF_THE_WEBSITE_YOU_WANT_TO_SCRAPE/wp-json/wp/v2/posts/?page="+str(i) , headers=headers)
            js = g.json()
            for j in js:
                id = str(j['id'])
                f.write(id + "\n")
                print(id)
        except:
            None

This is the code where you actually add your WordPress details and publish the scraped content

import html
import json
import requests
import re
from slugify import slugify
from bs4 import BeautifulSoup
import time




def pop(file):
    with open(file, 'r+') as f: # open file in read / write mode
        firstLine = f.readline() # read the first line and throw it out
        data = f.read() # read the rest
        f.seek(0) # set the cursor to the top of the file
        f.write(data) # write the data back
        f.truncate() # set the file size to the current size
        return firstLine



def loadJson(url,id):
    g1_post = json.loads( requests.get("https://"+str(url).strip().rstrip().lstrip()+"/wp-json/wp/v2/posts/"+str(id).strip().rstrip().lstrip(),verify=False).text )
    title   =  re.sub('&(.*?);','',str(g1_post['title']['rendered']))
    
    
    try:
        soup = BeautifulSoup(str(g1_post['content']['rendered']),"html.parser")
        f = soup.find('div', attrs={"id":"toc_container"})
        f.decompose()
        content = str(soup)
    except:
        content = g1_post['content']['rendered']
    #content =  cleatattrs( content )
    cat_id = g1_post['categories'][0]
    g1_cat = json.loads( requests.get("https://"+url+"/wp-json/wp/v2/Categories/"+str(cat_id),verify=False).text )
    cat_title = g1_cat['name']
    return {"title":html.unescape(title).replace('[','').replace(']','').replace('/','').replace('SOLVED','').replace('Solved','').replace('”','').replace('’','').replace('-','').replace(':','').replace('“','').replace('(','').replace(')','').replace('-',''),"slug":slugify(html.unescape(title).replace('[','').replace(']','').replace('/','').replace('SOLVED','').replace('Solved','').replace('”','').replace('’','').replace('-','').replace(':','').replace('“','').replace('(','').replace(')','').replace('-','')),"content":content + '<br><br>Source: ' + url,"cat_title":cat_title,"cat_slug":slugify(cat_title)}





from wordpress_xmlrpc import WordPressPost
from wordpress_xmlrpc import Client, WordPressPost
from wordpress_xmlrpc.methods import posts
from wordpress_xmlrpc.methods.posts import GetPosts, NewPost
from wordpress_xmlrpc.methods.users import GetUserInfo
from wordpress_xmlrpc import Client, WordPressPost
from wordpress_xmlrpc.compat import xmlrpc_client
from wordpress_xmlrpc.methods import media, posts
import xmlrpc.client


    
while True:
    # run your script here
    first = pop('filename.txt')
    data = loadJson("HERE_GOES_WEBSITE_YOU_TOOK_IDS_FROM",first)

    client = Client('https://YOUR_URL.com/xmlrpc.php', 'WordPress_Username', 'WordPress_Password')

    post = WordPressPost()
    post.title = data['title']
    post.content = data['content']
    post.terms_names = {
        'category': [data['cat_title']]
    }
    
    success = False
    retries = 0
    while not success and retries < 3:
        try:
            post.id = client.call(posts.NewPost(post))
            post.post_status = 'publish'
            client.call(posts.EditPost(post.id, post))
            success = True
        except xmlrpc.client.ProtocolError as e:
            print(f"Caught ProtocolError: {e}")
            print(f"Retrying in 5 seconds... ({retries+1}/3)")
            time.sleep(5)
            retries += 1

    if success:
        print("Post created successfully")
    else:
        print("Failed to create post after 3 retries")

    time.sleep(19) #How often you want to run the script(in seconds) automatically. Change to your preference. 

This is not perfect code but does the job. Don’t forget to add your WP details in the script and the website you are scraping.

See Also:  How do I disable News and interests on the taskbar via group policy?

I have added in bold what needs to be changed in order for the script to work.

In conclusion, web scraping with Python is a powerful tool for extracting data from websites and importing it into another WordPress website. By using the libraries BeautifulSoup and Requests, we can make HTTP requests, parse HTML content, and easily scrape any website we need. With the help of this tutorial, you should now have a better understanding of how to web scrape using Python and how to import the content into another WordPress website.

Thank you for reading this tutorial and we hope it was helpful. If you have any questions or comments, feel free to ask. We would love to hear from you. Don’t forget to subscribe to our channel for more tutorials on Python and other programming languages.

By Philip Anderson