Visible to the public Mitigating Web Scrapers using Markup Randomization

TitleMitigating Web Scrapers using Markup Randomization
Publication TypeConference Paper
Year of Publication2021
AuthorsBolbol, Noor, Barhoom, Tawfiq
Conference Name2021 Palestinian International Conference on Information and Communication Technology (PICICT)
Date Publishedsep
KeywordsBlogs, Collaboration, composability, content security, Crawlers, data mining, information and communication technology, Information Reuse, machine learning algorithms, markup HTML, middleware, middleware security, policy-based governance, pubcrawl, Randomization, security, Web crawler, Web scraping

Web Scraping is the technique of extracting desired data in an automated way by scanning the internal links and content of a website, this activity usually performed by systematically programmed bots. This paper explains our proposed solution to protect the blog content from theft and from being copied to other destinations by mitigating the scraping bots. To achieve our purpose we applied two steps in two levels, the first one, on the main blog page level, mitigated the work of crawler bots by adding extra empty articles anchors among real articles, and the next step, on the article page level, we add a random number of empty and hidden spans with randomly generated text among the article's body. To assess this solution we apply it to a local project developed using PHP language in Laravel framework, and put four criteria that measure the effectiveness. The results show that the changes in the file size before and after the application do not affect it, also, the processing time increased by few milliseconds which still in the acceptable range. And by using the HTML-similarity tool we get very good results that show the symmetric over style, with a few bit changes over the structure. Finally, to assess the effects on the bots, scraper bot reused and get the expected results from the programmed middleware. These results show that the solution is feasible to be adopted and use to protect blogs content.

Citation Keybolbol_mitigating_2021