# SPDX-FileCopyrightText: © 2024 Seirdy # SPDX-FileCopyrightText: © 2025 Bart Groeneveld # SPDX-License-Identifier: CC-BY-SA-4.0 # My own additions: User-agent: ia_archiver Disallow: / # The motivation # Imagine someone walking on the street. That person is certainly in public space # and can be viewed by anyone. Now change the situation by adding a camera that # is filming this public space, and recording every moment to another publicly # accessible space. We all understand there is a difference between the # situation before and after the camera was added, right? It is all the same # for websites: everyone can view this website, but I will ask people to not put # a 'camera' on it recording every moment to another publicly accessible space. # Q: How about The Internet Archive? # A: I consider archiving only necessary for actual public information. Think # governments, news, court decisions, etc. Not personal websites. # Q: Do you have you something to hide? # A: Yes. Not necessarily because my actions are questionable, # but because your judgement may be. # But also because I make errors. I should be able to fix them. # I should not get haunted by my wrong-doings for the rest of my life. # Q: How about Bitcoin and other blockchain projects? # A: That's why I don't use them. # Q: Then how do you use Git? # A: git reset --soft HEAD~1; git push --force-with-lease # Q: Are you aware that bots can ignore a robots.txt file? # A: Yes. Just as a DNT-header can get ignored, or a trespasser can ignore the # "Do not pass"-sign. It is mainly to signal intent; no-one has an excuse now. # The rest of this file is sourced from https://seirdy.one/robots.txt. # I could not find the license of that file, # but that site includes 'CC-BY-SA-4.0' in the footer of all pages. # It is copied almost verbatim (except of course the sitemap), # to make diff'ing and future updates easier. # How to find updates: # curl https://seirdy.one/robots.txt | diff - layouts/robots.txt # Please see for an explanation on almost every entry, including intentionally-excluded entries. User-agent: * Disallow: /noindex/ Disallow: /misc/ User-Agent: peer39_crawler/1.0 Disallow: / ## IP-violation scanners ## User-Agent: TurnitinBot Disallow: / User-Agent: AcademicBotRTU Disallow: / User-Agent: SlySearch Disallow: / User-Agent: BLEXBot Disallow: / User-agent: CheckMarkNetwork/1.0 (+https://www.checkmarknetwork.com/spider.html) Disallow: / User-agent: BrandVerity/1.0 Disallow: / ## Misc. icky stuff ## User-agent: PiplBot Disallow: / # Well-known overly-aggressive bot that claims to respect robots.txt: http://mj12bot.com/ User-agent: MJ12bot Crawl-Delay: 10 ## Gen-AI data scrapers ## User-agent: GPTBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: Applebot-Extended Disallow: / User-agent: ClaudeBot Disallow: / User-Agent: FacebookBot User-Agent: meta-externalagent Disallow: / User-agent: Cotoyogi Disallow: / User-agent: Webzio-extended Disallow: / User-agent: Kangaroo Bot Disallow: / User-Agent: GenAI Disallow: / User-Agent: SemrushBot-OCOB User-Agent: SemrushBot-FT Disallow: / User-Agent: VelenPublicWebCrawler Disallow: / Sitemap: https://bartavi.nl/sitemap.xml