# SPDX-FileCopyrightText: © 2024 Seirdy <seirdy@seirdy.one>
# SPDX-FileCopyrightText: © 2025 Bart Groeneveld <avi@bartavi.nl>
# SPDX-License-Identifier: CC-BY-SA-4.0

# My own additions:

User-agent: ia_archiver
Disallow: /

# The motivation

# Imagine someone walking on the street. That person is certainly in public space
# and can be viewed by anyone. Now change the situation by adding a camera that
# is filming this public space, and recording every moment to another publicly
# accessible space. We all understand there is a difference between the
# situation before and after the camera was added, right?  It is all the same
# for websites: everyone can view this website, but I will ask people to not put
# a 'camera' on it recording every moment to another publicly accessible space.

# Q: How about The Internet Archive?
# A: I consider archiving only necessary for actual public information. Think
# governments, news, court decisions, etc. Not personal websites.

# Q: Do you have you something to hide?
# A: Yes. Not necessarily because my actions are questionable,
# but because your judgement may be.
# But also because I make errors. I should be able to fix them.
# I should not get haunted by my wrong-doings for the rest of my life.

# Q: How about Bitcoin and other blockchain projects?
# A: That's why I don't use them.

# Q: Then how do you use Git?
# A: git reset --soft HEAD~1; git push --force-with-lease

# Q: Are you aware that bots can ignore a robots.txt file?
# A: Yes. Just as a DNT-header can get ignored, or a trespasser can ignore the
# "Do not pass"-sign. It is mainly to signal intent; no-one has an excuse now.


# The rest of this file is sourced from https://seirdy.one/robots.txt.
# I could not find the license of that file,
# but that site includes 'CC-BY-SA-4.0' in the footer of all pages.
# It is copied almost verbatim (except of course the sitemap),
# to make diff'ing and future updates easier.
# How to find updates:
# curl https://seirdy.one/robots.txt | diff - layouts/robots.txt

# Please see <https://seirdy.one/meta/scrapers-i-block/> for an explanation on almost every entry, including intentionally-excluded entries.

User-agent: *
Disallow: /noindex/
Disallow: /misc/

User-Agent: peer39_crawler/1.0
Disallow: /

## IP-violation scanners ##

User-Agent: TurnitinBot
Disallow: /

User-Agent: AcademicBotRTU
Disallow: /

User-Agent: SlySearch
Disallow: /

User-Agent: BLEXBot
Disallow: /

User-agent: CheckMarkNetwork/1.0 (+https://www.checkmarknetwork.com/spider.html)
Disallow: /

User-agent: BrandVerity/1.0
Disallow: /

## Misc. icky stuff ##

User-agent: PiplBot
Disallow: /

# Well-known overly-aggressive bot that claims to respect robots.txt: http://mj12bot.com/
User-agent: MJ12bot
Crawl-Delay: 10

## Gen-AI data scrapers ##

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-Agent: FacebookBot
User-Agent: meta-externalagent
Disallow: /

User-agent: Cotoyogi
Disallow: /

User-agent: Webzio-extended
Disallow: /

User-agent: Kangaroo Bot
Disallow: /

User-Agent: GenAI
Disallow: /

User-Agent: SemrushBot-OCOB
User-Agent: SemrushBot-FT
Disallow: /

User-Agent: VelenPublicWebCrawler
Disallow: /

Sitemap: https://bartavi.nl/sitemap.xml