If a crawler from MyJay just visited your site.
It identifies itself as MyJaySearch/1.0 in its User-Agent
string, with a link back to this exact page, and sends a custom
X-Crawler-Info header pointing here too, so you don't
have to guess what hit your server.
Stop it from visiting your site
Three ways, in order of how fast they take effect:
-
Add a
Disallowrule to your ownrobots.txt(for everyone, or specificallyUser-agent: MyJaySearch). Checked, and respected, before every single page fetch, and cached for at most 24 hours, so a change takes effect within a day at most. - Use the removal request link on any of your pages in search results, or submit one directly with just your URL and an optional reason, no account needed. Approved requests block the domain and delete it from the index, typically reviewed within 24 hours.
- If your site is on MyJay itself: turn off "Allow this site to be indexed by MyJay Search" in your dashboard's Settings tab. Takes effect on the next crawl.
What gets indexed, and how
Three platforms, three different discovery methods:
- MyJay: every published site is indexed by default (see the opt-out above), discovered directly from MyJay's own records, no crawling needed for discovery.
-
Neocities: discovered from Neocities' own public
/browselisting (sorted by newest and most-recently updated), the same page any search engine would use, not from their API (their API's terms explicitly disallow using it to enumerate all sites, so this crawler doesn't). - Nekoweb: Nekoweb has no bulk site listing at all. Sites arrive either through a manual submission (anyone can submit a URL) or by being linked to from a page on another site that's already indexed.
For a discovered site, the crawl follows links up to two clicks deep
from its homepage, extracting the page title, meta description, and
visible body text (scripts, navs, and footers are stripped first).
A handful of simple heuristics (presence of <canvas>,
image-heavy pages, frequent <article> tags, a few
keyword checks) infer rough tags like blog, art, or portfolio, shown
as filters in search.
Nothing more than that is stored: no personal data is pulled out of a page, and raw page content isn't kept beyond what's needed to show a title, excerpt, and tags in results.
robots.txt and rate limits
robots.txt is fetched and checked before a single page
on a domain is touched, including its Crawl-delay
directive if it sets one. The crawler never goes faster than one
request per second to any single domain, and slows down further if
a site's robots.txt asks for it. A page-level
<meta name="robots" content="noindex"> tag or
X-Robots-Tag header is honored too, even when
robots.txt itself allows the path.
If robots.txt can't be fetched at all (a timeout, a
server error, anything other than a clean 404), the crawler skips
that domain for the run entirely rather than guessing it's safe to
proceed. A 404 is treated as "no restrictions," same as any other
crawler would.
Re-crawl schedule
An incremental pass runs daily (recently changed MyJay sites, recently updated Neocities sites); a full pass runs weekly across everything currently indexed. Re-crawling refreshes a page's title, excerpt, and tags, and can pick up new pages linked from existing ones.