conesus 4 days ago

I run NewsBlur[0] and I’ve been battling this issue of NewsBlur fetching 403s across the web for months now. My users are revolting and asking for refunds. I’ve tried emailing dozens of site owners and publishers and only two of them have done the work of whitelisting their RSS feed. It’s maddening and is having a real negative effect on NewsBlur.

NewsBlur is an open-source RSS news reader (full source available at [1]), something we should all agree is necessary to support the open web! But Cloudflare blocking all of my feed fetchers is bizarre behavior. And we’re on the verified bots list for years, but it hasn’t made a difference.

Let me know what I can do. NewsBlur publishes a list of IPs that it uses for feed fetching that I've shared with Cloudflare but it hasn't made a difference.

I'm hoping Cloudflare uses the IP address list that I publish and adds them to their allowlist so NewsBlur can keep fetching (and archiving) millions of feeds.

[0]: https://newsblur.com

[1]: https://github.com/samuelclay/NewsBlur

  • srik 4 days ago

    RSS is an essential component to modern web publishing and it feels scary to see how one company’s inconsideration might harm its already fragile future. One day cloudflare will get big enough to be subject to antitrust regulation and this instance will be a strong data point working against them.

    • immibis 4 days ago

      It's not one company - it's an individual decision of every blog operator to block their own readers by signing up for cloudflare.

    • 01HNNWZ0MV43FF 4 days ago

      It's not essential, I don't know anyone in real life who uses it.

      I run an RSS feed on my blog out of principle and I don't bother reading other feeds I'm subscribed to

      When I'm bored I come here, I go on Mastodon, and gods save me, I go on Reddit

      • djhn 4 days ago

        Podcasts are based on RSS and a lot of people listen to podcasts.

  • AyyEye 4 days ago

    Three consenting parties trying to use their internet blocked by a single intermediary that's too big to care is just gross. It's the web we deserve.

    • eddythompson80 4 days ago

      > Three consenting parties

      Clearly they are not 100% consenting, or at best one of them (the content publisher) is misconfiguring/misunderstanding their setup. They enabled RSS on their service, then setup a rule to require human verification for accessing that RSS feed.

      It's like a business advertising a singles only area, then hiring a security company and telling them to only allow couples in the building.

      • AyyEye 4 days ago

        If Cloudflare was honest and upfront about the tradeoffs being made and the fact that it's still going to require configuration and maintenance work they'd have significantly less customers.

  • p4bl0 4 days ago

    I've been a paying NewsBlur user since the downfall of Google Reader and I'm very happy with it. Thank you for NewsBlur!

  • renaissancec a day ago

    Can't recommend Newsblur enough. I have been a customer since Fastladder was shut down. I love their integration of being able to use pinboard.in within the web interface to bookmark articles. An essential part of my web productivity flow.

  • miohtama 4 days ago

    Thank you for the hard work.

    Newsblur was the first SaaS I could afford as a student. I have been subscriber for something like 20 years now. And I will keep doing it to the grave. Best money ever spent.

    • p4bl0 4 days ago

      > I have been subscriber for something like 20 years now.

      NewsBlur is "only" 15 years old (and GReader was there up until 11 years ago).

  • hedora 3 days ago

    Maybe pay for residential proxy network access?

    I used to get my internet from a small local ISP, and ip blacklisting basically means no one in our zipcode could have reliable internet.

    These days, the 10-20% of us with an unobstructed sky view switched to starlink and didn’t look back.

    The thing is, both ISPs use CGNAT, but there’s no way cloudflare is going to block Musk like they do the mom and pop shop.

    Anyway, apparently residential proxy networks work pretty well if you hit a spurious ip block. I’ve had good luck with apple private relay too.

    I’m hoping service providers realize how useless and damaging ip blocking is to their reputations, but I’m not holding my breath. Sometimes I think the endgame is just routing 100% of residential traffic through 8.8.8.8.

  • wooque 4 days ago

    You just bypass it with library like cloudscraper/hrequests.

kevincox 5 days ago

I dislike advice of whitelisting specific readers by user-agent. Not only is this endless manual work that will only solve the problem for a subset of users but it also is easy to bypass by malicious actors. My recommendation would be to create a page rule that disables bot blocking for your feeds. This will fix the problem for all readers with no ongoing maintenance.

If you are worried about DoS attacks that may hammer on your feeds then you can use the same configuration rule to ignore the query string for cache keys (if your feed doesn't use query strings) and overriding the caching settings if your server doesn't set the proper headers. This way Cloudflare will cache your feed and you can serve any number of visitors without putting load onto your origin.

As for Cloudflare fixing the defaults, it seems unlikely to happen. It has been broken for years, Cloudflare's own blog is affected. They have been "actively working" on fixing it for at least 2 years according to their VP of product: https://news.ycombinator.com/item?id=33675847

  • benregenspan 4 days ago

    AI crawlers have changed the picture significantly and in my opinion are a much bigger threat to the open web than Cloudflare. The training arms race has drastically increased bot traffic, and the value proposition behind that bot traffic has inverted. Previously many site operators could rely on the average automated request being net-beneficial to the site and its users (outside of scattered, time-limited DDoS attacks) but now most of these requests represent value extraction. Combine this with a seemingly related increase in high-volume bots that don't respect robots.txt and don't set a useful User-Agent, and using a heavy-handed firewall becomes a much easier business decision, even if it may target some desirable traffic (like valid RSS requests).

  • vaylian 5 days ago

    I don't know if cloudflare offers it, but whitelisting the URL of the RSS feed would be much more effective than filtering user agents.

    • derkades 5 days ago

      Yes it supports it, and I think that's what the parent comment was all about

      • BiteCode_dev 5 days ago

        Specifically, whitelisting the URL for the bot protection, but not the cache, so that you are still somewhat protected against adversarial use.

        • londons_explore 4 days ago

          An adversary can easily send no-cache headers to bust the cache.

          • acdha 4 days ago

            The CDN can choose whether to honor those. That hasn’t been an effective adversarial technique since the turn of the century.

            • londons_explore 4 days ago

              does cloudflare give such an option? Even for non-paid accounts?

              • acdha 3 days ago

                They ignore request cache control headers, I believe unconditionally so you’d have to disable caching for the endpoints which clients are allowed to request uncached.

    • jks 4 days ago

      Yes, you can do it with a "page rule", which the parent comment mentioned. The CloudFlare free tier has a budget of three page rules, which might mean that you have to bundle all your rss feeds in one folder so they share a path prefix.

  • a-french-anon 4 days ago

    And for those of us using sfeed, the default UA is Curl's.

wenbin 4 days ago

At Listen Notes, we rely heavily on Cloudflare to manage and protect our services, which cater to both human users and scripts/bots.

One particularly effective strategy we've implemented is using separate subdomains for services designed for different types of traffic, allowing us to apply customized firewall and page rules to each subdomain.

For example:

- www. listennotes.com is dedicated to human users. E.g., https://www.listennotes.com/podcast-realtime/

- feeds. listennotes.com is tailored for bots, providing access to RSS feeds. Eg., https://feeds.listennotes.com/listen/wenbin-fangs-podcast-pl...

- audio. listennotes.com serves both humans and bots, handling audio URL proxies. E.g., https://audio.listennotes.com/e/p/1a0b2d081cae4d6d9889c49651...

This subdomain-based approach enables us to fine-tune security and performance settings for each type of traffic, ensuring optimal service delivery.

  • kevindamm 4 days ago

    Where do you put your sitemap (or its equivalent)? Looking at the site, I don't notice one in the metadata but I do see a "site index" on the www subdomain, though possibly that's intended for humans not bots? I think the usual recommendation is to have a sitemap per subdomain and not mix them, but clearly they're meant for bots not humans...

    • wenbin 4 days ago

      Great question.

      We only need to provide the sitemap (with custom paths, not publicly available) in a few specific places, like Google Search Console. This means the rules for managing sitemaps are quite manageable. It’s not a perfect setup, but once we configure it, we can usually leave it untouched for a long time.

amatecha 5 days ago

I get blocked from websites with some regularity, running Firefox with strict privacy settings, "resist fingerprinting" etc. on OpenBSD. They just give a 403 Forbidden with no explanation, but it's only ever on sites fronted by CloudFlare. Good times. Seems legit.

  • wakeupcall 4 days ago

    Also running FF with strict privacy settings and several blockers. The annoyances are constantly increasing. Cloudflare, captchas, "we think you're a bot", constantly recurring cookie popups and absurd requirements are making me hate most of the websites and services I hit nowdays.

    I tried for a long time to get around it, but now when I hit a website like this just close the tab and don't bother anymore.

    • afh1 4 days ago

      Same, but for VPN (either corporate or personal). Reddit blocks it completely, requires you to sign-in but even the sign-in page is "network restricted"; LinkedIn shows you a captcha but gives an error when submitting the result (several reports online); and overall a lot of 403's. All go magically away when turning off the VPN. Companies, specially adtechs like Reddit and LinkedIn, do NOT want you to browse privately, to the point they rather you don't use their website at all unless without a condom.

      • acdha 4 days ago

        > Companies, specially adtechs like Reddit and LinkedIn, do NOT want you to browse privately, to the point they rather you don't use their website at all unless without a condom.

        That’s true in some cases, I’m sure, but also remember that most site owners deal with lots of tedious abuse. For example, some people get really annoyed about Tor being blocked but for most sites Tor is a tiny fraction of total traffic but a fairly large percentage of the abuse probing for vulnerabilities, guessing passwords, spamming contact forms, etc. so while I sympathize for the legitimate users I also completely understand why a busy site operator is going to flip a switch making their log noise go down by a double-digit percentage.

        • rolph 4 days ago

          funny thing, when FF is blocked i can get through with TOR.

          • mmooss 4 days ago

            With what browser? The same one that's blocked?

      • Adachi91 4 days ago

        > Reddit blocks it completely, requires you to sign-in but even the sign-in page is "network restricted";

        I've been creating accounts every time I need to visit Reddit now to read a thread about [insert subject]. They do not validate E-Mail, so I just use `example@example.com`, whatever random username it suggests, and `example` as a password. I've created at least a thousand accounts at this point.

        Malicious Compliance, until they disable this last effort at accessing their content.

        • zargon 4 days ago

          They verify signup emails now. At least for me.

        • hombre_fatal 4 days ago

          Most subreddits worth posting on usually have a minimum account age + minimum account karma. I've found it annoying to register new accounts too often.

        • immibis 4 days ago

          I've created a few thousand accounts through a VPN (random node per account). After doing that, I found out Reddit accounts created through VPNs are automatically shadow banned the second time they comment (I think the first is also shadow deleted in some way). But they allow you to browse from a shadow banned account just fine.

      • anthk 4 days ago

        For Reddit I just use it r/o under gopher://gopherddit.com

        A good client it's either Lagrange (multiplatform), the old Lynx or Dillo with the Gopher plugin.

      • appendix-rock 4 days ago

        I don’t follow the logic here. There seems to be an implication of ulterior motive but I’m not seeing what it is. What aspect of ‘privacy’ offered by a VPN do you think that Reddit / LinkedIn are incentivised to bypass? From a privacy POV, your VPN is doing nothing to them, because your IP address means very little to them from a tracking POV. This is just FUD perpetuated by VPN advertising.

        However, the undeniable reality is that accessing the website with a non-residential IP is a very, very strong indicator of sinister behaviour. Anyone that’s been in a position to operate one of these services will tell you that. For every…let’s call them ‘privacy-conscious’ user, there are 10 (or more) nefarious actors that present largely the same way. It’s easy to forget this as a user.

        I’m all but certain that if Reddit or LinkedIn could differentiate, they would. But they can’t. That’s kinda the whole point.

        • bo1024 4 days ago

          Not following what could be sinister about a GET request to a public website.

          > From a privacy POV, your VPN is doing nothing to them, because your IP address means very little to them from a tracking POV.

          I disagree. (1) Since I have javascript disabled, IP address is generally their next best thing to go on. (2) I don't want to give them IP address to correlate with the other data they have on me, because if they sell that data, now someone else who only has my IP address suddenly can get a bunch of other stuff with it too.

          • hombre_fatal 4 days ago

            At the very least, they're wasting bandwidth to a (likely) low quality connection.

            But anyone making malicious POST requests, like spamming chatGPT comments, first makes GET requests to load the submission and find comments to reply to. If they think you're a low quality user, I don't see why they'd bother just locking down POSTs.

          • zahllos 4 days ago

            SQL injection?

            Get parameters can be abused like any parameter. This could be sql, could be directory traversal attempts, brute force username attempts, you name it.

            • kam 4 days ago

              If your site is vulnerable to SQL injection, you need to fix that, not pretend Cloudflare will save you.

              • zahllos 4 days ago

                Obviously. But I was responding to "what is sinister about a GET request". To put it a slightly different way, it does not matter so much whether the request is a read or a write. For example DNS amplfication attacks work by asking a DNS server (read) for a much larger record than the request packet requires, and faking the request IP to match the victim. That's not even a connection the victim initiated, but that packet still travels along the network path. In fact, if it crashes a switch or something along the way, that's just as good from the point of view of the attacker, maybe even better as it will have more impact.

                I am absolutely not a fan of all these "are you human?" checks at all, doubly so when ad-blockers trigger them. I think there are very legitimate reasons for wanting to access certain sites without being tracked - anything related to health is an example.

                Maybe I should have made a more substantive comment, but I don't believe this is as simple a problem as reducing it to request types.

        • homebrewer 4 days ago

          It's equally easy to forget about users from countries with way less freedom of speech and information sharing than in Western rich societies. These anti-abuse measures have made it much more difficult to access information blocked by my internet provider during the last few years. I'm relatively competent and can find ways around it, but my friends and relatives who pursue other career choices simply don't bother anymore.

          Telegram channels have been a good alternative, but even that is going downhill thanks to French authorities.

          Cloudflare and Google also often treat us like bots (endless captchas, etc) which makes it even more difficult.

        • afh1 4 days ago

          IP address is a fingerprint to be shared with third parties, of course it's relevant. It's not ulterior motive, it's explicit, it's not caring about your traffic because you're not good product. They can and do differentiate by requiring a sign-in. They just don't care enough to make it actually work. Because they are adtechs and not interested in you as a user.

        • miki123211 4 days ago

          > For every…let’s call them ‘privacy-conscious’ user, there are 10 (or more) nefarious actors that present largely the same way.

          And each one of these could potentially create thousands of accounts, and do 100x as many requests as a normal user would.

          Even if only 1% of the people using your service are fraudsters, a normal user has at most a few accounts, while fraudsters may try to create thousands per day. This means that e.g. 90% of your signups are fraudulent, despite the population of fraudsters being extremely small.

        • ruszki 4 days ago

          Was anybody stopped to do nefarious actions by these annoyances?

          It's like at my current and previous companies. They make a lot of security restrictions. The problem is, if somebody wants to get data out, they can get out anytime (or in). Security department says that it's against "accidental" leaks. I'm still waiting a single instance when they caught an "accidental" leak, and they are just not introducing extra steps, when at the end I achieve the exact same thing. Even when I caused a real potential leak, nobody stopped me to do it. The only reason why they have these security services/apps is to push responsibility to other companies.

    • anilakar 4 days ago

      Heck, I cannot even pass ReCAPTCHA nowadays. No amount of clicking buses, bicycles, motorcycles, traffic lights, stairs, crosswalks, bridges and fire hydrants will suffice. The audio transcript feature is the only way to get past a prompt.

      • josteink 4 days ago

        Just a heads up that this is how Google treat connections it suspects to originate from bots. Silently keeping you in an endless loop promising reward if you can complete it correctly.

        I discovered this when I set up IPv6 using hurricane electric as a tunnel broker for IPv6 connectivity.

        Seemingly Google has all HEnet IPv6tunnel subnets listed for such behaviour without it being documented anywhere. It was extremely annoying until I figured out what was going on.

        • n4r9 4 days ago

          > Silently keeping you in an endless loop promising reward if you can complete it correctly.

          Sounds suspiciously like how product managers talk to developers as well.

        • anilakar 4 days ago

          Sadly my biggest crime is running Firefox with default privacy settings and uBlock Origin installed. No VPNs or IPv6 tunnels, no Tor traffic whatsoever, no Google search history poisoning plugins.

          If only there was a law that allowed one to be excluded from automatic behavior profiling...

      • marssaxman 4 days ago

        There's a pho restaurant near where I work which wants you to scan a QR code at the table, then order and pay through their website instead of talking to a person. In three visits, I have not once managed to get past their captcha!

        (The actual process at this restaurant is to sit down, fuss with your phone a bit, then get up like you're about to leave; someone will arrive promptly to take your order.)

        • eddythompson80 4 days ago

          I’ve only seen that at Asian restaurants near a university in my city. When I asked I was told that this is a common way in China and they get a lot of international students who prefer/expect it that way.

    • Terr_ 3 days ago

      The worst part is that a lot of it is mysteriously capricious with no recourse.

      Like, you visit Site A too often while blocking some javascript, and now Site B doesn't work for no apparent reason, and there's no resolution path. Worse, the bad information may become permanent if an owner uses it to taint your account, again with no clear reason or appeal.

      I suspect Reddit effectively killed my 10+ year account (appeal granted, but somehow still shadowbanned) because I once used the "wrong" public wifi to access it.

    • lioeters 4 days ago

      Same here. I occasionally encounter websites that won't work with ad blockers, sometimes with Cloudflare involved, and I don't even bother with those sites anymore. Same with sites that display a cookie "consent" form without an option to not accept. I reject the entire site.

      Site owners probably don't even see these bounced visits, and it's such a tiny percentage of visitors who do this that it won't make a difference. Meh, it's just another annoyance to be able to use the web on our own terms.

      • capitainenemo 4 days ago

        It's a tiny percentage of visitors, but a tech savvy one, and depending on your website, they could be a higher than average percentage of useful users or product purchasers. The impact could be disproportionate. What's frustrating is many websites don't even realise it is happening because the reporting from the intermediate (Cloudflare say) is inaccurate or incorrectly represents how it works. Fingerprinting has become integral to bot "protection". It's also frustrating when people think this can be drop in, and put it in front of APIs that are completely incapable of handling the challenge with no special casing (encountered on FedEx, GoFundMe), much like the RSS reader problem.

    • orbisvicis 4 days ago

      I have to solve captchas for Amazon while logged into my Amazon account.

      • m463 4 days ago

        at one point I couldn't access amazon at night.

        I would get different captcha, one convoluted that wouldn't even load the required images.

        And I would get the oops sorry dog page for everything.

        I finally contacted amazon, gave them my (static) ip address and it was good.

        In other locations, I have to solve a 6-distorted-letter captcha to log in, but that's the extent of it.

      • tenken 4 days ago

        Why?! ... I've had 404 pages on Amazon, but never a captcha...

    • doctor_radium 4 days ago

      Hey, same here! For better or worse, I use Opera Mini for much of my mobile browsing, and it fares far worse than Firefox with uBlock Origin and ResistFingerprinting. I complained about this roughly a year ago on a similar HN thread, on which a Cloudflare rep also participated. Since then something changed, but both sides being black boxes, I can't tell if Cloudflare is wising up or Mini has stepped up. I still get the same challenge pages, but Mini gets through them automatically now, more often than not.

      But not always. My most recent stumbling block is https://www.napaonline.com. Guess I'm buying oxygen sensors somewhere else.

    • SoftTalker 4 days ago

      Same. If a site doesn't want me there, fine. There's no website that's so crucial to my life that I will go through those kinds of contortions to access it.

    • JohnFen 4 days ago

      > when I hit a website like this just close the tab and don't bother anymore.

      Yeah, that's my solution as well. I take those annoyances as the website telling me that they don't want me there, so I grant them their wish.

      • immibis 4 days ago

        That's fine. You were an obstacle to their revenue gathering anyway.

    • amanda99 4 days ago

      Yes and the most infuriating thing is the "we need to verify the security of your connection" text.

  • BiteCode_dev 5 days ago

    Cloudflare is a fantastic service with an unmatched value proposition, but it's unfortunately slowly killing web privacy, with 1000s paper cuts.

    Another problem is "resist fingerprinting" prevents some canvas processing, and many websites like bluesky, linked in or substack uses canvas to handle image upload, so your images appear to be stripes of pixel.

    Then you have mobile apps that just don't run if you don't have a google account, like chatgpt's native app.

    I understand why people give up, trying to fight for your privacy is an uphill battle with no end in sight.

    • madeofpalk 4 days ago

      > Then you have mobile apps that just don't run if you don't have a google account, like chatgpt's native app.

      Is that true? At least on iOS you can log into the ChatGPT with same email/password as the website.

      I never use Google login for stuff and ChatGPT works fine for me.

    • pjc50 4 days ago

      The privacy battle has to be at the legal layer. GDPR is far from perfect (bureaucratic and unclear with weak enforcement), but it's a step in the right direction.

      In an adversarial environment, especially with both AI scrapers and AI posters, websites have to be able to identify and ban persistent abusers. Which unfortunately implies having some kind of identification of everybody.

      • nonameiguess 4 days ago

        No, it's more than that. Cloudflare's bot protection has blocked me from sites where I have a paid account, paid for by my real checking account with my real name attached. Even when I am perfectly willing to give out my identity and be tracked, I still can't because I can't even get to the login page.

        • HappMacDonald 4 days ago

          They block such visits because their pragma suspects that your visit is the account of a real human that was hacked by a bot.

      • wbl 4 days ago

        You notice that Analogue Devices puts their (incredibly useful) information up for free. That's because they make money other ways. Ad supported content farm Internet had a nice run but we will get on without it.

      • BiteCode_dev 4 days ago

        That's another problem, we want cheap easy solutions like tracking people, instead of more targetteed or systemic ones.

      • Gormo 4 days ago

        > The privacy battle has to be at the legal layer.

        I couldn't disagree more. The way to protect privacy is to make privacy the standard at the implementation layer, and to make it costly and difficult to breach it.

        Trying to rely on political institutions without the practical and technical incentives favoring privacy will inevitably result in the political institutions themselves becoming the main instrument that erodes privacy.

        • HappMacDonald 4 days ago

          Yet without regulation nothing stops large companies from simply changing the implementation layer for one that pads their bottom line better, or just rebuild it from scratch.

          If people who valued privacy really controlled the implementation layer we wouldn't have gotten to this point in the first place.

          • Gormo 4 days ago

            The point we're at is one in which privacy is still attainable via implementation-layer measures, even if it requires investing some effort and making some trade-offs to sustain. The alternative -- placing trust in regulation, which never works in the long run -- will inevitably result in regulatory capture that eliminates those remaining practical measures and replaces them with, at best, a performative illusion.

    • KomoD 4 days ago

      > Then you have mobile apps that just don't run if you don't have a google account, like chatgpt's native app.

      That's not true, I use ChatGPT's app on my phone without logging into a Google account.

      You don't even need any kind of account at all to use it.

      • BiteCode_dev 4 days ago

        On Android at least, even if you don't need to log in to your google account when connecting to chatgpt, the app won't work if your phone isn't signed in into google play, which doesn't work if your phone isn't linked to a google account.

        An android phone asks you to link a google account when you use it for the first time. It takes a very dedicated user to refuse that, then to avoid logging in into the gmail, youtube or app store apps which will all also link your phone to your google account when you sign in.

        But I do actively avoid this, I use Aurora, F-droid, K9 and NewPipeX, so no link to google.

        But then no ChatGPT app. When I start it, I get hit with a logging page to the app store and it's game over.

        • __MatrixMan__ 4 days ago

          I have a similar experience with the pager duty app. It loads up and then exits with "security problem detected by app" because I've made it more secure by isolating it from Google (a competitor). Workaround is to just control it via slack instead.

          • BiteCode_dev 4 days ago

            Well you can use the web base chagpt so there is a workaround. Except it's worse a worse experience.

        • ForHackernews 4 days ago
          • BiteCode_dev 4 days ago

            That won't make chatgpt's app work thought.

            • ForHackernews 4 days ago

              It might well do, depending on what ChatGPT's app is asking the OS for. /e/OS is an Android fork that removes Google services and replaces them with open source stubs/re-implementations from https://microg.org/

              I haven't tried the ChatGPT app, but I know that, for example my bank and other financial services apps work with on-device fingerprint authentication and no Google account on /e/OS.

        • acdha 4 days ago

          So the requirement is to pass the phone’s system validation process rather than having a Google account. I don’t love that but I can understand why they don’t want to pay the bill for the otherwise ubiquitous bots, and it’s why it’s an Android-specific issue.

          • BiteCode_dev 4 days ago

            You can make a very rational case for each privacy invasive technical decision ever made.

            In the end, the fact remain: no chatgpt app without giving up your privacy, to google none the less.

            • acdha 4 days ago

              “Giving up your privacy” is a pretty sweeping claim – it sounds like you’re saying that Android inherently leaks private data to Google, which is broader than even Apple fans tend to say.

              • michaelt 4 days ago

                A person who was maximally distrustful of Google would assume they link your phone and your IP through the connection used to receive push notifications, and the wifi-network-visibility-to-location API, and the software update checker, and the DNS over HTTPS, and suchlike. As a US company, they could even be forced to do this in secret against their will, and lie about it.

                Of course as Google doesn't claim they do this, many people would consider it unreasonably fearful/cynical.

                • acdha 4 days ago

                  Sure, but that says you shouldn’t have a phone, not that ChatGPT is forcing you to give up your privacy.

              • ForHackernews 4 days ago

                > it sounds like you’re saying that Android inherently leaks private data to Google, which is broader than even Apple fans tend to say.

                Yes? I mean, not "leaks" - it's designed to upload your private data to Google and others.

                https://www.tcd.ie/news_events/articles/study-reveals-scale-...

                > Even when minimally configured and the handset is idle, with the notable exception of e/OS, these vendor-customised Android variants transmit substantial amounts of information to the OS developer and to third parties such as Google, Microsoft, LinkedIn, and Facebook that have pre-installed system apps. There is no opt-out from this data collection.

              • BiteCode_dev 4 days ago

                Google and Apple were both part of the PRISM program, of course I'm making this claim.

                That's the opposite stance that would be bonkers.

                • acdha 4 days ago

                  PRISM covered communications through U.S. company’s servers. It was not a magic back door giving them access to your device’s local data, and even if you did believe that it was the answer would be not using a phone. A major intelligence agency does not need you to have a Google account so they can spy on you.

                  • BiteCode_dev 4 days ago

                    Forest for the tree.

                    Google and Apple are both heavily invested in ads (apple made 4.7 billion from ads in 2022), they have a track record of exfiltrating your data (remember contractors listening to your siri recordings?), of lying to the customers (remember the home button scandal on iPhone?), have control over a device that have your whole life yet runs partially on code you can't evaluate.

                    Trusting those people makes no sense at all. You have a business relationship with them, that's it.

                    • acdha 3 days ago

                      It’s interesting how each time you say something which isn’t accurate you try to distract by changing the topic.

  • neilv 4 days ago

    Similar here. It's not unusual to be blocked from a site by CloudFlare when I'm running Firefox (either ESR or current release) on Linux.

    I suspect that people operating Web sites have no idea how many legitimate users are blocked by CloudFlare.

    And. based on the responses I got when I contacted two of the companies whose sites were chronically blocked by CloudFlare for months, it seemed like it wasn't worth any employee's time to try to diagnose.

    Also, I'm frequently blocked by CloudFlare when running Tor Browser. Blocking by Tor exit node IP address (if that's what's happening) is much more understandable than blocking Firefox from a residential IP address, but still makes CloudFlare not a friend of people who want or need to use Tor.

    • jorams 4 days ago

      > I suspect that people operating Web sites have no idea how many legitimate users are blocked by CloudFlare.

      I sometimes wonder if all Cloudflare employees are on some kind of whitelist that makes them not realize the ridiculous false positive rate of their bot detection.

    • pjc50 4 days ago

      > CloudFlare not a friend of people who want or need to use Tor

      The adversarial aspect of all this is a problem: P(malicious|Tor) is much higher than P(malicious|!Tor)

    • johnklos 4 days ago

      I've had several discussions that were literally along the lines of, "we don't see what you're talking about in our logs". Yes, you don't - traffic is blocked before it gets to your servers!

    • lovethevoid 4 days ago

      What are some examples? I've been running ff on linux for quite some time now and am rarely blocked. I just run it with ublock origin.

      • capitainenemo 4 days ago

        Odds are they have Resist Fingerprinting turned on. When I use it in a Firefox profile I encounter this all over the place. Drupal, FedEx.. some sites handle it better than others. Some it's a hard block with a single terse error. Some it is a challenge which gets blocked due to using remote javascript. Some it's a local challenge you can get past. But it has definitely been getting worse. Fingerprinting is being normalised, and the excuse of "bot protection" (bots can make unique fingerprints too, though) means that it can now be used maliciously (or by ad networks like google, same diff) as a standard feature.

        • lovethevoid 4 days ago

          I also use Mullvad Browser (a browser based on Firefox), and it supports resisting fingerprinting without any of those blocks. Tried it on Drupal and Fedex. Loads Cloudflare sites normally.

          I'm guessing if it's really Resist Fingerprinting on Firefox (something Mullvad also has on by default), then there are other settings that aren't being enabled causing the issue. Mullvad actually lists the settings related to resisting fingerprinting here - https://mullvad.net/en/browser/hard-facts

          • capitainenemo 4 days ago

            Or it could simply be that since it is on by default for Mullvad, that Cloudflare and others have an explicit exception built in for it. It might also be dependent on where traffic is coming from. I have had different behaviour with different ISPs. Perhaps your entire VPN network gets a pass due to, perhaps depending on how they manage abuse, or how much unique information they can get just based on the few bits of info the browser leaks combined with the uniqueness of the browser and VPN connection IPs.

    • amatecha 4 days ago

      Yeah, I've contacted numerous owners of personal/small sites and they are usually surprised, and never have any idea why I was blocked (not sure if it's an aspect of CF not revealing the reason, or the owner not knowing how to find that information). One or two allowlisted my IP but that doesn't strike me as a solution.

      I've contacted companies about this and they usually just tell me to use a different browser or computer, which is like "duh, really?" , but also doesn't solve the problem for me or anyone else.

  • mzajc 4 days ago

    I randomize my User-Agent header and many websites outright block me, most often with no captcha and no useless error message.

    The most egregious is Microsoft (just about every Microsoft service/page, really), where all you get is a "The request is blocked." and a few pointless identifiers listed at the bottom, purely because it thinks your browser is too old.

    CF's captcha page isn't any better either, usually putting me in an endless loop if it doesn't like my User-Agent.

    • pushcx 4 days ago

      Rails is going to make this much worse for you. All new apps include naive agent sniffing and block anything “old” https://github.com/rails/rails/pull/50505

      • mzajc 4 days ago

        This is horrifying. What happened to simply displaying a "Your browser is outdated, consider upgrading" banner on the website?

        • shbooms 4 days ago

          idk, even that seems too much to me, but maybe I'm just being too senstive.

          but like, why is it a website's job to tell me what browser version to use? unless my outdated browser is lacking legitmate functionality which is required by your website, just serve the page and be done with it.

          • michaelt 4 days ago

            Back when the sun was setting on IE6, sites deployed banners that basically meant "We don't test on this, there's a good chance it's broken, but we don't know the specifics because we don't test with it"

        • freedomben 4 days ago

          Wow. And this is now happening right as I've blacklisted google-chrome due to manifest v3 removal :facepalm:

      • GoblinSlayer 4 days ago

          def blocked?
            user_agent_version_reported? && unsupported_browser?
          end
        
        well, you know what to do here :)
    • charrondev 4 days ago

      Are you sending an actual random string as your UA or sending one of a set of actual user agents?

      You’re best off just picking real ones. We’ve got hit by a botnet sending 10k+ requests from 40 different ASNs with 1000s of different IPs. The only way we’re able to identify/block the traffic was excluding user agents matching some regex (for whatever reason they weren’t spoofing real user agents but weren’t sending actual ones either).

      • RALaBarge 4 days ago

        I worked at an anti-spam email security company in the aughts, and we had a perl engine that would rip apart the MIME boundaries and measure everything - UA, SMTP client fingerprint headers, even the number of anchor or paragraph tags. A large combination of IF/OR evaluations with a regex engine did a pretty good job since the botnets usually don't bother to fully randomize or really opsec the payloads they are sending since it is a cannon instead of a flyswatter.

        • kccqzy 4 days ago

          Similar techniques are known in the HTTP world too. There were things like detecting the order of HTTP request headers and matching them to known software, or even just comparing the actual content of the Accept header.

          • miki123211 4 days ago

            And then there's also TLS fingerprinting.

            Different browsers use TLS in slightly different ways, send data in a slightly different order, have a different set of supported extensions / algorithms etc.

            If your user agent says Safari 18, but your TLS fingerprint looks like Curl and not Safari, sophisticated services will immediately detect that something isn't right.

    • lovethevoid 4 days ago

      Not sure a random UA extension is giving you much privacy. Try your results on coveryourtracks eff, and see. A random UA would provide a lot of identifying information despite being randomized.

      From experience, a lot of the things people do in hopes of protecting their privacy only makes them far easier to profile.

      • mzajc 4 days ago

        coveryourtracks.eff.org is a great service, but it has a few limitations that apply here:

        - The website judges your fingerprint based on how unique it is, but assumes that it's otherwise persistent. Randomizing my User-Agent serves the exact opposite - a given User-Agent might be more unique than using the default, but I randomize it to throw trackers off.

        - To my knowledge, its "One in x browsers" metric (and by extension the "Bits of identifying information" and the final result) are based off of visitor statistics, which would likely be skewed as most of its visitors are privacy-conscious. They only say they have a "database of many other Internet users' configurations," so I can't verify this.

        - Most of the measurements it makes rely on javascript support. For what it's worth, it claims my fingerprint is not unique when javascript is disabled, which is how I browse the web by default.

        The other extreme would be fixing my User-Agent to the most common value, but I don't think that'd offer me much privacy unless I also used a proxy/NAT shared by many users.

        • lovethevoid 4 days ago

          Randomizing to throw trackers off only works if you only ever visit sites once.

          But yes, without javascript a lot of tracking functions fail to operate. That is good for privacy, and EFF notes that on the site.

          You can fix your UA to a common value, it's about providing the least amount of identifying bits, and randomizing it just provides another bit to identify you by. Always remember: an absence of information is also valuable information!

        • HappMacDonald 4 days ago

          I would just fingerprint you as "the only person on the internet who is scrambling their UA string" :)

  • pessimizer 4 days ago

    Also, Cloudflare won't let you in if you forge your referer (it's nobody's business what site I'm coming from.) For years, you could just send the root of the site you were visiting, then last year somebody at Cloudflare flipped a switch and took a bite out of everyone's privacy. Now it's just endless reloading captchas.

    • zamadatix 4 days ago

      Why go through that hassle instead of just removing the referer?

      • bityard 4 days ago

        Lots of sites see an empty referrer and send you to their main page or marketing page. Which means you can't get anywhere else on their site without a valid referrer. They consider it a form of "hotlink" protection.

        (I'm not saying I agree with it, just that it exists.)

        • zamadatix 4 days ago

          Fair and valid answer to my wording. Rewritten for what I meant to ask: "Why set referrer to the base of the destination origin instead of something like Referrer-Policy: strict-origin?". I.e. remove it completely for cross-origin instead of always making up that you came from the destination.

          Though what you mention does beg the question "is there really much privacy gain in that over using Referrer-Policy: same-origin and having referrer based pages work right?" I suppose so if you're randomizing your identity in an untrackable way for each connection it could be attractive... though I think that'd trigger being suspected as a bot far before the lack of proper same origin info :p.

    • philsnow 4 days ago

      Ah, maybe this is what’s happening to me.. I use Firefox with uBlock origin, privacy badger, multi-account containers, and temporary containers.

      Whenever I click a link to another site, i get a new tab in either a pre-assigned container or else in a “tmpNNNN” container, and i think either by default or I have it configured to omit Referer headers on those new tab navigations.

  • DrillShopper 4 days ago

    Maybe after the courts break up Amazon the FTC can turn its eye to Cloudflare.

    • gjsman-1000 4 days ago

      A. Do you think courts give a darn about the 0.1% of users that are still using RSS? We might as well care about the 0.1% of users who want the ability to set every website's background color to purple with neon green anchor tags. RSS never caught on as a standard to begin with, peaking at 6% adoption by 2005.

      B. Cloudflare has healthy competition with AWS, Akamai, Fastly, Bunny.net, Mux, Google Cloud, Azure, you name it, there's a competitor. This isn't even an Apple vs Google situation.

      • HappMacDonald 4 days ago

        Cloudflare doesn't offer the same product suite as the other companies you mention, though. Cloudflare is primarily DDoS prevention while the others are primarily cloud hosting.

        And it is the DDoS prevention measures at issue here.

        • gjsman-1000 4 days ago

          Five years ago, you would’ve been right, but Cloudflare is very different now.

          Nowadays, Cloudflare has image compression and CDN services, video storage and delivery services, serverless compute with Workers, domain registration, (soon) container support with optional GPUs, durable objects (basically serverless storage), serverless SQL databases (D1), even an AWS S3 competitor with B2. They even have bespoke services like CloudFlare Tunnels - what’s AWS got that’s anything like it?

          Cloudflare is getting close to full-on AWS. At least, the parts most customers use. If they just added boring old VPSs, people would realize very quickly how full featured they are.

          As for DDoS mitigation - you’ve still got AWS Shield, Akamai, Azure, Radware, F5, even Oracle (Dyn) competing in that market. Unless you could show Cloudflare did illegal tying as a monopolist specifically to sell DDoS prevention, there’s no case.

  • anthk 4 days ago

    Or any Dillo user, with a PSP User Agent which is legit for small displays.

  • anal_reactor 4 days ago

    On my phone Opera Mobile won't be allowed into some websites behind CloudFlare, most importantly 4chan

    • dialup_sounds 4 days ago

      4chan's CF config is so janky at this point it's the only site I have to use a VPN for.

  • Jazgot 4 days ago

    My rss reader was blocked on kvraudio.com by cloudflare. This issue wasn't solved for months. I simply stopped reading anything on kvraudio. Thank you cloudflare!

  • KPGv2 4 days ago

    Reddit seems to do this to me (sometimes) when I use Zen browser. Switching over to Safari or Chrome and the site always works great.

  • kjkjadksj 4 days ago

    Reddit has been bad about it as of late too

  • viraptor 5 days ago

    I know it's not a solution for you specifically here, but if anyone has access to the CF enterprise plan, they can report specific traffic as non-bot and hopefully improve the situation. They need to have access to the "Bot Management" feature though. It's a shitty situation, but some of us here can push back a little bit - so do it if you can.

    And yes, it's sad that the "make internet work again" is behind an expensive paywall..

    • meeby 4 days ago

      The issue here is that RSS readers are bots. Obviously perfectly sensible and useful bots, but they’re not “real people using a browser”. I doubt you could get RSS readers listed on Cloudflare’s “good bots” list either which would allow them the default bot protection feature given they’ll all run off random residential IPs.

      • j16sdiz 4 days ago

        They can't whitelist useragent, otherwise bot will pass just using agent spoofing.

        If you have enterprise plan, you can have custom rules including allowing by url

      • sam345 4 days ago

        Not sure if I get this.It seems to me an RSS reader is as much of a bot as a browser is for HTML. It just reads RSS rather than HTML.

        • kccqzy 4 days ago

          The difference is that RSS readers usually do background fetches on their own rather than waiting for a human to navigate to a page. So in theory, you could just set up a crontab (or systemd timer) that simply xdg-open various pages on a schedule and not be treated as bots.

      • viraptor 4 days ago

        I was responding to a person with Firefox issues, not RSS.

        I'm not sure either if RSS bots could be added to good bots, but if anyone has traffic from them, we can definitely try. (No high hopes though, given the responses I got from support so far)

  • jasonlotito 4 days ago

    Cloudflare has always been a dumpster fire in usability. The number of times it would block me in that way was enough to make me seriously question anyones technical knowledge that used it. It's a dumpster fire. Friends don't let friend use Cloudflare. To me, it's like the Spirit airlines of CDNs.

    Sure, tech wise it might work great, but from your users perspective: it's trash.

    • immibis 4 days ago

      It's got the best vendor lock-in enshittification story - it's free - and that's all that matters.

jgrahamc 5 days ago

My email is jgc@cloudflare.com. I'd like to hear from the owners of RSS readers directly on what they are experiencing. Going to ask team to take a closer look.

  • kalib_tweli 4 days ago

    There are email obfuscation and managed challenge script tags being injected into the RSS feed.

    You simply shouldn't have any challenges whatsoever on an RSS feed. They're literally meant to be read by a machine.

    • kalib_tweli 4 days ago

      I confirmed that if you explicitly set the Content-Type response header to application/rss+xml it seems to work with Cloudflare Proxy enabled.

      The issue here is that Cloudflare's content type check is naive. And the fact that CF is checking the content-type header directly needs to be made more explicit OR they need to do a file type check.

      • londons_explore 4 days ago

        I wonder if popular software for generating RSS feeds might not be setting the correct content-type header? Maybe this whole issue could be mostly-fixed by a few github PR's...

        • onli 4 days ago

          Correct might be debatable here as well. My blog for example sets Content-Type to text/xml, which is not exactly wrong for an RSS feed (after all, it is text and XML) and IIRC was the default back then.

          There were compatibility issues with other type headers, at least in the past.

          • johneth 4 days ago

            I think the current correct content types are:

            'application/rss+xml' (for RSS)

            'application/atom+xml' (for Atom)

            • londons_explore 4 days ago

              Sounds like a kind samaritan could write a scanner to find as many RSS feeds as possible which look like RSS/Atom and don't have these content types, then go and patch the hosting software those feeds use to have the correct content types, or ask the webmasters to fix it if they're home-made sites.

              As soon as a majority of sites use the correct types, clients can start requiring it for newly added feeds, which in turn will make webmasters make it right if they want their feed to work.

            • onli 4 days ago

              Not even Cloudflares own blog uses those, https://blog.cloudflare.com/rss/, or am I getting a wrong content-type shown in my dev tools? For me it is `application/xml`. So even if `application/rss+xml` were the correct type by an official spec, it's not something to rely on if it's not used commonly.

              • johneth 4 days ago

                I just checked Wikipedia and it says Atom's is 'application/atom+xml' (also confirmed in the IANA registry), and RSS's is 'application/rss+xml' (but it's not registered yet, and 'text/xml' is also used widely).

                'application/rss+xml' seems to be the best option though in my opinion. The '+xml' in the media type tells (good) parsers to fall back to using an XML parser if they don't understand the 'rss' part, but the 'rss' part provides more accurate information on the content's type for parsers that do understand RSS.

                All that said, it's a mess.

        • kalib_tweli 4 days ago

          It wouldn't. It's the role of the HTTP server to set the correct content type header.

        • djbusby 4 days ago

          The number of feeds with crap headers and other non-spec stuff going on; and loads of clients missing useful headers. Ugh. It seems like it should be simple; maybe that's why there are loads of naive implementations.

        • Klonoar 4 days ago

          Quite a few feeds out there use the incorrect type of text/xml, since it works slightly better in browsers by not prompting a download.

          Would not surprise me if Cloudflare lumps this in with text/html protections.

    • o11c 4 days ago

      Even outside of RSS, the injected scripts often make internet security significantly worse.

      Since the user-agent has no way to distinguish scripts injected by cloudflare from scripts originating from the actual website, in order to pass the challenge they are forced to execute arbitrary code from an untrusted party. And malicious Javascript is practically ubiquitous on the general internet.

  • badlibrarian 4 days ago

    Thank you for showing up here and being open to feedback. But I have to ask: shouldn't Cloudflare be running and reviewing reports to catch this before it became such a problem? It's three clicks in Tableau for anyone who cares, and clearly nobody does. And this isn't the first time something like this has slipped through the cracks.

    I tried reaching out to Cloudflare with issues like this in the past. The response is dozens of employees hitting my LinkedIn page yet no responses to basic, reproduceable technical issues.

    You need to fix this internally as it's a reputational problem now. Less screwing around using Salesforce as your private Twitter, more leadership in triage. Your devs obviously aren't motivated to fix this stuff independently and for whatever reason they keep breaking the web.

    • 015a 4 days ago

      The reality that HackerNews denizens need to accept, in this case and in a more general form, is: RSS feeds are not popular. They aren't just unpopular in the way that, say, Peacock is unpopular relative to Netflix; they're truly unpopular, used regularly by a number of people that could fit in an american football stadium. There are younger software engineers at Cloudflare that have never heard the term "RSS" before, and have no notion of what it is. It will probably be dead technology in ten years.

      I'm not saying this to say its a good thing; it isn't.

      Here's something to consider though: Why are we going after Cloudflare for this? Isn't the website operator far, far more at-fault? They chose Cloudflare. They configure Cloudflare. They, in theory, publish an RSS feed, which is broken because of infrastructure decisions they made. You're going after Ryobi because you've got a leaky pipe. But beyond that: isn't this tool Cloudflare publishes doing exactly what the website operators intended it to do? It blocks non-human traffic. RSS clients are non-human traffic. Maybe the reason you don't want to go after the website operators is because you know you're in the wrong? Why can't these RSS clients detect when they encounter this situation, and prompt the user with a captive portal to get past it?

      • badlibrarian 4 days ago

        I'm old enough to remember Dave Winer taking Feedburner to task for inserting crap into RSS feeds that broke his code.

        There will always be niche technologies and nascent standards and we're taking Cloudflare to task today because if they continue to stomp on them, we get nowhere.

        "Don't use Cloudflare" is an option, but we can demand both.

        • gjsman-1000 4 days ago

          "Old man yells at cloud about how the young'ns don't appreciate RSS."

          I mean that somewhat sarcastically; but there does come a point where the demands are unreasonable, the technology is dead. There are probably more people browsing with JavaScript disabled than using RSS feeds. There are probably more people browsing on Windows XP than using RSS feeds. Do I yell at you because your personal blog doesn't support IE6 anymore?

          • badlibrarian 4 days ago

            Spotify and Apple Podcasts use RSS feeds to update what they show in their apps. And even if millions of people weren't dependent on it, suggesting that an infrastructure provider not fix a bug only makes the web worse.

        • 015a 4 days ago

          I'm not backing down on this one: This is straight up an "old man yelling at the kids to get off his lawn" situation, and the fact that JGC from Cloudflare is in here saying "we'll take a look at this" is so far and beyond what anyone reasonable would expect of them that they deserve praise and nothing else.

          This is a matter between You and the Website Operators, period. Cloudflare has nothing to do with this. This article puts "Cloudflare" in the title because its fun to hate on Cloudflare and it gets upvotes. Cloudflare is a tool. These website operators are using Cloudflare The Tool to block inhuman access to their websites. RSS CLIENTS ARE NOT HUMAN. Let me repeat that: Cloudflare's bot detection is working fully appropriately here, because RSS Clients are Bots. Everything here is working as expected. The part where change should be asked is: Website operators should allow inhuman actors past the Cloudflare bot detection firewall specifically for RSS feeds. They can FULLY DO THIS. Cloudflare has many, many knobs and buttons that Website Operators can tweak; one of those is e.g. a page rule to turn off bot detection for specific routes, such as `/feed.xml`.

          If your favorite website is not doing this, its NOT CLOUDFLARE'S FAULT.

          Take it up with the Website Operators, Not Cloudflare. Or, build an RSS Client which supports a captive portal to do human authorization. God this is so boring, y'all just love shaking your first and yelling at big tech for LITERALLY no reason. I suspect its actually because half of y'all are concerningly uneducated on what we're talking about.

          • badlibrarian 4 days ago

            As part of proxying what may be as much as 20% of the web, Cloudflare injects code and modifies content that passes between clients and servers. It is in their core business interests to receive and act upon feedback regarding this functionality.

            • 015a 4 days ago

              Sure: Let's begin by not starting the conversation with "Don't use Cloudflare", as you did. That's obviously not only unhelpful, but it clearly points the finger at the wrong party.

          • doctor_radium 4 days ago

            I get what you're saying, and on a philosophical level you're probably right. If a website owner misconfigures their CDN to the point of impeding legitimate traffic then they can fail like businesses do everyday. Survival of the fittest. But with the majority of web users apparently running stock Chrome, on a practical level the web still has to work. I went looking for car parts a number of months ago and was blocked/accosted by firewalls over 50% of the time. Not all Cloudflare-powered sites. There isn't enough time in the day to take every misconfigured site to task (unless you're Bowerick Wowbagger [1]), so I believe the solution will eventually have to be either an altruistic effort from Cloudflare or from government regulation.

            [1] https://www.wowbagger.com/chapter1.htm

          • 627467 4 days ago

            What's does cloudflare do to search crawlers by default? Does it block them too?

  • viraptor 5 days ago

    It's cool and all that you're making an exception here, but how about including a "no, really, I'm actually a human" link on the block page rather than giving the visitor a puzzle: how to report the issue to the page owner (hard on its own for normies) if you can't even load the page. This is just externalising issues that belong to the Cloudflare service.

    • jgrahamc 5 days ago

      I am not trying to "make an exception", I'm asking for information external to Cloudflare so I can look at what people are experiencing and compare with what our systems are doing and figure out what needs to improve.

      • PaulRobinson 4 days ago

        Some "bots" are legitimate. RSS is intended for machine consumption. You should not be blocking content intended for machine consumption because a machine is attempting to consume it. You should not expect a machine, consuming content intended for a machine, to do some sort of step to show they aren't a machine, because they are in fact a machine. There is a lot of content on the internet that is not used by humans, and so checking that humans are using it is an aggressive anti-pattern that ruins experiences for millions of people.

        It's not that hard. If the content being requested is RSS (or Atom, or some other syndication format intended for consumption by software), just don't do bot checks, use other mechanisms like rate limiting if you must stop abuse.

        As an example: would you put a captcha on robots.txt as well?

        As other stories here can attest to, Cloudflare is slowly killing off independent publishing on the web through poor product management decisions and technology implementations, and the fix seems pretty simple.

        • jamespo 4 days ago

          From another post, if the content-type is correct it gets through. If this is the case I don't see the problem.

          • Scramblejams 4 days ago

            It's a very common misconfiguration, though, because it happens by default when setting up CF. If your customers are, by default, configuring things incorrectly, then it's reasonable to ask if the service should surface the issue more proactively in an attempt to help customers get it right.

            As another commenter noted, not even CF's own RSS feed seems to get the content type right. This issue could clearly use some work.

    • doctor_radium 4 days ago

      I had a conversation with a web site owner about this once. There apparently is such a feature, a way for sites to configure a "Please contact us here if you're having trouble reaching our site" page...usage of which I assume Cloudflare could track and then gain better insight into these issues. The problem? It requires a Premium Plan.

    • methou 5 days ago

      Some clients are more like a bot/service, imagine google reader that fetches and caches content for you. The client I’m currently using is miniflux, it also works in this way.

      I understand that there are some more interactive rss readers, but from personal experience it’s more like “hey I’m a good bot, let me in”

      • _Algernon_ 4 days ago

        An rss reader is a user agent (ie. a software acting on behalf of its users). If you define rss readers as a bot (even if it is a good bot), you may as well call Firefox a bot (it also sends off web requests without explicit approval of each request by the browser).

        • sofixa 4 days ago

          Their point was that the RSS reader does the scraping on its own in the background, without user input. If it can't read the page, it can't; it's not initiated by the user where the user can click on a "I'm not a bot, I promise" button.

      • viraptor 4 days ago

        It was a mental skip, but the same idea. It would awesome if CF just allowed reporting issues at the point something gets blocked - regardless if it's a human or a bot. They're missing an "I'm misclassified" button for people actually affected without the third-party runaround.

        • fluidcruft 4 days ago

          Unfortunately, I would expect that queue of reports to get flooded by bad faith actors.

          • viraptor 4 days ago

            Sure, but now they say that queue should go to the website owner instead, who has less global visibility on the traffic. So that's just ignoring something they don't want to deal with.

  • is_true 4 days ago

    Maybe when you detect urls that return the rss mimetype notify the owner of the site/CF account that it might be a good idea to allow bots on that urls.

    Ideally you could make it a simple switch in the config, somethin like: "Allow automated access on RSS endpoints".

  • prmoustache 4 days ago

    It is not only rss reader users that are affected. Any user with some extension to block trackers get regularly forbidden access to websites or have to deal with tons of captcha.

  • kevincox 4 days ago

    I'll mail you as well but I think public discussion is helpful. Especially since I have seem similar responses to this over the years and it feels very disingenuous. The problem is very clear (Cloudflare serves 403 blocks to feed readers for no reason) you have all of the logs. The solution is maybe not trivial but I fail to see how the perspective of someone seeing a 403 block is going to help much. This just starts to sound like a way to seem responsive without actually doing anything.

    From the feed reader perspective it is a 403 response. For example my reader has been trying to read https://blog.cloudflare.com/rss/ and the last successful response it got was on 2021-11-17. It has been backing off due to "errors" but it still is checking every 1-2 weeks and gets a 403 every time.

    This obviously isn't limited to the Cloudflare blog, I see it on many site "protected by" (or in this case broken by) Cloudflare. I could tell you what public cloud IPs my reader comes from or which user-agent it uses but that is besides the point. This is a URL which is clearly intended for bots so it shouldn't be bot-blocked by default.

    When people reach out to customer support we tell them that this is a bug for the site and there isn't much we can do. They can try contacting the site owner but this is most likely the default configuration of Cloudflare causing problems that the owner isn't aware of. I often recommend using a service like FeedBurner to proxy the request as these services seem to be on the whitelist of Cloudflare and other scraping prevention firewalls.

    I think the main solution would be to detect intended-for-robots content and exclude it from scraping prevention by default (at least to a huge degree).

    Another useful mechanism would be to allow these to be accessed when the target page is cachable, as the cache will protect the origin from overload-type DoS attacks anyways. Some care needs to be taken to ensure that adding a ?bust={random} query parameter can't break through to the origin but this would be a powerful tool for endpoints that need protection from overload but not against scraping (like RSS feeds). Unfortunately cache headers for feeds are far from universal, so this wouldn't fix all feeds on its own. (For example the Cloudflare blog's feed doesn't set any caching headers and is labeled as `cf-cache-status: DYNAMIC`.)

  • quinncom 4 days ago

    Cloudflare-enabled websites have had this issue for years.[1] The problem is that website owners are not educated enough to understand that URLs meant for bots should not enable Cloudflare’s bot blocker.

    Perhaps a solution would be for Cloudflare to have default page rules that disable bot-blocking features for common RSS feed URLs? Or pop-up a notice with instructions on how to create these page rules to users that appear to have RSS feeds on their website?

    [1] Here is Overcast’s owner raising the issue in 2022: https://x.com/OvercastFM/status/1578755654587940865

erikrothoff 5 days ago

As the owner of an RSS reader I love that they are making this more public. 30% of our support requests are ”my feed doesn’t” work. It sucks that the only thing we can say is ”contact the site owner, it’s their firewall”. And to be fair it’s not only Cloudflare, so many different firewall setups cause issues. It’s ironic that a public API endpoint meant for bots is blocked for being a bot.

belkinpower 5 days ago

I maintain an RSS reader for work and Cloudflare is the bane of my existence. Tons of feeds will stop working at random and there’s nothing we can do about it except for individually contacting website owners and asking them to add an exception for their feed URL.

  • stanislavb 5 days ago

    I was recently contacted by one of my website users as their RSS reader was blocked by Cloudflare.

  • sammy2255 5 days ago

    Unfortunately its not really Cloudflare but webadmins who have configured it to block everything thats not a browser, whether unknowingly or not

    • afandian 5 days ago

      If Cloudflare offer a product, for a particular purpose, that breaks existing conventions of that purpose, then it’s Cloudflare.

      • sammy2255 5 days ago

        Not really. You wouldn’t complain to a fence company for blocking a path if there were hired to do exactly that

        • shakna 5 days ago

          Yes, I would. Experts are expected to relay back to their client with their thoughts on a matter, not just blindly do as they're told. Your builder is meant to do their due diligence, which includes making recommendations.

        • gsich 5 days ago

          They are enablers. They get part of the blame.

      • echoangle 5 days ago

        Well it doesn’t break the conventions of the purpose they offer it for. Cloudflare attempts to block non-human users, and this is supposed to be used for human-readable websites. If someone puts cloudflare in front of a RSS feed, that’s user error. It’s like someone putting a captcha in front of an API and then complaining that the Captcha provider is breaking conventions.

    • nirvdrum 4 days ago

      I contend this wasn’t an issue prior to Cloudflare making that an option. Sure, some IDS would block some users and geo blocks have been around forever. But, Cloudflare is so prolific and makes it so easy to block things inadvertently, that I don’t think they get a pass and blame the downstream user.

      It’s particularly frustrating that they give their own WARP service a pass. I’ve run into many sites that will block VPN traffic, including iCloud Privacy Relay, but WARP traffic goes through just fine.

  • foul 4 days ago

    [flagged]

    • account42 4 days ago

      Ah yes, just wrap every protocol in HTTP to get through middle boxes. Just use chrome for all requests becaus fuck having a standard with different implementations. Next you're going to recommend to just automate a Windows PC through simulated mouse and keyboard input to deal with hardware attestation that these fuckers want to bring to the web.

      • foul 4 days ago

        Not my fault if the whole world bought the "openness" bullshit and then built cable-TV-with-mouse.

        If that guy makes money with that and has an issue with the Great Firewall Of America, there's a (bad) solution.

elwebmaster 4 days ago

Using Cloudflare on your website could be blocking Safari users, Chrome users, or just any users. It’s totally broken. They have no way of measuring the false positives. Website owners are paying for it in lost revenue. And poor users who lose access for no fault of their own. Until some C-level exec at a BigTech randomly gets blocked and makes noise. But even then, Cloudflare will probably just whitelist that specific domain/IP. It is very interesting how I have never been blocked when trying to access Cloudflare itself, only blocked on their customer’s sites.

wraptile 4 days ago

Cloudflare has been the bane of my web existance on Thai IP and a Linux Firefox fingerprint. I wonder how much traffic is lost because of Cloudflare and of course none of that is reported to the web admins so everyone continues with their jolly ignorance.

I wrote my own RSS bridge that scrapes websites using Scrapfly web scraping API that bypasses all that because it's so annoying that I can't even scrape some company's /blog that they are literally buying ads for but somehow have an anti-bot enabled that blocks all RSS readers.

Modern web is so anti social that the web 2.0 guys should be rolling in their "everything will be connected with APIs" graves by now.

  • vundercind 4 days ago

    The late '90s-'00s solution was to blackhole address blocks associated with entire countries or continents. It was easily worth it for many US sites that weren't super-huge to lose the the 0.1% of legitimate requests they'd get from, say, China or Thailand or Russia, to cut the speed their logs scrolled at by 99%.

    The state of the art isn't much better today, it seems. Similar outcome with more steps.

whs 5 days ago

My company runs a tech news website. We offer RSS feed as any Drupal website would, which content farm just scrape our RSS feed to rehost our content in full. This is usually fine for us - the content is CC-licensed and they do post the correct source. But they run thousands of different WordPress instances on the same IP and they individually fetch the feed.

In the end we had to use Cloudflare to rate limit the RSS endpoint.

  • kevincox 4 days ago

    > In the end we had to use Cloudflare to rate limit the RSS endpoint.

    I think this is fine. You are solving a specific problem and still allowing some traffic. The problem with the Cloudflare default settings is that they block all requests leading to users failing to get any updates even when fetching the feed at a reasonable rate.

    BTW in this case another solution may just be to configure proper caching headers. Even if you only cache for 5min at a time that will be at most 1 request every 5min per Cloudflare caching location (I don't know the exact configuration but typically use ~5 locations per origin, so that would be only 1req/min which is trivial load and will handle both these inconsiderate scrapers and regular users. You can also configure all fetches to come from a single location and then you would only need to actually serve the feed once per 5min)

  • yjftsjthsd-h 4 days ago

    > In the end we had to use Cloudflare to rate limit the RSS endpoint.

    Isn't the correct solution to use CF to cache RSS endpoints aggressively?

    • whs 4 days ago

      We do both, but the enterprise plan isn't as unlimited as the self service plans so we need to limit them as well. (It's not a large site but the Cloudflare contract is for every affiliated companies - it is funny when we serve news of ongoing Cloudflare outages on Cloudflare Enterprise)

butz 4 days ago

Not "could" but it is actually blocking. Very annoying when government website does that, as usually it is next to impossible to explain the issue and ask for a fix. And even if the fix is made, it is reverted several weeks later. Other websites does that too, it was funny when one website was asking RSS reader to resolve captcha and prove they are human.

MarvinYork 5 days ago

In any case, it blocks German Telekom users. There is an ongoing dispute between Cloudflare and Telekom as to who pays for the traffic costs. Telekom is therefore throttling connections to Cloudflare. This is the reason why we can no longer use Cloudflare.

  • SSLy 4 days ago

    as much as I am not a fan of cloudflare's practices, in this particular case DTAG seems to be the party at fault.

  • nisa 3 days ago

    There is no dispute. Telekom is not peering on public exchanges and wants ransom as in expensive private ip-transit contracts from everyone. Their customers are used as a bargain for this. Recently Meta stopped playing that game and Cloudflare never did afaik. Telekom could solve a part of this problem with a few 100k€ and a few weeks time if they would peer at the bigger German exchanges. If every big ISP would act like them the Internet would be dead.

davidfischer 4 days ago

My employer, Read the Docs is a heavy user of Cloudflare. It's actually hard to imagine serving as much traffic as we do as cheaply as we can without them.

That said, for publicly hosted open source documentation, we turn down the security settings almost all the way. Security level is set to "essentially off" (that's the actual setting name), no browser integrity check, TOR friendly (onion routing on), etc. We still have rate limits in place but they're pretty generous (~4 req/s sustained). For sites that don't require a login and don't accept inbound leads or something like that, that's probably around the right level. Our domains where doc authors manage their docs have higher security settings.

That said, being too generous can get you into trouble so I understand why people crank up the settings and just block some legitimate traffic. See our past post where AI scrapers scraped almost 100TB (https://news.ycombinator.com/item?id=41072549).

imartin2k 4 days ago

I’m happy to see that a post regarding the use of RSS gets so much attention on HN. It’s a good sign. As I basically live in my feed reader since 2007 or so, one of my greatest fears is the slow demise of RSS by way of reduced support of RSS feeds by websites owners.

hugoromano 4 days ago

"could be blocking RSS users" it says it all "could". I use RSS on my websites, which are serviced by Cloudflare, and my users are not blocked. For that, fine-tuning and setting Configuration Rules at Cloudflare Dashboard are required. Anyone on a free has access to 10 Configuration Rules. I prefer using Cloudflare Workers to tune better, but there is a cost. My suggestion for RSS these days is to reduce the info on RSS feed to teasers, AI bots are using RSS to circumvent bans, and continue to scrape.

veeti 5 days ago

I believe that disabling "Bot Fight Mode" is not enough, you may also need to create a rule to disable "Browser Integrity Check".

pentagrama 4 days ago

Can you whitelists urls to be readead by bot on Cloudflare? Maybe this is a good solution, and there you can put your RSS feeds, sitemaps, and other content for bots. Also Cloudflare can make a dedicated fields to whitelists RSS and Sitemaps on the admin panel so users can discover more easily that they may don't want block those bots.

Can you whitelist URLs to be read by bots on Cloudflare? Maybe this is a good solution, where you as a site mantainer can include your RSS feeds, sitemaps, and other content for bots.

Also, Cloudflare could ship a feature by creating a dedicated section in the admin panel to let the user add and whitelist RSS feeds and sitemaps, making it easier (and educate) users to avoid blocking those bots who aren't a threat to your site, of course sill considering rules to avoid DDOS on this urls, like massive requests or stuff that common bots from RSS readers don't do.

samplifier 2 days ago

I've noticed that Old Reddit still supports RSS feeds without returning a 403 error. This is in contrast to the main site, which often blocks RSS requests.

Here are some DNS details:

The main Reddit site (www.reddit.com) uses Fastly. Old Reddit (old.reddit.com) also uses Fastly. However, the "vomit" address (which often returns 403s for RSS requests) uses AWS DNS. Is Old Reddit not behind Cloudflare, or is there another reason why it handles RSS requests differently?

  • samplifier a day ago

    Ignore above. Brainfarted. It doesn't work.

tandav 4 days ago

As an admin of my personal website, I completely disable all Cloudflare features and use it only for DNS and domain registration. I also stop following websites that use Cloudflare checks or cookie popups (cookies are fine, but the popups are annoying).

pointlessone 4 days ago

I see this on a regular basis. My self-hosted RSS reader is blocked by Cloudflare even after my IP address was explicitly allowlisted by a few feed owners.

artooro 4 days ago

This is a truly problematic issue that I've experienced as well. The best solution is probably for Cloudflare to figure out what normal RSS usage looks like and have a provision for that in their bot detection.

PeterStuer 3 days ago

I was bitten by this as well. My product retrieves RSS feeds from public government sites, and suddely I'm blocked by cloudflair's antibotting for tryng to access a page that was specifically created for machine consumption. It is not that the website owner or publisher intend to block this. They are unaware that turng on Cloudflare will block everything, even stuff allowed to be consumed according to robots.txt .

P.S. when I mentioned this here on HN a few weeks back, it was implied that I probably did not respect robots.txt ( I do, Cloudflair does not) or that I should get in touch with the site administrators (impossible to do in any reasonably effective way at scale).

ricardo81 5 days ago

iirc even if you're listed as a "good bot" with Cloudflare, high security settings by the CF user can still result in 403s.

No idea if CF already does this, but allowing users to generate access tokens for 3rd party services would be another way of easing access alongside their apparent URL and IP whitelisting.

account42 4 days ago

Or just normal human users with a niche browser like Firefox.

ectospheno 4 days ago

I love that I get a cloudflare human check on almost every page they serve for customers except for when I login to my cloudflare account. Good times.

prmoustache 5 days ago

I believe this also pose issues to people running adblockers. I get tons of repetitive captchas on some websites.

Also other companies offering similar services like imperva seems to be straight banning my ip after one visit to a website with uBlock Origin I first get a captcha, then a page saying I am not allowed, and whatever I do, even using an extensionless chrome browser with a new profile I can't visit it anymore because my ip is banned.

  • acdha 4 days ago

    One thing to keep in mind is that the modern web sees a lot of spam and scraping, and ad revenue has been sliding for years. If you make your activity look like a not, most operators will assume you’re not generating revenue and block you. It sucks but thank a spammer for the situation.

    • immibis 4 days ago

      They should provide an API if they don't like scraping, but also, any sane scraper isn't really a problem, unless you are trying to enshittify your site by forcing people to use your app. I heard some AI scrapers are insane, and should be individually blocked.

      • acdha 3 days ago

        “Sane scraper” doesn’t have a definition or anyone to enforce it. Similarly, APIs aren’t magic - if you make things publicly available, people will harvest it whether that’s with a 90s-style bot making individual requests or a headless browser which runs the JavaScript you use to make API calls.

        The other thing to think about is the lack of enforcement: you can’t complain to the bot police when some dude in China decides to harvest your data, and if you try blocking by user-agent or IP you’ll play whack-a-mole trying to stay ahead of the bot operators who will spoof the former and churn the latter. After developing an appreciation for why security people talk about validating correctness rather than trying to enumerate badness, you’ll end up with a combination of rate-limiting and broader blocking for the same reasons. Yes, it’s no fun but the problem isn’t the sites but the people abusing the free services we’ve been given.

      • Klonoar 4 days ago

        Some AI scrapers have been proven to not report themselves as AI scrapers and mimic true users.

        This is part of what’s leading to the bludgeoning approach you see with blocking. They are not an individual thjng that can be blocked.

srmarm 4 days ago

I'd have thought the website owner whitelisting their RSS feed URI (or pattern matching *.xml/*.rss) might be better than doing it based on the users agent string. For one you'd expect bot traffic on these end points and you're also not leaving a door open to anyone who fakes their user agent.

Looks like it should be possible under the WAF

rcarmo 5 days ago

Ironically, the site seems to currently be hugged to death, so maybe they should consider using Cloudflare to deal with HN traffic?

  • sofixa 4 days ago

    Doesn't have to be using CloudFlare, just a static web host that will be able to scale to infinity (of which CloudFlare is one with Pages, but there's also Google with Firebase Hosting, AWS with Amplify, Microsoft with something in Azure with a verbose name, Netlify, Vercel, GitHub Pages, etc etc etc).

    • kawsper 4 days ago

      Or just add Varnish or Nginx configured with a cache in front.

      • sofixa 4 days ago

        That can still exhaust system resources on the box it's running on (file descriptors, inodes, ports, CPU/memory/bandwidth, etc) if you hit it too big.

        For something like entirely static content, it's so much easier (and cheaper, all of the static hosting providers have an extremely generous free tier) to use static hosting.

        And I say this as an SRE by heart who runs Kubernetes and Nomad for fun across a number of nodes at home and in various providers - my blog is on a static host. Use the appropriate solution for each task.

      • vundercind 4 days ago

        I used to serve low-tens-of-MB .zip files—worse than a web page and a few images or what have you—statically from Apache2 on a boring Linux server that'd qualify as potato-tier today, with traffic spikes into the hundreds of thousands per minute. Tens of thousands per minute against other endpoints gated by PHP setting a header to tell Apache2 to serve the file directly if the client authenticated correctly, and I think that one could have gone a lot higher, never really gave it a workout. Wasn't even really taxing the hardware that much for either workload.

        Before that, it was on a mediocre-even-at-the-time dedicated-cores VM. That caused performance problems... because its Internet "pipe" was straw-sized, it turned out. The server itself was fine.

        Web server performance has regressed amazingly badly in the world of the Cloud. Even "serious" sites have decided the performance equivalent of shitty shared-host Web hosting is a great idea and that introducing all the problems of distributed computing at the architecture level will help their moderate-traffic site work better (LOL; LMFAO), so now they need Cloudflare and such just so their "scalable" solution doesn't fall over in a light breeze.

  • timeon 5 days ago

    If it is unintentional DDoS, we can wait. Not everything needs to be on demand.

    • dewey 5 days ago

      The website is built to get attention, the attention is here right now. Nobody will remember to go back tomorrow and read the site again when it’s available.

      • BlueTemplar 4 days ago

        I'm not sure an open web can exist under this kind of assumption...

        Once you start chasing views, it's going to come at the detriment of everything else.

        • dewey 4 days ago

          This happened at least 15 years ago and we are doing okay.

drudru 4 days ago

I noticed this a while back when I was trying to read cloudflare's own blog. Periodically they would block my newsreader. I ended up just dropping their feed.

I am glad to see other people calling out the problem. Hopefully, a solution will emerge.

est 4 days ago

Hmmm, that's why "feedburner" is^H^Hwas a thing, right?

We have come to full circle.

  • kevincox 4 days ago

    Yeah, this is the recommendation that I usually give people who reach out to support. Feedburner tends to be on the whitelists to avoids this problem.

renewiltord 4 days ago

Ah, the Cloudflare free plan does not automatically turn these on. I know since I use it for some small things and don't have these on. I wouldn't use User-Agent filtering because those are spoofable. But putting feeds on a separate URL is probably a good idea. Right now the feed is actually generated on request for these sites, so caching it is probably a good idea anyway. I can just rudimentarily do that by periodically generating and copying it over.

015a 4 days ago

Suggesting that website operators should allowlist RSS clients through the Cloudflare bot detection system via their user-agent is a rather concerning recommendation.

soraminazuki 5 days ago

This is an issue with techdirt.com. I contacted them about this through their feedback form a long time ago, but the issue still remains unfortunately.

nfriedly 4 days ago

Liliputing.com had this problem a couple of years ago. I emailed the author and he got it sorted out after a bit of back and forth.

hwj 5 days ago

I had problems accessing Cloudflare-hosted websites via the Tor browser also. Don't know it that is still true.

timnetworks 4 days ago

RSS is the future that is being kept from us for twenty years already, fusion can kick bricks.

3np 4 days ago

Also: Sign in on gitlab.com is broken for me on Tor Browser because of an infinite "Verify you are human" refresh/redirect loop...

hkt 4 days ago

It also manages to break IRC bots that do things like show the contents of the title tag when someone posts a link. Another cloudy annoyance, albeit a minor one.

qwertyuiop_ a day ago

I have always suspected cloudflare being a classic intelligence community op. Just like Google was funded by qinetq

dewey 5 days ago

I’m using Miniflix and I always run into that on a few blogs which now I just stopped reading.

shaunpud 4 days ago

Namesilo are the same, their csv/rss behind Cloudflare so don't even bother anymore with their auctions and their own interface is meh

anilakar 4 days ago

...and there is a good number of people who see this as a feature, not a bug.

idunnoman1222 4 days ago

Yes, the way to retain your privacy is to not use the Internet

if you don’t like it, make your own Internet: assumedly one not funded by ads