Using Cloudflare on your website could be blocking RSS users

559 points by campuscodi a year ago

conesus a year ago

I run NewsBlur[0] and I’ve been battling this issue of NewsBlur fetching 403s across the web for months now. My users are revolting and asking for refunds. I’ve tried emailing dozens of site owners and publishers and only two of them have done the work of whitelisting their RSS feed. It’s maddening and is having a real negative effect on NewsBlur.

NewsBlur is an open-source RSS news reader (full source available at [1]), something we should all agree is necessary to support the open web! But Cloudflare blocking all of my feed fetchers is bizarre behavior. And we’re on the verified bots list for years, but it hasn’t made a difference.

Let me know what I can do. NewsBlur publishes a list of IPs that it uses for feed fetching that I've shared with Cloudflare but it hasn't made a difference.

I'm hoping Cloudflare uses the IP address list that I publish and adds them to their allowlist so NewsBlur can keep fetching (and archiving) millions of feeds.

[0]: https://newsblur.com

[1]: https://github.com/samuelclay/NewsBlur

srik a year ago

RSS is an essential component to modern web publishing and it feels scary to see how one company’s inconsideration might harm its already fragile future. One day cloudflare will get big enough to be subject to antitrust regulation and this instance will be a strong data point working against them.
- immibis a year ago
  
  It's not one company - it's an individual decision of every blog operator to block their own readers by signing up for cloudflare.
- 01HNNWZ0MV43FF a year ago
  
  It's not essential, I don't know anyone in real life who uses it.
  I run an RSS feed on my blog out of principle and I don't bother reading other feeds I'm subscribed to
  When I'm bored I come here, I go on Mastodon, and gods save me, I go on Reddit
  
  djhn a year ago
  
  Podcasts are based on RSS and a lot of people listen to podcasts.
  
  01HNNWZ0MV43FF a year ago
  
  Outside of Spotify and YouTube?
AyyEye a year ago

Three consenting parties trying to use their internet blocked by a single intermediary that's too big to care is just gross. It's the web we deserve.
- eddythompson80 a year ago
  
  > Three consenting parties
  Clearly they are not 100% consenting, or at best one of them (the content publisher) is misconfiguring/misunderstanding their setup. They enabled RSS on their service, then setup a rule to require human verification for accessing that RSS feed.
  It's like a business advertising a singles only area, then hiring a security company and telling them to only allow couples in the building.
  
  AyyEye a year ago
  
  If Cloudflare was honest and upfront about the tradeoffs being made and the fact that it's still going to require configuration and maintenance work they'd have significantly less customers.
p4bl0 a year ago

I've been a paying NewsBlur user since the downfall of Google Reader and I'm very happy with it. Thank you for NewsBlur!
brightball a year ago

I use Cloudflare and have home built RSS feeds on my site. If you've run into any issues on mine, I'll be happy to look into them.
https://www.brightball.com/
miohtama a year ago

Thank you for the hard work.
Newsblur was the first SaaS I could afford as a student. I have been subscriber for something like 20 years now. And I will keep doing it to the grave. Best money ever spent.
- p4bl0 a year ago
  
  > I have been subscriber for something like 20 years now.
  NewsBlur is "only" 15 years old (and GReader was there up until 11 years ago).
renaissancec a year ago

Can't recommend Newsblur enough. I have been a customer since Fastladder was shut down. I love their integration of being able to use pinboard.in within the web interface to bookmark articles. An essential part of my web productivity flow.
hedora a year ago

Maybe pay for residential proxy network access?
I used to get my internet from a small local ISP, and ip blacklisting basically means no one in our zipcode could have reliable internet.
These days, the 10-20% of us with an unobstructed sky view switched to starlink and didn’t look back.
The thing is, both ISPs use CGNAT, but there’s no way cloudflare is going to block Musk like they do the mom and pop shop.
Anyway, apparently residential proxy networks work pretty well if you hit a spurious ip block. I’ve had good luck with apple private relay too.
I’m hoping service providers realize how useless and damaging ip blocking is to their reputations, but I’m not holding my breath. Sometimes I think the endgame is just routing 100% of residential traffic through 8.8.8.8.
wooque a year ago

You just bypass it with library like cloudscraper/hrequests.

kevincox a year ago

I dislike advice of whitelisting specific readers by user-agent. Not only is this endless manual work that will only solve the problem for a subset of users but it also is easy to bypass by malicious actors. My recommendation would be to create a page rule that disables bot blocking for your feeds. This will fix the problem for all readers with no ongoing maintenance.

If you are worried about DoS attacks that may hammer on your feeds then you can use the same configuration rule to ignore the query string for cache keys (if your feed doesn't use query strings) and overriding the caching settings if your server doesn't set the proper headers. This way Cloudflare will cache your feed and you can serve any number of visitors without putting load onto your origin.

As for Cloudflare fixing the defaults, it seems unlikely to happen. It has been broken for years, Cloudflare's own blog is affected. They have been "actively working" on fixing it for at least 2 years according to their VP of product: https://news.ycombinator.com/item?id=33675847

benregenspan a year ago

AI crawlers have changed the picture significantly and in my opinion are a much bigger threat to the open web than Cloudflare. The training arms race has drastically increased bot traffic, and the value proposition behind that bot traffic has inverted. Previously many site operators could rely on the average automated request being net-beneficial to the site and its users (outside of scattered, time-limited DDoS attacks) but now most of these requests represent value extraction. Combine this with a seemingly related increase in high-volume bots that don't respect robots.txt and don't set a useful User-Agent, and using a heavy-handed firewall becomes a much easier business decision, even if it may target some desirable traffic (like valid RSS requests).
vaylian a year ago

I don't know if cloudflare offers it, but whitelisting the URL of the RSS feed would be much more effective than filtering user agents.
- derkades a year ago
  
  Yes it supports it, and I think that's what the parent comment was all about
  
  BiteCode_dev a year ago
  
  Specifically, whitelisting the URL for the bot protection, but not the cache, so that you are still somewhat protected against adversarial use.
  
  londons_explore a year ago
  
  An adversary can easily send no-cache headers to bust the cache.
  
  acdha a year ago
  
  The CDN can choose whether to honor those. That hasn’t been an effective adversarial technique since the turn of the century.
  
  londons_explore a year ago
  
  does cloudflare give such an option? Even for non-paid accounts?
  
  acdha a year ago
  
  They ignore request cache control headers, I believe unconditionally so you’d have to disable caching for the endpoints which clients are allowed to request uncached.
- jks a year ago
  
  Yes, you can do it with a "page rule", which the parent comment mentioned. The CloudFlare free tier has a budget of three page rules, which might mean that you have to bundle all your rss feeds in one folder so they share a path prefix.
a-french-anon a year ago

And for those of us using sfeed, the default UA is Curl's.

wenbin a year ago

At Listen Notes, we rely heavily on Cloudflare to manage and protect our services, which cater to both human users and scripts/bots.

One particularly effective strategy we've implemented is using separate subdomains for services designed for different types of traffic, allowing us to apply customized firewall and page rules to each subdomain.

For example:

- www. listennotes.com is dedicated to human users. E.g., https://www.listennotes.com/podcast-realtime/

- feeds. listennotes.com is tailored for bots, providing access to RSS feeds. Eg., https://feeds.listennotes.com/listen/wenbin-fangs-podcast-pl...

- audio. listennotes.com serves both humans and bots, handling audio URL proxies. E.g., https://audio.listennotes.com/e/p/1a0b2d081cae4d6d9889c49651...

This subdomain-based approach enables us to fine-tune security and performance settings for each type of traffic, ensuring optimal service delivery.

kevindamm a year ago

Where do you put your sitemap (or its equivalent)? Looking at the site, I don't notice one in the metadata but I do see a "site index" on the www subdomain, though possibly that's intended for humans not bots? I think the usual recommendation is to have a sitemap per subdomain and not mix them, but clearly they're meant for bots not humans...
- wenbin a year ago
  
  Great question.
  We only need to provide the sitemap (with custom paths, not publicly available) in a few specific places, like Google Search Console. This means the rules for managing sitemaps are quite manageable. It’s not a perfect setup, but once we configure it, we can usually leave it untouched for a long time.

amatecha a year ago

I get blocked from websites with some regularity, running Firefox with strict privacy settings, "resist fingerprinting" etc. on OpenBSD. They just give a 403 Forbidden with no explanation, but it's only ever on sites fronted by CloudFlare. Good times. Seems legit.

wakeupcall a year ago

Also running FF with strict privacy settings and several blockers. The annoyances are constantly increasing. Cloudflare, captchas, "we think you're a bot", constantly recurring cookie popups and absurd requirements are making me hate most of the websites and services I hit nowdays.
I tried for a long time to get around it, but now when I hit a website like this just close the tab and don't bother anymore.
- afh1 a year ago
  
  Same, but for VPN (either corporate or personal). Reddit blocks it completely, requires you to sign-in but even the sign-in page is "network restricted"; LinkedIn shows you a captcha but gives an error when submitting the result (several reports online); and overall a lot of 403's. All go magically away when turning off the VPN. Companies, specially adtechs like Reddit and LinkedIn, do NOT want you to browse privately, to the point they rather you don't use their website at all unless without a condom.
  
  acdha a year ago
  
  > Companies, specially adtechs like Reddit and LinkedIn, do NOT want you to browse privately, to the point they rather you don't use their website at all unless without a condom.
  That’s true in some cases, I’m sure, but also remember that most site owners deal with lots of tedious abuse. For example, some people get really annoyed about Tor being blocked but for most sites Tor is a tiny fraction of total traffic but a fairly large percentage of the abuse probing for vulnerabilities, guessing passwords, spamming contact forms, etc. so while I sympathize for the legitimate users I also completely understand why a busy site operator is going to flip a switch making their log noise go down by a double-digit percentage.
  
  rolph a year ago
  
  funny thing, when FF is blocked i can get through with TOR.
  
  mmooss a year ago
  
  With what browser? The same one that's blocked?
  
  Adachi91 a year ago
  
  > Reddit blocks it completely, requires you to sign-in but even the sign-in page is "network restricted";
  I've been creating accounts every time I need to visit Reddit now to read a thread about [insert subject]. They do not validate E-Mail, so I just use `example@example.com`, whatever random username it suggests, and `example` as a password. I've created at least a thousand accounts at this point.
  Malicious Compliance, until they disable this last effort at accessing their content.
  
  zargon a year ago
  
  They verify signup emails now. At least for me.
  
  hombre_fatal a year ago
  
  Most subreddits worth posting on usually have a minimum account age + minimum account karma. I've found it annoying to register new accounts too often.
  
  immibis a year ago
  
  I've created a few thousand accounts through a VPN (random node per account). After doing that, I found out Reddit accounts created through VPNs are automatically shadow banned the second time they comment (I think the first is also shadow deleted in some way). But they allow you to browse from a shadow banned account just fine.
  
  anthk a year ago
  
  For Reddit I just use it r/o under gopher://gopherddit.com
  A good client it's either Lagrange (multiplatform), the old Lynx or Dillo with the Gopher plugin.
  
  appendix-rock a year ago
  
  I don’t follow the logic here. There seems to be an implication of ulterior motive but I’m not seeing what it is. What aspect of ‘privacy’ offered by a VPN do you think that Reddit / LinkedIn are incentivised to bypass? From a privacy POV, your VPN is doing nothing to them, because your IP address means very little to them from a tracking POV. This is just FUD perpetuated by VPN advertising.
  However, the undeniable reality is that accessing the website with a non-residential IP is a very, very strong indicator of sinister behaviour. Anyone that’s been in a position to operate one of these services will tell you that. For every…let’s call them ‘privacy-conscious’ user, there are 10 (or more) nefarious actors that present largely the same way. It’s easy to forget this as a user.
  I’m all but certain that if Reddit or LinkedIn could differentiate, they would. But they can’t. That’s kinda the whole point.
  
  bo1024 a year ago
  
  Not following what could be sinister about a GET request to a public website.
  > From a privacy POV, your VPN is doing nothing to them, because your IP address means very little to them from a tracking POV.
  I disagree. (1) Since I have javascript disabled, IP address is generally their next best thing to go on. (2) I don't want to give them IP address to correlate with the other data they have on me, because if they sell that data, now someone else who only has my IP address suddenly can get a bunch of other stuff with it too.
  
  hombre_fatal a year ago
  
  At the very least, they're wasting bandwidth to a (likely) low quality connection.
  But anyone making malicious POST requests, like spamming chatGPT comments, first makes GET requests to load the submission and find comments to reply to. If they think you're a low quality user, I don't see why they'd bother just locking down POSTs.
  
  zahllos a year ago
  
  SQL injection?
  Get parameters can be abused like any parameter. This could be sql, could be directory traversal attempts, brute force username attempts, you name it.
  
  kam a year ago
  
  If your site is vulnerable to SQL injection, you need to fix that, not pretend Cloudflare will save you.
  
  zahllos a year ago
  
  Obviously. But I was responding to "what is sinister about a GET request". To put it a slightly different way, it does not matter so much whether the request is a read or a write. For example DNS amplfication attacks work by asking a DNS server (read) for a much larger record than the request packet requires, and faking the request IP to match the victim. That's not even a connection the victim initiated, but that packet still travels along the network path. In fact, if it crashes a switch or something along the way, that's just as good from the point of view of the attacker, maybe even better as it will have more impact.
  I am absolutely not a fan of all these "are you human?" checks at all, doubly so when ad-blockers trigger them. I think there are very legitimate reasons for wanting to access certain sites without being tracked - anything related to health is an example.
  Maybe I should have made a more substantive comment, but I don't believe this is as simple a problem as reducing it to request types.
  
  homebrewer a year ago
  
  It's equally easy to forget about users from countries with way less freedom of speech and information sharing than in Western rich societies. These anti-abuse measures have made it much more difficult to access information blocked by my internet provider during the last few years. I'm relatively competent and can find ways around it, but my friends and relatives who pursue other career choices simply don't bother anymore.
  Telegram channels have been a good alternative, but even that is going downhill thanks to French authorities.
  Cloudflare and Google also often treat us like bots (endless captchas, etc) which makes it even more difficult.
  
  afh1 a year ago
  
  IP address is a fingerprint to be shared with third parties, of course it's relevant. It's not ulterior motive, it's explicit, it's not caring about your traffic because you're not good product. They can and do differentiate by requiring a sign-in. They just don't care enough to make it actually work. Because they are adtechs and not interested in you as a user.
  
  miki123211 a year ago
  
  > For every…let’s call them ‘privacy-conscious’ user, there are 10 (or more) nefarious actors that present largely the same way.
  And each one of these could potentially create thousands of accounts, and do 100x as many requests as a normal user would.
  Even if only 1% of the people using your service are fraudsters, a normal user has at most a few accounts, while fraudsters may try to create thousands per day. This means that e.g. 90% of your signups are fraudulent, despite the population of fraudsters being extremely small.
  
  ruszki a year ago
  
  Was anybody stopped to do nefarious actions by these annoyances?
  It's like at my current and previous companies. They make a lot of security restrictions. The problem is, if somebody wants to get data out, they can get out anytime (or in). Security department says that it's against "accidental" leaks. I'm still waiting a single instance when they caught an "accidental" leak, and they are just not introducing extra steps, when at the end I achieve the exact same thing. Even when I caused a real potential leak, nobody stopped me to do it. The only reason why they have these security services/apps is to push responsibility to other companies.
- anilakar a year ago
  
  Heck, I cannot even pass ReCAPTCHA nowadays. No amount of clicking buses, bicycles, motorcycles, traffic lights, stairs, crosswalks, bridges and fire hydrants will suffice. The audio transcript feature is the only way to get past a prompt.
  
  josteink a year ago
  
  Just a heads up that this is how Google treat connections it suspects to originate from bots. Silently keeping you in an endless loop promising reward if you can complete it correctly.
  I discovered this when I set up IPv6 using hurricane electric as a tunnel broker for IPv6 connectivity.
  Seemingly Google has all HEnet IPv6tunnel subnets listed for such behaviour without it being documented anywhere. It was extremely annoying until I figured out what was going on.
  
  n4r9 a year ago
  
  > Silently keeping you in an endless loop promising reward if you can complete it correctly.
  Sounds suspiciously like how product managers talk to developers as well.
  
  anilakar a year ago
  
  Sadly my biggest crime is running Firefox with default privacy settings and uBlock Origin installed. No VPNs or IPv6 tunnels, no Tor traffic whatsoever, no Google search history poisoning plugins.
  If only there was a law that allowed one to be excluded from automatic behavior profiling...
  
  marssaxman a year ago
  
  There's a pho restaurant near where I work which wants you to scan a QR code at the table, then order and pay through their website instead of talking to a person. In three visits, I have not once managed to get past their captcha!
  (The actual process at this restaurant is to sit down, fuss with your phone a bit, then get up like you're about to leave; someone will arrive promptly to take your order.)
  
  eddythompson80 a year ago
  
  I’ve only seen that at Asian restaurants near a university in my city. When I asked I was told that this is a common way in China and they get a lot of international students who prefer/expect it that way.
- Terr_ a year ago
  
  The worst part is that a lot of it is mysteriously capricious with no recourse.
  Like, you visit Site A too often while blocking some javascript, and now Site B doesn't work for no apparent reason, and there's no resolution path. Worse, the bad information may become permanent if an owner uses it to taint your account, again with no clear reason or appeal.
  I suspect Reddit effectively killed my 10+ year account (appeal granted, but somehow still shadowbanned) because I once used the "wrong" public wifi to access it.
- lioeters a year ago
  
  Same here. I occasionally encounter websites that won't work with ad blockers, sometimes with Cloudflare involved, and I don't even bother with those sites anymore. Same with sites that display a cookie "consent" form without an option to not accept. I reject the entire site.
  Site owners probably don't even see these bounced visits, and it's such a tiny percentage of visitors who do this that it won't make a difference. Meh, it's just another annoyance to be able to use the web on our own terms.
  
  capitainenemo a year ago
  
  It's a tiny percentage of visitors, but a tech savvy one, and depending on your website, they could be a higher than average percentage of useful users or product purchasers. The impact could be disproportionate. What's frustrating is many websites don't even realise it is happening because the reporting from the intermediate (Cloudflare say) is inaccurate or incorrectly represents how it works. Fingerprinting has become integral to bot "protection". It's also frustrating when people think this can be drop in, and put it in front of APIs that are completely incapable of handling the challenge with no special casing (encountered on FedEx, GoFundMe), much like the RSS reader problem.
- orbisvicis a year ago
  
  I have to solve captchas for Amazon while logged into my Amazon account.
  
  m463 a year ago
  
  at one point I couldn't access amazon at night.
  I would get different captcha, one convoluted that wouldn't even load the required images.
  And I would get the oops sorry dog page for everything.
  I finally contacted amazon, gave them my (static) ip address and it was good.
  In other locations, I have to solve a 6-distorted-letter captcha to log in, but that's the extent of it.
  
  tenken a year ago
  
  Why?! ... I've had 404 pages on Amazon, but never a captcha...
- doctor_radium a year ago
  
  Hey, same here! For better or worse, I use Opera Mini for much of my mobile browsing, and it fares far worse than Firefox with uBlock Origin and ResistFingerprinting. I complained about this roughly a year ago on a similar HN thread, on which a Cloudflare rep also participated. Since then something changed, but both sides being black boxes, I can't tell if Cloudflare is wising up or Mini has stepped up. I still get the same challenge pages, but Mini gets through them automatically now, more often than not.
  But not always. My most recent stumbling block is https://www.napaonline.com. Guess I'm buying oxygen sensors somewhere else.
- SoftTalker a year ago
  
  Same. If a site doesn't want me there, fine. There's no website that's so crucial to my life that I will go through those kinds of contortions to access it.
- JohnFen a year ago
  
  > when I hit a website like this just close the tab and don't bother anymore.
  Yeah, that's my solution as well. I take those annoyances as the website telling me that they don't want me there, so I grant them their wish.
  
  immibis a year ago
  
  That's fine. You were an obstacle to their revenue gathering anyway.
- amanda99 a year ago
  
  Yes and the most infuriating thing is the "we need to verify the security of your connection" text.
BiteCode_dev a year ago

Cloudflare is a fantastic service with an unmatched value proposition, but it's unfortunately slowly killing web privacy, with 1000s paper cuts.
Another problem is "resist fingerprinting" prevents some canvas processing, and many websites like bluesky, linked in or substack uses canvas to handle image upload, so your images appear to be stripes of pixel.
Then you have mobile apps that just don't run if you don't have a google account, like chatgpt's native app.
I understand why people give up, trying to fight for your privacy is an uphill battle with no end in sight.
- madeofpalk a year ago
  
  > Then you have mobile apps that just don't run if you don't have a google account, like chatgpt's native app.
  Is that true? At least on iOS you can log into the ChatGPT with same email/password as the website.
  I never use Google login for stuff and ChatGPT works fine for me.
  
  BiteCode_dev a year ago
  
  See other comment.
- pjc50 a year ago
  
  The privacy battle has to be at the legal layer. GDPR is far from perfect (bureaucratic and unclear with weak enforcement), but it's a step in the right direction.
  In an adversarial environment, especially with both AI scrapers and AI posters, websites have to be able to identify and ban persistent abusers. Which unfortunately implies having some kind of identification of everybody.
  
  nonameiguess a year ago
  
  No, it's more than that. Cloudflare's bot protection has blocked me from sites where I have a paid account, paid for by my real checking account with my real name attached. Even when I am perfectly willing to give out my identity and be tracked, I still can't because I can't even get to the login page.
  
  HappMacDonald a year ago
  
  They block such visits because their pragma suspects that your visit is the account of a real human that was hacked by a bot.
  
  wbl a year ago
  
  You notice that Analogue Devices puts their (incredibly useful) information up for free. That's because they make money other ways. Ad supported content farm Internet had a nice run but we will get on without it.
  
  BiteCode_dev a year ago
  
  That's another problem, we want cheap easy solutions like tracking people, instead of more targetteed or systemic ones.
  
  Gormo a year ago
  
  > The privacy battle has to be at the legal layer.
  I couldn't disagree more. The way to protect privacy is to make privacy the standard at the implementation layer, and to make it costly and difficult to breach it.
  Trying to rely on political institutions without the practical and technical incentives favoring privacy will inevitably result in the political institutions themselves becoming the main instrument that erodes privacy.
  
  HappMacDonald a year ago
  
  Yet without regulation nothing stops large companies from simply changing the implementation layer for one that pads their bottom line better, or just rebuild it from scratch.
  If people who valued privacy really controlled the implementation layer we wouldn't have gotten to this point in the first place.
  
  Gormo a year ago
  
  The point we're at is one in which privacy is still attainable via implementation-layer measures, even if it requires investing some effort and making some trade-offs to sustain. The alternative -- placing trust in regulation, which never works in the long run -- will inevitably result in regulatory capture that eliminates those remaining practical measures and replaces them with, at best, a performative illusion.
- KomoD a year ago
  
  > Then you have mobile apps that just don't run if you don't have a google account, like chatgpt's native app.
  That's not true, I use ChatGPT's app on my phone without logging into a Google account.
  You don't even need any kind of account at all to use it.
  
  BiteCode_dev a year ago
  
  On Android at least, even if you don't need to log in to your google account when connecting to chatgpt, the app won't work if your phone isn't signed in into google play, which doesn't work if your phone isn't linked to a google account.
  An android phone asks you to link a google account when you use it for the first time. It takes a very dedicated user to refuse that, then to avoid logging in into the gmail, youtube or app store apps which will all also link your phone to your google account when you sign in.
  But I do actively avoid this, I use Aurora, F-droid, K9 and NewPipeX, so no link to google.
  But then no ChatGPT app. When I start it, I get hit with a logging page to the app store and it's game over.
  
  __MatrixMan__ a year ago
  
  I have a similar experience with the pager duty app. It loads up and then exits with "security problem detected by app" because I've made it more secure by isolating it from Google (a competitor). Workaround is to just control it via slack instead.
  
  BiteCode_dev a year ago
  
  Well you can use the web base chagpt so there is a workaround. Except it's worse a worse experience.
  
  ForHackernews a year ago
  
  You might like: https://e.foundation/e-os/
  
  BiteCode_dev a year ago
  
  That won't make chatgpt's app work thought.
  
  ForHackernews a year ago
  
  It might well do, depending on what ChatGPT's app is asking the OS for. /e/OS is an Android fork that removes Google services and replaces them with open source stubs/re-implementations from https://microg.org/
  I haven't tried the ChatGPT app, but I know that, for example my bank and other financial services apps work with on-device fingerprint authentication and no Google account on /e/OS.
  
  BiteCode_dev a year ago
  
  I already have microg installed.
  
  acdha a year ago
  
  So the requirement is to pass the phone’s system validation process rather than having a Google account. I don’t love that but I can understand why they don’t want to pay the bill for the otherwise ubiquitous bots, and it’s why it’s an Android-specific issue.
  
  BiteCode_dev a year ago
  
  You can make a very rational case for each privacy invasive technical decision ever made.
  In the end, the fact remain: no chatgpt app without giving up your privacy, to google none the less.
  
  acdha a year ago
  
  “Giving up your privacy” is a pretty sweeping claim – it sounds like you’re saying that Android inherently leaks private data to Google, which is broader than even Apple fans tend to say.
  
  michaelt a year ago
  
  A person who was maximally distrustful of Google would assume they link your phone and your IP through the connection used to receive push notifications, and the wifi-network-visibility-to-location API, and the software update checker, and the DNS over HTTPS, and suchlike. As a US company, they could even be forced to do this in secret against their will, and lie about it.
  Of course as Google doesn't claim they do this, many people would consider it unreasonably fearful/cynical.
  
  acdha a year ago
  
  Sure, but that says you shouldn’t have a phone, not that ChatGPT is forcing you to give up your privacy.
  
  ForHackernews a year ago
  
  > it sounds like you’re saying that Android inherently leaks private data to Google, which is broader than even Apple fans tend to say.
  Yes? I mean, not "leaks" - it's designed to upload your private data to Google and others.
  https://www.tcd.ie/news_events/articles/study-reveals-scale-...
  > Even when minimally configured and the handset is idle, with the notable exception of e/OS, these vendor-customised Android variants transmit substantial amounts of information to the OS developer and to third parties such as Google, Microsoft, LinkedIn, and Facebook that have pre-installed system apps. There is no opt-out from this data collection.
  
  BiteCode_dev a year ago
  
  Google and Apple were both part of the PRISM program, of course I'm making this claim.
  That's the opposite stance that would be bonkers.
  
  acdha a year ago
  
  PRISM covered communications through U.S. company’s servers. It was not a magic back door giving them access to your device’s local data, and even if you did believe that it was the answer would be not using a phone. A major intelligence agency does not need you to have a Google account so they can spy on you.
  
  BiteCode_dev a year ago
  
  Forest for the tree.
  Google and Apple are both heavily invested in ads (apple made 4.7 billion from ads in 2022), they have a track record of exfiltrating your data (remember contractors listening to your siri recordings?), of lying to the customers (remember the home button scandal on iPhone?), have control over a device that have your whole life yet runs partially on code you can't evaluate.
  Trusting those people makes no sense at all. You have a business relationship with them, that's it.
  
  acdha a year ago
  
  It’s interesting how each time you say something which isn’t accurate you try to distract by changing the topic.
  
  appendix-rock a year ago
  
  [dead]
neilv a year ago

Similar here. It's not unusual to be blocked from a site by CloudFlare when I'm running Firefox (either ESR or current release) on Linux.
I suspect that people operating Web sites have no idea how many legitimate users are blocked by CloudFlare.
And. based on the responses I got when I contacted two of the companies whose sites were chronically blocked by CloudFlare for months, it seemed like it wasn't worth any employee's time to try to diagnose.
Also, I'm frequently blocked by CloudFlare when running Tor Browser. Blocking by Tor exit node IP address (if that's what's happening) is much more understandable than blocking Firefox from a residential IP address, but still makes CloudFlare not a friend of people who want or need to use Tor.
- jorams a year ago
  
  > I suspect that people operating Web sites have no idea how many legitimate users are blocked by CloudFlare.
  I sometimes wonder if all Cloudflare employees are on some kind of whitelist that makes them not realize the ridiculous false positive rate of their bot detection.
- pjc50 a year ago
  
  > CloudFlare not a friend of people who want or need to use Tor
  The adversarial aspect of all this is a problem: P(malicious|Tor) is much higher than P(malicious|!Tor)
- johnklos a year ago
  
  I've had several discussions that were literally along the lines of, "we don't see what you're talking about in our logs". Yes, you don't - traffic is blocked before it gets to your servers!
- lovethevoid a year ago
  
  What are some examples? I've been running ff on linux for quite some time now and am rarely blocked. I just run it with ublock origin.
  
  capitainenemo a year ago
  
  Odds are they have Resist Fingerprinting turned on. When I use it in a Firefox profile I encounter this all over the place. Drupal, FedEx.. some sites handle it better than others. Some it's a hard block with a single terse error. Some it is a challenge which gets blocked due to using remote javascript. Some it's a local challenge you can get past. But it has definitely been getting worse. Fingerprinting is being normalised, and the excuse of "bot protection" (bots can make unique fingerprints too, though) means that it can now be used maliciously (or by ad networks like google, same diff) as a standard feature.
  
  lovethevoid a year ago
  
  I also use Mullvad Browser (a browser based on Firefox), and it supports resisting fingerprinting without any of those blocks. Tried it on Drupal and Fedex. Loads Cloudflare sites normally.
  I'm guessing if it's really Resist Fingerprinting on Firefox (something Mullvad also has on by default), then there are other settings that aren't being enabled causing the issue. Mullvad actually lists the settings related to resisting fingerprinting here - https://mullvad.net/en/browser/hard-facts
  
  capitainenemo a year ago
  
  Or it could simply be that since it is on by default for Mullvad, that Cloudflare and others have an explicit exception built in for it. It might also be dependent on where traffic is coming from. I have had different behaviour with different ISPs. Perhaps your entire VPN network gets a pass due to, perhaps depending on how they manage abuse, or how much unique information they can get just based on the few bits of info the browser leaks combined with the uniqueness of the browser and VPN connection IPs.
- amatecha a year ago
  
  Yeah, I've contacted numerous owners of personal/small sites and they are usually surprised, and never have any idea why I was blocked (not sure if it's an aspect of CF not revealing the reason, or the owner not knowing how to find that information). One or two allowlisted my IP but that doesn't strike me as a solution.
  I've contacted companies about this and they usually just tell me to use a different browser or computer, which is like "duh, really?" , but also doesn't solve the problem for me or anyone else.
amatecha a year ago

Nice, today I found I'm blocked from subway.com, that's cool. Good bot detection, my brand new Debian Linux install with Firefox must be really suspicious.
mzajc a year ago

I randomize my User-Agent header and many websites outright block me, most often with no captcha and no useless error message.
The most egregious is Microsoft (just about every Microsoft service/page, really), where all you get is a "The request is blocked." and a few pointless identifiers listed at the bottom, purely because it thinks your browser is too old.
CF's captcha page isn't any better either, usually putting me in an endless loop if it doesn't like my User-Agent.
- pushcx a year ago
  
  Rails is going to make this much worse for you. All new apps include naive agent sniffing and block anything “old” https://github.com/rails/rails/pull/50505
  
  mzajc a year ago
  
  This is horrifying. What happened to simply displaying a "Your browser is outdated, consider upgrading" banner on the website?
  
  whoopdedo a year ago
  
  The irony being you can get around the block by pretending to be a bot.
  https://github.com/rails/rails/pull/52531
  
  hombre_fatal a year ago
  
  It does do that, though.
  https://github.com/rails/rails/pull/50505/files#diff-dce8d06...
  
  shbooms a year ago
  
  idk, even that seems too much to me, but maybe I'm just being too senstive.
  but like, why is it a website's job to tell me what browser version to use? unless my outdated browser is lacking legitmate functionality which is required by your website, just serve the page and be done with it.
  
  michaelt a year ago
  
  Back when the sun was setting on IE6, sites deployed banners that basically meant "We don't test on this, there's a good chance it's broken, but we don't know the specifics because we don't test with it"
  
  freedomben a year ago
  
  Wow. And this is now happening right as I've blacklisted google-chrome due to manifest v3 removal :facepalm:
  
  GoblinSlayer a year ago
  
  def blocked? user_agent_version_reported? && unsupported_browser? end
  well, you know what to do here :)
- charrondev a year ago
  
  Are you sending an actual random string as your UA or sending one of a set of actual user agents?
  You’re best off just picking real ones. We’ve got hit by a botnet sending 10k+ requests from 40 different ASNs with 1000s of different IPs. The only way we’re able to identify/block the traffic was excluding user agents matching some regex (for whatever reason they weren’t spoofing real user agents but weren’t sending actual ones either).
  
  RALaBarge a year ago
  
  I worked at an anti-spam email security company in the aughts, and we had a perl engine that would rip apart the MIME boundaries and measure everything - UA, SMTP client fingerprint headers, even the number of anchor or paragraph tags. A large combination of IF/OR evaluations with a regex engine did a pretty good job since the botnets usually don't bother to fully randomize or really opsec the payloads they are sending since it is a cannon instead of a flyswatter.
  
  kccqzy a year ago
  
  Similar techniques are known in the HTTP world too. There were things like detecting the order of HTTP request headers and matching them to known software, or even just comparing the actual content of the Accept header.
  
  miki123211 a year ago
  
  And then there's also TLS fingerprinting.
  Different browsers use TLS in slightly different ways, send data in a slightly different order, have a different set of supported extensions / algorithms etc.
  If your user agent says Safari 18, but your TLS fingerprint looks like Curl and not Safari, sophisticated services will immediately detect that something isn't right.
  
  mzajc a year ago
  
  I use the Random User-Agent Switcher[1] extension on Firefox. It does pick real agents, but some of them might show a really outdated browser (eg. Firefox 5X), which I assume is the reason I'm getting blocked.
  [1]: https://addons.mozilla.org/en-US/firefox/addon/random_user_a...
- lovethevoid a year ago
  
  Not sure a random UA extension is giving you much privacy. Try your results on coveryourtracks eff, and see. A random UA would provide a lot of identifying information despite being randomized.
  From experience, a lot of the things people do in hopes of protecting their privacy only makes them far easier to profile.
  
  mzajc a year ago
  
  coveryourtracks.eff.org is a great service, but it has a few limitations that apply here:
  - The website judges your fingerprint based on how unique it is, but assumes that it's otherwise persistent. Randomizing my User-Agent serves the exact opposite - a given User-Agent might be more unique than using the default, but I randomize it to throw trackers off.
  - To my knowledge, its "One in x browsers" metric (and by extension the "Bits of identifying information" and the final result) are based off of visitor statistics, which would likely be skewed as most of its visitors are privacy-conscious. They only say they have a "database of many other Internet users' configurations," so I can't verify this.
  - Most of the measurements it makes rely on javascript support. For what it's worth, it claims my fingerprint is not unique when javascript is disabled, which is how I browse the web by default.
  The other extreme would be fixing my User-Agent to the most common value, but I don't think that'd offer me much privacy unless I also used a proxy/NAT shared by many users.
  
  lovethevoid a year ago
  
  Randomizing to throw trackers off only works if you only ever visit sites once.
  But yes, without javascript a lot of tracking functions fail to operate. That is good for privacy, and EFF notes that on the site.
  You can fix your UA to a common value, it's about providing the least amount of identifying bits, and randomizing it just provides another bit to identify you by. Always remember: an absence of information is also valuable information!
  
  HappMacDonald a year ago
  
  I would just fingerprint you as "the only person on the internet who is scrambling their UA string" :)
pessimizer a year ago

Also, Cloudflare won't let you in if you forge your referer (it's nobody's business what site I'm coming from.) For years, you could just send the root of the site you were visiting, then last year somebody at Cloudflare flipped a switch and took a bite out of everyone's privacy. Now it's just endless reloading captchas.
- zamadatix a year ago
  
  Why go through that hassle instead of just removing the referer?
  
  bityard a year ago
  
  Lots of sites see an empty referrer and send you to their main page or marketing page. Which means you can't get anywhere else on their site without a valid referrer. They consider it a form of "hotlink" protection.
  (I'm not saying I agree with it, just that it exists.)
  
  zamadatix a year ago
  
  Fair and valid answer to my wording. Rewritten for what I meant to ask: "Why set referrer to the base of the destination origin instead of something like Referrer-Policy: strict-origin?". I.e. remove it completely for cross-origin instead of always making up that you came from the destination.
  Though what you mention does beg the question "is there really much privacy gain in that over using Referrer-Policy: same-origin and having referrer based pages work right?" I suppose so if you're randomizing your identity in an untrackable way for each connection it could be attractive... though I think that'd trigger being suspected as a bot far before the lack of proper same origin info :p.
- philsnow a year ago
  
  Ah, maybe this is what’s happening to me.. I use Firefox with uBlock origin, privacy badger, multi-account containers, and temporary containers.
  Whenever I click a link to another site, i get a new tab in either a pre-assigned container or else in a “tmpNNNN” container, and i think either by default or I have it configured to omit Referer headers on those new tab navigations.
DrillShopper a year ago

Maybe after the courts break up Amazon the FTC can turn its eye to Cloudflare.
- gjsman-1000 a year ago
  
  A. Do you think courts give a darn about the 0.1% of users that are still using RSS? We might as well care about the 0.1% of users who want the ability to set every website's background color to purple with neon green anchor tags. RSS never caught on as a standard to begin with, peaking at 6% adoption by 2005.
  B. Cloudflare has healthy competition with AWS, Akamai, Fastly, Bunny.net, Mux, Google Cloud, Azure, you name it, there's a competitor. This isn't even an Apple vs Google situation.
  
  HappMacDonald a year ago
  
  Cloudflare doesn't offer the same product suite as the other companies you mention, though. Cloudflare is primarily DDoS prevention while the others are primarily cloud hosting.
  And it is the DDoS prevention measures at issue here.
  
  gjsman-1000 a year ago
  
  Five years ago, you would’ve been right, but Cloudflare is very different now.
  Nowadays, Cloudflare has image compression and CDN services, video storage and delivery services, serverless compute with Workers, domain registration, (soon) container support with optional GPUs, durable objects (basically serverless storage), serverless SQL databases (D1), even an AWS S3 competitor with B2. They even have bespoke services like CloudFlare Tunnels - what’s AWS got that’s anything like it?
  Cloudflare is getting close to full-on AWS. At least, the parts most customers use. If they just added boring old VPSs, people would realize very quickly how full featured they are.
  As for DDoS mitigation - you’ve still got AWS Shield, Akamai, Azure, Radware, F5, even Oracle (Dyn) competing in that market. Unless you could show Cloudflare did illegal tying as a monopolist specifically to sell DDoS prevention, there’s no case.
anthk a year ago

Or any Dillo user, with a PSP User Agent which is legit for small displays.
anal_reactor a year ago

On my phone Opera Mobile won't be allowed into some websites behind CloudFlare, most importantly 4chan
- dialup_sounds a year ago
  
  4chan's CF config is so janky at this point it's the only site I have to use a VPN for.
Jazgot a year ago

My rss reader was blocked on kvraudio.com by cloudflare. This issue wasn't solved for months. I simply stopped reading anything on kvraudio. Thank you cloudflare!
KPGv2 a year ago

Reddit seems to do this to me (sometimes) when I use Zen browser. Switching over to Safari or Chrome and the site always works great.
kjkjadksj a year ago

Reddit has been bad about it as of late too
viraptor a year ago

I know it's not a solution for you specifically here, but if anyone has access to the CF enterprise plan, they can report specific traffic as non-bot and hopefully improve the situation. They need to have access to the "Bot Management" feature though. It's a shitty situation, but some of us here can push back a little bit - so do it if you can.
And yes, it's sad that the "make internet work again" is behind an expensive paywall..
- meeb a year ago
  
  The issue here is that RSS readers are bots. Obviously perfectly sensible and useful bots, but they’re not “real people using a browser”. I doubt you could get RSS readers listed on Cloudflare’s “good bots” list either which would allow them the default bot protection feature given they’ll all run off random residential IPs.
  
  j16sdiz a year ago
  
  They can't whitelist useragent, otherwise bot will pass just using agent spoofing.
  If you have enterprise plan, you can have custom rules including allowing by url
  
  sam345 a year ago
  
  Not sure if I get this.It seems to me an RSS reader is as much of a bot as a browser is for HTML. It just reads RSS rather than HTML.
  
  kccqzy a year ago
  
  The difference is that RSS readers usually do background fetches on their own rather than waiting for a human to navigate to a page. So in theory, you could just set up a crontab (or systemd timer) that simply xdg-open various pages on a schedule and not be treated as bots.
  
  viraptor a year ago
  
  I was responding to a person with Firefox issues, not RSS.
  I'm not sure either if RSS bots could be added to good bots, but if anyone has traffic from them, we can definitely try. (No high hopes though, given the responses I got from support so far)
jasonlotito a year ago

Cloudflare has always been a dumpster fire in usability. The number of times it would block me in that way was enough to make me seriously question anyones technical knowledge that used it. It's a dumpster fire. Friends don't let friend use Cloudflare. To me, it's like the Spirit airlines of CDNs.
Sure, tech wise it might work great, but from your users perspective: it's trash.
- immibis a year ago
  
  It's got the best vendor lock-in enshittification story - it's free - and that's all that matters.

jgrahamc a year ago

My email is jgc@cloudflare.com. I'd like to hear from the owners of RSS readers directly on what they are experiencing. Going to ask team to take a closer look.

kalib_tweli a year ago

There are email obfuscation and managed challenge script tags being injected into the RSS feed.
You simply shouldn't have any challenges whatsoever on an RSS feed. They're literally meant to be read by a machine.
- kalib_tweli a year ago
  
  I confirmed that if you explicitly set the Content-Type response header to application/rss+xml it seems to work with Cloudflare Proxy enabled.
  The issue here is that Cloudflare's content type check is naive. And the fact that CF is checking the content-type header directly needs to be made more explicit OR they need to do a file type check.
  
  londons_explore a year ago
  
  I wonder if popular software for generating RSS feeds might not be setting the correct content-type header? Maybe this whole issue could be mostly-fixed by a few github PR's...
  
  onli a year ago
  
  Correct might be debatable here as well. My blog for example sets Content-Type to text/xml, which is not exactly wrong for an RSS feed (after all, it is text and XML) and IIRC was the default back then.
  There were compatibility issues with other type headers, at least in the past.
  
  johneth a year ago
  
  I think the current correct content types are:
  'application/rss+xml' (for RSS)
  'application/atom+xml' (for Atom)
  
  londons_explore a year ago
  
  Sounds like a kind samaritan could write a scanner to find as many RSS feeds as possible which look like RSS/Atom and don't have these content types, then go and patch the hosting software those feeds use to have the correct content types, or ask the webmasters to fix it if they're home-made sites.
  As soon as a majority of sites use the correct types, clients can start requiring it for newly added feeds, which in turn will make webmasters make it right if they want their feed to work.
  
  onli a year ago
  
  Not even Cloudflares own blog uses those, https://blog.cloudflare.com/rss/, or am I getting a wrong content-type shown in my dev tools? For me it is `application/xml`. So even if `application/rss+xml` were the correct type by an official spec, it's not something to rely on if it's not used commonly.
  
  johneth a year ago
  
  I just checked Wikipedia and it says Atom's is 'application/atom+xml' (also confirmed in the IANA registry), and RSS's is 'application/rss+xml' (but it's not registered yet, and 'text/xml' is also used widely).
  'application/rss+xml' seems to be the best option though in my opinion. The '+xml' in the media type tells (good) parsers to fall back to using an XML parser if they don't understand the 'rss' part, but the 'rss' part provides more accurate information on the content's type for parsers that do understand RSS.
  All that said, it's a mess.
  
  kalib_tweli a year ago
  
  It wouldn't. It's the role of the HTTP server to set the correct content type header.
  
  djbusby a year ago
  
  The number of feeds with crap headers and other non-spec stuff going on; and loads of clients missing useful headers. Ugh. It seems like it should be simple; maybe that's why there are loads of naive implementations.
  
  Klonoar a year ago
  
  Quite a few feeds out there use the incorrect type of text/xml, since it works slightly better in browsers by not prompting a download.
  Would not surprise me if Cloudflare lumps this in with text/html protections.
- o11c a year ago
  
  Even outside of RSS, the injected scripts often make internet security significantly worse.
  Since the user-agent has no way to distinguish scripts injected by cloudflare from scripts originating from the actual website, in order to pass the challenge they are forced to execute arbitrary code from an untrusted party. And malicious Javascript is practically ubiquitous on the general internet.
badlibrarian a year ago

Thank you for showing up here and being open to feedback. But I have to ask: shouldn't Cloudflare be running and reviewing reports to catch this before it became such a problem? It's three clicks in Tableau for anyone who cares, and clearly nobody does. And this isn't the first time something like this has slipped through the cracks.
I tried reaching out to Cloudflare with issues like this in the past. The response is dozens of employees hitting my LinkedIn page yet no responses to basic, reproduceable technical issues.
You need to fix this internally as it's a reputational problem now. Less screwing around using Salesforce as your private Twitter, more leadership in triage. Your devs obviously aren't motivated to fix this stuff independently and for whatever reason they keep breaking the web.
- 015a a year ago
  
  The reality that HackerNews denizens need to accept, in this case and in a more general form, is: RSS feeds are not popular. They aren't just unpopular in the way that, say, Peacock is unpopular relative to Netflix; they're truly unpopular, used regularly by a number of people that could fit in an american football stadium. There are younger software engineers at Cloudflare that have never heard the term "RSS" before, and have no notion of what it is. It will probably be dead technology in ten years.
  I'm not saying this to say its a good thing; it isn't.
  Here's something to consider though: Why are we going after Cloudflare for this? Isn't the website operator far, far more at-fault? They chose Cloudflare. They configure Cloudflare. They, in theory, publish an RSS feed, which is broken because of infrastructure decisions they made. You're going after Ryobi because you've got a leaky pipe. But beyond that: isn't this tool Cloudflare publishes doing exactly what the website operators intended it to do? It blocks non-human traffic. RSS clients are non-human traffic. Maybe the reason you don't want to go after the website operators is because you know you're in the wrong? Why can't these RSS clients detect when they encounter this situation, and prompt the user with a captive portal to get past it?
  
  badlibrarian a year ago
  
  I'm old enough to remember Dave Winer taking Feedburner to task for inserting crap into RSS feeds that broke his code.
  There will always be niche technologies and nascent standards and we're taking Cloudflare to task today because if they continue to stomp on them, we get nowhere.
  "Don't use Cloudflare" is an option, but we can demand both.
  
  gjsman-1000 a year ago
  
  "Old man yells at cloud about how the young'ns don't appreciate RSS."
  I mean that somewhat sarcastically; but there does come a point where the demands are unreasonable, the technology is dead. There are probably more people browsing with JavaScript disabled than using RSS feeds. There are probably more people browsing on Windows XP than using RSS feeds. Do I yell at you because your personal blog doesn't support IE6 anymore?
  
  badlibrarian a year ago
  
  Spotify and Apple Podcasts use RSS feeds to update what they show in their apps. And even if millions of people weren't dependent on it, suggesting that an infrastructure provider not fix a bug only makes the web worse.
  
  015a a year ago
  
  I'm not backing down on this one: This is straight up an "old man yelling at the kids to get off his lawn" situation, and the fact that JGC from Cloudflare is in here saying "we'll take a look at this" is so far and beyond what anyone reasonable would expect of them that they deserve praise and nothing else.
  This is a matter between You and the Website Operators, period. Cloudflare has nothing to do with this. This article puts "Cloudflare" in the title because its fun to hate on Cloudflare and it gets upvotes. Cloudflare is a tool. These website operators are using Cloudflare The Tool to block inhuman access to their websites. RSS CLIENTS ARE NOT HUMAN. Let me repeat that: Cloudflare's bot detection is working fully appropriately here, because RSS Clients are Bots. Everything here is working as expected. The part where change should be asked is: Website operators should allow inhuman actors past the Cloudflare bot detection firewall specifically for RSS feeds. They can FULLY DO THIS. Cloudflare has many, many knobs and buttons that Website Operators can tweak; one of those is e.g. a page rule to turn off bot detection for specific routes, such as `/feed.xml`.
  If your favorite website is not doing this, its NOT CLOUDFLARE'S FAULT.
  Take it up with the Website Operators, Not Cloudflare. Or, build an RSS Client which supports a captive portal to do human authorization. God this is so boring, y'all just love shaking your first and yelling at big tech for LITERALLY no reason. I suspect its actually because half of y'all are concerningly uneducated on what we're talking about.
  
  badlibrarian a year ago
  
  As part of proxying what may be as much as 20% of the web, Cloudflare injects code and modifies content that passes between clients and servers. It is in their core business interests to receive and act upon feedback regarding this functionality.
  
  015a a year ago
  
  Sure: Let's begin by not starting the conversation with "Don't use Cloudflare", as you did. That's obviously not only unhelpful, but it clearly points the finger at the wrong party.
  
  doctor_radium a year ago
  
  I get what you're saying, and on a philosophical level you're probably right. If a website owner misconfigures their CDN to the point of impeding legitimate traffic then they can fail like businesses do everyday. Survival of the fittest. But with the majority of web users apparently running stock Chrome, on a practical level the web still has to work. I went looking for car parts a number of months ago and was blocked/accosted by firewalls over 50% of the time. Not all Cloudflare-powered sites. There isn't enough time in the day to take every misconfigured site to task (unless you're Bowerick Wowbagger [1]), so I believe the solution will eventually have to be either an altruistic effort from Cloudflare or from government regulation.
  [1] https://www.wowbagger.com/chapter1.htm
  
  627467 a year ago
  
  What's does cloudflare do to search crawlers by default? Does it block them too?
viraptor a year ago

It's cool and all that you're making an exception here, but how about including a "no, really, I'm actually a human" link on the block page rather than giving the visitor a puzzle: how to report the issue to the page owner (hard on its own for normies) if you can't even load the page. This is just externalising issues that belong to the Cloudflare service.
- jgrahamc a year ago
  
  I am not trying to "make an exception", I'm asking for information external to Cloudflare so I can look at what people are experiencing and compare with what our systems are doing and figure out what needs to improve.
  
  PaulRobinson a year ago
  
  Some "bots" are legitimate. RSS is intended for machine consumption. You should not be blocking content intended for machine consumption because a machine is attempting to consume it. You should not expect a machine, consuming content intended for a machine, to do some sort of step to show they aren't a machine, because they are in fact a machine. There is a lot of content on the internet that is not used by humans, and so checking that humans are using it is an aggressive anti-pattern that ruins experiences for millions of people.
  It's not that hard. If the content being requested is RSS (or Atom, or some other syndication format intended for consumption by software), just don't do bot checks, use other mechanisms like rate limiting if you must stop abuse.
  As an example: would you put a captcha on robots.txt as well?
  As other stories here can attest to, Cloudflare is slowly killing off independent publishing on the web through poor product management decisions and technology implementations, and the fix seems pretty simple.
  
  jamespo a year ago
  
  From another post, if the content-type is correct it gets through. If this is the case I don't see the problem.
  
  Scramblejams a year ago
  
  It's a very common misconfiguration, though, because it happens by default when setting up CF. If your customers are, by default, configuring things incorrectly, then it's reasonable to ask if the service should surface the issue more proactively in an attempt to help customers get it right.
  As another commenter noted, not even CF's own RSS feed seems to get the content type right. This issue could clearly use some work.
  
  robertlagrant a year ago
  
  This is useful info: https://news.ycombinator.com/item?id=33675847
- doctor_radium a year ago
  
  I had a conversation with a web site owner about this once. There apparently is such a feature, a way for sites to configure a "Please contact us here if you're having trouble reaching our site" page...usage of which I assume Cloudflare could track and then gain better insight into these issues. The problem? It requires a Premium Plan.
- methou a year ago
  
  Some clients are more like a bot/service, imagine google reader that fetches and caches content for you. The client I’m currently using is miniflux, it also works in this way.
  I understand that there are some more interactive rss readers, but from personal experience it’s more like “hey I’m a good bot, let me in”
  
  _Algernon_ a year ago
  
  An rss reader is a user agent (ie. a software acting on behalf of its users). If you define rss readers as a bot (even if it is a good bot), you may as well call Firefox a bot (it also sends off web requests without explicit approval of each request by the browser).
  
  sofixa a year ago
  
  Their point was that the RSS reader does the scraping on its own in the background, without user input. If it can't read the page, it can't; it's not initiated by the user where the user can click on a "I'm not a bot, I promise" button.
  
  viraptor a year ago
  
  It was a mental skip, but the same idea. It would awesome if CF just allowed reporting issues at the point something gets blocked - regardless if it's a human or a bot. They're missing an "I'm misclassified" button for people actually affected without the third-party runaround.
  
  fluidcruft a year ago
  
  Unfortunately, I would expect that queue of reports to get flooded by bad faith actors.
  
  viraptor a year ago
  
  Sure, but now they say that queue should go to the website owner instead, who has less global visibility on the traffic. So that's just ignoring something they don't want to deal with.
is_true a year ago

Maybe when you detect urls that return the rss mimetype notify the owner of the site/CF account that it might be a good idea to allow bots on that urls.
Ideally you could make it a simple switch in the config, somethin like: "Allow automated access on RSS endpoints".
prmoustache a year ago

It is not only rss reader users that are affected. Any user with some extension to block trackers get regularly forbidden access to websites or have to deal with tons of captcha.
kevincox a year ago

I'll mail you as well but I think public discussion is helpful. Especially since I have seem similar responses to this over the years and it feels very disingenuous. The problem is very clear (Cloudflare serves 403 blocks to feed readers for no reason) you have all of the logs. The solution is maybe not trivial but I fail to see how the perspective of someone seeing a 403 block is going to help much. This just starts to sound like a way to seem responsive without actually doing anything.
From the feed reader perspective it is a 403 response. For example my reader has been trying to read https://blog.cloudflare.com/rss/ and the last successful response it got was on 2021-11-17. It has been backing off due to "errors" but it still is checking every 1-2 weeks and gets a 403 every time.
This obviously isn't limited to the Cloudflare blog, I see it on many site "protected by" (or in this case broken by) Cloudflare. I could tell you what public cloud IPs my reader comes from or which user-agent it uses but that is besides the point. This is a URL which is clearly intended for bots so it shouldn't be bot-blocked by default.
When people reach out to customer support we tell them that this is a bug for the site and there isn't much we can do. They can try contacting the site owner but this is most likely the default configuration of Cloudflare causing problems that the owner isn't aware of. I often recommend using a service like FeedBurner to proxy the request as these services seem to be on the whitelist of Cloudflare and other scraping prevention firewalls.
I think the main solution would be to detect intended-for-robots content and exclude it from scraping prevention by default (at least to a huge degree).
Another useful mechanism would be to allow these to be accessed when the target page is cachable, as the cache will protect the origin from overload-type DoS attacks anyways. Some care needs to be taken to ensure that adding a ?bust={random} query parameter can't break through to the origin but this would be a powerful tool for endpoints that need protection from overload but not against scraping (like RSS feeds). Unfortunately cache headers for feeds are far from universal, so this wouldn't fix all feeds on its own. (For example the Cloudflare blog's feed doesn't set any caching headers and is labeled as `cf-cache-status: DYNAMIC`.)
quinncom a year ago

Cloudflare-enabled websites have had this issue for years.[1] The problem is that website owners are not educated enough to understand that URLs meant for bots should not enable Cloudflare’s bot blocker.
Perhaps a solution would be for Cloudflare to have default page rules that disable bot-blocking features for common RSS feed URLs? Or pop-up a notice with instructions on how to create these page rules to users that appear to have RSS feeds on their website?
[1] Here is Overcast’s owner raising the issue in 2022: https://x.com/OvercastFM/status/1578755654587940865

erikrothoff a year ago

As the owner of an RSS reader I love that they are making this more public. 30% of our support requests are ”my feed doesn’t” work. It sucks that the only thing we can say is ”contact the site owner, it’s their firewall”. And to be fair it’s not only Cloudflare, so many different firewall setups cause issues. It’s ironic that a public API endpoint meant for bots is blocked for being a bot.

belkinpower a year ago

I maintain an RSS reader for work and Cloudflare is the bane of my existence. Tons of feeds will stop working at random and there’s nothing we can do about it except for individually contacting website owners and asking them to add an exception for their feed URL.

stanislavb a year ago

I was recently contacted by one of my website users as their RSS reader was blocked by Cloudflare.
sammy2255 a year ago

Unfortunately its not really Cloudflare but webadmins who have configured it to block everything thats not a browser, whether unknowingly or not
- afandian a year ago
  
  If Cloudflare offer a product, for a particular purpose, that breaks existing conventions of that purpose, then it’s Cloudflare.
  
  sammy2255 a year ago
  
  Not really. You wouldn’t complain to a fence company for blocking a path if there were hired to do exactly that
  
  shakna a year ago
  
  Yes, I would. Experts are expected to relay back to their client with their thoughts on a matter, not just blindly do as they're told. Your builder is meant to do their due diligence, which includes making recommendations.
  
  gsich a year ago
  
  They are enablers. They get part of the blame.
  
  echoangle a year ago
  
  Well it doesn’t break the conventions of the purpose they offer it for. Cloudflare attempts to block non-human users, and this is supposed to be used for human-readable websites. If someone puts cloudflare in front of a RSS feed, that’s user error. It’s like someone putting a captcha in front of an API and then complaining that the Captcha provider is breaking conventions.
- nirvdrum a year ago
  
  I contend this wasn’t an issue prior to Cloudflare making that an option. Sure, some IDS would block some users and geo blocks have been around forever. But, Cloudflare is so prolific and makes it so easy to block things inadvertently, that I don’t think they get a pass and blame the downstream user.
  It’s particularly frustrating that they give their own WARP service a pass. I’ve run into many sites that will block VPN traffic, including iCloud Privacy Relay, but WARP traffic goes through just fine.
foul a year ago

[flagged]
- account42 a year ago
  
  Ah yes, just wrap every protocol in HTTP to get through middle boxes. Just use chrome for all requests becaus fuck having a standard with different implementations. Next you're going to recommend to just automate a Windows PC through simulated mouse and keyboard input to deal with hardware attestation that these fuckers want to bring to the web.
  
  foul a year ago
  
  Not my fault if the whole world bought the "openness" bullshit and then built cable-TV-with-mouse.
  If that guy makes money with that and has an issue with the Great Firewall Of America, there's a (bad) solution.

elwebmaster a year ago

Using Cloudflare on your website could be blocking Safari users, Chrome users, or just any users. It’s totally broken. They have no way of measuring the false positives. Website owners are paying for it in lost revenue. And poor users who lose access for no fault of their own. Until some C-level exec at a BigTech randomly gets blocked and makes noise. But even then, Cloudflare will probably just whitelist that specific domain/IP. It is very interesting how I have never been blocked when trying to access Cloudflare itself, only blocked on their customer’s sites.

wraptile a year ago

Cloudflare has been the bane of my web existance on Thai IP and a Linux Firefox fingerprint. I wonder how much traffic is lost because of Cloudflare and of course none of that is reported to the web admins so everyone continues with their jolly ignorance.

I wrote my own RSS bridge that scrapes websites using Scrapfly web scraping API that bypasses all that because it's so annoying that I can't even scrape some company's /blog that they are literally buying ads for but somehow have an anti-bot enabled that blocks all RSS readers.

Modern web is so anti social that the web 2.0 guys should be rolling in their "everything will be connected with APIs" graves by now.

vundercind a year ago

The late '90s-'00s solution was to blackhole address blocks associated with entire countries or continents. It was easily worth it for many US sites that weren't super-huge to lose the the 0.1% of legitimate requests they'd get from, say, China or Thailand or Russia, to cut the speed their logs scrolled at by 99%.
The state of the art isn't much better today, it seems. Similar outcome with more steps.

whs a year ago

My company runs a tech news website. We offer RSS feed as any Drupal website would, which content farm just scrape our RSS feed to rehost our content in full. This is usually fine for us - the content is CC-licensed and they do post the correct source. But they run thousands of different WordPress instances on the same IP and they individually fetch the feed.

In the end we had to use Cloudflare to rate limit the RSS endpoint.

kevincox a year ago

> In the end we had to use Cloudflare to rate limit the RSS endpoint.
I think this is fine. You are solving a specific problem and still allowing some traffic. The problem with the Cloudflare default settings is that they block all requests leading to users failing to get any updates even when fetching the feed at a reasonable rate.
BTW in this case another solution may just be to configure proper caching headers. Even if you only cache for 5min at a time that will be at most 1 request every 5min per Cloudflare caching location (I don't know the exact configuration but typically use ~5 locations per origin, so that would be only 1req/min which is trivial load and will handle both these inconsiderate scrapers and regular users. You can also configure all fetches to come from a single location and then you would only need to actually serve the feed once per 5min)
yjftsjthsd-h a year ago

> In the end we had to use Cloudflare to rate limit the RSS endpoint.
Isn't the correct solution to use CF to cache RSS endpoints aggressively?
- whs a year ago
  
  We do both, but the enterprise plan isn't as unlimited as the self service plans so we need to limit them as well. (It's not a large site but the Cloudflare contract is for every affiliated companies - it is funny when we serve news of ongoing Cloudflare outages on Cloudflare Enterprise)

butz a year ago

Not "could" but it is actually blocking. Very annoying when government website does that, as usually it is next to impossible to explain the issue and ask for a fix. And even if the fix is made, it is reverted several weeks later. Other websites does that too, it was funny when one website was asking RSS reader to resolve captcha and prove they are human.

MarvinYork a year ago

In any case, it blocks German Telekom users. There is an ongoing dispute between Cloudflare and Telekom as to who pays for the traffic costs. Telekom is therefore throttling connections to Cloudflare. This is the reason why we can no longer use Cloudflare.

SSLy a year ago

as much as I am not a fan of cloudflare's practices, in this particular case DTAG seems to be the party at fault.
nisa a year ago

There is no dispute. Telekom is not peering on public exchanges and wants ransom as in expensive private ip-transit contracts from everyone. Their customers are used as a bargain for this. Recently Meta stopped playing that game and Cloudflare never did afaik. Telekom could solve a part of this problem with a few 100k€ and a few weeks time if they would peer at the bigger German exchanges. If every big ISP would act like them the Internet would be dead.

davidfischer a year ago

My employer, Read the Docs is a heavy user of Cloudflare. It's actually hard to imagine serving as much traffic as we do as cheaply as we can without them.

That said, for publicly hosted open source documentation, we turn down the security settings almost all the way. Security level is set to "essentially off" (that's the actual setting name), no browser integrity check, TOR friendly (onion routing on), etc. We still have rate limits in place but they're pretty generous (~4 req/s sustained). For sites that don't require a login and don't accept inbound leads or something like that, that's probably around the right level. Our domains where doc authors manage their docs have higher security settings.

That said, being too generous can get you into trouble so I understand why people crank up the settings and just block some legitimate traffic. See our past post where AI scrapers scraped almost 100TB (https://news.ycombinator.com/item?id=41072549).

mbo a year ago

This is an active issue with Rate Your Music right now: https://rateyourmusic.com/rymzilla/view?id=6108

Unfixed for 4 months.

veeti a year ago

I believe that disabling "Bot Fight Mode" is not enough, you may also need to create a rule to disable "Browser Integrity Check".

hugoromano a year ago

"could be blocking RSS users" it says it all "could". I use RSS on my websites, which are serviced by Cloudflare, and my users are not blocked. For that, fine-tuning and setting Configuration Rules at Cloudflare Dashboard are required. Anyone on a free has access to 10 Configuration Rules. I prefer using Cloudflare Workers to tune better, but there is a cost. My suggestion for RSS these days is to reduce the info on RSS feed to teasers, AI bots are using RSS to circumvent bans, and continue to scrape.

imartin2k a year ago

I’m happy to see that a post regarding the use of RSS gets so much attention on HN. It’s a good sign. As I basically live in my feed reader since 2007 or so, one of my greatest fears is the slow demise of RSS by way of reduced support of RSS feeds by websites owners.

pentagrama a year ago

Can you whitelists urls to be readead by bot on Cloudflare? Maybe this is a good solution, and there you can put your RSS feeds, sitemaps, and other content for bots. Also Cloudflare can make a dedicated fields to whitelists RSS and Sitemaps on the admin panel so users can discover more easily that they may don't want block those bots.

Can you whitelist URLs to be read by bots on Cloudflare? Maybe this is a good solution, where you as a site mantainer can include your RSS feeds, sitemaps, and other content for bots.

Also, Cloudflare could ship a feature by creating a dedicated section in the admin panel to let the user add and whitelist RSS feeds and sitemaps, making it easier (and educate) users to avoid blocking those bots who aren't a threat to your site, of course sill considering rules to avoid DDOS on this urls, like massive requests or stuff that common bots from RSS readers don't do.

pointlessone a year ago

I see this on a regular basis. My self-hosted RSS reader is blocked by Cloudflare even after my IP address was explicitly allowlisted by a few feed owners.

tandav a year ago

As an admin of my personal website, I completely disable all Cloudflare features and use it only for DNS and domain registration. I also stop following websites that use Cloudflare checks or cookie popups (cookies are fine, but the popups are annoying).

artooro a year ago

This is a truly problematic issue that I've experienced as well. The best solution is probably for Cloudflare to figure out what normal RSS usage looks like and have a provision for that in their bot detection.

ricardo81 a year ago

iirc even if you're listed as a "good bot" with Cloudflare, high security settings by the CF user can still result in 403s.

No idea if CF already does this, but allowing users to generate access tokens for 3rd party services would be another way of easing access alongside their apparent URL and IP whitelisting.

account42 a year ago

Or just normal human users with a niche browser like Firefox.

prmoustache a year ago

I believe this also pose issues to people running adblockers. I get tons of repetitive captchas on some websites.

Also other companies offering similar services like imperva seems to be straight banning my ip after one visit to a website with uBlock Origin I first get a captcha, then a page saying I am not allowed, and whatever I do, even using an extensionless chrome browser with a new profile I can't visit it anymore because my ip is banned.

acdha a year ago

One thing to keep in mind is that the modern web sees a lot of spam and scraping, and ad revenue has been sliding for years. If you make your activity look like a not, most operators will assume you’re not generating revenue and block you. It sucks but thank a spammer for the situation.
- immibis a year ago
  
  They should provide an API if they don't like scraping, but also, any sane scraper isn't really a problem, unless you are trying to enshittify your site by forcing people to use your app. I heard some AI scrapers are insane, and should be individually blocked.
  
  acdha a year ago
  
  “Sane scraper” doesn’t have a definition or anyone to enforce it. Similarly, APIs aren’t magic - if you make things publicly available, people will harvest it whether that’s with a 90s-style bot making individual requests or a headless browser which runs the JavaScript you use to make API calls.
  The other thing to think about is the lack of enforcement: you can’t complain to the bot police when some dude in China decides to harvest your data, and if you try blocking by user-agent or IP you’ll play whack-a-mole trying to stay ahead of the bot operators who will spoof the former and churn the latter. After developing an appreciation for why security people talk about validating correctness rather than trying to enumerate badness, you’ll end up with a combination of rate-limiting and broader blocking for the same reasons. Yes, it’s no fun but the problem isn’t the sites but the people abusing the free services we’ve been given.
  
  Klonoar a year ago
  
  Some AI scrapers have been proven to not report themselves as AI scrapers and mimic true users.
  This is part of what’s leading to the bludgeoning approach you see with blocking. They are not an individual thjng that can be blocked.

ectospheno a year ago

I love that I get a cloudflare human check on almost every page they serve for customers except for when I login to my cloudflare account. Good times.

srmarm a year ago

I'd have thought the website owner whitelisting their RSS feed URI (or pattern matching *.xml/*.rss) might be better than doing it based on the users agent string. For one you'd expect bot traffic on these end points and you're also not leaving a door open to anyone who fakes their user agent.

Looks like it should be possible under the WAF

rcarmo a year ago

Ironically, the site seems to currently be hugged to death, so maybe they should consider using Cloudflare to deal with HN traffic?

sofixa a year ago

Doesn't have to be using CloudFlare, just a static web host that will be able to scale to infinity (of which CloudFlare is one with Pages, but there's also Google with Firebase Hosting, AWS with Amplify, Microsoft with something in Azure with a verbose name, Netlify, Vercel, GitHub Pages, etc etc etc).
- kawsper a year ago
  
  Or just add Varnish or Nginx configured with a cache in front.
  
  sofixa a year ago
  
  That can still exhaust system resources on the box it's running on (file descriptors, inodes, ports, CPU/memory/bandwidth, etc) if you hit it too big.
  For something like entirely static content, it's so much easier (and cheaper, all of the static hosting providers have an extremely generous free tier) to use static hosting.
  And I say this as an SRE by heart who runs Kubernetes and Nomad for fun across a number of nodes at home and in various providers - my blog is on a static host. Use the appropriate solution for each task.
  
  vundercind a year ago
  
  I used to serve low-tens-of-MB .zip files—worse than a web page and a few images or what have you—statically from Apache2 on a boring Linux server that'd qualify as potato-tier today, with traffic spikes into the hundreds of thousands per minute. Tens of thousands per minute against other endpoints gated by PHP setting a header to tell Apache2 to serve the file directly if the client authenticated correctly, and I think that one could have gone a lot higher, never really gave it a workout. Wasn't even really taxing the hardware that much for either workload.
  Before that, it was on a mediocre-even-at-the-time dedicated-cores VM. That caused performance problems... because its Internet "pipe" was straw-sized, it turned out. The server itself was fine.
  Web server performance has regressed amazingly badly in the world of the Cloud. Even "serious" sites have decided the performance equivalent of shitty shared-host Web hosting is a great idea and that introducing all the problems of distributed computing at the architecture level will help their moderate-traffic site work better (LOL; LMFAO), so now they need Cloudflare and such just so their "scalable" solution doesn't fall over in a light breeze.
timeon a year ago

If it is unintentional DDoS, we can wait. Not everything needs to be on demand.
- dewey a year ago
  
  The website is built to get attention, the attention is here right now. Nobody will remember to go back tomorrow and read the site again when it’s available.
  
  BlueTemplar a year ago
  
  I'm not sure an open web can exist under this kind of assumption...
  Once you start chasing views, it's going to come at the detriment of everything else.
  
  dewey a year ago
  
  This happened at least 15 years ago and we are doing okay.

PeterStuer a year ago

I was bitten by this as well. My product retrieves RSS feeds from public government sites, and suddely I'm blocked by cloudflair's antibotting for tryng to access a page that was specifically created for machine consumption. It is not that the website owner or publisher intend to block this. They are unaware that turng on Cloudflare will block everything, even stuff allowed to be consumed according to robots.txt .

P.S. when I mentioned this here on HN a few weeks back, it was implied that I probably did not respect robots.txt ( I do, Cloudflair does not) or that I should get in touch with the site administrators (impossible to do in any reasonably effective way at scale).

drudru a year ago

I noticed this a while back when I was trying to read cloudflare's own blog. Periodically they would block my newsreader. I ended up just dropping their feed.

I am glad to see other people calling out the problem. Hopefully, a solution will emerge.

samplifier a year ago

I've noticed that Old Reddit still supports RSS feeds without returning a 403 error. This is in contrast to the main site, which often blocks RSS requests.

Here are some DNS details:

The main Reddit site (www.reddit.com) uses Fastly. Old Reddit (old.reddit.com) also uses Fastly. However, the "vomit" address (which often returns 403s for RSS requests) uses AWS DNS. Is Old Reddit not behind Cloudflare, or is there another reason why it handles RSS requests differently?

samplifier a year ago

Ignore above. Brainfarted. It doesn't work.

est a year ago

Hmmm, that's why "feedburner" is^H^Hwas a thing, right?

We have come to full circle.

kevincox a year ago

Yeah, this is the recommendation that I usually give people who reach out to support. Feedburner tends to be on the whitelists to avoids this problem.

renewiltord a year ago

Ah, the Cloudflare free plan does not automatically turn these on. I know since I use it for some small things and don't have these on. I wouldn't use User-Agent filtering because those are spoofable. But putting feeds on a separate URL is probably a good idea. Right now the feed is actually generated on request for these sites, so caching it is probably a good idea anyway. I can just rudimentarily do that by periodically generating and copying it over.

015a a year ago

Suggesting that website operators should allowlist RSS clients through the Cloudflare bot detection system via their user-agent is a rather concerning recommendation.

soraminazuki a year ago

This is an issue with techdirt.com. I contacted them about this through their feedback form a long time ago, but the issue still remains unfortunately.

nfriedly a year ago

Liliputing.com had this problem a couple of years ago. I emailed the author and he got it sorted out after a bit of back and forth.

hwj a year ago

I had problems accessing Cloudflare-hosted websites via the Tor browser also. Don't know it that is still true.

timnetworks a year ago

RSS is the future that is being kept from us for twenty years already, fusion can kick bricks.

qwertyuiop_ a year ago

I have always suspected cloudflare being a classic intelligence community op. Just like Google was funded by qinetq

hkt a year ago

It also manages to break IRC bots that do things like show the contents of the title tag when someone posts a link. Another cloudy annoyance, albeit a minor one.

3np a year ago

Also: Sign in on gitlab.com is broken for me on Tor Browser because of an infinite "Verify you are human" refresh/redirect loop...

dewey a year ago

I’m using Miniflix and I always run into that on a few blogs which now I just stopped reading.

shaunpud a year ago

Namesilo are the same, their csv/rss behind Cloudflare so don't even bother anymore with their auctions and their own interface is meh

anilakar a year ago

...and there is a good number of people who see this as a feature, not a bug.

glub103011 a year ago

[dead]

idunnoman1222 a year ago

Yes, the way to retain your privacy is to not use the Internet

if you don’t like it, make your own Internet: assumedly one not funded by ads