Blog status and where to go

Trac:
Child Ticket(s): #33105 (moved), #32239 (moved)

added component::internal services/tor sysadmin team owner::tpa priority::medium severity::normal status::new type::defect labels

Trac:
Summary: Caching for the blog to Blog status and where to go
Description: Our blog has become quite popular and has received around 300k monthly "visitors" and above 1.5M "page loads". This is bumping our expenses significantly and we are evaluating various options regarding caching.

Using a CDN like Fastly, Netlify, or Cloudflare
Using Varnish

Caching via Varnish could create a bottleneck for our blog and a single point of failure.

to

We have a few issues with our blog.

Our template is broken. Comments are displayed out of the intended layout.
Our blog is generating a lot of page views and has become quite expensive (more below).

Our blog has become quite popular and has received around 300k monthly "visitors" and above 1.5M "page loads". This is bumping our expenses significantly and we are evaluating various options regarding caching.

Using a CDN like Fastly, Netlify, or Cloudflare
Using Varnish

Caching via Varnish could create a bottleneck for our blog and a single point of failure.

In the medium term we could also evaluate what we want to do with our current blog.

There is a Drupal static caching project called Tome that we could use together with drupal comments from pantheon.
We could migrate the blog to a static content generator and use a separate system for comments.

ah. i saw this ticket after replying in private by email, but i'll share that analysis here. ;)

TL;DR: I'd go with varnish still, and ask the next steps on that.

The single bottleneck issue for Varnish could be a problem, but we do have multiple locations for our servers and would be able to providing multiple redundant servers without too much problems if that becomes an issue. I would certainly advocate towards creating at least two frontends to start with.

As we discussed last week, we already have a (~free) contract with Fastly, so if we want to go the "CDN" way, it would be a good option. They say they don't log/track their users, but I'm not sure it would be a great move in terms of "publicity". I'm also not quite sure I trust Fastly with doing the right thing here, ultimately, nor do I feel that the idea of putting all our eggs in the same basket to be safe. We also run the chance of blowing our quota there eventually if we throw everything in Fastly.

I would assume CF is out of the question, and I don't know enough about Netlify to speak about it...

It would be useful to know a little more what "page loads" mean. The 300k "visitors" and 1.5M "pages" figures are similar to what we see in the dashboard, but in terms of server resources, actual raw numbers (megabits per second or total gigabytes, and "hits" per second, as oposed to pages) would be more useful to evaluate our capacity. What's a "page" for example? Is that one page load, with all extra resources like CSS and images? While that's useful for them because it's their primary driver (because it's drupal fighting with PHP and the database to create the page on the fly), for us at the caching layer, we don't care about the type of content as much. :)

Finally, I looked at Tome briefly. There were various modules like this in Drupal's history, the one I knew about before today is called "boost" but hasn't been ported to D8 it seems. Tome is interesting, as it does allow the creation of a static site in front of drupal, and we could then share it on the mirror system, but then it still means we need to deal and pay with pantheon for the hosting, which still seems like an expensive proposition for basically a glorified text editor. I'm not sure how "just sending the comment links" would work in practice, but maybe it can be done too.

Anyways, Tome would take time and effort to setup, and since we are still considering our long-term options here, I wouldn't advise for that solution just yet and just start working more concretely on how to setup the varnish frontends, provided we have confirmation on the numbers. With a rough guesstimate, 1.5M "pages" is about 23Mbit/s on average during the month, something we could probably absorb in the existing infrastructure without too much troubel. But that's assuming just the 5MB frontpage, having better numbers would help here tremendously.

Trac:

problem 2 should be solved:

we're taking about 88% of the traffic out of the blog, which should drastically reduce the costs. a 88% reduction should bump us from the peak 435000 visits (the 300k visits per month package, 1000

/mth or more), down to around 52k visits per month, which is about the metric for "medium" package (300

) or, worst case, the "large" package (150k, 600$).

we'll see the actual result at the end of november, i guess!

next up is the design issues and deeper underlying issue with the blog maintenance.

We had a meeting in Stockholm about the blog, you may find the following notes interesting also: https://trac.torproject.org/projects/tor/wiki/org/meetings/2019Stockholm/Notes/Blog

It's not so much about the hosting but mainly about the broken template issues and whether we want to continue using Drupal

Trac:

Another update, two weeks later:

We're still steady at around 87-89% hit ratio. We had a small outage on one of the servers friday (#32603 (moved)) but thanks to their redundant nature, that probably went unnoticed. We are down to 450$/mth in the billing, and the caches haven't been online for a full month yet, so that's likely to go down a little further.

We're currently at 1M pages served for november, according to Pantheon, (october = 3M) and 147k visits (oct = 435k).

Trac:

Here's a summary of our status with the blog, a month after the cache went online. Two main problems were identified with the blog:

Broken templates and long-term web development goals
Cost overrun issues

TL;DR:

fix templates in-house or Giant Rabbit, switch to static site generator (Lektor?) and external commenting system (Discourse?) in the mid-long term
cost overruns back under control (~500$?), but incomprehensible billing makes this possibly uncertain, need to double-check

= Broken templates and web development

Ever since some change happened on the blog (an upgrade?), HTML templates were broken, which is particularly visible in the comments section. Those are not formatted properly and we want those fixed. We considered various providers to outsource this consulting to and, coincidentally, consider moving our hosting elsewhere. We had a quote from Koumbit.org which was privately discussed.

For now, we will try to fix the blog where it is in the meantime, maybe with the help of an existing Drupal provider (Giant Rabbit) instead of starting a new business relationship. Something that we should consider is that fixing the template might be expensive. Hiro is willing to make another try adapting our styleguide to an updated bootstrap template.

In the long term, we want to move away from Drupal, towards a static site generator for the content and something like Discourse for the comments in the backend. The latter could be reused for other projects inside Tor, particularly the support and community teams, among others. It was also considered as an option for easier user onboarding for bug reporting when compared to GitLab. The static site generator could be one we area already using, like Lektor. This still has to be discussed further. We might achieve the same level of WYSIWYG with a static site generator, without the time and economical investment of running a giant framework like drupal.

= Cost issues

The other problem that was identified in October was the cost overrun issues. Around August or September, we passed the 300k visits per month mark, which bumped us in another price range with Pantheon (~1k$/mth). Their pricing plan seem to go as follows, in terms of visits/month vs cost/month:

small, 25k: 175$
medium, 50k: 300$
large, 150k: 600$
extra-large, 300k: 1000$

(I'm ignoring the "basic" 50$/mth package because I'm going under the assertion that's not accessible for us, because it's a high traffic site.)

Before the traffic bumps happened, we were billed 500

/mth for the site, presumably a prefered rate over the official 600

/mth rate. We were bumped from the "large" to the "extra-large" package first on september 27th, then again on october 29th. Their billing system is ... a bit opaque to me, but it seems we are now billed 500$/mth again. I honestly can't figure out what is going on with the billing at this point, honestly. I would love if Jon or someone else could go over those invoices and figure it out.

But my theory right now is the caching system did its job and brought us back to a "pre-crisis" level of billing, that is, the "large" billing package. Indeed, that is what the "billing" section of the Pantheon dashboard says. There's also this message in the "Workflow section:

Changed site plan to "plan-performance_large-preferred-monthly-1" [matt's email address at panthon] Finished 40 minutes ago

So maybe we got someone at Pantheon to intervene for us?

We can clearly see a drop in the traffic on the backend in the Pantheon stats:

October: 435k visits, 3.1M pages served
November: 165k visits, 1M pages served

That's a 63% drop in visits and 68% drop in page served. It could still get slightly better in December, as out hit ratio is actually better than this, at 88%:

The reason those ratios don't correspond exactly to each other is we have different ways to count those statistics. Pantheon uses "visits" and "pages", we use "hits". The distinction is that a "visitor" can hit multiple "pages" in one "visit", and a page is made of multiple "hits". So while we may keep many hits from going to the backend, we may not keep as many "pages" as we want going there. I suspect it would be very hard to remove the other 115k visits per month to get down to the medium package, and I have not made more efforts to do so.

Also, as far as I can tell, this traffic hitting our own TPA infrastructure is not affecting us in any significant way, neither in terms of cost (traffic is not large enough to change billing significantly) or performance (load is not big enough to affect the server's overall performance).

So I consider the "cost" crisis to be over, but there might be more tricks we could do to bring the hit ratio down. At this point, I consider the cost tradeoff to not be worth it, however, as long as Pantheon doesn't bump us back to the "extra large" cost grid.

Today I found how to fix the old template while creating a new template. So the template issues are now partially resolved also in the old template.

mentioned in issue #32239 (moved)

mentioned in issue #33105 (moved)

Blog status and where to go

Child items 0

Activity