Opened 12 months ago

Last modified 10 months ago

#32090 new defect

Blog status and where to go

Reported by: hiro Owned by: tpa
Priority: Medium Milestone:
Component: Internal Services/Tor Sysadmin Team Version:
Severity: Normal Keywords:
Cc: isabela, anarcat, hiro Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description (last modified by hiro)

We have a few issues with our blog.

  1. Our template is broken. Comments are displayed out of the intended layout.
  1. Our blog is generating a lot of page views and has become quite expensive (more below).

Our blog has become quite popular and has received around 300k monthly
"visitors" and above 1.5M "page loads".
This is bumping our expenses significantly and we are evaluating various options regarding caching.

  • Using a CDN like Fastly, Netlify, or Cloudflare
  • Using Varnish

Caching via Varnish could create a bottleneck for our blog and a single point of failure.

In the medium term we could also evaluate what we want to do with our current blog.

  • There is a Drupal static caching project called Tome that we could use together with drupal comments from pantheon.
  • We could migrate the blog to a static content generator and use a separate system for comments.

Child Tickets

#32239closedanarcatsetup a cache frontend for the blogInternal Services/Service - cache
#33105closedhiroevaluate if discourse can be used as comments platform for the blogWebpages/Blog

Attachments (5)

snap-2019.11.11-10.45.54.png (64.4 KB) - added by anarcat 11 months ago.
snap-2019.11.25-09.39.54.png (62.9 KB) - added by anarcat 11 months ago.
snap-2019.11.25-09.45.06.png (269.0 KB) - added by anarcat 11 months ago.
snap-2019.12.09-11.28.37.png (58.5 KB) - added by anarcat 11 months ago.
snap-2019.12.09-11.30.12.png (226.7 KB) - added by anarcat 11 months ago.

Download all attachments as: .zip

Change History (12)

comment:1 Changed 12 months ago by hiro

Description: modified (diff)
Summary: Caching for the blogBlog status and where to go

comment:2 Changed 12 months ago by anarcat

ah. i saw this ticket after replying in private by email, but i'll share that analysis here. ;)

TL;DR: I'd go with varnish still, and ask the next steps on that.

The single bottleneck issue for Varnish could be a problem, but we do
have multiple locations for our servers and would be able to providing
multiple redundant servers without too much problems if that becomes an
issue. I would certainly advocate towards creating at least two
frontends to start with.

As we discussed last week, we already have a (~free) contract with
Fastly, so if we want to go the "CDN" way, it would be a good
option. They say they don't log/track their users, but I'm not sure it
would be a great move in terms of "publicity". I'm also not quite sure I
trust Fastly with doing the right thing here, ultimately, nor do I feel
that the idea of putting all our eggs in the same basket to be safe. We
also run the chance of blowing our quota there eventually if we throw
everything in Fastly.

I would assume CF is out of the question, and I don't know enough about
Netlify to speak about it...

It would be useful to know a little more what "page loads" mean. The
300k "visitors" and 1.5M "pages" figures are similar to what we see in
the dashboard, but in terms of server resources, actual raw numbers
(megabits per second or total gigabytes, and "hits" per second, as
oposed to pages) would be more useful to evaluate our capacity. What's a
"page" for example? Is that one page load, with all extra resources like
CSS and images? While that's useful for them because it's their primary
driver (because it's drupal fighting with PHP and the database to create
the page on the fly), for us at the caching layer, we don't care about
the type of content as much. :)

Finally, I looked at Tome briefly. There were various modules like this
in Drupal's history, the one I knew about before today is called "boost"
but hasn't been ported to D8 it seems. Tome is interesting, as it does
allow the creation of a static site in front of drupal, and we could
then share it on the mirror system, but then it still means we need to
deal and pay with pantheon for the hosting, which still seems like an
expensive proposition for basically a glorified text editor. I'm not
sure how "just sending the comment links" would work in practice, but
maybe it can be done too.

Anyways, Tome would take time and effort to setup, and since we are
still considering our long-term options here, I wouldn't advise for that
solution just yet and just start working more concretely on how to setup
the varnish frontends, provided we have confirmation on the
numbers. With a rough guesstimate, 1.5M "pages" is about 23Mbit/s on
average during the month, something we could probably absorb in the
existing infrastructure without too much troubel. But that's assuming
just the 5MB frontpage, having better numbers would help here

Changed 11 months ago by anarcat

comment:3 Changed 11 months ago by anarcat

problem 2 should be solved:

we're taking about 88% of the traffic out of the blog, which should drastically reduce the costs. a 88% reduction should bump us from the peak 435000 visits (the 300k visits per month package, 1000$/mth or more), down to around 52k visits per month, which is about the metric for "medium" package (300$) or, worst case, the "large" package (150k, 600$).

we'll see the actual result at the end of november, i guess!

next up is the design issues and deeper underlying issue with the blog maintenance.

comment:4 Changed 11 months ago by pili

We had a meeting in Stockholm about the blog, you may find the following notes interesting also:

It's not so much about the hosting but mainly about the broken template issues and whether we want to continue using Drupal

Changed 11 months ago by anarcat

Changed 11 months ago by anarcat

comment:5 Changed 11 months ago by anarcat

Another update, two weeks later:

We're still steady at around 87-89% hit ratio. We had a small outage on one of the servers friday (#32603) but thanks to their redundant nature, that probably went unnoticed. We are down to 450$/mth in the billing, and the caches haven't been online for a full month yet, so that's likely to go down a little further.

We're currently at 1M pages served for november, according to Pantheon, (october = 3M) and 147k visits (oct = 435k).

Last edited 11 months ago by anarcat (previous) (diff)

Changed 11 months ago by anarcat

Changed 11 months ago by anarcat

comment:6 Changed 10 months ago by anarcat

Here's a summary of our status with the blog, a month after the cache went online. Two main problems were identified with the blog:

  1. Broken templates and long-term web development goals
  2. Cost overrun issues


  1. fix templates in-house or Giant Rabbit, switch to static site generator (Lektor?) and external commenting system (Discourse?) in the mid-long term
  2. cost overruns back under control (~500$?), but incomprehensible billing makes this possibly uncertain, need to double-check

Broken templates and web development

Ever since some change happened on the blog (an upgrade?), HTML templates were broken, which is particularly visible in the comments section. Those are not formatted properly and we want those fixed. We considered various providers to outsource this consulting to and, coincidentally, consider moving our hosting elsewhere. We had a quote from which was privately discussed.

For now, we will try to fix the blog where it is in the meantime, maybe with the help of an existing Drupal provider (Giant Rabbit) instead of starting a new business relationship. Something that we should consider is that fixing the template might be expensive. Hiro is willing to make another try adapting our styleguide to an updated bootstrap template.

In the long term, we want to move away from Drupal, towards a static site generator for the content and something like Discourse for the comments in the backend. The latter could be reused for other projects inside Tor, particularly the support and community teams, among others. It was also considered as an option for easier user onboarding for bug reporting when compared to GitLab. The static site generator could be one we area already using, like Lektor. This still has to be discussed further. We might achieve the same level of WYSIWYG with a static site generator, without the time and economical investment of running a giant framework like drupal.

Cost issues

The other problem that was identified in October was the cost overrun issues. Around August or September, we passed the 300k visits per month mark, which bumped us in another price range with Pantheon (~1k$/mth). Their pricing plan seem to go as follows, in terms of visits/month vs cost/month:

  • small, 25k: 175$
  • medium, 50k: 300$
  • large, 150k: 600$
  • extra-large, 300k: 1000$

(I'm ignoring the "basic" 50$/mth package because I'm going under the assertion that's not accessible for us, because it's a high traffic site.)

Before the traffic bumps happened, we were billed 500$/mth for the site, presumably a prefered rate over the official 600$/mth rate. We were bumped from the "large" to the "extra-large" package first on september 27th, then again on october 29th. Their billing system is ... a bit opaque to me, but it seems we are now billed 500$/mth again. I honestly can't figure out *what* is going on with the billing at this point, honestly. I would love if Jon or someone else could go over those invoices and figure it out.

But my theory right now is the caching system did its job and brought us back to a "pre-crisis" level of billing, that is, the "large" billing package. Indeed, that is what the "billing" section of the Pantheon dashboard says. There's also this message in the "Workflow section:

Changed site plan to "plan-performance_large-preferred-monthly-1"
[matt's email address at panthon]
Finished 40 minutes ago

So maybe we got someone at Pantheon to intervene for us?

We can clearly see a drop in the traffic on the backend in the Pantheon stats:

  • October: 435k visits, 3.1M pages served
  • November: 165k visits, 1M pages served

That's a 63% drop in visits and 68% drop in page served. It could still get slightly better in December, as out hit ratio is actually better than this, at 88%:

The reason those ratios don't correspond exactly to each other is we have different ways to count those statistics. Pantheon uses "visits" and "pages", we use "hits". The distinction is that a "visitor" can hit multiple "pages" in one "visit", and a page is made of multiple "hits". So while we may keep many hits from going to the backend, we may not keep as many "pages" as we want going there. I suspect it would be very hard to remove the other 115k visits per month to get down to the medium package, and I have not made more efforts to do so.

Also, as far as I can tell, this traffic hitting our own TPA infrastructure is not affecting us in any significant way, neither in terms of cost (traffic is not large enough to change billing significantly) or performance (load is not big enough to affect the server's overall performance).

So I consider the "cost" crisis to be over, but there might be more tricks we could do to bring the hit ratio down. At this point, I consider the cost tradeoff to not be worth it, however, as long as Pantheon doesn't bump us back to the "extra large" cost grid.

comment:7 Changed 10 months ago by hiro

Today I found how to fix the old template while creating a new template. So the template issues are now partially resolved also in the old template.

Note: See TracTickets for help on using tickets.