Opened 6 months ago

Last modified 4 days ago

#25985 needs_revision project

Snowflake rendezvous using AMP cache

Reported by: twim Owned by:
Priority: Medium Milestone:
Component: Obfuscation/Snowflake Version:
Severity: Normal Keywords:
Cc: dcf, arlolra Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description (last modified by dcf)

This ticket is meant to track progress on adding AMP cache fronting into Snowflake broker and client.

This is a followup of https://trac.torproject.org/projects/tor/ticket/25804#comment:25:

It turns out that AppEngine is not the only option for domain fronting with Google.
Google also provides a service called ​AMP cache for AMP pages. What it basically does is proxying random pages on the Internet and making them load faster (e.g. on Google search results). It requires pages to comply with some format though and also strips invisible content, resizes images, etc.
Despite it is being served via different domain names (one per real domain) it is still hosted at Google infrastructure which can be fronted.

Child Tickets

Change History (29)

comment:1 Changed 6 months ago by twim

Type: defectproject

comment:2 Changed 6 months ago by dcf

Description: modified (diff)

twim's library for tunneling through AMP: https://github.com/nogoegst/amper

comment:3 Changed 6 months ago by dcf

The way I picture this working for snowflake is, we add a new route to the broker, like /amp/client. It will work exactly like the existing /client route (which is where clients POST their ICE offer)--with the only difference being that /amp/client will wrap the response in the additional AMP markup. The client side will have to know how to strip off the extra markup. It's probably not very much work.
https://gitweb.torproject.org/pluggable-transports/snowflake.git/tree/broker/broker.go?id=10ad59fc9d26900ded6456f50ea6adf4cb58be9d#n146

Here's a diagram of how it works now: https://www.bamsoftware.com/papers/thesis/#fig:snowflake-rendezvous
Imagine an AMP node between the client and broker. In Step 1, instead of domain fronting, the client will just do a normal request for https://www.google.com/amp/snowflake.example.com or whatever. In Step 3, the broker will append the extra AMP markup. Everything else is the same.

twim, can you give a brief guide on what is needed to set up AMP? I presume you at least need a Google account; is it something you set up in the Google Cloud Platform? Is there a fee? I've seen different kinds of AMP URLs, like

https://www.google.com/amp/s/amp.reddit.com/r/OutOfTheLoop/comments/56euau/whats_with_google_amp_quite_annoyingly_being_used/
https://amp-reddit-com.cdn.ampproject.org/
https://amp.reddit.com/

Do you know what the difference between all these URL styles is? Are they basically interchangeable? The first one looks like the best, if we can use it.

comment:4 Changed 6 months ago by dcf

Status: newneeds_information

comment:5 Changed 6 months ago by twim

twim, can you give a brief guide on what is needed to set up AMP?

Sure. One needs to
On the server side:

Setup a web server accessible via domain name and preferably using TLS. This server has to wrap some *text* data inside the AMP markup. The text should be "visible" as AMP cache strips unnecessary content.

On the client side:

Assemble a special URL which contains:

  • correct AMP-styled host
  • random part (to go around caching)
  • the outgoing payload (as only GET requests can be made)

Fetch the URL using some front and extract (and decode) data from the AMP page back.

In a way it is pretty similar to meek except GET instead of POST and extra encoding/decoding thing.

I presume you at least need a Google account; is it something you set up in the Google Cloud Platform? Is there a fee?

Curiously enough you don't need a Google account for that because the AMP project itself isn't solely a Google thing. It is just a special HTML markup that can be accelerated by any party incl. Google. You just set up an AMP version of your pages at your host and it just works. No GCP involved. There is no fee at the moment for page loading, there will only be on API calls (not our case). As IANAL, I am not aware whether this usage violates ToS. I couldn't find any.

I've seen different kinds of AMP URLs...
Do you know what the difference between all these URL styles is? Are they basically interchangeable? The first one looks like the best, if we can use it.

I haven't managed to make URLs like https://www.google.com/amp/s/amp.reddit.com/blablabla to not redirect to the full article. I am probably just do not understand how this kind of links differs from others.

https://amp-reddit-com.cdn.ampproject.org/

This is the kind of links I am using in amper. I guess that in theory *.cdn.ampproject.org can resolve to non-Google IPs as well. These hosts can be fronted by typical Google server names.

https://amp.reddit.com/

This is the host from which one is serving their AMP pages.

comment:6 Changed 6 months ago by cypherpunks

Status: needs_informationnew

comment:7 in reply to:  5 ; Changed 6 months ago by dcf

Replying to twim:

I presume you at least need a Google account; is it something you set up in the Google Cloud Platform? Is there a fee?

Curiously enough you don't need a Google account for that because the AMP project itself isn't solely a Google thing. It is just a special HTML markup that can be accelerated by any party incl. Google. You just set up an AMP version of your pages at your host and it just works. No GCP involved. There is no fee at the moment for page loading, there will only be on API calls (not our case). As IANAL, I am not aware whether this usage violates ToS. I couldn't find any.

Thanks for this great info. It's a lot easier than I imagined. The Google AMP cache will issue GET requests to arbitrary URLs on your behalf (going back to the OSS idea). I tried it with my web server, which I haven't done anything to set up for AMP:

https://www-bamsoftware-com.cdn.ampproject.org/c/s/www.bamsoftware.com/amptest

This resulted in an HTTP request to my server:

64.233.172.149 - - [03/May/2018:10:59:20 -0600] "GET /amptest HTTP/1.1" 404 3726 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Google-AMPHTML)"

Probably because the page doesn't pass AMP validation (i.e., doesn't exist), the AMP cache's response was a status-200 meta/JavaScript redirect to the original URL:

HTTP/1.1 200 OK
Location: https://www.bamsoftware.com/amptest
Cache-Control: private
X-Content-Type-Options: nosniff
Date: Thu, 03 May 2018 17:05:10 GMT
Content-Type: text/html; charset=UTF-8
Server: sffe
Content-Length: 361
X-XSS-Protection: 1; mode=block
Alt-Svc: hq=":443"; ma=2592000; quic=51303433; quic=51303432; quic=51303431; quic=51303339; quic=51303335,quic=":443"; ma=2592000; v="43,42,41,39,35"

<HTML><HEAD>
<meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>Redirecting</TITLE>
<META HTTP-EQUIV="refresh" content="1; url=https://www.bamsoftware.com/amptest">
</HEAD>
<BODY onLoad="location.replace('https://www.bamsoftware.com/amptest'+document.location.hash)">
Redirecting you to https://www.bamsoftware.com/amptest</BODY></HTML>

I've seen different kinds of AMP URLs...
Do you know what the difference between all these URL styles is? Are they basically interchangeable? The first one looks like the best, if we can use it.

I haven't managed to make URLs like https://www.google.com/amp/s/amp.reddit.com/blablabla to not redirect to the full article. I am probably just do not understand how this kind of links differs from others.

The trick with these is you have to use a mobile User-Agent. Press Ctrl+Shift+I to open the browser console, click the "Responsive Design Mode", and choose a phone from the menu.

https://amp-reddit-com.cdn.ampproject.org/

This is the kind of links I am using in amper. I guess that in theory *.cdn.ampproject.org can resolve to non-Google IPs as well. These hosts can be fronted by typical Google server names.

Okay yeah, I found these guides to the URL format. The c means content (can also be r for resource or i for image) and the s means use TLS.

https://developers.google.com/amp/cache/overview#amp-cache-url-format
https://www.ampbyexample.com/advanced/using_the_google_amp_cache/#amp-cache-url-format

https://amp.reddit.com/

This is the host from which one is serving their AMP pages.

I see; so "amp" in the name here is just a convention, not a requirement.

I found this description of the three kinds of URLs:

https://www.ampproject.org/latest/blog/whats-in-an-amp-url/

The "*.cdn.ampproject.org" ones they call "AMP Cache" URLs and the "google.com/amp" ones they call "AMP Viewer" URLs. It seems like the "AMP Viewer" URLs are only produced automatically by Google in search result pages. But yeah, in any case, you can domain-front the "AMP Cache" URLs.

How these all link together is you have the original non-AMP page:

https://www.reddit.com/r/OutOfTheLoop/comments/56euau/whats_with_google_amp_quite_annoyingly_being_used/

In the source code, there is a <link rel="amphtml" /> that points to the AMP version:

https://amp.reddit.com/r/OutOfTheLoop/comments/56euau/whats_with_google_amp_quite_annoyingly_being_used/

The AMP URL (or any URL) can be mechanically converted to an AMP Cache URL:

https://amp-reddit-com.cdn.ampproject.org/c/s/amp.reddit.com/r/OutOfTheLoop/comments/56euau/whats_with_google_amp_quite_annoyingly_being_used/

Which in some contexts may appear as an AMP Viewer URL (adds a header to the page):

https://www.google.com/amp/s/amp.reddit.com/r/OutOfTheLoop/comments/56euau/whats_with_google_amp_quite_annoyingly_being_used/

Last edited 6 months ago by dcf (previous) (diff)

comment:8 in reply to:  7 ; Changed 6 months ago by twim

Replying to dcf:
Thanks for breaking it down!

Okay yeah, I found these guides to the URL format. The c means content (can also be r for resource or i for image) and the s means use TLS.

https://developers.google.com/amp/cache/overview#amp-cache-url-format
https://www.ampbyexample.com/advanced/using_the_google_amp_cache/#amp-cache-url-format

Strange, I've seen this docs to find out what /s means, though I took the URL itself from desktop Google search results. This URL (in amper) uses /v/s/ location instead of /c/s/. I don't see any docs on that. Maybe it is a deprecated format, however it doesn't explain why Google still use it then.

https://www.ampproject.org/latest/blog/whats-in-an-amp-url/

The "*.cdn.ampproject.org" ones they call "AMP Cache" URLs and the "google.com/amp" ones they call "AMP Viewer" URLs. It seems like the "AMP Viewer" URLs are only produced automatically by Google in search result pages. But yeah, in any case, you can domain-front the "AMP Cache" URLs.

Also this google.com/amp viewer seems to use iFrames to download content from the CDN anyway.

comment:9 in reply to:  8 Changed 6 months ago by dcf

Replying to twim:

https://www.ampproject.org/latest/blog/whats-in-an-amp-url/

The "*.cdn.ampproject.org" ones they call "AMP Cache" URLs and the "google.com/amp" ones they call "AMP Viewer" URLs. It seems like the "AMP Viewer" URLs are only produced automatically by Google in search result pages. But yeah, in any case, you can domain-front the "AMP Cache" URLs.

Also this google.com/amp viewer seems to use iFrames to download content from the CDN anyway.

Aha, you're right. So the Amp Viewer URLs don't really buy us anything in terms of blocking resistance.

comment:10 Changed 6 months ago by dcf

I think this AMP cache idea is really viable and can be implemented quickly. twim, do you have the time and interest to work on an implementation in Snowflake?

This is the design I am picturing:

In the broker

Add a new route /amp/client. This is similar to the existing /client (clientOffers function), except that it conforms to AMP requirements. Specifically, when a request arrives:

  1. Take the final path component in request URL, decode it using the base64 URL-safe alphabet, and store that as offer.
  2. If ctx.snowflakes.Len() <= 0, return an AMP-appropriate "not found" response (in case AMP does not pass our current status 503 StatusServiceUnavailable unmodified).
  3. Keep the same logic up to answer := <-snowflake.answerChannel.
  4. When writing the answer, instead of writing it directly to the body, encode it using base64 and wrap it in the necessary AMP markup. (Side note: if AMP provides a way to pass JSON through unmodified (maybe the r does this?), that would be ideal.)
  5. In the <-time.After(time.Second * ClientTimeout) case, we may have to think of some other way to signal a timeout, in case the AMP cache doesn't pass the status 502 (StatusGatewayTimeout) unmodified.

We can discuss alternatives paths to /amp/client. We could even perhaps overload /client if something else in the request marks it as being AMP (but there's no need to be cute if we don't have to).

In the client

Add a new command-line flag -amp that sets a global AMP flag. My thinking here is that the AMP use case is similar enough to existing domain fronting use case that we can use the same conventions. These are the use cases I think should be supported:

access broker directly
client -url https://snowflake-broker.bamsoftware.com/
access broker through a domain front
client -url https://snowflake-broker.azureedge.net/ -front ajax.aspnetcdn.com
access broker via AMP cache
client -amp -url https://snowflake--broker-bamsoftware-com.cdn.ampproject.org/
access broker via domain-fronted AMP cache
client -amp -url https://snowflake--broker-bamsoftware-com.cdn.ampproject.org/ -front www.google.com

In BrokerChannel.Negotiate (rendezvous.go), when the global AMP flag is set,

  1. Instead of appending "client" to bc.url, append "c/s/<bc.Host>/amp/client/<random>/<base64(data)>".
  2. Use GET with an empty body, instead of POST.
  3. In the http.StatusOK case, read the body and strip out the AMP markup before deserializing the answer.
  4. In the http.StatusServiceUnavailable and http.StatusBadRequest cases, do whatever is needed to match how those errors are represented in the broker.

As a bit of perhaps premature optimization, we can probably remove the cache-inhibiting <random> from the URLs, because the client offers already contain ice-pwd and fingerprint fields that (I think) don't repeat.

comment:11 in reply to:  10 ; Changed 6 months ago by twim

Replying to dcf:

I think this AMP cache idea is really viable and can be implemented quickly. twim, do you have the time and interest to work on an implementation in Snowflake?

Yes, I think I can do that.

This is the design I am picturing:

In the broker

Add a new route /amp/client. This is similar to the existing /client (clientOffers function), except that it conforms to AMP requirements. Specifically, when a request arrives:
...

  1. In the <-time.After(time.Second * ClientTimeout) case, we may have to think of some other way to signal a timeout, in case the AMP cache doesn't pass the status 502 (StatusGatewayTimeout) unmodified.

We can discuss alternatives paths to /amp/client. We could even perhaps overload /client if something else in the request marks it as being AMP (but there's no need to be cute if we don't have to).

Another idea I got here about it is to always reply with 200, so any changes to AMP will less likely break things. Thus we can abstract RPC away from HTTP in a way and use different pluggable roundtrippers (like POST, AMP). It would make it easier to plug other OSSes.
Also we can't pass headers as we do in meek.

(Side note: if AMP provides a way to pass JSON through unmodified (maybe the r does this?), that would be ideal.)

JSON is probably fine if wrapped into <pre> but I haven't test it yet. I don't feel like using JSON will be as safe as Base64 in terms of survivability against modifications.

As a bit of perhaps premature optimization, we can probably remove the cache-inhibiting <random> from the URLs, because the client offers already contain ice-pwd and fingerprint fields that (I think) don't repeat.

I thought about it too, but it might be a slippery slope as you have to ensure that paths never repeat. Random prefix doesn't introduce such a bit overhead to worry about.

comment:12 in reply to:  11 ; Changed 6 months ago by dcf

Replying to twim:

Replying to dcf:

I think this AMP cache idea is really viable and can be implemented quickly. twim, do you have the time and interest to work on an implementation in Snowflake?

Yes, I think I can do that.

That's great. I can devote some time to assistance.

Do you have the infrastructure you need for testing? It looks like you have a way to make your own subdomains. The broker has automatic Let's Encrypt certificate support, but it requires you to have ports 443 and 80 free. If you don't have a spare server set up, I may be able to get you one.

  1. In the <-time.After(time.Second * ClientTimeout) case, we may have to think of some other way to signal a timeout, in case the AMP cache doesn't pass the status 502 (StatusGatewayTimeout) unmodified.

Another idea I got here about it is to always reply with 200, so any changes to AMP will less likely break things. Thus we can abstract RPC away from HTTP in a way and use different pluggable roundtrippers (like POST, AMP). It would make it easier to plug other OSSes.

Always using 200 for AMP is fine with me. If you need to add another layer around the existing JSON, that's fine. I don't want to overhaul the existing /client behavior though.

(Side note: if AMP provides a way to pass JSON through unmodified (maybe the r does this?), that would be ideal.)

JSON is probably fine if wrapped into <pre> but I haven't test it yet. I don't feel like using JSON will be as safe as Base64 in terms of survivability against modifications.

I'm thinking about the <script type="application/ld+json"> block. Perhaps we can stuff arbitrary data in there and the AMP cache won't touch it. But realistically speaking, base64 in the page body is likely good enough.

    <script type="application/ld+json">
    {
      "@context": "http://schema.org",
      "@type": "NewsArticle",
      "headline": "Article headline",
      "image": [
        "thumbnail1.jpg"
      ],
      "datePublished": "2015-02-05T08:00:00+08:00"
    }
    </script>

comment:13 in reply to:  12 ; Changed 6 months ago by twim

Replying to dcf:

Do you have the infrastructure you need for testing? It looks like you have a way to make your own subdomains. The broker has automatic Let's Encrypt certificate support, but it requires you to have ports 443 and 80 free. If you don't have a spare server set up, I may be able to get you one.

Yes, I do. Thanks a lot, I guess I am fine with that.

I'm thinking about the <script type="application/ld+json"> block. Perhaps we can stuff arbitrary data in there and the AMP cache won't touch it. But realistically speaking, base64 in the page body is likely good enough.

Probably it will work out. Though it looks like a premature optimization for me because AMP cache may strip "unsupported data" from this JSON/whatever. It should never strip words and links and Base64 is more like them.

comment:14 in reply to:  13 Changed 6 months ago by dcf

Replying to twim:

I'm thinking about the <script type="application/ld+json"> block. Perhaps we can stuff arbitrary data in there and the AMP cache won't touch it. But realistically speaking, base64 in the page body is likely good enough.

Probably it will work out. Though it looks like a premature optimization for me because AMP cache may strip "unsupported data" from this JSON/whatever. It should never strip words and links and Base64 is more like them.

I think you're right. Do what you think is best.

comment:15 Changed 5 months ago by twim

I've (finally) plugged AMP support into both snowflake client and broker. You can find it at https://github.com/nogoegst/snowflake , "amp" branch.
It's worth to mention that the AMP cache turned out to impose some kind of "word length limit". Before I started inserting spaces to break down large base64 data I got magical things. AMP cache just trimmed randomly these lines, most of the time leaving around 4 chars. *shrug*

Sidenote: I made some code reorg and open for a discussion on that.

comment:16 Changed 5 months ago by dcf

Status: newneeds_revision

Thanks for this.

I've only taken a quick look. Here is some preliminary feedback.

  • Changing POST to use only status code 200 breaks backward compatibility. I'm opposed to this change. At any rate, it would require matching changes in /proxy and /proxy-go. But it's better to remain compatible.
  • Try to minimize the extent of changes. This ticket shouldn't require any changes to clientOffers, so I'm skeptical if I see any changes there. Don't worry about code duplication. Go ahead and write duplicative code. If needed, we can deduplicate in a refactoring. I expect that the new feature in the broker will require 1 or 2 new functions and one new line in main. It's fine if AMP uses JSON and POST does not. If there are code architecture changes they need to be in a separate commit from the functional changes.
  • I'd prefer not to depend on an external amper, if we can reasonably include the necessary code here.
  • -codec=post and -codec=amp is a nice idea. The -help hint should list all the possible values.
  • BrokerChannel.RoundTrip: it seems like this switch would be better done in main by setting a function pointer on bc. (Should fail immediately in main if the command line is bad.)
  • In the "/client/amp/" path, I'm not sure if a trailing slash is significant, but all the entries should either have a trailing slash or not have one.
  • Is it possible to avoid the vendoring changes, or at least get them in separate commits so they're separable from the functional changes?

The biggest thing for me is to avoid making unnecessary code changes and to keep backward compatibility. If you absolutely need a refactoring change, you can do it as a separate commit preliminary to making your functional changes. It should be of a nature that we would want the refactoring even if we weren't going to add the AMP feature. But I feel it's less risky (= more likely to get the patch accepted) to add the new feature first, in a minimal patch with clearly defined boundaries, and then refactor later.

comment:17 in reply to:  16 ; Changed 5 months ago by twim

Replying to dcf:

I've only taken a quick look. Here is some preliminary feedback.

Thanks for the feedback!

  • Changing POST to use only status code 200 breaks backward compatibility. I'm opposed to this change. At any rate, it would require matching changes in /proxy and /proxy-go. But it's better to remain compatible.
  • Try to minimize the extent of changes. This ticket shouldn't require any changes to clientOffers, so I'm skeptical if I see any changes there. Don't worry about code duplication. Go ahead and write duplicative code. If needed, we can deduplicate in a refactoring. I expect that the new feature in the broker will require 1 or 2 new functions and one new line in main. It's fine if AMP uses JSON and POST does not. If there are code architecture changes they need to be in a separate commit from the functional changes.

The point was to retain some of backward compatibility so old clients are able to connect to a broker.
I've added status codes back.

  • I'd prefer not to depend on an external amper, if we can reasonably include the necessary code here.

I think it's better to move codec logic out out the scope of snowflake itself. In case of amper the library handles domain fronting (which is not typical), adding padding and other. Anyway I just don't see the point of copy-pasting code if we can just vendor it.

  • -codec=post and -codec=amp is a nice idea. The -help hint should list all the possible values.

Fixed.

  • BrokerChannel.RoundTrip: it seems like this switch would be better done in main by setting a function pointer on bc. (Should fail immediately in main if the command line is bad.)

Command line should check these values if it needs to. The thing here is that this function pointers are messy and are being created in init function while the they do depend on BrokerContext state.

  • In the "/client/amp/" path, I'm not sure if a trailing slash is significant, but all the entries should either have a trailing slash or not have one.

This is just how it works. '/client' handles only this exact path, while '/client/amp/' handles everything with this prefix and sets up a redirect from '/client/amp/' to '/client/amp/'.

  • Is it possible to avoid the vendoring changes, or at least get them in separate commits so they're separable from the functional changes?

Right, I decided to leave vendoring for another ticket.

The biggest thing for me is to avoid making unnecessary code changes and to keep backward compatibility. If you absolutely need a refactoring change, you can do it as a separate commit preliminary to making your functional changes. It should be of a nature that we would want the refactoring even if we weren't going to add the AMP feature. But I feel it's less risky (= more likely to get the patch accepted) to add the new feature first, in a minimal patch with clearly defined boundaries, and then refactor later.

It's not that much of refactoring though. I'm not liking idea of creating duplicate code and adding completely different code - it becomes unmaintainable pretty quickly and 'refactor later' never comes.

comment:18 Changed 5 months ago by cypherpunks

Resolution: fixed
Status: needs_revisionclosed

comment:19 Changed 5 months ago by dcf

Resolution: fixed
Status: closedreopened

comment:20 Changed 5 months ago by dcf

Status: reopenedneeds_review

comment:21 Changed 5 months ago by dcf

I'm sorry for being harsh in my previous response. I don't mean to give the impression that I don't appreciate the work you're doing. I'm overly sensitive on this point because it was my failure to be firm with the flash proxy source code that led it to becoming an overly abstracted, tightly coupled mess.

I'm not against refactoring, even as part of this ticket. But I have to insist on functional changes and refactoring changes being in separate commits. For one thing, it enables discussing the functional and refactoring changes separately. But for another, it helps me reconstruct your thinking to justify the changes. Look at it from my point of view. When I'm reviewing code I would rather see steps A→B→C, where B is simple but introduces some obvious code duplication, and C refactors the duplication; than see A→C where I have to mentally reconstruct B in order to fully understand the change. It's better for git blame purposes too. I know you know these things too; I'm not trying to explain it to you, just letting you know what I value as a code reviewer.

WRT external interfaces, I know the programmer's instinct is always to generalize, generalize as soon as possible and promote interfaces to public APIs. I prefer to keep interfaces private whenever possible, not expose unneeded functionality, and only make an interface generic and external after it has endured some internal battle testing. You don't really know what you want from an external API until you have used it privately for a while anyway. And if it turns out that you never really need to make it public, that's even better, because every public interface is an ongoing maintenance burden. This is just my point of view and I respect that people can think differently. I'm not too worried about the external amper dep, because we can always remove it if we find it necessarily, when we eventually add TLS fingerprint camouflage or something like that. It just means an additional new sub-project in the Tor Browser build--it all adds up.

Anyway, I already like your branch in b75253ba55 better. I'll find some time to look at it. Maybe arlolra or someone else can look too. My time for this is very very limited but I am excited about the AMP Cache development.

comment:22 in reply to:  17 Changed 5 months ago by dcf

Replying to twim:

Replying to dcf:

  • In the "/client/amp/" path, I'm not sure if a trailing slash is significant, but all the entries should either have a trailing slash or not have one.

This is just how it works. '/client' handles only this exact path, while '/client/amp/' handles everything with this prefix and sets up a redirect from '/client/amp/' to '/client/amp/'.

Great, thanks. I was wondering whether we should add a trailing slash to the other entries but it sounds like it should be exactly the way you wrote it. "/client/amp/" gets a trailing slash because it takes additional information in the rest of the URL path; all the others do not get a trailing slash because we only want to match those exact paths.

comment:23 Changed 5 months ago by dcf

Summary: Add AMP cache as another domain fronting option with GoogleSnowflake rendezvous using AMP cache

comment:24 Changed 4 months ago by dcf

Status: needs_reviewneeds_revision

This is some review of commit b75253ba55bde140be733c4cf5425fc1aab161c4.

I had trouble testing this change locally because amper.Client was changing my http into https. This is what I did:

broker$ ./broker -disable-tls -addr 127.0.0.1:8080
proxy-go$ ./proxy-go -broker http://127.0.0.1:8080/
client$ TOR_PT_MANAGED_TRANSPORT_VER=1 TOR_PT_CLIENT_TRANSPORTS=snowflake ./client -codec amp -url http://127.0.0.1:8080/

The error I get from the client is a result of attempting to parse HTTP as TLS:

BrokerChannel Error: tls: oversized record received with length 20527

(On further reflection, what I was trying to do wouldn't have worked anyway, because the AMP client code would have sent a URL path of /c/s/.../amp/client/..., not /amp/client/..., but I could have worked around that for testing.) It looks like the cause is here in amper/client.go, with a hardcoded "https". Looking at that, I also noticed a CDNDomain = "cdn.ampproject.org". I have to say, it skeeves me a bit to see hardcoded constants like that in a library; it seems like the kind of thing that belongs in a configuration file.


I like the idea behind the commit that adds the -codec flag: generalize a currently hardcoded interface, then provide alternate implementations in a later commit. I must say, though, that I don't like the layering of clientOffersPOSTctx.handleClientOffersDecodeError. Instead of returning an error directly, ctx.handleClientOffers writes a JSON representation of the error into a buffer provided by the caller, which the caller must then convert back into an error by calling DecodeError. It seems unnecessarily convoluted just to enable the side effect of writing directly to the response body in the AMP case. This is a case where I would prefer to see duplicated code first, because I suspect we can find a better factoring.

In my testing, the DecodeError error recovery didn't work. It decodes to an Error that has the same E string value but is unequal to ErrNoSnowflakes, for example. When I run a master branch client (with POST) against an amp branch broker, it fails because the broker returns status 200 when it should return status 503. The culprit seems to be this error-matching switch. If I add a default case with a panic, I see that the ErrNoSnowflakes case isn't being matched, even though that is the intention:

http: panic serving 127.0.0.1:56174: {"error":"No snowflakes proxies available"}

Here's an example of what the AMP responses look like:

$ wget -q -O- http://127.0.0.1:8080/client/amp
<!doctype html>
<html amp>
  <head>
    <meta charset="utf-8">
    <title>amp</title>
    <link rel="canonical" href="https://ampproject.org" />
    <meta name="viewport" content="width=device-width,minimum-scale=1,initial-scale=1">
    <style>body {opacity: 0}</style><noscript><style>body {opacity: 1}</style></noscript>
    <script async src="https://cdn.ampproject.org/v0.js"></script>
  </head>
  <body>
    <p>In varietate concordia</p>
    <pre id="data">eyJlcnJvciI6Ik5vIHNub3dmbGFrZXMg cHJveGllcyBhdmFpbGFibGUifQo</pre>
  </body>
</html>
[The base64 decodes to '{"error":"No snowflakes proxies available"}'.]

The overall AMP container encoding looks pretty reasonable. A few tiny questions:

  • basic_markup says the <script async> has to be the second child of <head>, right after <meta charset="utf-8">.
  • The <style>body {opacity: 0}</style><noscript><style>body {opacity: 1}</style></noscript> doesn't match the documented AMP boilerplate code, which is much longer. Is there a reason?
  • <link rel="canonical" href="https://ampproject.org" />: it would be better not to refer to any third parties. The AMP guide says this can be a relative URL, so I would prefer one of these if they work:
    • <link rel="canonical" href="">
    • <link rel="canonical" href="#">
    • <link rel="canonical" href="?">

About the encoded data inside the AMP HTML container: I'm wondering if we should always put the data payload inside the same JSON object that can signal an error, rather than making the payload be sent instead of the error object when there is no error. This is how it currently works:

  1. Try to parse the response body as JSON with an "error" field. If the JSON parsing is successful, handle the error and stop.
  2. Otherwise (JSON parsing failed), return the entire response body as a blob of bytes.

That is, under the base64 encoding, a message can look either like this:

{"error": "No snowflake proxies available"}

or like this:

{"type": "answer", "sdp": "..."}

One problem with this design, as far as amper is concerned, is that it's impossible for the server to send a payload that happens to parse as JSON with an "error" field—it will be interpreted as an error, not a payload. This consideration doesn't affect Snowflake, because Snowflake's payloads don't have an "error" field even though they are JSON. But another consideration is that you can't tell the difference between a non-error response and an improperly encoded error response (if the body is truncated, for example).

I propose to remove this ambiguity by including the data payload inside the JSON object that represents an error. Just like in an HTTP response, we have two things—a status code and a message—inside one package. A message would either look like this:

{"error": "No snowflake proxies available", "data": null}

or like this:

{"error": null, "data": "eyJ0eXBlIjogImFuc3dlciIsICJzZ..."}

(The null fields are included for clarity and could be omitted in practice.) To handle this, a client needs to json.Unmarshal on a struct { Error, Data: string }. Any JSON decoding error signifies an error in the AMP transport, not a Snowflake-level error. Assuming JSON decoding is successful, the "error" field indicates whether there was a Snowflake-level error.

About signaling errors, I would prefer to use machine-readable strings (e.g. "ErrNoSnowflakes") rather than human-readable strings (e.g. "No snowflake proxies available"). The latter seems more likely to get broken accidentally by someone adding punctuation, for example. The variable names you've chosen are good, but I would just them as the actual values, not just the identifiers. We can set up a canonical mapping, e.g. 200⇒"", 503⇒"ErrNoSnowflakes", 504⇒"ErrAnswerTimeout", to map the status codes from POST into the error strings for common handling.


Does NewBrokerChannel need to instantiate an amper.Client when it's not using -codec amp?

comment:25 in reply to:  24 ; Changed 4 months ago by twim

Replying to dcf:
Thanks for the review!

For now I've fixed amper-related issues you brought up.

(On further reflection, what I was trying to do wouldn't have worked anyway, because the AMP client code would have sent a URL path of /c/s/.../amp/client/..., not /amp/client/..., but I could have worked around that for testing.) It looks like the cause is here in amper/client.go, with a hardcoded "https". Looking at that, I also noticed a CDNDomain = "cdn.ampproject.org". I have to say, it skeeves me a bit to see hardcoded constants like that in a library; it seems like the kind of thing that belongs in a configuration file.

Thanks to your suggestion I made some constants configurable. One can now specify Client.Scheme and Client.CDNDomain.
As you may guess, it was intended to work with the only AMP cache around thus the hardcoded values.

Here's an example of what the AMP responses look like:
...
The overall AMP container encoding looks pretty reasonable. A few tiny questions:

  • basic_markup says the <script async> has to be the second child of <head>, right after <meta charset="utf-8">.

You're right. Despite it is being accepted in validator, I moved it to the right place to avoid a breakage.

  • The <style>body {opacity: 0}</style><noscript><style>body {opacity: 1}</style></noscript> doesn't match the documented AMP boilerplate code, which is much longer. Is there a reason?

This is knowm as "old boilerlate" (https://github.com/ampproject/amphtml/commit/0a056ca50ac8cb9ba8e5a6489baeecb5ed958556#diff-17672f7dfc8b0583c360b45234dbf59a). The reason is that it's much shorter than the new one. I've picked this as a tradeoff to save bandwidth as the old boilerplate is still accepted by AMP CDN.
Now this is also configurable: set Server.UseOldAMPBoilterplate to true to use shorter boilerplate, or get completely valid 'modern' AMP page by default.

  • <link rel="canonical" href="https://ampproject.org" />: it would be better not to refer to any third parties. The AMP guide says this can be a relative URL, so I would prefer one of these if they work:
    • <link rel="canonical" href="">
    • <link rel="canonical" href="#">
    • <link rel="canonical" href="?">

Yep, changed to href="#" and it works.

Last edited 4 months ago by twim (previous) (diff)

comment:26 in reply to:  25 Changed 4 months ago by dcf

Replying to twim:

Replying to dcf:
Thanks to your suggestion I made some constants configurable. One can now specify Client.Scheme and Client.CDNDomain.
As you may guess, it was intended to work with the only AMP cache around thus the hardcoded values.

Great. I'm fine with there being defaults, as long as it's possible to override them with -front and -url like the examples in comment:10.

  • The <style>body {opacity: 0}</style><noscript><style>body {opacity: 1}</style></noscript> doesn't match the documented AMP boilerplate code, which is much longer. Is there a reason?

This is knowm as "old boilerlate" (https://github.com/ampproject/amphtml/commit/0a056ca50ac8cb9ba8e5a6489baeecb5ed958556#diff-17672f7dfc8b0583c360b45234dbf59a). The reason is that it's much shorter than the new one. I've picked this as a tradeoff to save bandwidth as the old boilerplate is still accepted by AMP CDN.
Now this is also configurable: set Server.UseOldAMPBoilterplate to true to use shorter boilerplate, or get completely valid 'modern' AMP page by default.

Thanks for the explanation. I didn't know about that. The old boilerplate is fine if it's shorter and it works, as long as there's a short comment explaining what's going on. I don't think it even needs to be configurable, necessarily.

comment:27 Changed 4 months ago by dcf

Today I learned via this OONI/AFTE report that the censors in Egypt block AMP since February 2018. I think what this means is that they block the AMP Cache.

After the spread of the use of the AMP service by some blocked websites in Egypt, including the Mada website, the Egyptian government resorted to blocking the service on February 3, 2018, which affected smart phone users using Google’s search engine for any website that uses AMP. Users were thus unable to access these websites, including websites that the Egyptian government has not blocked. Consequently, Google has announced the suspension of the service in Egypt.

Apparently it's also been blocked in Saudi Arabia:

But I'm guessing, since we'll front the requests to the AMP Cache, that these blocks don't matter to us.

comment:28 Changed 4 weeks ago by dcf

Now Microsoft is providing an AMP cache and AMP viewer via Bing.

https://amphtml.wordpress.com/2018/09/19/introducing-bing-amp-viewer-and-bing-amp-cache/

Here's an AMP cache link and the corresponding AMP viewer link (unlike Google, this one doesn't redirect you when viewed with a non-mobile device):

https://amp-space-com.bing-amp.com/c/s/amp.space.com/41904-london-band-rocks-with-nasa-shuttle-enterprise.html
https://www.bing.com/amp/s/amp.space.com/41904-london-band-rocks-with-nasa-shuttle-enterprise.html

For comparison, these are the corresponding Google links. Just change bing-amp.com to cdn.ampproject.org and www.bing.com to www.google.com.

https://amp-space-com.cdn.ampproject.org/c/s/amp.space.com/41904-london-band-rocks-with-nasa-shuttle-enterprise.html
https://www.google.com/amp/s/amp.space.com/41904-london-band-rocks-with-nasa-shuttle-enterprise.html

Domain fronting works with www.bing.com as a front, but not with www.microsoft.com or ajax.aspnetcdn.com.

$ wget https://www.bing.com/c/s/amp.space.com/41904-london-band-rocks-with-nasa-shuttle-enterprise.html --header 'Host: amp-space-com.bing-amp.com'

comment:29 Changed 4 days ago by dcf

Cloudflare also has an AMP cache, at amp.cloudflare.com. I noticed this while looking at the Citizen Lab test lists.

https://amp-space-com.amp.cloudflare.com/c/s/amp.space.com/41904-london-band-rocks-with-nasa-shuttle-enterprise.html

Apparently they also have an AMP viewer, but I haven't been able to find an example URL for it, except that you can opt out of the viewer at https://amp.cloudflare.com/viewer/optout.html.

Note: See TracTickets for help on using tickets.