Opened 7 weeks ago

Closed 2 weeks ago

#30731 closed task (fixed)

Publish post-sanitization broker logs

Reported by: dcf Owned by:
Priority: Medium Milestone:
Component: Circumvention/Snowflake Version:
Severity: Normal Keywords:
Cc: arlolra, cohosh, phw, dcf Actual Points:
Parent ID: Points:
Reviewer: phw Sponsor:

Description

#30693 deleted old unsanitized server logs. Before deletion, I extracted a sanitized CSV file that is enough to make graphs like this.

This ticket is to discuss whether the sanitized CSV is safe to publish and publish it if so. The sanitized log is currently on the snowflake-broker host under the filename:

/var/log/snowflake-broker/broker.csv.xz

The scripts used to create it are

attachment:broker-logs.zip:ticket:30693

Child Tickets

Change History (7)

comment:1 Changed 7 weeks ago by dcf

Status: newneeds_review

comment:2 Changed 6 weeks ago by phw

Reviewer: phw
Status: needs_reviewneeds_information

The clientid column appears to be empty:

$ cut -d , -f 4 broker.csv | sort -n | uniq

clientid

Do we need it if it doesn't contain any information?

The only problem I can think of is somebody using the data set to confirm if a given computer was a client or a proxy. Given the low resolution of the timestamps, the absence of client IDs, and the pseudonymous proxy IDs, I consider the risk low.

comment:3 Changed 6 weeks ago by cohosh

These graphs are really nice. Thanks for doing this! I'm wondering if we want something similar to the information in these graphs for the ongoing metrics collected from the broker (#21315).

I agree that this information should be fine to publish. We don't have any geoip information from proxies or clients either which keeps the risk low.

comment:4 in reply to:  2 Changed 6 weeks ago by dcf

Replying to phw:

The clientid column appears to be empty:

$ cut -d , -f 4 broker.csv | sort -n | uniq

clientid

Do we need it if it doesn't contain any information?

We can remove that column. I included the column to show that I would have used that information if it were present, but none of the broker's log messages has it.

The only problem I can think of is somebody using the data set to confirm if a given computer was a client or a proxy. Given the low resolution of the timestamps, the absence of client IDs, and the pseudonymous proxy IDs, I consider the risk low.

I was thinking, maybe the truncated timestamps don't help so much. Proxies poll every 10 seconds, so if you see a certain proxy ID polled 10 times in the 00:00:00–00:10:00 interval, and 30 times in the 00:10:00–00:20:00 interval, then it's a good guess that the proxy was active starting 00:08:40 and ending 00:15:00. In the graphs, I further truncated the timestamps to 24 hours, just to reduce the size of the data (see reduce.R). Maybe we should make the timestamps even coarser than 10 minutes?

comment:5 in reply to:  3 ; Changed 6 weeks ago by dcf

Replying to cohosh:

These graphs are really nice. Thanks for doing this! I'm wondering if we want something similar to the information in these graphs for the ongoing metrics collected from the broker (#21315).

Yes, I think it would be great to have similar graphs using up-to-date logs. The scripts in attachment:broker-logs.zip:ticket:30693​ should still work with the current sanitized logs. We should potentially think about making the logs better reflect what we want to measure, because currently they require a fair bit of inference. Here's the main classification function from the process script that tries to regularize the log messages:

var regexpRecvSnowflake = regexp.MustCompile(`^Received snowflake:\s*([\w/+]*)`)
var regexpProxyNoClient = regexp.MustCompile(`^Proxy ([\w/+]*) did not receive a Client offer.`)

        if strings.HasPrefix(msg, "ACME hostnames: ") {
                err = w.Write([]string{timestamp, "start", "", "", ""})
        } else if strings.HasPrefix(msg, "http: TLS handshake error ") {
                err = w.Write([]string{timestamp, "error", "", "", "tls"})
        } else if strings.HasPrefix(msg, "http2: server: error ") {
                err = w.Write([]string{timestamp, "error", "", "", "http2"})
        } else if strings.HasPrefix(msg, "http2: received GOAWAY ") {
                err = w.Write([]string{timestamp, "error", "", "", "http2"})
        } else if msg == "Starting HTTP-01 listener" {
        } else if msg == "Invalid data." {
                err = w.Write([]string{timestamp, "error", "", "", "invalid-data"})
        } else if msg == "Mismatched IDs!" {
                err = w.Write([]string{timestamp, "proxy-gets-none", "", "", "mismatched-ids"})
        } else if msg == "Passing client offer to snowflake proxy." {
                err = w.Write([]string{timestamp, "client-offers", "", "", ""})
        } else if msg == "Client: No snowflake proxies available." {
                err = w.Write([]string{timestamp, "client-gets-none", "", "", "no-proxies"})
        } else if msg == "Client: Timed out." {
                err = w.Write([]string{timestamp, "client-gets-none", "", "", "timeout"})
        } else if matches := regexpRecvSnowflake.FindStringSubmatch(msg); matches != nil {
                err = w.Write([]string{timestamp, "proxy-polls", idFor(matches[1]), "", ""})
        } else if matches := regexpProxyNoClient.FindStringSubmatch(msg); matches != nil {
                err = w.Write([]string{timestamp, "proxy-gets-none", idFor(matches[1]), "", "no-clients"})
        } else if msg == "Passing client offer to snowflake." {
                err = w.Write([]string{timestamp, "proxy-gets-offer", "", "", ""})
        } else if strings.HasPrefix(msg, "Received answer: ") {
                err = w.Write([]string{timestamp, "proxy-answers", "", "", ""})
        } else if msg == "Client: Retrieving answer" {
                err = w.Write([]string{timestamp, "client-gets-answer", "", "", ""})
        } else {
                return fmt.Errorf("cannot parse line")
        }

comment:6 in reply to:  5 Changed 5 weeks ago by cohosh

Replying to dcf:

Replying to cohosh:

These graphs are really nice. Thanks for doing this! I'm wondering if we want something similar to the information in these graphs for the ongoing metrics collected from the broker (#21315).

Yes, I think it would be great to have similar graphs using up-to-date logs. The scripts in attachment:broker-logs.zip:ticket:30693​ should still work with the current sanitized logs. We should potentially think about making the logs better reflect what we want to measure, because currently they require a fair bit of inference. Here's the main classification function from the process script that tries to regularize the log messages:

I made a ticket for this: #30830.

I'm trying to reason out the difference between analyzing the log output like we've done here and exporting formatted stats for the metrics like we're planning to do for the geoip metrics. Are there scenarios where it's better for us to analyze data in the way we've done here, or should we figure out how to make the graphs in #30693 with exported data?

comment:7 Changed 2 weeks ago by dcf

Resolution: fixed
Status: needs_informationclosed

I've published the sanitized CSV file as well as the sanitization scripts at https://archive.org/details/snowflake-broker-logs-20190416.

I made a repo for the sanitization scripts at https://dip.torproject.org/dcf/snowflake-broker-log-sanitization, in case they are ever useful again.

Note: See TracTickets for help on using tickets.