Opened 3 years ago

Last modified 3 years ago

#17588 new defect

GetTor Logging

Reported by: sukhbir Owned by: ilv
Priority: Medium Milestone:
Component: Applications/GetTor Version:
Severity: Normal Keywords:
Cc: mrphs, ilv Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

This is the main ticket for GetTor logging. Let's try to discuss everything here or you can open new tickets and reference this as the parent ticket.

GetTor's logging is important so that we can see estimate how many users use it, what kind of bundles are important, etc. Note that we will not storing any information that can identify users of the service; our intent is to store counters so that we can know how many requests we had.

Here is what we will be storing (counters):

  • Number of requests for the email bot. (A "request" is considered if we reply to an email with the links to the bundles.)
  • Number of requests for the other distribution channels: Twitter and XMPP. (A "request" is considered if we reply to a query with the links to the bundles.)
  • OS: Windows, Linux or OS X.
  • Locale: Language of the request (en, es, etc.)
  • Requests per day: this is useful if in events of censorship, if there was an increase in the number of requests for a given day.

Talking with ilv, he described how we are storing user data.

  • In the SQLite database, we have a table that stores the sha256 of the address so that we can prevent GetTor from being spammed. Let's clear this after a day so that we don't keep the hashed email address for long and also because since we are not actually sending out the bundles, we shouldn't enforce harsh limits on blacklisting addresses.

On a related note, after how many requests does an email address get blacklisted?

  • In the request table we only store the counter for the requests. This is fine. From the log files, we should extract the other information, update this request table and then use that to generate the automatic reports.

ilv, seems fine? Let's finalize this before the implementation.

Child Tickets

Change History (13)

comment:1 in reply to:  description ; Changed 3 years ago by ilv

Replying to sukhbir:

This is the main ticket for GetTor logging. Let's try to discuss everything here or you can open new tickets and reference this as the parent ticket.


Sounds great, thank you.

Here is what we will be storing (counters):

  • Number of requests for the email bot. (A "request" is considered if we reply to an email with the links to the bundles.)
  • Number of requests for the other distribution channels: Twitter and XMPP. (A "request" is considered if we reply to a query with the links to the bundles.)


What about if a "request" is considered whenever we reply to an email (or a message, or a DM), and we make the distinction about the type of request i.e. between sending links, help message, mirrors?

  • OS: Windows, Linux or OS X.
  • Locale: Language of the request (en, es, etc.)


Sounds good, I agree.

  • Requests per day: this is useful if in events of censorship, if there was an increase in the number of requests for a given day.


Is this related to the number of requests mentioned above (number of requests)?

Talking with ilv, he described how we are storing user data.

  • In the SQLite database, we have a table that stores the sha256 of the address so that we can prevent GetTor from being spammed. Let's clear this after a day so that we don't keep the hashed email address for long and also because since we are not actually sending out the bundles, we shouldn't enforce harsh limits on blacklisting addresses.


Yes, clearing this after a day sounds good. I established a limit to avoid having to process zillions of automated requests. Main reason for this was to prevent an overload of the service, although we process the request anyways, so having a blacklist might be an overkill. However, a good reason to have a short limit for this is that in case someone is making automated requests we don't get our stats ruined.

On a related note, after how many requests does an email address get blacklisted?


Right now after 100 requests. I changed this a while ago in order to make some tests, and I didn't change it back. I think 10 or less should be a good number.

  • In the request table we only store the counter for the requests. This is fine. From the log files, we should extract the other information, update this request table and then use that to generate the automatic reports.


I think we have two options here:

1) extracting this from the log files, and then insert it into the database.
2) storing this on the database directly when we process a request.

I would say 2), because I think we should use the log files to help debugging if/when something goes wrong, not for extracting data.


What do you think?

comment:2 in reply to:  1 ; Changed 3 years ago by sukhbir

Replying to ilv:

What about if a "request" is considered whenever we reply to an email (or a message, or a DM), and we make the distinction about the type of request i.e. between sending links, help message, mirrors?

Do you mean individual counters for link, help, mirror etc?

Is this related to the number of requests mentioned above (number of requests)?

Just storing the date along with each request should work. dd-mm-yyy. Sounds OK?

Yes, clearing this after a day sounds good. I established a limit to avoid having to process zillions of automated requests. Main reason for this was to prevent an overload of the service, although we process the request anyways, so having a blacklist might be an overkill. However, a good reason to have a short limit for this is that in case someone is making automated requests we don't get our stats ruined.

That's true.

Right now after 100 requests. I changed this a while ago in order to make some tests, and I didn't change it back. I think 10 or less should be a good number.

Where is this constant defined? Since thus number is arbitrary in any case, 10 sounds OK.

I think we have two options here:

1) extracting this from the log files, and then insert it into the database.
2) storing this on the database directly when we process a request.

I would say 2), because I think we should use the log files to help debugging if/when something goes wrong, not for extracting data.

Do you think it matters if we are only reading from the log file? I am personally OK with both. The deciding factor here is which will take more time to implement? (Unless I am missing your concern with 2).

Version 0, edited 3 years ago by sukhbir (next)

comment:3 in reply to:  2 ; Changed 3 years ago by ilv

Replying to sukhbir:

Replying to ilv:

What about if a "request" is considered whenever we reply to an email (or a message, or a DM), and we make the distinction about the type of request i.e. between sending links, help message, mirrors?

Do you mean individual counters for link, help, mirror etc?


Yes, that way we can know how many requests we receive for links vs mirrors, for instance.

Is this related to the number of requests mentioned above (number of requests)?

Just storing the date along with each request should work. dd-mm-yyy. Sounds OK?


I'm a little confused here. This sounds like we are going to store each request, or maybe you are talking about extracting that info from the log files? If so, the counter for number of requests for each channel is going to be daily/weekly/monthly?

Right now after 100 requests. I changed this a while ago in order to make some tests, and I didn't change it back. I think 10 or less should be a good number.

Where is this constant defined? Since thus number is arbitrary in any case, 10 sounds OK.


This is defined in smtp.cfg (and xmpp.cfg, twitter.cfg) 1

I think we have two options here:

1) extracting this from the log files, and then insert it into the database.
2) storing this on the database directly when we process a request.

I would say 2), because I think we should use the log files to help debugging if/when something goes wrong, not for extracting data.

Do you think it matters if we are only reading from the log file? I am personally OK with both. The deciding factor here is which will take more time to implement? (Unless I am missing your concern with 1).


Hmm, on a second thought, maybe we should go for 2), this way we avoid interacting with the database on each request, and we only do it once a day or so.

comment:4 in reply to:  3 ; Changed 3 years ago by sukhbir

Replying to ilv:

Yes, that way we can know how many requests we receive for links vs mirrors, for instance.

OK.

I'm a little confused here. This sounds like we are going to store each request, or maybe you are talking about extracting that info from the log files? If so, the counter for number of requests for each channel is going to be daily/weekly/monthly?

From the log files. I am not sure if we are putting the date which we can use as a delimiter. Or is there another way you have in mind?

Hmm, on a second thought, maybe we should go for 2), this way we avoid interacting with the database on each request, and we only do it once a day or so.

Another valid point. Let's go with 2).

comment:5 in reply to:  4 ; Changed 3 years ago by ilv

I'm a little confused here. This sounds like we are going to store each request, or maybe you are talking about extracting that info from the log files? If so, the counter for number of requests for each channel is going to be daily/weekly/monthly?

From the log files. I am not sure if we are putting the date which we can use as a delimiter. Or is there another way you have in mind?


Ok. Yes, we are currently putting the date, so we can use that.

Hmm, on a second thought, maybe we should go for 2), this way we avoid interacting with the database on each request, and we only do it once a day or so.

Another valid point. Let's go with 2).


Ok, good!

comment:6 in reply to:  5 Changed 3 years ago by sukhbir

Replying to ilv:

I'm a little confused here. This sounds like we are going to store each request, or maybe you are talking about extracting that info from the log files? If so, the counter for number of requests for each channel is going to be daily/weekly/monthly?

From the log files. I am not sure if we are putting the date which we can use as a delimiter. Or is there another way you have in mind?


Ok. Yes, we are currently putting the date, so we can use that.

Hmm, on a second thought, maybe we should go for 2), this way we avoid interacting with the database on each request, and we only do it once a day or so.

Another valid point. Let's go with 2).


Ok, good!

Summarizing:

  • Number of requests for the channel (email, Twitter, XMPP)
  • Type of request (link, mirror)
  • OS
  • Locale
  • Request per day

comment:7 Changed 3 years ago by sukhbir

OK, now we come to the scripts for generating the report since we finished parsing the logs.

What kind of information goes in the report? We need to decide on a format. Or, we can just use the above information and present it, one per line (number of requests, OS, etc.)

comment:8 Changed 3 years ago by ilv

I think we should include all the info that we discussed. Given that the automated report will be sent once a month, we should add up all the "Request per day" for that. We can even generate a simple graph that shows number of requests vs day for that month, and this could be attached to the report email. Makes sense?

Also, we could use something similar to this as a template for the report:

Hi, this is the GetTor robot.

Below you will find the usage statistics corresponding to $MONTH, $YEAR:

+++++++++++++++++++++++++++++ BEGIN REPORT +++++++++++++++++++++++++++++++++++++

 [*] Requests received (total): X

 [*] Requests per channel
    * Email: X1.1
    * XMPP: X1.2
    * Twitter: X1.3

 [*] Requests per type
    * help: X2.1
    * links: X2.2
    * mirrors: X3.3

 [*] Requests per OS
    * Windows: X3.1
    * Linux: X3.2
    * OSX: X3.3

 [*] Requests per locale
    * English (en-US): X4.1
    * Farsi (fa): X4.2
    * Chinese (zh-CN): X4.3
    * Turkish (tr): X4.4

+++++++++++++++++++++++++++++ END REPORT +++++++++++++++++++++++++++++++++++++++

That is all for now. Have a nice day!
--
GetTor robot

comment:9 Changed 3 years ago by sukhbir

OK so I finalized the automatic report generation. Here is what the output looks like:

@ GetTor Report for December 2015

We received a total of 920 requests in December, with a peak of 406 requests on December 4.

[*] Request
            help: 783
           links: 131
       blacklist: 3
         mirrors: 3

[*] OS
         windows: 127
           linux: 2
             osx: 2

[*] Language
              en: 859
              zh: 42
              fa: 16

[*] Channel
            smtp: 920

Let me know if you want any changes (including language)? We can do fancy stuff later but for now I wanted to focus on getting a basic report out!

Last edited 3 years ago by sukhbir (previous) (diff)

comment:10 Changed 3 years ago by ilv

Thanks Sukhbir, it looks great! We can add improvements to the format later (we should open new tickets for that). Can you push it to the develop branch (or attach a patch if you prefer)? The only thing remaining would be to have a cronjob to clean the log files on a daily basis.

comment:11 in reply to:  10 ; Changed 3 years ago by sukhbir

Replying to ilv:

Thanks Sukhbir, it looks great! We can add improvements to the format later (we should open new tickets for that). Can you push it to the develop branch (or attach a patch if you prefer)? The only thing remaining would be to have a cronjob to clean the log files on a daily basis.

Let's parse the logs daily, populate the db, clear the logs and then generate the stats once a month. Sounds OK?

comment:12 Changed 3 years ago by sukhbir

I have pushed the file as report.py (develop branch)

Last edited 3 years ago by sukhbir (previous) (diff)

comment:13 in reply to:  11 Changed 3 years ago by ilv

Replying to sukhbir:

Let's parse the logs daily, populate the db, clear the logs and then generate the stats once a month. Sounds OK?


Sounds OK!

Replying to sukhbir:

I have pushed the file as report.py (develop branch)


Great, thank you! I will take a look at it and let you know my comments asap.

Note: See TracTickets for help on using tickets.