Opened 5 months ago

Closed 4 months ago

#34063 closed task (fixed)

[RT-admin] Check if spam filter script is running

Reported by: ggus Owned by: anarcat
Priority: Medium Milestone:
Component: Internal Services/Services Admin Team Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

According to RT service documentation[1], there are some maintenance actions happening like spam training in RT. Since we're receiving a lot of spam, we should verify if spam filter is actually running.

Spam training

Every mail sent to RT is also sent to the rtmailarchive account. This is required to be able to train SpamAssassin as it can only learn from unaltered email messages.

A three steps cronjob is run daily.

​Step 1: Every mail in Maildir/.help* is checked against the RT. For each message, we look up a matching ticket using the Message-Id header. If the ticket is in a help* queue and has status resolved, we move it to the ham training folder. If the ticket in in the spam queue and has status resolved, we move it to the spam training folder. If the file is more than 100 days old, we delete it.

Step 2: SpamAssassin is fed with the content of the ham and spam training folder. After the process, the message is moved to the corresponding learned folder.

Step 3: Message in the learned folders are deleteed. 

[1]
https://trac.torproject.org/projects/tor/wiki/org/operations/services/rt.torproject.org#Spamtraining

Child Tickets

Change History (4)

comment:1 Changed 4 months ago by anarcat

Owner: set to anarcat
Status: newaccepted

there's this cronjob in the rtmailarchive user:

@daily /srv/rtstuff/support-tools/train-spam-filters/train_spam_filters && bin/spam-learn && find Maildir/.spam.learned Maildir/.xham.learned -type f -delete

so it *should* be running... it last ran yesterday:

May 20 00:00:01 rude/rude CRON[7781]: (rtmailarchive) CMD (/srv/rtstuff/support-tools/train-spam-filters/train_spam_filters && bin/spam-learn && find Maildir/.spam.learned Maildir/.xham.learned -type f -delete)

and of course, if i run it by hand, it crashes in a flaming heap of this backtrace:

rtmailarchive@rude:~$ /srv/rtstuff/support-tools/train-spam-filters/train_spam_filters && bin/spam-learn && find Maildir/.spam.learned Maildir/.xham.learned -type f -delete
Traceback (most recent call last):
  File "/srv/rtstuff/support-tools/train-spam-filters/train_spam_filters", line 114, in <module>
    con = psycopg2.connect(RT_CONNINFO)
  File "/usr/lib/python2.7/dist-packages/psycopg2/__init__.py", line 164, in connect
    conn = _connect(dsn, connection_factory=connection_factory, async=async)
psycopg2.OperationalError: could not translate host name "drobovi.torproject.org" to address: Name or service not known

so no, it's not working, thanks for the report.

comment:2 Changed 4 months ago by anarcat

Status: acceptedneeds_review

the authentication was broken, and i fixed it. now the cron job is running, but it might take some time to complete.

comment:3 Changed 4 months ago by anarcat

the job is still running. it had about 210,00 messages earlier on, and it's down to about 200,000 messages now. there's a screen showing disk usage, top and ncdu, along with the actual job. disk space usage is also going down: it was at 80% and is already down to 76%. no idea when this will complete, however.

and because i don't trust this will do the right thing (e.g. locking), i've disabled the cron job until this batch completes.

comment:4 Changed 4 months ago by anarcat

Resolution: fixed
Status: needs_reviewclosed

i fixed the authentication problem and the script finally completed some time last night. it processed about 50,000 messages and cleaned up 20% of the disk space.

i have re-enabled the cron job.

it definitely needs some work. i've documented the design and especially current limitations here:

https://help.torproject.org/tsa/howto/rt/#Possible_improvements

feel free to reopen (or open a different one) if things do not work as well as you would like.

Note: See TracTickets for help on using tickets.