Opened 6 years ago

Closed 17 months ago

Last modified 12 months ago

#10479 closed task (wontfix)

Export blog posts from current blog

Reported by: phobos Owned by: ultrasandwich
Priority: Medium Milestone: 2014 Tor Blog Replacement
Component: Webpages/Blog Version:
Severity: Normal Keywords:
Cc: Actual Points:
Parent ID: #10022 Points:
Reviewer: Sponsor:

Description

Export everything in some sane format for future imports (html vs. markdown?)

Child Tickets

Attachments (4)

entries.json (2.6 MB) - added by ultrasandwich 6 years ago.
JSON blob of blog entries
entries3.json.zip (2.8 MB) - added by ultrasandwich 5 years ago.
Updated, includes more recent posts + comments
events.json (99.1 KB) - added by ultrasandwich 5 years ago.
Updated events JSON file.
markdown.zip (86.6 KB) - added by ultrasandwich 5 years ago.
Attached a zip of 25 example markdown-formatted files, generated from the JSON. Should be good to go for all posts now.

Change History (19)

comment:1 Changed 6 years ago by rndm

Is there a publicly available archive or database dump with all current blog posts or should one scrape blog.torproject.org?

comment:2 Changed 6 years ago by ultrasandwich

I'm also curious about this. Guessing a database dump is an ideal way to accomplish this since the blog is on Drupal. Posted a question on the www-team mailing list earlier: https://lists.torproject.org/pipermail/www-team/2014-January/000000.html

Last edited 6 years ago by ultrasandwich (previous) (diff)

comment:3 Changed 6 years ago by phobos

There is RSS. And web scraping. That's it.

comment:4 Changed 6 years ago by ultrasandwich

Owner: set to ultrasandwich
Status: newaccepted

comment:5 Changed 6 years ago by ultrasandwich

I managed to scrape all 589 entries.

All saved into a JSON file with the following properties:

  1. id
  2. url
  3. title
  4. date
  5. author name
  6. tags
  7. content
  8. comments

Each entry can be spit back out into separate markdown files or whatever.

I don't have the comments. Should be trivial to add these in though.

[UPDATE: added comments]

Last edited 6 years ago by ultrasandwich (previous) (diff)

Changed 6 years ago by ultrasandwich

Attachment: entries.json added

JSON blob of blog entries

comment:6 Changed 6 years ago by ultrasandwich

I wrote a Ruby script that will output these into frontmatter markdown files, when the time comes.

comment:7 in reply to:  5 ; Changed 6 years ago by lunar

Replying to ultrasandwich:

I managed to scrape all 589 entries.

Great! Have you also scraped the “events” part? Would it be too much to ask to do so in case you haven't?

comment:8 in reply to:  7 Changed 6 years ago by ultrasandwich

Replying to lunar:

Replying to ultrasandwich:

I managed to scrape all 589 entries.

Great! Have you also scraped the “events” part? Would it be too much to ask to do so in case you haven't?

Haven't scraped the events yet, but will give it a shot this weekend.

[UPDATE:] all posts as of 5/7/2014 are scraped and rendered into separate markdown files: https://github.com/eschaefer/tor-blog.

Last edited 5 years ago by ultrasandwich (previous) (diff)

Changed 5 years ago by ultrasandwich

Attachment: entries3.json.zip added

Updated, includes more recent posts + comments

Changed 5 years ago by ultrasandwich

Attachment: events.json added

Updated events JSON file.

Changed 5 years ago by ultrasandwich

Attachment: markdown.zip added

Attached a zip of 25 example markdown-formatted files, generated from the JSON. Should be good to go for all posts now.

comment:9 Changed 5 years ago by phobos

This is great. I think this needs to be slightly more thought out and planned. Perhaps we should create a project for it in trac and we can work towards the final goal of blog.torproject.org hosted as a static site in jekyll with git backend and a comment server somewhere else.

comment:10 Changed 5 years ago by phobos

How long does it take to scrape the content?

comment:11 in reply to:  10 Changed 5 years ago by ultrasandwich

Replying to phobos:

How long does it take to scrape the content?

About 10 minutes or so.

comment:12 Changed 5 years ago by Sebastian

Component: WebsiteBlog

Moving to new Blog component

comment:13 Changed 22 months ago by teor

Severity: Normal

Set all open tickets without a severity to "Normal"

comment:14 Changed 17 months ago by arma

Resolution: wontfix
Status: acceptedclosed

Ok, 4 years have passed, and in the mean time we moved our blog to a new version of drupal.

We don't have any current plans to do another blog migration yet, so I'm going to close these tickets in the mean time.

Thanks everybody, especially ultrasandwich (and sorry for dropping this until now).

comment:15 Changed 12 months ago by teor

Component: Internal Services/BlogWebpages/Blog

Moved closed tickets from Internal Services/Blog to Webpages/Blog

Note: See TracTickets for help on using tickets.