BlogSet-BR English

Versão em Português

Description

BlogSet-BR is a collection of posts gathered from Blogspot platform written by Brazillian users. This resource has three files:

  • a compress csv only with brazillian posts, and
  • a xls file with survey answers, and
  • a tar.gz with original json.

Download

Compress CSV with 7.4 milion Brazillian Posts.

XLS with 4 thousand answers of  Brazillian Bloggers.

Compress TAR with 3 million blogs gathered from Blogspot.

Instructions

The main file blogset-br could be open in Pandas with the command line below:

import pandas as pd
posts = pd.read_csv('blogset-br.csv.gz', compression='gzip', header=None)
# columns: post.id, .blog.id, .published, .title, .content, .author.id, .author.displayName, .replies.totalItems, tags

Citation

Bibtext and Fulltext

Henrique D. P. dos Santos, Vinicius Woloszyn, and Renata Vieira. 2018. BlogSet-BR: A Brazilian Portuguese Blog Corpus. In Proceedings of 11th edition of the Language Resources and Evaluation Conference, 7-12 May 2018, Miyazaki (Japan).

License

Apache License 2.0