BlogSet-BR English

Versão em Português


BlogSet-BR is a collection of posts gathered from Blogspot platform written by Brazillian users. This resource has three files:

  • a compress csv only with brazillian posts, and
  • a xls file with survey answers, and
  • a tar.gz with original json.


Compress CSV with 7.4 milion Brazillian Posts.

XLS with 4 thousand answers of  Brazillian Bloggers.

Compress TAR with 3 million blogs gathered from Blogspot.


The main file blogset-br could be open in Pandas with the command line below:

import pandas as pd
posts = pd.read_csv('blogset-br.csv.gz', compression='gzip', header=None)
# columns:,, .published, .title, .content,, .author.displayName, .replies.totalItems, tags


Bibtext and Fulltext

Henrique D. P. dos Santos, Vinicius Woloszyn, and Renata Vieira. 2018. BlogSet-BR: A Brazilian Portuguese Blog Corpus. In Proceedings of 11th edition of the Language Resources and Evaluation Conference, 7-12 May 2018, Miyazaki (Japan).


Apache License 2.0