Datasets Documentation

Update Date: 2024-08-01

Datasets Providers

Datasets in Graphs

The entire archive is represented by the graphs bellow.

../_images/datasets_in_graphs.png

The raw data used to produce these graphs is available in corpus_report_2023_03_30.zip.

Corpus Report

The corpus report TSV file (corpus_report_2023_03_30.zip) has a list of all the documents available in the “Research Data” dataset. The fields of this TSV file are explaned bellow.

Note

Note that this TSV file contains too many rows to be opened with Microsoft Excel or with Libre Office Calc. Its data must be parsed using a programming language such as Python or other.

With this TSV file, researchers can identify which files they are intended to work with filtering those files based on some basic metadata, and estimate the local storage size required to store the required dataset.

The fields available for each dataprovider are described in the table bellow.

Field name

Is mandatory

BAnQ

LAC

CRKN

Erudit

file_identifier

yes

yes

yes

yes

yes

container_identifier

no

no

no

no

yes

container_title

no

yes

no

no

yes

container_type

no

yes

yes

yes

yes

content_type

no

yes

yes

yes

yes

corpus

yes

yes

yes

yes

yes

collection

yes

yes

yes

yes

yes

file_type

yes

yes

yes

yes

yes

file_size

yes

yes

yes

yes

yes

publication_year

yes

yes

yes

yes

yes

Fields documentation

file_identifier

The unique file identifier. It represents the file path in the file system.

container_identifier

The unique identifier of the container of a document.

This field is not available for all datasets. There are some data providers that do not provide this information.

It is an optional field.

container_title

The container title.

This field is not available for all datasets. There are some data providers that do not provide this information.

It is an optional field.

exemple:

The container title could be a journal title, a newspaper title, a book title.

container_type

The container type.

We do our best effort to better identify this information for all dataset, even if it was not provided to us.

It is an optional field.

exemple:

  • journal

  • newspaper

  • meetings

  • book

  • other

  • etc

content_type

The content type.

We do our best effort to better identify this information for all dataset, even if it is not provided to us.

It is an optional field.

exemple:

  • document : The fulltext is comprised in the file

  • asset : It represents the assets (images, tables, video, audio, etc) of a document

  • page : Not the fulltext, but a page of a document. Usually available in datasets of digitized images. To have access to the fulltext it is required to identify the group of images that belongs to a document. As for exemple, the content of Cabinet Conclusions, 1944 to 1979. Unfortunatelly somethimes therer are no ways to aggregate pages of a document because there are no data available for the container (container_identifer, container_title).

corpus

The identifier of a data provider.

It is a mandatory field.

exemple:

  • banq

  • baclac

  • canadiana

  • erudit

collection

The identifier of a collection provided by a data provider.

It is a mandatory field.

There are some cases, as for exemple in the Bibliothèque et Archives nationales du Québec dataset, we are not able to clearly identify the files collection. In this case the collection field is filled with the same identifier of the data provider.

  • canadiana_serial (canadiana)

  • conclusions_cabinet (baclac)

  • gazette_canada (baclac)

  • journaux_canadiens_francais (baclac)

  • erudit_journals (erudit)

  • banq (banq)

file_type

The extension of each file.

it is a mandatory field.

exemple:

  • pdf

  • xml

  • gif

  • jpg

  • tif

  • txt

  • png

file_size

The size of one file in bytes.

This field can be used to estimate the size of the dataset a researcher aims to download.

it is a mandatory field.

publication_year

The publication year of the content available in the file.

We do our best effort to better identify this information for all dataset, even if it was not provided to us.

it is a mandatory field.