Datasets Documentation
Update Date: 2024-08-01
Datasets Providers
Datasets in Graphs
The entire archive is represented by the graphs bellow.
The raw data used to produce these graphs is available in corpus_report_2023_03_30.zip
.
Corpus Report
The corpus report TSV file (corpus_report_2023_03_30.zip
) has a list of all the documents available in the “Research Data” dataset. The fields of this TSV file are explaned bellow.
Note
Note that this TSV file contains too many rows to be opened with Microsoft Excel or with Libre Office Calc. Its data must be parsed using a programming language such as Python or other.
With this TSV file, researchers can identify which files they are intended to work with filtering those files based on some basic metadata, and estimate the local storage size required to store the required dataset.
The fields available for each dataprovider are described in the table bellow.
Field name |
Is mandatory |
BAnQ |
LAC |
CRKN |
Erudit |
---|---|---|---|---|---|
file_identifier |
yes |
yes |
yes |
yes |
yes |
container_identifier |
no |
no |
no |
no |
yes |
container_title |
no |
yes |
no |
no |
yes |
container_type |
no |
yes |
yes |
yes |
yes |
content_type |
no |
yes |
yes |
yes |
yes |
corpus |
yes |
yes |
yes |
yes |
yes |
collection |
yes |
yes |
yes |
yes |
yes |
file_type |
yes |
yes |
yes |
yes |
yes |
file_size |
yes |
yes |
yes |
yes |
yes |
publication_year |
yes |
yes |
yes |
yes |
yes |
Fields documentation
file_identifier
The unique file identifier. It represents the file path in the file system.
container_identifier
The unique identifier of the container of a document.
This field is not available for all datasets. There are some data providers that do not provide this information.
It is an optional field.
container_title
The container title.
This field is not available for all datasets. There are some data providers that do not provide this information.
It is an optional field.
exemple:
The container title could be a journal title, a newspaper title, a book title.
container_type
The container type.
We do our best effort to better identify this information for all dataset, even if it was not provided to us.
It is an optional field.
exemple:
journal
newspaper
meetings
book
other
etc
content_type
The content type.
We do our best effort to better identify this information for all dataset, even if it is not provided to us.
It is an optional field.
exemple:
document : The fulltext is comprised in the file
asset : It represents the assets (images, tables, video, audio, etc) of a document
page : Not the fulltext, but a page of a document. Usually available in datasets of digitized images. To have access to the fulltext it is required to identify the group of images that belongs to a document. As for exemple, the content of Cabinet Conclusions, 1944 to 1979. Unfortunatelly somethimes therer are no ways to aggregate pages of a document because there are no data available for the container (container_identifer, container_title).
corpus
The identifier of a data provider.
It is a mandatory field.
exemple:
banq
baclac
canadiana
erudit
collection
The identifier of a collection provided by a data provider.
It is a mandatory field.
There are some cases, as for exemple in the Bibliothèque et Archives nationales du Québec dataset, we are not able to clearly identify the files collection. In this case the collection field is filled with the same identifier of the data provider.
canadiana_serial (canadiana)
conclusions_cabinet (baclac)
gazette_canada (baclac)
journaux_canadiens_francais (baclac)
erudit_journals (erudit)
banq (banq)
file_type
The extension of each file.
it is a mandatory field.
exemple:
pdf
xml
gif
jpg
tif
txt
png
file_size
The size of one file in bytes.
This field can be used to estimate the size of the dataset a researcher aims to download.
it is a mandatory field.
publication_year
The publication year of the content available in the file.
We do our best effort to better identify this information for all dataset, even if it was not provided to us.
it is a mandatory field.