This is the npm module of "Mediawiki history dumps scraper", refer to the main branch to see in general the projects' purpose.
This npm module allows you to get (also selectively), through a scraper, the available content in Mediawiki history dumps. You can check wich versions are available, which language, which datasets, the download links, the size...
This module was written in Typescript and uses axios
and regexps to scrape the content from the Download site. The code is linted with eslint and prettier, and bundled with webpack. A code documentation made with Typedoc and hosted with Vercel is available at https://mhdscraper.euber.dev.
npm install mhdscraper
An example (you can add console.log of a variable to see the response).
import * as mhdscraper from 'mhdscraper';
async function main() {
// Returns the root url of the datasets site
const root_url = mhdscraper.WIKI_URL;
// Returns an array of versions, returning the version name and its url
const versions = await mhdscraper.fetchVersions();
// Returns an array of datasets, returning the dataset name, its url and
// including all the available wikies (name and url)
const versionsWithLangs = await mhdscraper.fetchVersions({
wikies: true
});
// Returns an array containing all the wikies of the latest version,
// returning name and url
const wikies = await mhdscraper.fetchWikies('latest');
// Returns an array containing the wikies ending with 'wiki' of the
// latest version, returning name and url
const wikiesEndingWIthWiki = await mhdscraper.fetchWikies('latest', {
wikitype: 'wiki'
});
// Returns an array containing the wikies starting with 'it' of the latest version,
// returning name, url and the array of available dumps
const wikiesWithDumps = await mhdscraper.fetchWikies('latest', {
lang: 'it',
dumps: true
});
// Returns an array containing all the dumps of 'itwiki' of the latest version,
// reurning many pieces of information such as filename, start and end date
// of the content, size in bytes, url to download it...
const dumps = await mhdscraper.fetchDumps('latest', 'itwiki');
// Returna an arrayo containing all the dumps of 'itwiki' of the latest version,
// whose content is between 2004-01-01 and 2005-02-01
const dumpsSelected = await mhdscraper.fetchDumps('latest', 'itwiki', {
start: new Date('2004-01-01'),
end: new Date('2005-02-01')
});
}
main();
The result of:
import * as mhdscraper from 'mhdscraper';
async function main() {
const result = await mhdscraper.fetchWikies('latest', {
lang: 'it',
wikitype: 'wiki',
dumps: true,
start: new Date('2010-01-01'),
end: new Date('2012-12-31'),
});
}
main();
Would be (as of July 2021):
[
{
"dumps": [
{
"bytes": "691419132",
"filename": "2021-06.itwiki.2010.tsv.bz2",
"from": "2010-01-01",
"lastUpdate": "2021-07-03T10:38:00",
"time": "2010",
"to": "2010-12-31",
"url": "https://dumps.wikimedia.org/other/mediawiki_history/2021-06/itwiki/2021-06.itwiki.2010.tsv.bz2"
},
{
"bytes": "706208269",
"filename": "2021-06.itwiki.2011.tsv.bz2",
"from": "2011-01-01",
"lastUpdate": "2021-07-03T10:57:00",
"time": "2011",
"to": "2011-12-31",
"url": "https://dumps.wikimedia.org/other/mediawiki_history/2021-06/itwiki/2021-06.itwiki.2011.tsv.bz2"
},
{
"bytes": "747376403",
"filename": "2021-06.itwiki.2012.tsv.bz2",
"from": "2012-01-01",
"lastUpdate": "2021-07-03T10:11:00",
"time": "2012",
"to": "2012-12-31",
"url": "https://dumps.wikimedia.org/other/mediawiki_history/2021-06/itwiki/2021-06.itwiki.2012.tsv.bz2"
}
],
"url": "https://dumps.wikimedia.org/other/mediawiki_history/2021-06/itwiki",
"wiki": "itwiki"
}
]
WIKI_URL
It is a constant containing the url of the root of the datasets site
async fetchLatestVersion(options)
Fetches the last version of the mediawiki history dumps.
The version is the year-month of the release of the dumps
Options' fields:
true
, the language of the wikies to return (a wiki name starts with the language).true
, the wiki type of the wikies to return (a wiki name ends with the wiki type).true
, retrieve only the dumps newer than this datetrue
, retrieve only the dumps older than this dateReturns an object with:
version
(string) for the version year-monthurl
(string) for the url of that version. wikies
will contain the fetched wikies if the argument was set to true
.None
is returned.fetchVersions(options)
Fetch the versions of the mediawiki history dumps
The versions are the year-month of the release of the dumps
Options' fields:
true
, the language of the wikies to return (a wiki name starts with the language).true
, the wiki type of the wikies to return (a wiki name ends with the wiki type).true
, retrieve only the dumps newer than this datetrue
, retrieve only the dumps older than this dateReturns an array of objects with:
version
(string) for the version year-monthurl
(string) for the url of that version. wikies
will contain the fetched wikies if the argument was set to true
(see fetch_wikies to see the result).fetchWikies(version, options)
Fetch the wikies of a version of the mediawiki history dumps
Parameters:
Options' fields:
true
, retrieve only the dumps newer than this datetrue
, retrieve only the dumps older than this dateReturns an array of objects with:
wiki
(string) for the wiki nameurl
(string) for the url of that wiki.
In addition, if the dumps
argument is true
, a dumps
(array) field contain the fetched dumps (see fetch_dumps to see the reuslt).fetchDumps(version, wiki, options)
Fetch the dumps of a wiki of the mediawiki history dumps
Parameters:
Options' fields:
Returns an array of objects with:
filename
(string) for dump file nametime
(string) for the time of the data ('all-time'
, year or year-monthlastUpdate
(Datetime) for the last update datebytes
(int) for the size in bytes of the filefrom
(Date) for the start date of the datato
(Date) for the end date of the dataurl
(string) the url of the fileGenerated using TypeDoc