The files in this directory provide hourly aggregated pagecounts and projectcounts of all sites (i.e.: desktop, mobile, and zero) across public wikis. 1. On-wiki documentation 2. Used pageview definition 3. File structure 3.1. Disambiguating abbreviations ending in “.m” 4. Contact / Bugs 1. On-wiki documentation ========================== While this README.txt is currently (2014-10-01) up-to-date, this file cannot easily be updated by the community. Hence, we consider the on-wiki documentation at https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-sites the authoritative documentation. This README.txt is just a convenience for people that have issues accessing the on-wiki documentation. 2. Used pageview definition ============================= The files use the pageview definition of webstatscollector, and (in contrast to the files at http://dumps.wikimedia.org/other/pagecounts-raw/ ) do not only apply it to the desktop site, but all sites. So: desktop, mobile, and zero. (Note that this does not yet catch everything that we want to consider as pageview. It comes with the same definition issues as webstatscollector data from pagecounts-raw. But! It provides data for mobile, as a stop-gap measure.) 3. File structure =================== The file structure should be compatible with the files from http://dumps.wikimedia.org/other/pagecounts-raw/ . Filenames are of the form http://dumps.wikimedia.org/other/pagecounts-all-sites/${YEAR}/${YEAR}-${MONTH}/pagecounts-${YEAR}${MONTH}${DAY}-${HOUR}0000.gz http://dumps.wikimedia.org/other/pagecounts-all-sites/${YEAR}/${YEAR}-${MONTH}/projectcounts-${YEAR}${MONTH}${DAY}-${HOUR}0000 . The pagecounts are gzipped text files holding hourly per page aggregates of pageviews and total response bytes, and projectcounts are plain text files holding hourly per domain-name aggregates of pageviews and total response bytes, and projectcounts. Note that (to maintain compatibility with pagecounts-raw) the time used in the filename refers to the end of the aggregation period, not the beginning. Both pagecounts and projectcounts are made up of lines having 4 space-separated fields: domain page_title count_views total_response_size * domain Domain name of the request. Common trailing parts have been abbreviated just as they are for the above pagecounts-raw files: ".wikipedia.org" -> "" ".wikibooks.org" -> ".b" ".wiktionary.org" -> ".d" ".wikimediafoundation.org" -> ".f" ".wikimedia.org" -> ".m" (only for some projects. See below) ".wikinews.org" -> ".n" ".wikiquote.org" -> ".q" ".wikisource.org" -> ".s" ".wikiversity.org" -> ".v" ".wikivoyage.org" -> ".voy" ".mediawiki.org" -> ".w" ".wikidata.org" -> ".wd" For ".wikimedia.org", only the following domains are considered: * commons.wikimedia.org * meta.wikimedia.org * incubator.wikimedia.org * species.wikimedia.org * strategy.wikimedia.org * outreach.wikimedia.org * usability.wikimedia.org * quality.wikimedia.org (There is also ".mw", but ".mw" is only there for legacy reasons and are a legacy attempt to count mobile sites per language across projects. Please do not use ".mw", and use the counts for the mobile sites (like "en.m.voy") instead.) * page_title For pagecounts files, it holds the title of the page. E.g.: Main_Page Berlin For projectcounts files, it is "-". * count_views the number of times this page has been viewed in the respective hour. * total_response_size the total response size caused by the requests for this page in the respective hour. So for example a line of en Main_Page 42 50043 means 42 requests to "en.wikipedia.org/wiki/Main_Page", which accounted in total for 50043 response bytes. And de.m.voy Berlin 176 314159 would stand for 176 requests to "de.m.wikivoyage.org/wiki/Berlin", which accounted in total for 314159 response bytes. Each domain+page_title combination occurs at most once. The file is sorted by domain and page_title. 3.1. Disambiguating abbreviations ending in “.m” -------------------------------------------------- The are two ways for an abbreviation end in “.m”. Either because the domain is a whitelisted project on wikimedia.org (like “commons.wikimedia.org” being abbreviated to “commons.m”), or the domain is the mobile site of wikipedia (like “en.m.wikipedia.org” being abbreviated to “en.m”). Since the whitelisted wikimedia.org projects (see abbreviation table above) never match a language code on wikipedia, the mapping between domain name and abbrevaition is bijective. While this solution requires an “if” for the edge case of “Summing up pageviews across all mobile sites”, it allows to stay compatible with pagecounts-raw's abbreviations while at the same time also keeping the concept and semantics of abbreviating domain names. Also it makes it easier to automate comparisons between this dataset and TSVs (like sampled-1000) or Hive data. 4. Contact / Bugs =================== You can reach the analytics team via email at analytics@lists.wikimedia.org or via IRC on freenode in #wikimedia-analytics .