一区二区日本_久久久久久久国产精品_无码国模国产在线观看_久久99深爱久久99精品_亚洲一区二区三区四区五区午夜_日本在线观看一区二区

The Pile

An 800GB Dataset of Diverse Text for Language Modeling

What is the Pile?

The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

Download

The Pile is hosted by the Eye.

The format of the Pile is jsonlines data compressed using zstandard.

Have a model that uses or evaluates on the Pile? Let us know!

Why is the Pile a good training set?

Recent work has shown that especially for large models, diversity in data sources improves general cross-domain knowledge of the model, as well as downstream generalization capability. In our evaluations, not only do models trained on the Pile show moderate improvements in traditional language modeling benchmarks, they also show significant improvements on Pile BPB.

Why is the Pile a good benchmark?

To score well on Pile BPB (bits per byte), a model must be able to understand many disparate domains including books, github repositories, webpages, chat logs, and medical, physics, math, computer science, and philosophy papers. Pile BPB is a measure of world knowledge and reasoning ability in these domains, making it a robust benchmark of general, cross-domain text modeling ability for large language models.

Citing

If you use the Pile or any of the components, please cite us!

@article{pile,
  title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},
  author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},
  journal={arXiv preprint arXiv:2101.00027},
  year={2020}
}
                

Leaderboard

* indicates potential test-set overlap. Zero-shot indicates that not all of the components of the Pile were present in the training data.

Rank Model Test BPB

1.

Jan 1.2021

GPT-3 (Zero-Shot)*

OpenAI

0.7177

2.

Jan 1.2021

GPT-2 (Zero-Shot)*

OpenAI

1.2253

主站蜘蛛池模板: 99免费| 中文字幕男人的天堂 | 午夜久久| 亚洲国产自产 | 欧美一区二区三区,视频 | 久久中文字幕一区 | 精品乱人伦一区二区三区 | 久久精品亚洲精品 | 日韩毛片免费看 | 欧美激情视频一区二区三区在线播放 | 久久久久黑人 | 日本视频免费观看 | 亚洲人成在线观看 | 亚洲精品视频在线 | 日本黄色一级视频 | 国产精品久久久久久久一区二区 | 五月婷婷 六月丁香 | 亚洲一区久久久 | 国产99久久精品一区二区300 | 超碰免费在 | gogo肉体亚洲高清在线视 | 天天干天天插天天 | 国产亚洲一区二区三区 | 精品欧美一区二区三区精品久久 | 国产精品美女久久久久久久网站 | 国产99久久精品一区二区300 | 国产午夜一级 | 二区三区视频 | 一区二区视频在线 | 欧美成人精品一区二区三区 | 国内精品久久精品 | 亚洲精品亚洲人成人网 | 色婷婷亚洲国产女人的天堂 | 欧美成人在线影院 | 日本一区二区视频 | 国产精品日本一区二区不卡视频 | 成人免费在线 | 动漫www.被爆羞羞av44 | 色秀网站 | 欧美日韩亚洲视频 | 欧美成年人 |