使用Django从数据库中随机取N条记录的不同方法及其性能实测

使用Django从数据库中随机取N条记录的不同方法及其性能实测

【声明】:本文中的实验仅限于特定数据库和特定框架。不同数据库,数据库服务器的性能,甚至同一个数据库的不同配置都会影响到同一段代码的性能。具体情况请在自己的生产环境进行测试。

这里(stackoverflow)有一篇关于使用Django随机获取记录的讨论。主要意思是说

1
Record.objects.order_by('?')[:2]

这样获取2个记录会导致性能问题,原因如下:

“ 对于有着相当多数量记录的表来说,这种方法异常糟糕。这会导致一个 ORDER BY RAND() 的SQL查询。举个栗子,这里是MYSQL是如何处理这个查询的(其他数据库的情况也差不多),想象一下当一个表有十亿行的时候会怎样:

  1. 为了完成ORDER BY RAND() ,需要一个RAND()列来排序
  2. 为了有RAND()列,需要一个新表,因为现有的表没有这个列。
  3. 为了这个新表,mysql建立了一个带有新列的,新的临时表,并且将已有的一百万行数据复制进去。
  4. 当其新建完了,他如你所要求的,为每一行运行RAND()函数来填上这个值。是的,你派mysql创建一百万个随机数,这要点时间:)
  5. 几个小时或几天后,当他干完这活,他要排序。是的,你排mysql去排序一个一百万行的,最糟糕的表(说他最糟糕是因为排序的键是随机的)。
  6. 几天或者几星期后,当排序完了,他忠诚地将你实际需要的可怜的两行抓出来返回给你。做的好。;)

注意:只是稍微说一句,得注意到mysql一开始会试着在内存中创建临时表。当内存不够了,他将会把所有东西放在硬盘上,所以你会因为近乎于整个过程中的I/O瓶颈而雪上加霜。

怀疑者可以去看看python代码引起的查询语句,确认是ORDER BY RAND(), 然后去Google下”order by rand()”(带上引号)。

一个更好的方式是将这个耗费严重的查询换成3个耗费更轻的:

1
2
3
4
5
6
7
8
last = MyModel.objects.count() - 1 
# 这是一个获取两个不重复随机数的简单方法 
index1 = randint(0, last) 
index2 = randint(0, last - 1
if index2 == index1:  
    index2 = last 
MyObj1 = MyModel.objects.all()[index1] 
MyObj2 = MyModel.objects.all()[index2]

如上Manganeez所说的方法,相应的获取n条记录的代码应该如下:

1
2
sample = random.sample(xrange(Record.objects.count()),n) 
result = [Record.objects.all()[i] for i in sample]

基于Python代码应该简洁优雅的想法,如上的代码似乎可以写成:

1
result = random.sample(Record.objects.all(),n)

就性能问题,请教了stackoverflow上的大神 (虽然被踩和被教育了=。=)

Record.objects.count() 将被转换成一个相当轻量级的SQL请求:

1
SELECT COUNT(*) FROM TABLE

Record.objects.all()[0]也会被转换成一个十分轻量级的SQL请求

1
SELECT * FROM TABLE LIMIT 1

Querying all 是一个耗费十分严重的请求

1
SELECT * FROM TABLE

通常情况下Django会不显示其他的结果,这样你不会真正的获取到所有的记录。

1
SELECT * FROM table LIMIT 20// or something similar

任何时候你将一个Queryset转换成list的时候,将是资源消耗严重的时候。

如果我没错的话,在这个例子里,sample方法将把Queryset转换成list。

这样如果你result = random.sample(Record.objects.all(),n) 这样做的话,全部的Queryset将会转换成list,然后从中随机选择。

想象一下如果你有十亿行的数据。你是打算把它存储在一个有百万元素的list中,还是愿意一个一个的query?

在上边Yeo的回答中,freakish回复道:.count的性能是基于数据库的。而Postgres的.count为人所熟知的相当之慢。

某人说过,要知道梨子的滋味,就得变革梨子,亲口尝一尝。

测试环境:

  • Win8.1 pro x64
  • Wampserver2.4-x64 (apache2.4.4 mysql5.6.12 php5.4.12)
  • Python2.7.5
  • Django1.4.6

在一个已有的测试project中新建一个app,数据库是MYSQL:

1
D:\PyWorkspace\DjangoTest>python manage.py startapp randomrecords

在models.py中添加模型:

1
2
3
4
5
6
7
8
9
10
class Record(models.Model): 
    """docstring for Record""" 
 
    id = models.AutoField(primary_key = True
    content = models.CharField(max_length = 16
 
    def __str__(self): 
        return "id:%s content:%s" % (self.id, self.content) 
    def __unicode__(self): 
        return u"id:%s content:%s" % (self.id, self.content)

添加一万行数据:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
D:\PyWorkspace\DjangoTest>python manage.py syncdb 
Creating tables ... 
Creating table randomrecords_record 
Installing custom SQL ... 
Installing indexes ... 
Installed 0 object(s) from 0 fixture(s) 
 
D:\PyWorkspace\DjangoTest>python manage.py shell 
Python 2.7.5 (default, May 15 2013, 22:44:16) [MSC v.1500 64 bit (AMD64)] on win 
32 
Type "help", "copyright", "credits" or "license" for more information. 
(InteractiveConsole) 
>>> from randomrecords.models import Record 
>>> for i in xrange(10000): 
...   Record.objects.create(content = 'c of %s' % i).save()

15分钟以后我得到了这个MYSQL表。真的,不骗你,真的是15分钟。看了记录才知道 每次save都要调用一次insert和一次update。。。。下次一定用SQL语句初始化。。。。

先写了个脚本 在manage.py shell中调用了下 结果让我震惊了。我表示不敢相信 又写了view 并在settings.py中添加了显示SQL Query语句的log

这里是写的view:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def test1(request): 
    start = datetime.datetime.now() 
    result = Record.objects.order_by('?')[:20
    l = list(result) # Queryset是惰性的,强制将Queryset转为list 
    end = datetime.datetime.now() 
    return HttpResponse("time:  %s" % (end-start).microseconds/1000)) 
 
def test2(request): 
    start = datetime.datetime.now() 
    sample = random.sample(xrange(Record.objects.count()),20
    result = [Record.objects.all()[i] for i in sample] 
    l = list(result) 
    end = datetime.datetime.now() 
    return HttpResponse("time:  %s" % (end-start) 
 
def test3(request): 
    start = datetime.datetime.now() 
    result = random.sample(Record.objects.all(),20
    l = list(result) 
    end = datetime.datetime.now() 
    return HttpResponse("time:  %s" % (end-start)

运行结果如下,第一行是页面显示的时间,后边是Queryset实际调用的SQL语句

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
test1: 
 
time: 0:00:00.012000 
 
(0.009) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` ORDER BY RAND() LIMIT 20; args=() 
[05/Dec/2013 17:48:19] "GET /dbtest/test1 HTTP/1.1" 200 775 
 
test2: 
 
time: 0:00:00.055000 
 
(0.002) SELECT COUNT(*) FROM `randomrecords_record`; args=() 
(0.002) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 6593; args=() 
(0.001) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 2570; args=() 
(0.001) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 620; args=() 
(0.001) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 5814; args=() 
(0.003) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 6510; args=() 
(0.002) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 3536; args=() 
(0.001) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 3362; args=() 
(0.003) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 8948; args=() 
(0.002) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 7723; args=() 
(0.001) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 2374; args=() 
(0.002) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 8269; args=() 
(0.002) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 4370; args=() 
(0.002) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 6953; args=() 
(0.001) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 1441; args=() 
(0.000) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 772; args=() 
(0.002) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 4323; args=() 
(0.002) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 8139; args=() 
(0.002) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 7441; args=() 
(0.001) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 1306; args=() 
(0.001) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 5462; args=() 
[05/Dec/2013 17:50:34] "GET /dbtest/test2 HTTP/1.1" 200 777 
 
test3: 
 
time: 0:00:00.156000 
 
(0.032) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record`; args=() 
[05/Dec/2013 17:51:29] "GET /dbtest/test3 HTTP/1.1" 200 774

令人难以置信的,在10000行的MYSQL表中 方法1的效率是最高的。无论是结果上看(12ms)还是SQL语句的运行时间上看(9ms)方法1甩了其他方法一大截

即便数据量增加到21万,方法1也会比其他两种方法快:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
time: 98 
 
(0.094) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` ORDER BY RAND() LIMIT 20; args=() 
[05/Dec/2013 19:18:59] "GET /dbtest/test1 HTTP/1.1" 200 14 
 
time: 0:00:00.668000 
//这里没有注意到 掉了一行count语句 
(0.045) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 176449; args=() 
(0.016) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 68082; args=() 
(0.036) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 145571; args=() 
(0.033) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 111029; args=() 
(0.043) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 169675; args=() 
(0.046) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 186234; args=() 
(0.043) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 167233; args=() 
(0.015) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 54404; args=() 
(0.036) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 140395; args=() 
(0.004) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 13128; args=() 
(0.039) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 153695; args=() 
(0.034) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 131863; args=() 
(0.021) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 82785; args=() 
(0.015) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 57253; args=() 
(0.021) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 77836; args=() 
(0.049) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 199567; args=() 
(0.002) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 3867; args=() 
(0.027) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 104470; args=() 
(0.026) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 107058; args=() 
(0.043) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record` LIMIT 1 OFFSET 150979; args=() 
[05/Dec/2013 19:21:33] "GET /dbtest/test2 HTTP/1.1" 200 15 
 
time 0:00:00.781000 
 
[05/Dec/2013 19:23:01] "GET /dbtest/test3 HTTP/1.1" 200 15 
(0.703) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content` FRO 
M `randomrecords_record`; args=() 
[05/Dec/2013 19:23:06] "GET /dbtest/test3 HTTP/1.1" 200 15

数据量再次提升至百万级别 1066768条数据