Discussion:
Django + Django REST Framework SQL query response handling and serialization is very, very slow - yet not a “N+1” problem
Miika Huusko
2017-06-11 10:06:23 UTC
Permalink
I have a performance problem with Django + Django REST Framework. The
problem is that it takes Django very, very long time to handle SQL query
response + DRF serialization take quite some time also. It happens when
there are ManyToMany or OneToMany relations and nested objects. Sounds like
"N + 1" problem, but that's not the case. I have 10k to 50k items with
related items and I try to fetch and serialize them for REST API. Request
takes from 20 to 60 seconds and currently I have no idea what is causing
the slowness.

Note: I asked the same question in Stack Overflow, but haven't found the
answer from there -
https://stackoverflow.com/questions/44461638/django-django-rest-framework-postgresql-queries-and-serialization-is-very-sl


* N+1 Problem *

There are not too many queries and queries themselves are fast (queries at
the end of this post). I'm using `prefetch_related` to limit number of
queries and what I'm seeing from DB queries everything is looking Okay..ish
(?). I get one query for each `prefetch_related` property + the original
query for serialized objects. There are lots and lots of IDs included in
`prefetch_related` queries, but I guess that is inevitable - as many IDs as
there are original items.

To test SQL queries + data transfer, I run same queries with psql from one
of my EC2 instances to RDS DB with same DB data and wrote the data to file.
Data transfer plus file write on top of that and it is totally between 100
to 500 ms for bigger SQL queries for different data sets. File write is
extra, but I wanted to ensure that I get all the data I expect.

I tested timing from EC2 instance with command from like:

time psql -f fat-query.psql --host=rds_host --port=5432 --username=user
--dbname=dbname > output.txt


* Profiling *

When profiling as shown here
https://www.dabapps.com/blog/api-performance-profiling-django-rest-framework/
I get results that DB lookups take most of the time while serializing is
not too fast either. As an example I have about 12k "Items" in the my local
PostgreSQL database for one "Project". All "Items" have 1-5 "Events" and
majority of "Items" have also 1-2 "Photos". Fetching and serializing that
data takes around 22 seconds on my laptop. I'm using AWS EC2 + RDS for
deployment and the timing is about the same there. On larger "Item" sets
serialization time is increasing more than DB lookup time, but DB lookups
always take most of the time. With 40k items you'll start to reach 1 min
execution time and different timeouts from Nginx and other parts of the
stack.

Example with 12k items (models, serializers and queries below)

Database lookup | 14.0292s
Serialization | 6.0076s
Django request/response | 0.3488s
API view | 0.2170s
Response rendering | 1.1092s

If I leave Photos and Events out the result is

Database lookup | 1.2447s
Serialization | 3.9668s
Django request/response | 0.2435s
API view | 0.1320s
Response rendering | 0.8495s


* What might cause the slowness? *

So, the related fields are taking most of the time (many=True). The
profiling I used for testing is making `list` out of queryset before
serializing. Therefore lazy queries are executed before serialization. If I
don't do that, it doesn't change the overall results, but DB lookups are
evaluated when serializing with about the same amount of time.

Now the problem for me is that all queries that are done are fast if
executed manually. So, I believe SQL queries are fast, but DB lookups from
Django's point of view are very slow. What am I missing here? Or how should
I continue investigations? It feels like now it requires serious effort
from Django to convert SQL query results to Django model instances. That
would imply that there's something wrong with my models, right?


At the end, I could turn to caching, but I would assume that handling <
100k objects should not be an issue for Django if done correctly.

----------

Setup: Python 2.7.13, Django 1.10.7, DRF 3.6.3

Simplified versions of models, views and serializers:

class List(models.Model):
... CharFields, DateTimeFields, ForeignKeys etc. ...

class Item(models.Model):
list = models.ForeignKey(List, on_delete=models.CASCADE,
db_index=True, null=True, related_name='items')
deleted_at = models.DateTimeField(db_index=True, blank=True,
null=True, default=None)
created_by = models.ForeignKey(User, blank=False)
project = models.ForeignKey('projects.Project',
on_delete=models.CASCADE)
... other CharFields, DateTimeFields, ForeignKeys etc. ...

class Event(models.Model):
item = models.ForeignKey(Item, on_delete=models.CASCADE,
db_index=True, null=True, related_name='events')
created_by = models.ForeignKey(User, blank=False)
deleted_at = models.DateTimeField(db_index=True, blank=True,
null=True, default=None)
... other CharFields, DateTimeFields, ForeignKeys etc. ...

class Photo(models.Model):
item = models.ForeignKey(Item, on_delete=models.CASCADE,
db_index=True, null=True, related_name='photos')
created_by = models.ForeignKey(User, blank=False)
deleted_at = models.DateTimeField(db_index=True, blank=True,
null=True, default=None)
... other CharFields, DateTimeFields, ForeignKeys etc. ...


class PhotoSerializer(serializers.ModelSerializer):
... other CharFields, DateTimeFields, ForeignKeys etc. ...

class EventSerializer(serializers.ModelSerializer):
createdBy = PrimaryKeyRelatedStringField(source='created_by',
read_only=True)
createdByFullName =
serializers.CharField(source='created_by.get_full_name', read_only=True)
... other CharFields, DateTimeFields, ForeignKeys etc. ...

class ItemSerializer(serializers.ModelSerializer):
listName = serializers.CharField(source='list.name', read_only=True)
createdBy = PrimaryKeyRelatedStringField(source='created_by',
read_only=True)
createdByFullName =
serializers.CharField(source='created_by.get_full_name', read_only=True)
photos = PhotoSerializer(many=True, read_only=True)
events = EventSerializer(many=True, required=False,
allow_null=True, queryset=Event.objects.all())
... other fields ...


class ItemListAPIView(ListAPIView):
model = Item
serializer_class = ItemSerializer

def get_queryset(self):
return
Item.objects.all().filter(project_id=...).filter(deleted_at__isnull=True).prefetch_related(
'created_by', # ID of user who created item
'photos', # Photo properties
'event__created_by', # Full name of the person who created
the event
'list', # Name of the related list
)

Example queries from tests with 14s DB lookup result:

django.db.backends: (0.196) SELECT "todo_item"."version", ...
everything ... FROM "todo_item" WHERE ("todo_item"."project_id" = 1 AND
"todo_item"."deleted_at" IS NULL) ORDER BY "todo_item"."created_at" DESC;
django.db.backends: (0.001) SELECT "auth_user"."id", ... everything ...
FROM "auth_user" WHERE "auth_user"."id" IN (1, 2, ... some IDs ...);
django.db.backends: (0.148) SELECT "photos_photo"."version", ...
everything ... FROM "photos_photo" WHERE ("photos_photo"."deleted_at" IS
NULL AND "photos_photo"."item_id" IN (1, 2, ... lots of IDs... N)) ORDER BY
"photos_photo"."created_at" DESC;
django.db.backends: (0.078) SELECT "events_event"."created_at", ...
everything ... FROM "events_event" WHERE ("events_event"."deleted_at" IS
NULL AND "events_event"."item_id" IN (1, 2, ... lots of IDs... N)) ORDER BY
"events_event"."created_at" DESC, "events_event"."created_at" DESC;
django.db.backends: (0.157) SELECT "todo_list"."created_at", ... FROM
"todo_list" WHERE "todo_list"."id" IN (1, 2, ... lots of IDs... N)
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/989dd9f6-508a-4dd7-a085-c30557204f41%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Alceu Rodrigues de Freitas Junior
2017-06-11 15:14:08 UTC
Permalink
There is some overhead by using ORM: this is inevitable.

On the other hand, you didn't posted your ORM calls, and the queries are
incomplete.

If you think your queries are optimized (as I understand they are from
your description), an workaround is to avoid using the ORM for those
specific queries. Or review your Models, they might not be optimized for
what you want. Can you try to reduce the number of queries to a smaller
number by creating more complex queries?

Another possibility is getting a new Model recovering information from a
view/stored procedure on the database... you see, you want to get 10k to
50k items... that's a lot (to serialize too).
Post by Miika Huusko
I have a performance problem with Django + Django REST Framework. The
problem is that it takes Django very, very long time to handle SQL
query response + DRF serialization take quite some time also. It
happens when there are ManyToMany or OneToMany relations and nested
objects. Sounds like "N + 1" problem, but that's not the case. I have
10k to 50k items with related items and I try to fetch and serialize
them for REST API. Request takes from 20 to 60 seconds and currently I
have no idea what is causing the slowness.
Note: I asked the same question in Stack Overflow, but haven't found
the answer from there -
https://stackoverflow.com/questions/44461638/django-django-rest-framework-postgresql-queries-and-serialization-is-very-sl
* N+1 Problem *
There are not too many queries and queries themselves are fast
(queries at the end of this post). I'm using `prefetch_related` to
limit number of queries and what I'm seeing from DB queries everything
is looking Okay..ish (?). I get one query for each `prefetch_related`
property + the original query for serialized objects. There are lots
and lots of IDs included in `prefetch_related` queries, but I guess
that is inevitable - as many IDs as there are original items.
To test SQL queries + data transfer, I run same queries with psql from
one of my EC2 instances to RDS DB with same DB data and wrote the data
to file. Data transfer plus file write on top of that and it is
totally between 100 to 500 ms for bigger SQL queries for different
data sets. File write is extra, but I wanted to ensure that I get all
the data I expect.
time psql -f fat-query.psql --host=rds_host --port=5432
--username=user --dbname=dbname > output.txt
* Profiling *
When profiling as shown here
https://www.dabapps.com/blog/api-performance-profiling-django-rest-framework/
I get results that DB lookups take most of the time while serializing
is not too fast either. As an example I have about 12k "Items" in the
my local PostgreSQL database for one "Project". All "Items" have 1-5
"Events" and majority of "Items" have also 1-2 "Photos". Fetching and
serializing that data takes around 22 seconds on my laptop. I'm using
AWS EC2 + RDS for deployment and the timing is about the same there.
On larger "Item" sets serialization time is increasing more than DB
lookup time, but DB lookups always take most of the time. With 40k
items you'll start to reach 1 min execution time and different
timeouts from Nginx and other parts of the stack.
Example with 12k items (models, serializers and queries below)
Database lookup | 14.0292s
Serialization | 6.0076s
Django request/response | 0.3488s
API view | 0.2170s
Response rendering | 1.1092s
If I leave Photos and Events out the result is
Database lookup | 1.2447s
Serialization | 3.9668s
Django request/response | 0.2435s
API view | 0.1320s
Response rendering | 0.8495s
* What might cause the slowness? *
So, the related fields are taking most of the time (many=True). The
profiling I used for testing is making `list` out of queryset before
serializing. Therefore lazy queries are executed before serialization.
If I don't do that, it doesn't change the overall results, but DB
lookups are evaluated when serializing with about the same amount of time.
Now the problem for me is that all queries that are done are fast if
executed manually. So, I believe SQL queries are fast, but DB lookups
from Django's point of view are very slow. What am I missing here? Or
how should I continue investigations? It feels like now it requires
serious effort from Django to convert SQL query results to Django
model instances. That would imply that there's something wrong with my
models, right?
At the end, I could turn to caching, but I would assume that handling
< 100k objects should not be an issue for Django if done correctly.
----------
Setup: Python 2.7.13, Django 1.10.7, DRF 3.6.3
... CharFields, DateTimeFields, ForeignKeys etc. ...
list = models.ForeignKey(List, on_delete=models.CASCADE,
db_index=True, null=True, related_name='items')
deleted_at = models.DateTimeField(db_index=True, blank=True,
null=True, default=None)
created_by = models.ForeignKey(User, blank=False)
project = models.ForeignKey('projects.Project',
on_delete=models.CASCADE)
... other CharFields, DateTimeFields, ForeignKeys etc. ...
item = models.ForeignKey(Item, on_delete=models.CASCADE,
db_index=True, null=True, related_name='events')
created_by = models.ForeignKey(User, blank=False)
deleted_at = models.DateTimeField(db_index=True, blank=True,
null=True, default=None)
... other CharFields, DateTimeFields, ForeignKeys etc. ...
item = models.ForeignKey(Item, on_delete=models.CASCADE,
db_index=True, null=True, related_name='photos')
created_by = models.ForeignKey(User, blank=False)
deleted_at = models.DateTimeField(db_index=True, blank=True,
null=True, default=None)
... other CharFields, DateTimeFields, ForeignKeys etc. ...
... other CharFields, DateTimeFields, ForeignKeys etc. ...
createdBy = PrimaryKeyRelatedStringField(source='created_by',
read_only=True)
createdByFullName =
serializers.CharField(source='created_by.get_full_name', read_only=True)
... other CharFields, DateTimeFields, ForeignKeys etc. ...
listName = serializers.CharField(source='list.name',
read_only=True)
createdBy = PrimaryKeyRelatedStringField(source='created_by',
read_only=True)
createdByFullName =
serializers.CharField(source='created_by.get_full_name', read_only=True)
photos = PhotoSerializer(many=True, read_only=True)
events = EventSerializer(many=True, required=False,
allow_null=True, queryset=Event.objects.all())
... other fields ...
model = Item
serializer_class = ItemSerializer
return
Item.objects.all().filter(project_id=...).filter(deleted_at__isnull=True).prefetch_related(
'created_by',# ID of user who created item
'photos',# Photo properties
'event__created_by',# Full name of the person who
created the event
'list',# Name of the related list
)
django.db.backends: (0.196) SELECT "todo_item"."version", ...
everything ... FROM "todo_item" WHERE ("todo_item"."project_id" = 1
AND "todo_item"."deleted_at" IS NULL) ORDER BY
"todo_item"."created_at" DESC;
django.db.backends: (0.001) SELECT "auth_user"."id", ...
everything ... FROM "auth_user" WHERE "auth_user"."id" IN (1, 2, ...
some IDs ...);
django.db.backends: (0.148) SELECT "photos_photo"."version", ...
everything ... FROM "photos_photo" WHERE ("photos_photo"."deleted_at"
IS NULL AND "photos_photo"."item_id" IN (1, 2, ... lots of IDs... N))
ORDER BY "photos_photo"."created_at" DESC;
django.db.backends: (0.078) SELECT "events_event"."created_at",
... everything ... FROM "events_event" WHERE
("events_event"."deleted_at" IS NULL AND "events_event"."item_id" IN
(1, 2, ... lots of IDs... N)) ORDER BY "events_event"."created_at"
DESC, "events_event"."created_at" DESC;
django.db.backends: (0.157) SELECT "todo_list"."created_at", ...
FROM "todo_list" WHERE "todo_list"."id" IN (1, 2, ... lots of IDs... N)
--
You received this message because you are subscribed to the Google
Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit
https://groups.google.com/d/msgid/django-users/989dd9f6-508a-4dd7-a085-c30557204f41%40googlegroups.com
<https://groups.google.com/d/msgid/django-users/989dd9f6-508a-4dd7-a085-c30557204f41%40googlegroups.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/6d19bbf1-57aa-bde9-f4f6-3836f941df47%40gmail.com.
For more options, visit https://groups.google.com/d/optout.
Alceu Rodrigues de Freitas Junior
2017-06-11 15:17:55 UTC
Permalink
Besides, you "time" from a shell is not the appropriate way to measure
DB response. Try to use whatever is available for your database to
measure that.

Also, see
http://blogs.perl.org/users/steffen_mueller/2010/09/your-benchmarks-suck.html.
Don't be fooled by the article being on a Perl related blog, the
concepts on the post are applicable everywhere.
Post by Miika Huusko
I have a performance problem with Django + Django REST Framework. The
problem is that it takes Django very, very long time to handle SQL
query response + DRF serialization take quite some time also. It
happens when there are ManyToMany or OneToMany relations and nested
objects. Sounds like "N + 1" problem, but that's not the case. I have
10k to 50k items with related items and I try to fetch and serialize
them for REST API. Request takes from 20 to 60 seconds and currently I
have no idea what is causing the slowness.
Note: I asked the same question in Stack Overflow, but haven't found
the answer from there -
https://stackoverflow.com/questions/44461638/django-django-rest-framework-postgresql-queries-and-serialization-is-very-sl
* N+1 Problem *
There are not too many queries and queries themselves are fast
(queries at the end of this post). I'm using `prefetch_related` to
limit number of queries and what I'm seeing from DB queries everything
is looking Okay..ish (?). I get one query for each `prefetch_related`
property + the original query for serialized objects. There are lots
and lots of IDs included in `prefetch_related` queries, but I guess
that is inevitable - as many IDs as there are original items.
To test SQL queries + data transfer, I run same queries with psql from
one of my EC2 instances to RDS DB with same DB data and wrote the data
to file. Data transfer plus file write on top of that and it is
totally between 100 to 500 ms for bigger SQL queries for different
data sets. File write is extra, but I wanted to ensure that I get all
the data I expect.
time psql -f fat-query.psql --host=rds_host --port=5432
--username=user --dbname=dbname > output.txt
* Profiling *
When profiling as shown here
https://www.dabapps.com/blog/api-performance-profiling-django-rest-framework/
I get results that DB lookups take most of the time while serializing
is not too fast either. As an example I have about 12k "Items" in the
my local PostgreSQL database for one "Project". All "Items" have 1-5
"Events" and majority of "Items" have also 1-2 "Photos". Fetching and
serializing that data takes around 22 seconds on my laptop. I'm using
AWS EC2 + RDS for deployment and the timing is about the same there.
On larger "Item" sets serialization time is increasing more than DB
lookup time, but DB lookups always take most of the time. With 40k
items you'll start to reach 1 min execution time and different
timeouts from Nginx and other parts of the stack.
Example with 12k items (models, serializers and queries below)
Database lookup | 14.0292s
Serialization | 6.0076s
Django request/response | 0.3488s
API view | 0.2170s
Response rendering | 1.1092s
If I leave Photos and Events out the result is
Database lookup | 1.2447s
Serialization | 3.9668s
Django request/response | 0.2435s
API view | 0.1320s
Response rendering | 0.8495s
* What might cause the slowness? *
So, the related fields are taking most of the time (many=True). The
profiling I used for testing is making `list` out of queryset before
serializing. Therefore lazy queries are executed before serialization.
If I don't do that, it doesn't change the overall results, but DB
lookups are evaluated when serializing with about the same amount of time.
Now the problem for me is that all queries that are done are fast if
executed manually. So, I believe SQL queries are fast, but DB lookups
from Django's point of view are very slow. What am I missing here? Or
how should I continue investigations? It feels like now it requires
serious effort from Django to convert SQL query results to Django
model instances. That would imply that there's something wrong with my
models, right?
At the end, I could turn to caching, but I would assume that handling
< 100k objects should not be an issue for Django if done correctly.
----------
Setup: Python 2.7.13, Django 1.10.7, DRF 3.6.3
... CharFields, DateTimeFields, ForeignKeys etc. ...
list = models.ForeignKey(List, on_delete=models.CASCADE,
db_index=True, null=True, related_name='items')
deleted_at = models.DateTimeField(db_index=True, blank=True,
null=True, default=None)
created_by = models.ForeignKey(User, blank=False)
project = models.ForeignKey('projects.Project',
on_delete=models.CASCADE)
... other CharFields, DateTimeFields, ForeignKeys etc. ...
item = models.ForeignKey(Item, on_delete=models.CASCADE,
db_index=True, null=True, related_name='events')
created_by = models.ForeignKey(User, blank=False)
deleted_at = models.DateTimeField(db_index=True, blank=True,
null=True, default=None)
... other CharFields, DateTimeFields, ForeignKeys etc. ...
item = models.ForeignKey(Item, on_delete=models.CASCADE,
db_index=True, null=True, related_name='photos')
created_by = models.ForeignKey(User, blank=False)
deleted_at = models.DateTimeField(db_index=True, blank=True,
null=True, default=None)
... other CharFields, DateTimeFields, ForeignKeys etc. ...
... other CharFields, DateTimeFields, ForeignKeys etc. ...
createdBy = PrimaryKeyRelatedStringField(source='created_by',
read_only=True)
createdByFullName =
serializers.CharField(source='created_by.get_full_name', read_only=True)
... other CharFields, DateTimeFields, ForeignKeys etc. ...
listName = serializers.CharField(source='list.name',
read_only=True)
createdBy = PrimaryKeyRelatedStringField(source='created_by',
read_only=True)
createdByFullName =
serializers.CharField(source='created_by.get_full_name', read_only=True)
photos = PhotoSerializer(many=True, read_only=True)
events = EventSerializer(many=True, required=False,
allow_null=True, queryset=Event.objects.all())
... other fields ...
model = Item
serializer_class = ItemSerializer
return
Item.objects.all().filter(project_id=...).filter(deleted_at__isnull=True).prefetch_related(
'created_by',# ID of user who created item
'photos',# Photo properties
'event__created_by',# Full name of the person who
created the event
'list',# Name of the related list
)
django.db.backends: (0.196) SELECT "todo_item"."version", ...
everything ... FROM "todo_item" WHERE ("todo_item"."project_id" = 1
AND "todo_item"."deleted_at" IS NULL) ORDER BY
"todo_item"."created_at" DESC;
django.db.backends: (0.001) SELECT "auth_user"."id", ...
everything ... FROM "auth_user" WHERE "auth_user"."id" IN (1, 2, ...
some IDs ...);
django.db.backends: (0.148) SELECT "photos_photo"."version", ...
everything ... FROM "photos_photo" WHERE ("photos_photo"."deleted_at"
IS NULL AND "photos_photo"."item_id" IN (1, 2, ... lots of IDs... N))
ORDER BY "photos_photo"."created_at" DESC;
django.db.backends: (0.078) SELECT "events_event"."created_at",
... everything ... FROM "events_event" WHERE
("events_event"."deleted_at" IS NULL AND "events_event"."item_id" IN
(1, 2, ... lots of IDs... N)) ORDER BY "events_event"."created_at"
DESC, "events_event"."created_at" DESC;
django.db.backends: (0.157) SELECT "todo_list"."created_at", ...
FROM "todo_list" WHERE "todo_list"."id" IN (1, 2, ... lots of IDs... N)
--
You received this message because you are subscribed to the Google
Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit
https://groups.google.com/d/msgid/django-users/989dd9f6-508a-4dd7-a085-c30557204f41%40googlegroups.com
<https://groups.google.com/d/msgid/django-users/989dd9f6-508a-4dd7-a085-c30557204f41%40googlegroups.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.
--
*Alceu Rodrigues de Freitas Junior*
Phone: +551139264816 | Mobile: +55119989312021 | Skype call to
glasswalk3r <skype:glasswalk3r?call>
http://www.linkedin.com/in/alceufreitasjr
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/bc5cfa1d-e624-128d-82cc-963f21eb47c4%40gmail.com.
For more options, visit https://groups.google.com/d/optout.
Miika Huusko
2017-06-11 16:15:09 UTC
Permalink
Thanks for response!

Yep, timing with "time" is not the best way to go to compare SQL query
times. The reason I added that "time" test is that in Stack Overflow I was
asked to confirm that data transfer is not factor here. I timed SQL queries
with "EXPLAIN ANALYZE" and "\timing" first. That don't take data transfer
into account, so, "time" as quick test that data transfer is not a problem.

About timing: as an example I can reduce the problem to serializing only
Items and related Photos. It results only two queries. For example for a
dataset of about 40k items results:

django.db.backends: (1.252) SELECT "todo_item"."version", *... all item
properties ...* FROM "todo_item" WHERE ("todo_item"."project_id" = '...'
AND "todo_item"."deleted_at" IS NULL);

django.db.backends: (0.883) SELECT "photos_photo"."version", *... all item
properties ...* FROM "photos_photo" WHERE "photos_photo"."note_id" IN
(349527, 349528, 349529, *... and rest of the 40k IDs ...* );


Quite simple quries. Timing shown in django.db.backends logs is what those
queries take when executed manually with psql (1252 ms + 883 ms). That
results simple profiling info:

Database lookup | 20.4447s
Serialization | 3.3821s
Django request/response | 0.3419s
API view | 0.1988s
Response rendering | 0.4591s

That's only from a single query and query times vary of course. Still, the
difference between how long it takes query data from DB and how long Django
process it is just that huge. The part I don't understand is that it takes
about 20 seconds to run list(self.get_queryset()) while those two queries
take about 2 seconds in SQL. There is some serious effort and time put
there by Django. Those two queries are only queries that are run during
list(self.get_queryset()) according to django.db.backend logs. "list" is
there to force query execution to separate DB lookup time and serialization
time.

Adding a new Model recovering information from a view/stored procedure on
the database is a good idea. Of course, I would like to first understand
what might be wrong in current models to not make same mistakes again. Is
there something one should consider when making related items like Photos
and Events related to Items? That use case is quite simple and still result
Django to use 18 seconds to process SQL query response. There is lot of
data of course, but I have thought that returning 50k objects should not be
a problem for Python / Django even though ORM always adds some overhead.
Post by Miika Huusko
I have a performance problem with Django + Django REST Framework. The
problem is that it takes Django very, very long time to handle SQL query
response + DRF serialization take quite some time also. It happens when
there are ManyToMany or OneToMany relations and nested objects. Sounds like
"N + 1" problem, but that's not the case. I have 10k to 50k items with
related items and I try to fetch and serialize them for REST API. Request
takes from 20 to 60 seconds and currently I have no idea what is causing
the slowness.
Note: I asked the same question in Stack Overflow, but haven't found the
answer from there -
https://stackoverflow.com/questions/44461638/django-django-rest-framework-postgresql-queries-and-serialization-is-very-sl
* N+1 Problem *
There are not too many queries and queries themselves are fast (queries at
the end of this post). I'm using `prefetch_related` to limit number of
queries and what I'm seeing from DB queries everything is looking Okay..ish
(?). I get one query for each `prefetch_related` property + the original
query for serialized objects. There are lots and lots of IDs included in
`prefetch_related` queries, but I guess that is inevitable - as many IDs as
there are original items.
To test SQL queries + data transfer, I run same queries with psql from one
of my EC2 instances to RDS DB with same DB data and wrote the data to file.
Data transfer plus file write on top of that and it is totally between 100
to 500 ms for bigger SQL queries for different data sets. File write is
extra, but I wanted to ensure that I get all the data I expect.
time psql -f fat-query.psql --host=rds_host --port=5432
--username=user --dbname=dbname > output.txt
* Profiling *
When profiling as shown here
https://www.dabapps.com/blog/api-performance-profiling-django-rest-framework/
I get results that DB lookups take most of the time while serializing is
not too fast either. As an example I have about 12k "Items" in the my local
PostgreSQL database for one "Project". All "Items" have 1-5 "Events" and
majority of "Items" have also 1-2 "Photos". Fetching and serializing that
data takes around 22 seconds on my laptop. I'm using AWS EC2 + RDS for
deployment and the timing is about the same there. On larger "Item" sets
serialization time is increasing more than DB lookup time, but DB lookups
always take most of the time. With 40k items you'll start to reach 1 min
execution time and different timeouts from Nginx and other parts of the
stack.
Example with 12k items (models, serializers and queries below)
Database lookup | 14.0292s
Serialization | 6.0076s
Django request/response | 0.3488s
API view | 0.2170s
Response rendering | 1.1092s
If I leave Photos and Events out the result is
Database lookup | 1.2447s
Serialization | 3.9668s
Django request/response | 0.2435s
API view | 0.1320s
Response rendering | 0.8495s
* What might cause the slowness? *
So, the related fields are taking most of the time (many=True). The
profiling I used for testing is making `list` out of queryset before
serializing. Therefore lazy queries are executed before serialization. If I
don't do that, it doesn't change the overall results, but DB lookups are
evaluated when serializing with about the same amount of time.
Now the problem for me is that all queries that are done are fast if
executed manually. So, I believe SQL queries are fast, but DB lookups from
Django's point of view are very slow. What am I missing here? Or how should
I continue investigations? It feels like now it requires serious effort
from Django to convert SQL query results to Django model instances. That
would imply that there's something wrong with my models, right?
At the end, I could turn to caching, but I would assume that handling <
100k objects should not be an issue for Django if done correctly.
----------
Setup: Python 2.7.13, Django 1.10.7, DRF 3.6.3
... CharFields, DateTimeFields, ForeignKeys etc. ...
list = models.ForeignKey(List, on_delete=models.CASCADE,
db_index=True, null=True, related_name='items')
deleted_at = models.DateTimeField(db_index=True, blank=True,
null=True, default=None)
created_by = models.ForeignKey(User, blank=False)
project = models.ForeignKey('projects.Project',
on_delete=models.CASCADE)
... other CharFields, DateTimeFields, ForeignKeys etc. ...
item = models.ForeignKey(Item, on_delete=models.CASCADE,
db_index=True, null=True, related_name='events')
created_by = models.ForeignKey(User, blank=False)
deleted_at = models.DateTimeField(db_index=True, blank=True,
null=True, default=None)
... other CharFields, DateTimeFields, ForeignKeys etc. ...
item = models.ForeignKey(Item, on_delete=models.CASCADE,
db_index=True, null=True, related_name='photos')
created_by = models.ForeignKey(User, blank=False)
deleted_at = models.DateTimeField(db_index=True, blank=True,
null=True, default=None)
... other CharFields, DateTimeFields, ForeignKeys etc. ...
... other CharFields, DateTimeFields, ForeignKeys etc. ...
createdBy = PrimaryKeyRelatedStringField(source='created_by',
read_only=True)
createdByFullName =
serializers.CharField(source='created_by.get_full_name', read_only=True)
... other CharFields, DateTimeFields, ForeignKeys etc. ...
listName = serializers.CharField(source='list.name',
read_only=True)
createdBy = PrimaryKeyRelatedStringField(source='created_by',
read_only=True)
createdByFullName =
serializers.CharField(source='created_by.get_full_name', read_only=True)
photos = PhotoSerializer(many=True, read_only=True)
events = EventSerializer(many=True, required=False,
allow_null=True, queryset=Event.objects.all())
... other fields ...
model = Item
serializer_class = ItemSerializer
return
Item.objects.all().filter(project_id=...).filter(deleted_at__isnull=True).prefetch_related(
'created_by', # ID of user who created item
'photos', # Photo properties
'event__created_by', # Full name of the person who
created the event
'list', # Name of the related list
)
django.db.backends: (0.196) SELECT "todo_item"."version", ...
everything ... FROM "todo_item" WHERE ("todo_item"."project_id" = 1 AND
"todo_item"."deleted_at" IS NULL) ORDER BY "todo_item"."created_at" DESC;
django.db.backends: (0.001) SELECT "auth_user"."id", ... everything
... FROM "auth_user" WHERE "auth_user"."id" IN (1, 2, ... some IDs ...);
django.db.backends: (0.148) SELECT "photos_photo"."version", ...
everything ... FROM "photos_photo" WHERE ("photos_photo"."deleted_at" IS
NULL AND "photos_photo"."item_id" IN (1, 2, ... lots of IDs... N)) ORDER BY
"photos_photo"."created_at" DESC;
django.db.backends: (0.078) SELECT "events_event"."created_at", ...
everything ... FROM "events_event" WHERE ("events_event"."deleted_at" IS
NULL AND "events_event"."item_id" IN (1, 2, ... lots of IDs... N)) ORDER BY
"events_event"."created_at" DESC, "events_event"."created_at" DESC;
django.db.backends: (0.157) SELECT "todo_list"."created_at", ... FROM
"todo_list" WHERE "todo_list"."id" IN (1, 2, ... lots of IDs... N)
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/077f7f7c-7bab-4266-ad1d-60d2ad8167b0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Alceu Rodrigues de Freitas Junior
2017-06-11 16:40:50 UTC
Permalink
This may be very basic thing to ask, but there is no harm to double
check: are you using select_related instead of all for those two
specific Models?

Check out
https://django-debug-toolbar.readthedocs.io/en/stable/index.html if you
aren't using it yet, it will help you to get that very easily (I'm
assuming you have a view to list all those instances for those Models).
Post by Miika Huusko
Thanks for response!
Yep, timing with "time" is not the best way to go to compare SQL query
times. The reason I added that "time" test is that in Stack Overflow I
was asked to confirm that data transfer is not factor here. I timed
SQL queries with "EXPLAIN ANALYZE" and "\timing" first. That don't
take data transfer into account, so, "time" as quick test that data
transfer is not a problem.
About timing: as an example I can reduce the problem to serializing
only Items and related Photos. It results only two queries. For
django.db.backends: (1.252) SELECT "todo_item"."version", *... all
item properties ...* FROM "todo_item" WHERE ("todo_item"."project_id"
= '...' AND "todo_item"."deleted_at" IS NULL);
django.db.backends: (0.883) SELECT "photos_photo"."version", *... all
item properties ...* FROM "photos_photo" WHERE
"photos_photo"."note_id" IN (349527, 349528, 349529, *... and rest of
the 40k IDs ...* );
Quite simple quries. Timing shown in django.db.backends logs is what
those queries take when executed manually with psql (1252 ms + 883
Database lookup | 20.4447s
Serialization | 3.3821s
Django request/response | 0.3419s
API view | 0.1988s
Response rendering | 0.4591s
That's only from a single query and query times vary of course. Still,
the difference between how long it takes query data from DB and how
long Django process it is just that huge. The part I don't understand
is that it takes about 20 seconds to run list(self.get_queryset())
while those two queries take about 2 seconds in SQL. There is some
serious effort and time put there by Django. Those two queries are
only queries that are run during list(self.get_queryset()) according
to django.db.backend logs. "list" is there to force query execution to
separate DB lookup time and serialization time.
Adding a new Model recovering information from a view/stored procedure
on the database is a good idea. Of course, I would like to first
understand what might be wrong in current models to not make same
mistakes again. Is there something one should consider when making
related items like Photos and Events related to Items? That use case
is quite simple and still result Django to use 18 seconds to process
SQL query response. There is lot of data of course, but I have thought
that returning 50k objects should not be a problem for Python / Django
even though ORM always adds some overhead.
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/b25172ca-da2b-e645-cf70-f06b11329bba%40gmail.com.
For more options, visit https://groups.google.com/d/optout.
Melvyn Sopacua
2017-06-11 16:54:20 UTC
Permalink
Post by Miika Huusko
Thanks for response!
Yep, timing with "time" is not the best way to go to compare SQL query
times. The reason I added that "time" test is that in Stack Overflow
I was asked to confirm that data transfer is not factor here. I timed
SQL queries with "EXPLAIN ANALYZE" and "\timing" first. That don't
take data transfer into account, so, "time" as quick test that data
transfer is not a problem.
About timing: as an example I can reduce the problem to serializing
only Items and related Photos. It results only two queries. For
django.db.backends: (1.252) SELECT "todo_item"."version", *... all
item properties ...* FROM "todo_item" WHERE ("todo_item"."project_id"
= '...' AND "todo_item"."deleted_at" IS NULL);
django.db.backends: (0.883) SELECT "photos_photo"."version", *... all
item properties ...* FROM "photos_photo" WHERE
"photos_photo"."note_id" IN (349527, 349528, 349529, *... and rest of
the 40k IDs ...* );
Why did you opt to use prefetch_related and not select_related?
--
Melvyn Sopacua
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/24568739.Jy3x7NNMDG%40devstation.
For more options, visit https://groups.google.com/d/optout.
Miika Huusko
2017-06-11 20:41:55 UTC
Permalink
@Alceu Rodrigues de Freitas Junior, there are no too basic questions to ask
:) I'm sure I'm missing something, so, please keep asking. But yes, I have
considered that.

@Melvyn Sopacue, I opt to use prefetch_related because I'm fetching "Items"
that are used as ForeignKey for "Photos" and "Events". Select_related is
meant for cases where you have OneToOne of ForeignKey relation (single
item) that you need to fetch. If I would be fetching "Photos" I would use
selected_related to JOIN "Items" in SQL level. Now that I have fetching
"Items" relations to "Photos" and "Events" are "reverse ForeignKey"
Many-to-One sets and I need to use prefetch_related. Of course it means
that JOIN is done in Python (that is probably what takes the time now), but
I cannot see how I could do it with selected_related. Please let me know,
if there's a way to utilize select_related with related sets.

I'm kind of hoping to find a problem in my implementation or thinking that
would somehow make the Python level JOIN behave badly. It's a big JOIN, but
still, Python is fast. There must be some kind of extra looping or cloning
going on when performing Python level JOIN for prefetched Photos or other
related sets.
Post by Miika Huusko
I have a performance problem with Django + Django REST Framework. The
problem is that it takes Django very, very long time to handle SQL query
response + DRF serialization take quite some time also. It happens when
there are ManyToMany or OneToMany relations and nested objects. Sounds like
"N + 1" problem, but that's not the case. I have 10k to 50k items with
related items and I try to fetch and serialize them for REST API. Request
takes from 20 to 60 seconds and currently I have no idea what is causing
the slowness.
Note: I asked the same question in Stack Overflow, but haven't found the
answer from there -
https://stackoverflow.com/questions/44461638/django-django-rest-framework-postgresql-queries-and-serialization-is-very-sl
* N+1 Problem *
There are not too many queries and queries themselves are fast (queries at
the end of this post). I'm using `prefetch_related` to limit number of
queries and what I'm seeing from DB queries everything is looking Okay..ish
(?). I get one query for each `prefetch_related` property + the original
query for serialized objects. There are lots and lots of IDs included in
`prefetch_related` queries, but I guess that is inevitable - as many IDs as
there are original items.
To test SQL queries + data transfer, I run same queries with psql from one
of my EC2 instances to RDS DB with same DB data and wrote the data to file.
Data transfer plus file write on top of that and it is totally between 100
to 500 ms for bigger SQL queries for different data sets. File write is
extra, but I wanted to ensure that I get all the data I expect.
time psql -f fat-query.psql --host=rds_host --port=5432
--username=user --dbname=dbname > output.txt
* Profiling *
When profiling as shown here
https://www.dabapps.com/blog/api-performance-profiling-django-rest-framework/
I get results that DB lookups take most of the time while serializing is
not too fast either. As an example I have about 12k "Items" in the my local
PostgreSQL database for one "Project". All "Items" have 1-5 "Events" and
majority of "Items" have also 1-2 "Photos". Fetching and serializing that
data takes around 22 seconds on my laptop. I'm using AWS EC2 + RDS for
deployment and the timing is about the same there. On larger "Item" sets
serialization time is increasing more than DB lookup time, but DB lookups
always take most of the time. With 40k items you'll start to reach 1 min
execution time and different timeouts from Nginx and other parts of the
stack.
Example with 12k items (models, serializers and queries below)
Database lookup | 14.0292s
Serialization | 6.0076s
Django request/response | 0.3488s
API view | 0.2170s
Response rendering | 1.1092s
If I leave Photos and Events out the result is
Database lookup | 1.2447s
Serialization | 3.9668s
Django request/response | 0.2435s
API view | 0.1320s
Response rendering | 0.8495s
* What might cause the slowness? *
So, the related fields are taking most of the time (many=True). The
profiling I used for testing is making `list` out of queryset before
serializing. Therefore lazy queries are executed before serialization. If I
don't do that, it doesn't change the overall results, but DB lookups are
evaluated when serializing with about the same amount of time.
Now the problem for me is that all queries that are done are fast if
executed manually. So, I believe SQL queries are fast, but DB lookups from
Django's point of view are very slow. What am I missing here? Or how should
I continue investigations? It feels like now it requires serious effort
from Django to convert SQL query results to Django model instances. That
would imply that there's something wrong with my models, right?
At the end, I could turn to caching, but I would assume that handling <
100k objects should not be an issue for Django if done correctly.
----------
Setup: Python 2.7.13, Django 1.10.7, DRF 3.6.3
... CharFields, DateTimeFields, ForeignKeys etc. ...
list = models.ForeignKey(List, on_delete=models.CASCADE,
db_index=True, null=True, related_name='items')
deleted_at = models.DateTimeField(db_index=True, blank=True,
null=True, default=None)
created_by = models.ForeignKey(User, blank=False)
project = models.ForeignKey('projects.Project',
on_delete=models.CASCADE)
... other CharFields, DateTimeFields, ForeignKeys etc. ...
item = models.ForeignKey(Item, on_delete=models.CASCADE,
db_index=True, null=True, related_name='events')
created_by = models.ForeignKey(User, blank=False)
deleted_at = models.DateTimeField(db_index=True, blank=True,
null=True, default=None)
... other CharFields, DateTimeFields, ForeignKeys etc. ...
item = models.ForeignKey(Item, on_delete=models.CASCADE,
db_index=True, null=True, related_name='photos')
created_by = models.ForeignKey(User, blank=False)
deleted_at = models.DateTimeField(db_index=True, blank=True,
null=True, default=None)
... other CharFields, DateTimeFields, ForeignKeys etc. ...
... other CharFields, DateTimeFields, ForeignKeys etc. ...
createdBy = PrimaryKeyRelatedStringField(source='created_by',
read_only=True)
createdByFullName =
serializers.CharField(source='created_by.get_full_name', read_only=True)
... other CharFields, DateTimeFields, ForeignKeys etc. ...
listName = serializers.CharField(source='list.name',
read_only=True)
createdBy = PrimaryKeyRelatedStringField(source='created_by',
read_only=True)
createdByFullName =
serializers.CharField(source='created_by.get_full_name', read_only=True)
photos = PhotoSerializer(many=True, read_only=True)
events = EventSerializer(many=True, required=False,
allow_null=True, queryset=Event.objects.all())
... other fields ...
model = Item
serializer_class = ItemSerializer
return
Item.objects.all().filter(project_id=...).filter(deleted_at__isnull=True).prefetch_related(
'created_by', # ID of user who created item
'photos', # Photo properties
'event__created_by', # Full name of the person who
created the event
'list', # Name of the related list
)
django.db.backends: (0.196) SELECT "todo_item"."version", ...
everything ... FROM "todo_item" WHERE ("todo_item"."project_id" = 1 AND
"todo_item"."deleted_at" IS NULL) ORDER BY "todo_item"."created_at" DESC;
django.db.backends: (0.001) SELECT "auth_user"."id", ... everything
... FROM "auth_user" WHERE "auth_user"."id" IN (1, 2, ... some IDs ...);
django.db.backends: (0.148) SELECT "photos_photo"."version", ...
everything ... FROM "photos_photo" WHERE ("photos_photo"."deleted_at" IS
NULL AND "photos_photo"."item_id" IN (1, 2, ... lots of IDs... N)) ORDER BY
"photos_photo"."created_at" DESC;
django.db.backends: (0.078) SELECT "events_event"."created_at", ...
everything ... FROM "events_event" WHERE ("events_event"."deleted_at" IS
NULL AND "events_event"."item_id" IN (1, 2, ... lots of IDs... N)) ORDER BY
"events_event"."created_at" DESC, "events_event"."created_at" DESC;
django.db.backends: (0.157) SELECT "todo_list"."created_at", ... FROM
"todo_list" WHERE "todo_list"."id" IN (1, 2, ... lots of IDs... N)
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/4e837be3-4832-4a46-be62-c74a4d617bb6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Melvyn Sopacua
2017-06-11 21:01:15 UTC
Permalink
Post by Miika Huusko
@Melvyn Sopacue, I opt to use prefetch_related because I'm fetching
"Items" that are used as ForeignKey for "Photos" and "Events".
Select_related is meant for cases where you have OneToOne of
ForeignKey relation (single item) that you need to fetch. If I would
be fetching "Photos" I would use selected_related to JOIN "Items" in
SQL level. Now that I have fetching "Items" relations to "Photos" and
"Events" are "reverse ForeignKey" Many-to-One sets and I need to use
prefetch_related.
You are in part.
But created_by is a foreign key on items, not the other way around.
Post by Miika Huusko
Post by Miika Huusko
django.db.backends: (0.001) SELECT "auth_user"."id", ...
everything
... FROM "auth_user" WHERE "auth_user"."id" IN (1, 2, ... some IDs
...);>
can be avoided.

Last but not least, this is why API's use pagination. If for whatever reason you must provide
such a complex model (instead of having the consumer make more API calls for photo's they
don't have yet) you paginate, but it's not very flexible and scales bad.

The proper desgin pattern here would be to hash your photo's, use that as identifier in the API
and put the burdon on the consumer to only request photo's they don't have the ID for yet.

This is for example, how Zendesk email attachments.
--
Melvyn Sopacua
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/3506371.52MvSNx3by%40devstation.
For more options, visit https://groups.google.com/d/optout.
Miika Huusko
2017-06-11 21:41:22 UTC
Permalink
@Melvyn Sopacua, yes, that is very true that auth_user SELECT can be
avoided. Thanks for pointing that out! I don't know what I had put it like
that. Unfortunately, it's the related sets and big JOINs that are the
problem and generating the "slowness".

The reason I'm providing all "Items" and other relates stuff with one
request is that I want to make customer (that's also me and my good frieds)
life easier. I want to keep all the difficult and heavy work on server
side. Customer needs all data because of offline requirements and therefore
would need to loop through pagination or perform the same big JOIN if
"Photos" and other related objects would be requested separately. Doing
that in mobile devices is going to take longer than on Ubuntu server even
if it takes longer that I would expect with Python + Django. That said,
it's good to point out opinions on design patterns. I haven't listed all
background information for decisions.

I'm not quite sure what you mean by "hash photos and use that as
identifier". What I could do - and believe what you mean(?) - is to provide
list of "Photo" IDs (or hashes) for "Item" and ask customer to fetch
"Photo" details separately with another request(s). Or - easier for API -
leave "Photo" IDs out of the "Item" information completely and ask customer
to make the JOIN with separately requested "Photos". JOIN could be done
with hashes or IDs. Either way it would be customer who would make the JOIN.

I might go there, but still I don't believe that Django could not handle my
meager data faster than 20 seconds :) If that would be the case it really
would be sad.
Post by Miika Huusko
I have a performance problem with Django + Django REST Framework. The
problem is that it takes Django very, very long time to handle SQL query
response + DRF serialization take quite some time also. It happens when
there are ManyToMany or OneToMany relations and nested objects. Sounds like
"N + 1" problem, but that's not the case. I have 10k to 50k items with
related items and I try to fetch and serialize them for REST API. Request
takes from 20 to 60 seconds and currently I have no idea what is causing
the slowness.
Note: I asked the same question in Stack Overflow, but haven't found the
answer from there -
https://stackoverflow.com/questions/44461638/django-django-rest-framework-postgresql-queries-and-serialization-is-very-sl
* N+1 Problem *
There are not too many queries and queries themselves are fast (queries at
the end of this post). I'm using `prefetch_related` to limit number of
queries and what I'm seeing from DB queries everything is looking Okay..ish
(?). I get one query for each `prefetch_related` property + the original
query for serialized objects. There are lots and lots of IDs included in
`prefetch_related` queries, but I guess that is inevitable - as many IDs as
there are original items.
To test SQL queries + data transfer, I run same queries with psql from one
of my EC2 instances to RDS DB with same DB data and wrote the data to file.
Data transfer plus file write on top of that and it is totally between 100
to 500 ms for bigger SQL queries for different data sets. File write is
extra, but I wanted to ensure that I get all the data I expect.
time psql -f fat-query.psql --host=rds_host --port=5432
--username=user --dbname=dbname > output.txt
* Profiling *
When profiling as shown here
https://www.dabapps.com/blog/api-performance-profiling-django-rest-framework/
I get results that DB lookups take most of the time while serializing is
not too fast either. As an example I have about 12k "Items" in the my local
PostgreSQL database for one "Project". All "Items" have 1-5 "Events" and
majority of "Items" have also 1-2 "Photos". Fetching and serializing that
data takes around 22 seconds on my laptop. I'm using AWS EC2 + RDS for
deployment and the timing is about the same there. On larger "Item" sets
serialization time is increasing more than DB lookup time, but DB lookups
always take most of the time. With 40k items you'll start to reach 1 min
execution time and different timeouts from Nginx and other parts of the
stack.
Example with 12k items (models, serializers and queries below)
Database lookup | 14.0292s
Serialization | 6.0076s
Django request/response | 0.3488s
API view | 0.2170s
Response rendering | 1.1092s
If I leave Photos and Events out the result is
Database lookup | 1.2447s
Serialization | 3.9668s
Django request/response | 0.2435s
API view | 0.1320s
Response rendering | 0.8495s
* What might cause the slowness? *
So, the related fields are taking most of the time (many=True). The
profiling I used for testing is making `list` out of queryset before
serializing. Therefore lazy queries are executed before serialization. If I
don't do that, it doesn't change the overall results, but DB lookups are
evaluated when serializing with about the same amount of time.
Now the problem for me is that all queries that are done are fast if
executed manually. So, I believe SQL queries are fast, but DB lookups from
Django's point of view are very slow. What am I missing here? Or how should
I continue investigations? It feels like now it requires serious effort
from Django to convert SQL query results to Django model instances. That
would imply that there's something wrong with my models, right?
At the end, I could turn to caching, but I would assume that handling <
100k objects should not be an issue for Django if done correctly.
----------
Setup: Python 2.7.13, Django 1.10.7, DRF 3.6.3
... CharFields, DateTimeFields, ForeignKeys etc. ...
list = models.ForeignKey(List, on_delete=models.CASCADE,
db_index=True, null=True, related_name='items')
deleted_at = models.DateTimeField(db_index=True, blank=True,
null=True, default=None)
created_by = models.ForeignKey(User, blank=False)
project = models.ForeignKey('projects.Project',
on_delete=models.CASCADE)
... other CharFields, DateTimeFields, ForeignKeys etc. ...
item = models.ForeignKey(Item, on_delete=models.CASCADE,
db_index=True, null=True, related_name='events')
created_by = models.ForeignKey(User, blank=False)
deleted_at = models.DateTimeField(db_index=True, blank=True,
null=True, default=None)
... other CharFields, DateTimeFields, ForeignKeys etc. ...
item = models.ForeignKey(Item, on_delete=models.CASCADE,
db_index=True, null=True, related_name='photos')
created_by = models.ForeignKey(User, blank=False)
deleted_at = models.DateTimeField(db_index=True, blank=True,
null=True, default=None)
... other CharFields, DateTimeFields, ForeignKeys etc. ...
... other CharFields, DateTimeFields, ForeignKeys etc. ...
createdBy = PrimaryKeyRelatedStringField(source='created_by',
read_only=True)
createdByFullName =
serializers.CharField(source='created_by.get_full_name', read_only=True)
... other CharFields, DateTimeFields, ForeignKeys etc. ...
listName = serializers.CharField(source='list.name',
read_only=True)
createdBy = PrimaryKeyRelatedStringField(source='created_by',
read_only=True)
createdByFullName =
serializers.CharField(source='created_by.get_full_name', read_only=True)
photos = PhotoSerializer(many=True, read_only=True)
events = EventSerializer(many=True, required=False,
allow_null=True, queryset=Event.objects.all())
... other fields ...
model = Item
serializer_class = ItemSerializer
return
Item.objects.all().filter(project_id=...).filter(deleted_at__isnull=True).prefetch_related(
'created_by', # ID of user who created item
'photos', # Photo properties
'event__created_by', # Full name of the person who
created the event
'list', # Name of the related list
)
django.db.backends: (0.196) SELECT "todo_item"."version", ...
everything ... FROM "todo_item" WHERE ("todo_item"."project_id" = 1 AND
"todo_item"."deleted_at" IS NULL) ORDER BY "todo_item"."created_at" DESC;
django.db.backends: (0.001) SELECT "auth_user"."id", ... everything
... FROM "auth_user" WHERE "auth_user"."id" IN (1, 2, ... some IDs ...);
django.db.backends: (0.148) SELECT "photos_photo"."version", ...
everything ... FROM "photos_photo" WHERE ("photos_photo"."deleted_at" IS
NULL AND "photos_photo"."item_id" IN (1, 2, ... lots of IDs... N)) ORDER BY
"photos_photo"."created_at" DESC;
django.db.backends: (0.078) SELECT "events_event"."created_at", ...
everything ... FROM "events_event" WHERE ("events_event"."deleted_at" IS
NULL AND "events_event"."item_id" IN (1, 2, ... lots of IDs... N)) ORDER BY
"events_event"."created_at" DESC, "events_event"."created_at" DESC;
django.db.backends: (0.157) SELECT "todo_list"."created_at", ... FROM
"todo_list" WHERE "todo_list"."id" IN (1, 2, ... lots of IDs... N)
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/9f1b5a50-4870-4c30-b017-842c14fef1f4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Alceu Rodrigues de Freitas Junior
2017-06-11 22:02:38 UTC
Permalink
I'm guessing here that REST would not be the appropriate to handle
"large" amounts of data, and probably that's why Melevyn suggested to
use pagination.

If your customer needs such data to be used off-line, I bet that would
be better to generate a file in the appropriate format, compact it and
redirect it to request the file to be downloaded.

For that, you could skip the ORM entirely, or build an appropriate Model
for it. I would stick with the former (and export the data straight from
the DB with a stored procedure), I don't see you reusing this new Model
in other places in your Django app.
Post by Miika Huusko
@Melvyn Sopacua, yes, that is very true that auth_user SELECT can be
avoided. Thanks for pointing that out! I don't know what I had put it
like that. Unfortunately, it's the related sets and big JOINs that are
the problem and generating the "slowness".
The reason I'm providing all "Items" and other relates stuff with one
request is that I want to make customer (that's also me and my good
frieds) life easier. I want to keep all the difficult and heavy work
on server side. Customer needs all data because of offline
requirements and therefore would need to loop through pagination or
perform the same big JOIN if "Photos" and other related objects would
be requested separately. Doing that in mobile devices is going to take
longer than on Ubuntu server even if it takes longer that I would
expect with Python + Django. That said, it's good to point out
opinions on design patterns. I haven't listed all background
information for decisions.
I'm not quite sure what you mean by "hash photos and use that as
identifier". What I could do - and believe what you mean(?) - is to
provide list of "Photo" IDs (or hashes) for "Item" and ask customer to
fetch "Photo" details separately with another request(s). Or - easier
for API - leave "Photo" IDs out of the "Item" information completely
and ask customer to make the JOIN with separately requested "Photos".
JOIN could be done with hashes or IDs. Either way it would be customer
who would make the JOIN.
I might go there, but still I don't believe that Django could not
handle my meager data faster than 20 seconds :) If that would be the
case it really would be sad.
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/db9cf163-6ddb-db87-e246-1151b180f9cd%40gmail.com.
For more options, visit https://groups.google.com/d/optout.
Melvyn Sopacua
2017-06-11 22:28:27 UTC
Permalink
Post by Miika Huusko
The reason I'm providing all "Items" and other relates stuff with one
request is that I want to make customer (that's also me and my good
frieds) life easier. I want to keep all the difficult and heavy work
on server side. Customer needs all data because of offline
requirements and therefore would need to loop through pagination or
perform the same big JOIN if "Photos" and other related objects would
be requested separately. Doing that in mobile devices is going to
take longer than on Ubuntu server even if it takes longer that I
would expect with Python + Django. That said, it's good to point out
opinions on design patterns. I haven't listed all background
information for decisions.
Think about how Google Maps does this with Offline dataset for area's you've marked. They
cache it for x days. Then prompt you if you have a free moment to sync up till you do.
It's not a burdon in practice and makes life a lot easier. Sometimes we set requirements too
harsh for ourselves.
Post by Miika Huusko
I'm not quite sure what you mean by "hash photos and use that as
identifier".
Sorry, I wasn't complete as it's been a while. I've built an application that fetches Zendesk data
into Django to for trend analysis. To get an attachment, for a given message was:

- get the message id: /messages/id
- get the attachment id's: messages/id/attachments (this is your reverse related)
- for each one that you don't have yet, fetch it: /files/id

The hashing came into play, because corporate idiots - I mean people - attach their logo to each
and every email they send. So by hashing them, you eliminate duplicates and scale down your
storage requirements and (query) data size.
Post by Miika Huusko
What I could do - and believe what you mean(?) - is to
provide list of "Photo" IDs (or hashes) for "Item" and ask customer
to fetch "Photo" details separately with another request(s). Or -
easier for API - leave "Photo" IDs out of the "Item" information
completely and ask customer to make the JOIN with separately
requested "Photos". JOIN could be done with hashes or IDs. Either way
it would be customer who would make the JOIN.
But it makes a lot more sense now with the mobile and offline requirements, why you have
setup the API the way you did. Two more thourghts for you:

- Use a cacheable dataset and incremental updates to be done "when wifi or better is available".
Complex, but worth it in the long run.
- Ditch DRF serialization and get the querieset as values(), dumping to json instantly and thus
eliminate both serialization objects and model objects.

It's easy to test the performance gain of the 2nd, especially if you can stream the json (as
opposed to in-memory-then-flush).
--
Melvyn Sopacua
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/1751373.LEcVXiJZ3g%40devstation.
For more options, visit https://groups.google.com/d/optout.
Miika Huusko
2017-06-14 13:14:31 UTC
Permalink
Thanks for input @Melvyn Sopacua and @Alceu Rodriques de Freitas Junior!

I know that it would be easy to just change architecture, but I wanted
actually two things here. One was to "get stuff done" and the other was to
"understand the problem". The first is easier. If nothing else would have
worked I could have paginated or something similar, but the second is more
difficult.

The main problem here is the JOIN after `prefetch_related` queries. As the
profiling shows something strange happens before serialization and Django
is using ridiculous amount of time for the JOIN while SQL queries
themselves are fast. Serialization is a bit slow, but not that bad. I sill
don't know why the JOIN in this use case by default is so slow. So the
"understand the problem" is open :)

For the "get stuff done" part I ended up doing the JOIN + serialization by
myself. Also adding same ideas that you Melvyn had, e.g. using 'values'.
The main part is that I didn't use 'prefetch_related' but made same
QuerySets separately and then simple JOIN. Not doing the automatic JOIN
following 'prefetch_related' made the biggest difference. While "default"
GET e.g. for 50k "Items" took about 1 minute, with updates it's now about
10 seconds. That's good enough for now, although it's sad I had to write
code that one shouldn't need to write when using good framework.

Added update also to Stack
Overflow: https://stackoverflow.com/questions/44461638/django-django-rest-framework-postgresql-queries-and-serialization-is-very-sl

Thanks for help.
Post by Miika Huusko
I have a performance problem with Django + Django REST Framework. The
problem is that it takes Django very, very long time to handle SQL query
response + DRF serialization take quite some time also. It happens when
there are ManyToMany or OneToMany relations and nested objects. Sounds like
"N + 1" problem, but that's not the case. I have 10k to 50k items with
related items and I try to fetch and serialize them for REST API. Request
takes from 20 to 60 seconds and currently I have no idea what is causing
the slowness.
Note: I asked the same question in Stack Overflow, but haven't found the
answer from there -
https://stackoverflow.com/questions/44461638/django-django-rest-framework-postgresql-queries-and-serialization-is-very-sl
* N+1 Problem *
There are not too many queries and queries themselves are fast (queries at
the end of this post). I'm using `prefetch_related` to limit number of
queries and what I'm seeing from DB queries everything is looking Okay..ish
(?). I get one query for each `prefetch_related` property + the original
query for serialized objects. There are lots and lots of IDs included in
`prefetch_related` queries, but I guess that is inevitable - as many IDs as
there are original items.
To test SQL queries + data transfer, I run same queries with psql from one
of my EC2 instances to RDS DB with same DB data and wrote the data to file.
Data transfer plus file write on top of that and it is totally between 100
to 500 ms for bigger SQL queries for different data sets. File write is
extra, but I wanted to ensure that I get all the data I expect.
time psql -f fat-query.psql --host=rds_host --port=5432
--username=user --dbname=dbname > output.txt
* Profiling *
When profiling as shown here
https://www.dabapps.com/blog/api-performance-profiling-django-rest-framework/
I get results that DB lookups take most of the time while serializing is
not too fast either. As an example I have about 12k "Items" in the my local
PostgreSQL database for one "Project". All "Items" have 1-5 "Events" and
majority of "Items" have also 1-2 "Photos". Fetching and serializing that
data takes around 22 seconds on my laptop. I'm using AWS EC2 + RDS for
deployment and the timing is about the same there. On larger "Item" sets
serialization time is increasing more than DB lookup time, but DB lookups
always take most of the time. With 40k items you'll start to reach 1 min
execution time and different timeouts from Nginx and other parts of the
stack.
Example with 12k items (models, serializers and queries below)
Database lookup | 14.0292s
Serialization | 6.0076s
Django request/response | 0.3488s
API view | 0.2170s
Response rendering | 1.1092s
If I leave Photos and Events out the result is
Database lookup | 1.2447s
Serialization | 3.9668s
Django request/response | 0.2435s
API view | 0.1320s
Response rendering | 0.8495s
* What might cause the slowness? *
So, the related fields are taking most of the time (many=True). The
profiling I used for testing is making `list` out of queryset before
serializing. Therefore lazy queries are executed before serialization. If I
don't do that, it doesn't change the overall results, but DB lookups are
evaluated when serializing with about the same amount of time.
Now the problem for me is that all queries that are done are fast if
executed manually. So, I believe SQL queries are fast, but DB lookups from
Django's point of view are very slow. What am I missing here? Or how should
I continue investigations? It feels like now it requires serious effort
from Django to convert SQL query results to Django model instances. That
would imply that there's something wrong with my models, right?
At the end, I could turn to caching, but I would assume that handling <
100k objects should not be an issue for Django if done correctly.
----------
Setup: Python 2.7.13, Django 1.10.7, DRF 3.6.3
... CharFields, DateTimeFields, ForeignKeys etc. ...
list = models.ForeignKey(List, on_delete=models.CASCADE,
db_index=True, null=True, related_name='items')
deleted_at = models.DateTimeField(db_index=True, blank=True,
null=True, default=None)
created_by = models.ForeignKey(User, blank=False)
project = models.ForeignKey('projects.Project',
on_delete=models.CASCADE)
... other CharFields, DateTimeFields, ForeignKeys etc. ...
item = models.ForeignKey(Item, on_delete=models.CASCADE,
db_index=True, null=True, related_name='events')
created_by = models.ForeignKey(User, blank=False)
deleted_at = models.DateTimeField(db_index=True, blank=True,
null=True, default=None)
... other CharFields, DateTimeFields, ForeignKeys etc. ...
item = models.ForeignKey(Item, on_delete=models.CASCADE,
db_index=True, null=True, related_name='photos')
created_by = models.ForeignKey(User, blank=False)
deleted_at = models.DateTimeField(db_index=True, blank=True,
null=True, default=None)
... other CharFields, DateTimeFields, ForeignKeys etc. ...
... other CharFields, DateTimeFields, ForeignKeys etc. ...
createdBy = PrimaryKeyRelatedStringField(source='created_by',
read_only=True)
createdByFullName =
serializers.CharField(source='created_by.get_full_name', read_only=True)
... other CharFields, DateTimeFields, ForeignKeys etc. ...
listName = serializers.CharField(source='list.name',
read_only=True)
createdBy = PrimaryKeyRelatedStringField(source='created_by',
read_only=True)
createdByFullName =
serializers.CharField(source='created_by.get_full_name', read_only=True)
photos = PhotoSerializer(many=True, read_only=True)
events = EventSerializer(many=True, required=False,
allow_null=True, queryset=Event.objects.all())
... other fields ...
model = Item
serializer_class = ItemSerializer
return
Item.objects.all().filter(project_id=...).filter(deleted_at__isnull=True).prefetch_related(
'created_by', # ID of user who created item
'photos', # Photo properties
'event__created_by', # Full name of the person who
created the event
'list', # Name of the related list
)
django.db.backends: (0.196) SELECT "todo_item"."version", ...
everything ... FROM "todo_item" WHERE ("todo_item"."project_id" = 1 AND
"todo_item"."deleted_at" IS NULL) ORDER BY "todo_item"."created_at" DESC;
django.db.backends: (0.001) SELECT "auth_user"."id", ... everything
... FROM "auth_user" WHERE "auth_user"."id" IN (1, 2, ... some IDs ...);
django.db.backends: (0.148) SELECT "photos_photo"."version", ...
everything ... FROM "photos_photo" WHERE ("photos_photo"."deleted_at" IS
NULL AND "photos_photo"."item_id" IN (1, 2, ... lots of IDs... N)) ORDER BY
"photos_photo"."created_at" DESC;
django.db.backends: (0.078) SELECT "events_event"."created_at", ...
everything ... FROM "events_event" WHERE ("events_event"."deleted_at" IS
NULL AND "events_event"."item_id" IN (1, 2, ... lots of IDs... N)) ORDER BY
"events_event"."created_at" DESC, "events_event"."created_at" DESC;
django.db.backends: (0.157) SELECT "todo_list"."created_at", ... FROM
"todo_list" WHERE "todo_list"."id" IN (1, 2, ... lots of IDs... N)
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/52395377-9189-4b96-8682-24db7f3b5a3e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Loading...