Slow running queries

  • I have two queries that take about 90 seconds or so to complete (a few others take about 8-10 seconds, but I tracked them back to locations that cache the retrieved data, so I'm not too worried).


    The first slow query appears to originate from PostSearch::getOuterSQLQuery. Possibly of note, I use elasticsearch for my forum. I've attached the relevant settings from the ACP as an image. As such, it seems odd that this gets hit (I only see it getting called from MysqlSearchEngine::search).


    I believe the second query originates from UserPostListPage when it attempts to get the number of items in the list. Trying to navigate to {forum_location}/user-post-list/{user_id}-{username}/ seems to confirm this. As it appears to only be capable of querying the database for that information, it doesn't surprise me much that this query is failing to quickly return results for the 90 million posts that exist. Of note, this query gets substantially more prevalent when the member has more posts (as you might imagine).


    Any suggestions/thoughts on how to fix the above issues?

  • The first slow query appears to originate from PostSearch::getOuterSQLQuery. Possibly of note, I use elasticsearch for my forum. I've attached the relevant settings from the ACP as an image. As such, it seems odd that this gets hit (I only see it getting called from MysqlSearchEngine::search). First Slow Query (whitespace preserved)

    This method really shouldn't be called for you at all. There is no part in the code that ever access the specific engine itself, calls are always handled through \wcf\system\search\SearchEngine which uses the ACP option to decide which backend to invoke. Do you see that query often or did it just pop up once or twice? If you see it more often, please double-check if your PHP-instances are working properly and use the correct options.inc.php.


    I believe the second query originates from UserPostListPage when it attempts to get the number of items in the list.

    That is correct too, but other than the query above, it is actually meant to work that way. In previous versions the list of posts by an user was actually offloading the work to the search engine, including its limitations and poor scaling when using MySQL.


    The Elasticsearch integration is actually cheating a bit: It fetches only up to 2.5k result ids, which are then embedded into a query that validates the results against the thread table. This way we can filter by accessible forums and usually hit the 1k result cap that the search uses, this is because these 1k ids are cached, navigating from page 1 to X will only query by a potion of the cached ids.


    This usually works quite well when searching for posts/threads and we can easily get away by limiting the results to 1k only. There are two cases where this approach doesn't really work out well:

    1. There are a lot of hidden areas for the current user, making it more likely to receive much less than 1k results, despite there being far more actual results. If your forum is large enough and have quite some hidden areas, you'll hit this sooner or later - which is also the reason why Elasticsearch returns more than 1k results to compensate.
    2. A list of posts of a certain user, once you got enough posts (look at me, sitting at 14k posts), it will become impossible to find them. Also the actual order in which the posts are returned are more or less arbitrary, because the only real condition is the user id - a criteria that matches all my posts in an equal manner. This is quite bad with MySQL, because search by relevance for user id is going to yield something that can be described as "random" as best, or we could order by time which will cause MySQL to put the stuff into temporary tables on disk (aka "full disk I/O rampage").


    … aaand nothing of that is going to solve your problem. Unfortunately, I have no easy solution off my hands, especially because there isn't really a sane way of working around this issue with some "easy" changes.


    Unless you have no (or almost no) hidden forums, in that case you could get away with trashing the thread join and work with the post table only. This does require quite some tinkering to get it to work, especially because you'll need to read the actual ids without any board/thread filter. Once you got the ids, you can run it through a secondary query that fetches the posts and validates them against the thread table, using only a tiny potion of thread ids (aka lightning fast primary index lookups). Thanks to the implementation of the page class, this can be achieved using nothing but an event listener, allowing for custom queries without risking to lose anything upon an update.


    The downside of the above "solution" is, that it's merely cheating and hiding the fact that there will be posts missing on a page due to the filtering. For instance it may happen that you actually see only 17 or 18 posts per page instead of the 20 you would expect. This ratio gets worse the more posts are contained in inaccessible forums, but the reverse is also true.

    Alexander Ebert
    Senior Developer WoltLab® GmbH

  • I was afraid of that. :P


    I've upped my max_execution_time for queries to 10 seconds to account for the queries that are grabbing data and caching it, but allowing a 90+ second query just isn't going to work out. I'm not quite sure what I'll end up doing about this, but my members aren't beating my door down with complaints about it, so I'll likely just let it be until it becomes an actual problem.


    I have ~40 slow queries in my slow query log from the last ~4 hours. Though my threshhold for a "slow query" is a rather substantial 9.9 seconds. I've attached some aggregate information in the hopes it'll be of use to you in some way in the future.

  • It would be quite helpful to receive the full slow query log, especially since it includes certain metrics that give some insight into the actual work that has been done. Feel free to send me the data via email to ebert@woltlab.com to keep the data private.

    Alexander Ebert
    Senior Developer WoltLab® GmbH

  • Not really, mostly because this isn't an issue with a faulty query, but simply the amount of data that needs to be handled.


    You could run an EXPLAIN SELECT … to figure out what the query planer is doing, in particular if it appears to examine a really large amount of rows. It would be helpful to know about the indices it is using, maybe there is something we can do to speed things up.

    Alexander Ebert
    Senior Developer WoltLab® GmbH

  • For a user with ~76,000 posts):

    Code
    EXPLAIN SELECT post.postID AS objectID
    FROM wbb1_post post
    JOIN wbb1_thread thread
    WHERE post.userID = {a user id}
        AND post.isDisabled = 0
        AND post.isDeleted = 0
        AND thread.threadID = post.threadID
        AND thread.boardID IN ({a bunch of board IDs})
    ORDER BY post.time DESC, post.postID DESC LIMIT 20 OFFSET 71780;
    Code
    +----+-------------+--------+------------+--------+-----------------------------------------------------+------------+---------+-------------------+--------+----------+---------------------------------------+
    | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
    +----+-------------+--------+------------+--------+-----------------------------------------------------+------------+---------+-------------------+--------+----------+---------------------------------------+
    | 1 | SIMPLE | post | NULL | ref | threadID,threadID_2,isDeleted,isDisabled,userToPost | userToPost | 7 | const,const,const | 249604 | 100.00 | Using index condition; Using filesort |
    | 1 | SIMPLE | thread | NULL | eq_ref | PRIMARY,boardID,boardID_2 | PRIMARY | 4 | wcf.post.threadID | 1 | 100.00 | Using where |
    +----+-------------+--------+------------+--------+-----------------------------------------------------+------------+---------+-------------------+--------+----------+---------------------------------------+
  • That looks rather good, hitting ~250k rows with ~76k posts is pretty accurate and it uses exclusively the key userToPost which is designed to handle this kind of queries. Using filesort is most likely the bottleneck, but at least it doesn't use a devastating temporary table on disk.


    I would like to suggest running the profiler for this query to get some exact numbers. I'm somewhat certain that it falls within the above assumptions, but it doesn't hurt to get this one confirmed before diving deeper into the miraculous world of MySQL query optimization.

    Alexander Ebert
    Senior Developer WoltLab® GmbH

  • Anything you need, my good man! Looks like it's all in sort index creation.

    Profile:


  • The effect of the above is dramatic. Additionally, the pages are left in the cache, and the follow up queries occur in a few tenths of a second.


  • That's still a huge amount of time spent sorting the results, I would still consider these numbers to be bad, even if they're not as devastating as before.


    I see little room for improvements here, unless we start abusing the internal record orders to skip sorting the results at all. I call it abusing, because retrieving records from MySQL without an explicit ORDER BY will yield the rows in arbitrary order, but in reality they mostly follow the order of appearance in the table. This would also require the offset logic to be inversed by selecting the latest posts and then reversing them in PHP again, effectively relying on the internal order and then just pretend it was really ordered in the query. Selecting the last 20 items without ordering should be super fast, but is also on the verge of being pure botchery.


    One last thing you could try before crossing that last line is to play around with the ORDER BY statement. Please keep in mind to use SELECT SQL_NO_CACHE … to avoid the query result going to the cache at all:

    • Ignore the postID entirely and just order by time.
    • Use just the postID and hope that little to no sorting will occur.

    None of the changes above will have a significant impact on the result quality, especially with posts created by the same user there is little to no loss of consistency in the displayed records. It may still happen, but the chance is so low that I doubt that it will ever be noticed by anybody.

    Alexander Ebert
    Senior Developer WoltLab® GmbH

  • I definitely agree that it's still unusable; just as a matter of improvement, it was dramatic. :D


    I also definitely agree that depending on the internal MySQL/Inno DB counts would be pretty much pure botchery.


    I'll take a poke at the ORDER BY stuff tonight (8 hours from now or so ) and let you know if I find anything dramatic.


    As always, thanks for your quick and effective assistance. :)

  • I'm not quite sure if it makes much of a difference, but you should try to use much lower OFFSET values. Given a typical pagination value, the query above represents page 3589, something that I assume no sane user to ever visit. MySQL can make use of an index, thus it is much easier to return early once it reaches the requested amount of items. The index data itself is ordered, but each value has to be checked against the conditions, eventually forcing MySQL to check every item until it reaches the requested offset.


    On second thought, I assume that it is much faster to have only the last 1k posts available for browsing, so you could just cap it at 1k in order to achieve low OFFSET values.


    It's like the members list page, it may has its use with a few hundred users, but when facing tens of thousands of users it just gets plain useless and nobody except for bots will every go that deep into the rabbit hole ;)

    Alexander Ebert
    Senior Developer WoltLab® GmbH

  • But one would be able to search for them, correct? So only the "list all" is capped with 1000?

    This whole thread isn't about the members list, I just brought it up as an example for pages that are useful to some extent at small to medium scale, but become simply useless at large. In particular, this issue is about the user post list.

    Alexander Ebert
    Senior Developer WoltLab® GmbH

  • On second thought, I assume that it is much faster to have only the last 1k posts available for browsing, so you could just cap it at 1k in order to achieve low OFFSET values.

    This surely helps. Though, given the test user I mentioned, the query begins to fail to return in the 30's (page 30-39). Occasionally (for other users) the query will fail on the first page, given the need to grab too much new data off the disk.

    However, removing the order by post.time(as you mentioned) from the query also helps substantially. I don't really like modifying the core, but given the options, I guess updating UserPostListPage::$sqlOrderBy and overriding UserPostListPage::countItems isn't the worst. Even if it gets overwritten by a future update, it isn't like it'll wipe out a bunch of functionality or make my forum unusable. :D The countItems override calls the parent to make sure 1) the events are still fired correctly and 2) how many pages we actually have (maybe it's less than the maximum we're willing to display :P )

  • I don't really like modifying the core, but given the options, I guess updating UserPostListPage::$sqlOrderBy and overriding UserPostListPage::countItems isn't the worst.

    Actually, you could write a plugin that listens on the calculateNumberOfPages event, which is the last event before counting items, which takes place before reading any actual data. You can then override the public properties $sqlOrderBy and $sqlLimit (defined in \wcf\page\MultipleLinkPage) with your own values and you're done. Boom, update-proof.


    Regardless, I've tried it with your top posters and the pages load quite fast, with the exception of that 200k post blob, but it's still fast enough. Out of curiosity, how does the profiler result look like when applying the aforementioned optimizations for an offset of about 200? I expect it to look like a faster version of the previous results, but you never know, maybe there is something else that could need some adjustments to push it even further.

    Alexander Ebert
    Senior Developer WoltLab® GmbH

  • Actually, you could write a plugin that listens on the calculateNumberOfPages event, which is the last event before counting items, which takes place before reading any actual data. You can then override the public properties $sqlOrderBy and $sqlLimit (defined in \wcf\page\MultipleLinkPage) with your own values and you're done. Boom, update-proof.

    Indeed. That was pretty obvious to me after I posted stating that I still wanted to trigger the event. :D Sometimes one misses the forest for the trees.


    This query

    SQL
    SELECT post.postID AS objectID
    FROM wcf.wbb1_post post
    JOIN wcf.wbb1_thread thread
    WHERE post.userID = {user_id} AND post.isDisabled = 0 AND post.isDeleted = 0 AND thread.threadID = post.threadID AND thread.boardID IN ({board ids})
    ORDER BY post.postID DESC LIMIT 20 OFFSET 200;

    results in this profile

    As you supposed, it looks quite like the previous, but without the 1-5 minutes of Creating sort index. :D