Does not work for large clusters #22

ghost · 2013-08-01T21:12:14Z

I built hannibal, changed the storage layer to mysql from h2 and we have around 1000 regionservers with around 300k regions in total. Because so much of the logic is done at the view layer it takes minutes to load. The response size for some of the requests is > 200 MB and too much of the logic in terms of sorting and combining is done at the view layer. Thus making hannibal which is an awesome tool unusable.

meniku · 2013-08-06T12:18:39Z

Whoah that's quite a lot of Regionservers. I am aware that Hannibal is currently not usable for such large installations. However, you came pretty far, I would have guessed Hannibal crashes a lot earlier ;-)

A colleague and me just looked at the request and response and think it can be improved quite a lot:

Enable GZIP. Unfortunately it doesn't seem to come out of the box with Playframwork
Add specific API for each graph to reduce payload. As you said: too much logic is done at the view layer. This is a major refactoring though.

However, I fear that this won't be enough. Adjustments have to be done at the UI to be usable for such huge amounts of data. Another bottleneck could be the communication between Hannibal and the Regionservers. And we should consider to make the recording of metric/compactions optional altogether or configurable on a per-region basis.

If someone wants to start working on some of those issues, let me know if I can assist you.

ghost · 2013-10-08T20:23:20Z

Sorry for the late response, I started working on it a bit. The main thing I noticed when looking at the server side code was that for each graph we make calls out to every region server for data points. When I created a cache that would use a background thread to update at an interval of every 5 to 10 minutes for example this helped a great deal in response times in loading the graphs. Still the view layer was somewhat of an issue, but the region cache improved performance quite a bit.

A crude patch would be located here:
https://github.com/churrodog/hannibal/commit/13231aa4eb7d14016ea854b74d90cca329a66546

please forgive my scala code, I know its pretty horrible - just a POC.

The view layer refactoring seems like a daunting task as I would probably screw things up, but I would be up for helping create API's for each graph thus the rendering layer refactoring could be done incrementally.

meniku · 2013-10-09T14:53:59Z

Thanks very much for the commit, this looks like a great improvement. I also like that you cleaned up the model a bit :-)

However I think we'll have to reduce the intervall of 30 minutes per default and introduce a configuration value for that. Also I have to think about how we'll sync this up with the regioninfo metrics as it doesn't make any sense to record the same cashed values over and over again. Maybe we should change the update of the regioninfo metrics so that they are recorded just after the cache gets updated.

I think we should introduce the following configuration values (dunno wether the defaults are good values though):

regions.update-interval: default 120s : determines how often the cache get's updated and also how often metrics will get recorded
compactions.update-interval: default 300s: determines how often the logfiles get fetched and compactions are recorded. 0 means recording the compactions is disabled altogether.

I hope I can implement it soon.

…on parameters for controlling how often metrics are fetched, refactor old configuration names and other refactorings. #22 Closes #25

meniku · 2013-10-21T14:50:50Z

I added most of your code and added the new configuration values (names differ a bit to the previous proposed ones).
It's available in the next branch and I will merge it back to master soon.

meniku pushed a commit that referenced this issue Oct 21, 2013

Created a cache for the regions (for large clusters), add configurati…

f7a1761

…on parameters for controlling how often metrics are fetched, refactor old configuration names and other refactorings. #22 Closes #25

meniku mentioned this issue Nov 11, 2013

Refactoring #26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does not work for large clusters #22

Does not work for large clusters #22

ghost commented Aug 1, 2013

meniku commented Aug 6, 2013

ghost commented Oct 8, 2013

meniku commented Oct 9, 2013

meniku commented Oct 21, 2013

Does not work for large clusters #22

Does not work for large clusters #22

Comments

ghost commented Aug 1, 2013

meniku commented Aug 6, 2013

ghost commented Oct 8, 2013

meniku commented Oct 9, 2013

meniku commented Oct 21, 2013