August 2, 2019 Florian AI, Deep-learning, keras, Private-ish, Python, Tensorflow no responses

Practical AI resources

Random collection of helpful resources, maybe only helpful for myself.
Some may be specific to the environment: Ubuntu 16.04, Python 3.5.2, Tensorflow/keras, PyCharm, CPU i7.

Theory & Understanding
Best, detailed LSTM description incl. intuitions and reasons: https://r2rt.com/written-memories-understanding-deriving-and-extending-the-lstm.html

Keras setup in Tensorflow in Python
Custom Tensorflow builds for your environment: https://github.com/lakshayg/tensorflow-build (if you really need to build it yourself, here some tutorial you may more or less readily be able to adopt; mind, builing itself easily takes some 5+ h compute time).

PyCharm IDE editor keras resources recognition issue
Use from tensorflow.python.keras import blabla syntax (instead of from tensorflow.keras import blabla) to enable PyCharm editor to properly recognize the resources and thus provide its usual editing aides (link).

Mercing etc.
Merging: concatenate vs. Concatenate etc. (though mind, in outdated merge vs. Merge; cf. here new name) K. Functional API: https://github.com/keras-team/keras/issues/3921#issuecomment-250643688

Plot keras model structure
keras.utils.plot_model(model, ‘mymodelplot.png’, show_shapes=True) plots figures of your model, indicating layer sizes & flow; great to debug your data dimensions.
To make plot_model work, avoiding complaints about graphviz and pydot(plus) resource missing etc., I tried few things but what probably was that made it work for me:
Command line:
pip install pydot
pip install pydot_ng # Probably not necessary
pip install pydotplus
sudo apt-get install graphviz
Code:
import pydot # Suggesting pydot_ng above was indeed not necessary, as else I’d here instead be using ‘import pydot_ng as pydot’
import pydotplus
from tensorflow.keras.utils import
Cf. also here and here

Slice keras layer using simple indexing
To get 3rd & 4th (feature) elements of the 2D inputs tensor (number sample * number features), exceptionally add the first ‘sample’ dimension that otherwise is typ. left out (at least in fct.al API):
x_aux_wk = inputs_aux[:,2:4]
Mind that if you leave the first ‘:,’ out, you’ll instead slice the set of observations, which is – often – not what you wanted

Concatenate layers – odd ‘wrong dimension’ complaints
Mind that the first dimension is the samples dimension! So, e.g., you may have also wrongly sliced along the observations/sample dimension instead of along e.g. feature dimension

-Train/Test split directly within keras model.fit
-Plot model training progress history
Copy the code corresponding lines from JB

Time-Series
LSTM extension for including time-invariant data: ConditionalRNN in cond-rnn

Btw, great ARIMA (or SARIMA or SARIMAX) overview here.

Cyclical data
For your hour-data or seasonal data or so, encode it with separate sin and cos transforms to make exploiting the cyclicality easy for the algo. Avoids encoding to yield same output for different inputs (circle passes twice through each height..) and to be stuck with (near-)zero derivative at some places (sin and cos each have their areas where the function changes barely with input change..). I seem to not be the first to have had this idea and it seems to be a common solution and to work well: e.g. a detailed explanation and test.

LeakyReLU: With Lambda directly within Dense et al. instead of separate layers
lrelu = lambda x: relu(x, alpha=0.01)
x_day = Dense(15,activation=lrelu)(x_day)
Thanks to https://stackoverflow.com/a/56869141/367332

RL

Ideal explanation of Actor-Critic, incl. A2C & A3C here.
Great graphical overview, ‘map’ of RL algos: With Nervana Systems Coach:
Dueling DQN (DDQN, but not to mix up with Double-DQN): neat explanation here. DDQN interesting for me, as goes towards Value-V(s)-instead-of-(only directly)-Q(s,a) value optimization: after all I think that for my power stuff, most of what I need is to know terminal value fct. V(s) rather than specifically Q(s,a); in the most advanced model, my centralized dispatch shall dispatch plants for given terminal value functions (or for corresponding derived offer price & quantity p&q-pairs; well I can alternatively use the actions to be offer pairs; that may actually be simpler to implement and efficient enough; maybe even quite ideal esp. when I get more complex stuff (with balancing and/or ancillary)!?

Double DQN: simple; neat explanation here.

Mixed Monte Carlo (MMC): Great explanation here; simply: weighted avg. between Double DQN and Monte Carlo, where Monte Carlo = simulate an ‘entire’ path of choices & rewards, to get the Q-value update estimate.

Normalized Advantage Functions (NAV): Nervana Systems Coach explains it. I just list it, as it seems the only algo I’ve seen so far that may be essentially focusing on estimating simply V(s) rather than Q(s,a) (?).

Value-only-learning instead of policy & value learning: (i) no ready-made solution may be available, though of course could implement from scratch/manually change open-source implementations such as from RLlib. (ii) One solution that I now found and may bring me close to the aim, without having to implement everything from scratch: two-headed networks, one head being for policy choice, one for value estimation. ray RLlib apparently allows it with parameter vf_share_layers, and pytorch may readily allow it too, Cf. https://www.datahubbs.com/two-headed-a2c-network-in-pytorch/ and https://towardsdatascience.com/ray-and-rllib-for-fast-and-parallel-reinforcement-learning-6d31ee21c96c, cf. my latest comment added to the post

Error AttributeError: module 'gym.envs.box2d' has no attribute 'LunarLander': Solved simply with pip install Box2D! Thanks to this.

PPO, great line-by-line tutorial video: here

Just an example quote challenging standard view on Policy Learning interpretation, or, essentially emphasizing that what is being learned under the hood may not always be so trivial and conform to what was intended or what seems most logical on first sight, and things are not always black and white in terms of how they should be categorized and what they do, by auterliantactics:

There’s an interesting paper ‘Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms?’ that questions are understanding of why PPO works. In particular, they find that learning rate annealing is necessary to achieve good performance with PPO despite the method not being a core part of PPO. Another finding regards the value network and function, calling into question our understanding of what the value network and function do. Thus possibly making our attempts to understand plots like value loss and value estimate an unrewarding task:
Our results (Figure 4a) show that the value network does succeed at both fitting the given loss function and generalizing to unseen data, showing low and stable mean relative error (MRE). However, the significant drop in performance as shown in Figure 4b indicates that the supervised learning problem induced by (17) does not lead to V π θ learning the underlying true value function.

New Actor-Critic style RL algo for Market Learning
This could help make sense of Actor-Critic for market players:
Value function (critic) provides terminal value.
Policy function (actor) tells where exactly to evaluate the value function: either discrete for say discrete CCGT decisions, or continuous, say for pump-storage, with infinite range but mapped to range of possible storage level results. Loss for critic itself then defined by how close his prediction was to the really ended-up-in state! 🙂. Only value function’s loss remains directly linked to the reward, using the typical temporal-difference type update of Actor-Critic models.
Could tweak e.g. PPO to work like this, no?
Might work?

RL Training

Good advice, shared experiences, pitfalls:

Debugging: https://old.reddit.com/r/reinforcementlearning/comments/9sh77q/what_are_your_best_tips_for_debugging_rl_problems/

https://old.reddit.com/r/reinforcementlearning/comments/7s8px9/deep_reinforcement_learning_practical_tips/

Hyperparameter tuning tips video course from Kagglers: https://www.coursera.org/lecture/competitive-data-science/hyperparameter-tuning-i-giBKx

Reward normalization/scaling: Particularly important, as value loss function is based on squared residuals and accumulations of different periods rewards (residuals, and can thus easily completely overwhelm policy loss component in total loss, unless proper scaling is done. Individual period reward magnitudes ideally are <1 (or maybe around 1 when rewards more sparse), maybe 0.1. Probably the main problem of vf_loss can be reduced by choosing a scaling value loss scaling factor (<<1 instead of ca. 1) in the total loss function (vf_loss_coeff), but I reckon ideally direct scaling of reward already in first place is best. Reducing the scale of reward to values typically <=0.1, may have been a main reason my simple PPO agent eventually started learning reasonably while before it used to be stubbornly incapable of reasonable learning. Before scaling reward, entropy mostly increased gradually instead of reducing, and no convergence at all seemed to happen.

Curriculum Learning: Train simple things first and then expose the learner to more and more difficult steps. More here.

Short best practice: PPO & general.

RLlib

Exploit/Deterministic action_choice, avoiding crazy wiggling in evaluation action:
compute_action(...,explore=False). Ensures we get deterministic best-action choice instead of stochastic exploration choice. Now, why (e.g. PPO) exploration distrib. does in my learning in some points not become more narrow anyways, I do not know yet; in fact in some cases it seems to first become more narrow but later again wider.

Note, another argument, to retrieve more info, is full_fetch.

A full statement can be, e.g., actX,_,info = agents.compute_action(obs[a], policy_id=agents.get_config()[‘multiagent’][‘policy_mapping_fn’](a),full_fetch=True,explore=True)

Multi-agent, multi-policy:
When mapping multiple agents to a single policy with policy_mapping_fn: the mapped agents do not (seem to) get independently trained; my results suggest each policy may train only a single parametrization (a single set of weights). At least this is what my results suggest, in https://bitbucket.org/oxeeai/rlexercise/src/master/py/rllib04_multiagentPowerPPO.py

Error AttributeError: 'list' object has no attribute 'keys' in in set_weights self.assignment_nodes[name] for name in new_weights.keys()in ray/experimental/tf_utils.py:
Can be caused when using tensorflow tf.compat.v1.enable_v2_behavior(). Just remove that statement and it should work

Error ValueError: Reward should be finite scalar, got [0.09153526], from result = agent.train() :
=> Ok, adding float() around the reward return in def step() in the Env class definition can cure that.

Error ValueError: Cannot feed value of shape (2, 1) for Tensor 'default_policy/prev_reward:0', which has shape '(?,), or

ValueError: Error fetching: [, {'action_prob': , 'action_logp': , 'vf_preds': , 'behaviour_logits': }], feed_dict={: [array([array([4.4697866], dtype=float32), array([2.1409502], dtype=float32), 0.3141944129280371, 116.33333333333334, 8], dtype=object), array([array([2.1409502], dtype=float32), array([4.4697866], dtype=float32), 0.3141944129280371, 116.33333333333334, 8], dtype=object)], : [array([-0.2900179], dtype=float32), array([-1.41676], dtype=float32)], : [array([-0.27919197], dtype=float32), array([-0.9626715], dtype=float32)], : False, : True, : 0}
when running Multi-Agent env in RLlib (or ray tune):
Again may be due to false type of agents’ rewards. I had to make them float() instead of np.float32(). E.g. rewards={agent: float(blabla[agent]) for agent in [self.agent_1, self.agent_2]}, for the environment’s step() method’s return, solved the issue for me.

Warning WARNING:root:NaN or Inf found in input tensor., when running RLlib and/or tune on your model (in my case a custom multi-agent environment):
This may come together with training/tuning messages indicating nan reward values (etc.).
Possible cause, interpretation and fix:
First, it can be due to the training/tuning iteration not doing enough sample time steps (e.g. 4400) to finish any entire environment episode (no ‘done’ yet…). Reducing my environment’s episodelength from 365*24 to 20*24 = 480 was enough to avoid the warning and the reported nan results in my case, given the (preset) tune iteration time-step number was 4400.

Mind though also, https://github.com/ray-project/ray/issues/7606 details that the warning may happen in some cases where it is really a red herring and can simply be ignored; training would go on still anyways.

Issue solved: Training returning ValueError: ('Observation outside expected value range', Box(12,) [...]. Can happen if observation values (e.g. result by environment’s .step() or .reset() methods) are just at specified finite border values for the specified observation_space Box. Solution: Simply increase the observation_space range by some small numerical noise magnitude, e.g. 0.001, that avoids the crashing.

Issue solved: Training returning non-finite action values (nan or inf). Happened to me when using infinite action_space bounds with TD3 learning algorithm (with PPO it did not happen to me). Solution: Choose large enough but finite upper and lower bounds in your continuous action space Box.

Error OSError: [Errno 99] cannot assign requested address asyncio base_events and/or error while attempting to bind on address ('::1', 8265, 0, 0) : cannot assign requested address. Solved by adding in ray.init() command. See also https://github.com/ray-project/ray/issues/7084

Install python torch (pytorch) on Ubuntu, without cuda

pip install torch==1.4.0+cpu torchvision==0.5.0+cpu -f https://download.pytorch.org/whl/torch_stable.html

Or find the command for your system readily at https://pytorch.org/

Watch GPU utilization: Use nvidia-smi this way to get automatically updated view: watch -d -n 0.5 nvidia-smi.

PPO in RLLib

Extreme values/Bad Learning with finite bounds of continuous action space (don’t know whether also non-PPO continuous algos have that issue in RLlib): Using infinite bounds in the action space Box( ... -np.inf, ... np.inf... ) instead of the finite ones you really want, and externally add a mapping from the infinite range to your desired finite bounded range, can help overcome this problem, making the learning more stable and avoid it from going nuts and end up converging towards stubbornly using bad boundary values instead of reasonable interior ones.
The mapping, by the way, can be done as follows:
action_bounded = low + 1 / (1 + math.exp(-action_raw)) * (high-low)
See also actionmap() in my flib fai.
A PPO instability w.r.t. finite-bounded action spaces is discussed by bbalaji here.

Tensorflow/Keras

Error tensorflow.python.framework.errors_impl.NotFoundError: Could not find valid device for node. Node:{{node Mul}}: In my case was due to me multiplying some tf relevant variable (here, rewards in step), with a bool (in this case reward = reward_if_run * run). Solved by instead using if-condition (in this case reward = reward_if_run if run else 0.).

Bad training results with custom loss and vectorized .fit(): Tensor axis/dimensionality may be different from what you think! In my case, I had to use the tf.math.reduce_sum axis -1 (or theoretically 2), instead of 1, even if I have not yet seen, why the third dimension comes in at all

Remote

Quick setup: git clone https://tempoxee:<heremypassword>@bitbucket.org/oxeeai/rlexercise && cd rlexercise/py && git checkout cloud && ./setup_remote.sh.

Pycharm @ vast.ai, error solved: When SSH connection suddenly fails with External interpreter not working. Error: "Can't run remote python interpreter: Can't get remote credentials for deployment server": (Restart Pycharm and/or PC), completely remove the SSH connection (in Deployment) and the SSH Interpreter, and re-add them. There may also be some auxiliary pycharm files to remove on the server, though in my case this was not neede. See: https://intellij-support.jetbrains.com/hc/en-us/community/posts/360003445859-External-interpreter-not-working-Error-Can-t-run-remote-python-interpreter-Can-t-get-remote-credentials-for-deployment-server-?page=1#community_comment_360001547800

Various errors about tensorflow, cuda, nvidia driver: 1. Make sure have compatible combination. E.g. tensorflow (aka tf) 2.0, cuda 10.0, nvidia 410 works for me; and e.g. tf 20.1, cuda 10.1, nvidia 430 I think MIGHT also work (here for more info). Importantly, do not change cuda & nvidia versions on vast.ai machines; hell may easily break loose otherwise. For example, I had gotten some error `Could not load dynamic library ‘libnvinfer_plugin.so.6’; dlerror: libnvrtc.so.10.2`, and downgrading from tf 2.1 to tf 2.0 I think solved that one: python3.6 $(which pip3) install tensorflow==2.0 . Cf. also /home/florian/Dropbox/AI/RL/doc/issues_setting_up_vastai_etc.txt for few more things.

Eventually btw I managed to easily setup for GPU to work on machine with cuda 10.0, nvidia 410 (vast.ai, machine 1350), and me installing tf 2.0.
But eventually failed to (re-)setup things properly at all (even indep. of GPU!) on machine with cuda 10.1, nvidia 430 (vast.ai machine 1649), even if at some point before I think I had managed somehow.

Share it!

Florian

Is Your Sustainable Contribution Simply Substituted Away?