API documentation#

Random Policy and Algorithm#

class ray_prover.random_algorithm.PatchedRandomPolicy(*args, **kwargs)#

RandomPolicy from Ray examples misses a couple of methods.

compute_actions(obs_batch: List[array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple] | array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple, state_batches: List[array | jnp.ndarray | tf.Tensor | torch.Tensor] | None = None, prev_action_batch: List[array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple] | array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple = None, prev_reward_batch: List[array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple] | array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple = None, **kwargs) Tuple[list, list, dict]#

Compute actions for the current policy.

Parameters:
  • obs_batch – Batch of observations.

  • state_batches – List of RNN state input batches, if any.

  • prev_action_batch – Batch of previous action values.

  • prev_reward_batch – Batch of previous rewards.

  • kwargs – Forward compatibility placeholder

Returns:

actions: Batch of output actions, with shape like

[BATCH_SIZE, ACTION_SHAPE].

state_outs (List[TensorType]): List of RNN state output

batches, if any, each with shape [BATCH_SIZE, STATE_SIZE].

info (List[dict]): Dictionary of extra feature batches, if any,

with shape like {“f1”: [BATCH_SIZE, …], “f2”: [BATCH_SIZE, …]}.

learn_on_loaded_batch(offset: int = 0, buffer_index: int = 0) dict#

Don’t learn anything and return empty results.

Returns:

empty dictionary (no metrics computed)

load_batch_into_buffer(batch: SampleBatch, buffer_index: int = 0) int#

Don’t load anything anywhere.

Returns:

always zero (no samples loaded)

class ray_prover.random_algorithm.RandomAlgorithm(config: AlgorithmConfig | None = None, env=None, logger_creator: Callable[[], Logger] | None = None, **kwargs)#

Algorithm taking random actions and not learning anything.

classmethod get_default_policy_class(config: AlgorithmConfig) RandomPolicy#

We created PatchedRandomPolicy exactly for this algorithm.

Returns:

patched random policy

Training Helper#

class ray_prover.training_helper.TrainingHelper(test_run: bool = False, storage_path: str | None = None)#

Training helper abstract base class.

abstract env_creator(env_config: Dict[str, Any]) Env#

Return a configured environment.

Parameters:

env_config – an environment config

property env_id: str#

Environment label.

Returns:

environment name set previously

Raises:

ValueError – if not set previously

abstract get_algorithm_config() AlgorithmConfig#

Return an algorithm config.

parse_args(arguments_to_parse: List[str] | None = None) None#

Parse command line arguments.

Parameters:

arguments_to_parse – command line arguments (or explicitly set ones)

train_algorithm(arguments_to_parse: List[str] | None = None) None#

Train a reinforcement learning algorithm.

Parameters:

arguments_to_parse – command line arguments (or explicitly set ones)

PPO example#

class ray_prover.ppo_prover.PPOProver(test_run: bool = False, storage_path: str | None = None)#

PPO-based prover experiments helper.

>>> storage_path = getfixture("tmp_path")
>>> from gym_saturation.constants import MOCK_TPTP_PROBLEM
>>> test_arguments = ["--prover", "Vampire", "--max_clauses", "1",
...     "--problem_filename", MOCK_TPTP_PROBLEM]
>>> PPOProver(True, storage_path).train_algorithm(
...     test_arguments + ["--random_baseline"])
╭─...
...
╭──────────────────────────────────────────╮
│ Training result                          │
├──────────────────────────────────────────┤
│ episodes_total                         2 │
│ num_env_steps_sampled                  2 │
│ num_env_steps_trained                  2 │
│ sampler_results/episode_len_mean       1 │
│ sampler_results/episode_reward_mean    0 │
╰──────────────────────────────────────────╯
...
>>> PPOProver(True, storage_path).train_algorithm(test_arguments)
╭─...
...
╭──────────────────────────────────────────╮
│ Training result                          │
├──────────────────────────────────────────┤
│ episodes_total                         2 │
│ num_env_steps_sampled                  2 │
│ num_env_steps_trained                  0 │
│ sampler_results/episode_len_mean       1 │
│ sampler_results/episode_reward_mean    0 │
╰──────────────────────────────────────────╯
...
Parameters:
  • arguments_to_parse – command line arguments (or explicitly set ones)

  • test_run – we use light parameters for testing

env_creator(env_config: Dict[str, Any]) Env#

Return a prover with AST2Vec state representation.

Parameters:

env_config – an environment config

Returns:

an environment

get_algorithm_config() AlgorithmConfig#

Return an algorithm config.

Returns:

algorithm config

Curriculum learning function#

ray_prover.curriculum.curriculum_fn(next_task: str, train_results: Dict[str, Any], task_settable_env: SaturationEnv, env_ctx: EnvContext) str#

Organise two-task curriculum.

Parameters:
  • next_task – next task to take if the current one is solved

  • train_results – training metrics

  • task_settable_env – the environment

  • env_ctx – left for compatibility

Returns:

a current task of next task

Custom stopper#

class ray_prover.custom_stopper.CustomStopper(last_task: str)#

Custom stopper for curriculum learning.

stop_all() bool#

Define for compatibility.

Returns:

always False for compatibility