API documentation#
Random Policy and Algorithm#
- class ray_prover.random_algorithm.PatchedRandomPolicy(*args, **kwargs)#
RandomPolicy from Ray examples misses a couple of methods.
- compute_actions(obs_batch: List[array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple] | array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple, state_batches: List[array | jnp.ndarray | tf.Tensor | torch.Tensor] | None = None, prev_action_batch: List[array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple] | array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple = None, prev_reward_batch: List[array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple] | array | jnp.ndarray | tf.Tensor | torch.Tensor | dict | tuple = None, **kwargs) Tuple[list, list, dict]#
Compute actions for the current policy.
- Parameters:
obs_batch – Batch of observations.
state_batches – List of RNN state input batches, if any.
prev_action_batch – Batch of previous action values.
prev_reward_batch – Batch of previous rewards.
kwargs – Forward compatibility placeholder
- Returns:
- actions: Batch of output actions, with shape like
[BATCH_SIZE, ACTION_SHAPE].
- state_outs (List[TensorType]): List of RNN state output
batches, if any, each with shape [BATCH_SIZE, STATE_SIZE].
- info (List[dict]): Dictionary of extra feature batches, if any,
with shape like {“f1”: [BATCH_SIZE, …], “f2”: [BATCH_SIZE, …]}.
- learn_on_loaded_batch(offset: int = 0, buffer_index: int = 0) dict#
Don’t learn anything and return empty results.
- Returns:
empty dictionary (no metrics computed)
- load_batch_into_buffer(batch: SampleBatch, buffer_index: int = 0) int#
Don’t load anything anywhere.
- Returns:
always zero (no samples loaded)
- class ray_prover.random_algorithm.RandomAlgorithm(config: AlgorithmConfig | None = None, env=None, logger_creator: Callable[[], Logger] | None = None, **kwargs)#
Algorithm taking random actions and not learning anything.
- classmethod get_default_policy_class(config: AlgorithmConfig) RandomPolicy#
We created PatchedRandomPolicy exactly for this algorithm.
- Returns:
patched random policy
Training Helper#
- class ray_prover.training_helper.TrainingHelper(test_run: bool = False, storage_path: str | None = None)#
Training helper abstract base class.
- abstract env_creator(env_config: Dict[str, Any]) Env#
Return a configured environment.
- Parameters:
env_config – an environment config
- property env_id: str#
Environment label.
- Returns:
environment name set previously
- Raises:
ValueError – if not set previously
- abstract get_algorithm_config() AlgorithmConfig#
Return an algorithm config.
- parse_args(arguments_to_parse: List[str] | None = None) None#
Parse command line arguments.
- Parameters:
arguments_to_parse – command line arguments (or explicitly set ones)
- train_algorithm(arguments_to_parse: List[str] | None = None) None#
Train a reinforcement learning algorithm.
- Parameters:
arguments_to_parse – command line arguments (or explicitly set ones)
PPO example#
- class ray_prover.ppo_prover.PPOProver(test_run: bool = False, storage_path: str | None = None)#
PPO-based prover experiments helper.
>>> storage_path = getfixture("tmp_path") >>> from gym_saturation.constants import MOCK_TPTP_PROBLEM >>> test_arguments = ["--prover", "Vampire", "--max_clauses", "1", ... "--problem_filename", MOCK_TPTP_PROBLEM] >>> PPOProver(True, storage_path).train_algorithm( ... test_arguments + ["--random_baseline"]) ╭─... ... ╭──────────────────────────────────────────╮ │ Training result │ ├──────────────────────────────────────────┤ │ episodes_total 2 │ │ num_env_steps_sampled 2 │ │ num_env_steps_trained 2 │ │ sampler_results/episode_len_mean 1 │ │ sampler_results/episode_reward_mean 0 │ ╰──────────────────────────────────────────╯ ... >>> PPOProver(True, storage_path).train_algorithm(test_arguments) ╭─... ... ╭──────────────────────────────────────────╮ │ Training result │ ├──────────────────────────────────────────┤ │ episodes_total 2 │ │ num_env_steps_sampled 2 │ │ num_env_steps_trained 0 │ │ sampler_results/episode_len_mean 1 │ │ sampler_results/episode_reward_mean 0 │ ╰──────────────────────────────────────────╯ ...
- Parameters:
arguments_to_parse – command line arguments (or explicitly set ones)
test_run – we use light parameters for testing
- env_creator(env_config: Dict[str, Any]) Env#
Return a prover with AST2Vec state representation.
- Parameters:
env_config – an environment config
- Returns:
an environment
- get_algorithm_config() AlgorithmConfig#
Return an algorithm config.
- Returns:
algorithm config
Curriculum learning function#
- ray_prover.curriculum.curriculum_fn(next_task: str, train_results: Dict[str, Any], task_settable_env: SaturationEnv, env_ctx: EnvContext) str#
Organise two-task curriculum.
- Parameters:
next_task – next task to take if the current one is solved
train_results – training metrics
task_settable_env – the environment
env_ctx – left for compatibility
- Returns:
a current task of next task