Check out the ansible configs repo (https://stash.broadinstitute.org/projects/CPDS/repos/ansible-configs/browse) at the same directory level as where the depmap repo is checked out. It contains a download key used for building the dev database and accessing downloads.
Also set up your taiga token (https://cds.team/taiga/token/)
Create a virtual environment, install Python and Javascript dependencies:
./install_prereqs.sh
Create and populate the database :
# first, in a different window, start breadbox, redis and the worker process
cd ../breadbox
./bb run &
redis-server &
./bb run_worker &
# then back in portal-backend create empty DB and sync the data into breadbox
./flask recreate_dev_db
In one window, compile the Javascript :
./flask webpack
In another window, start breadbox. For this follow the instructions in breadbox/README.md
Finally, in another window, run the app :
./flask run
At this point, the portal should be running locally.
If you use part of the website which requires a background worker (e.g. custom associations) you also need to start Redis which is used to hold results and as the message broker, and start the celery worker process :
redis-server
./flask run_dev_worker
To open the interactive shell, run :
./flask shell
You have access to the flask app
variable, and variables listed in
app.py -> register_shellcontext -> shell_context
.
To run all backend tests, run :
pytest
Examples of running specific tests: :
pytest <path/to/test/file.py>::<name_of_test_method>
pytest tests/depmap/context/test_models.py::test_get_entities_enriched_in_context_query
pytest -k <name of test>
pytest -k test_get_entities_enriched_in_context_query
To open a debugger shell on test failure, run pytest with the
\--pdb
option
To run all frontend tests:
cd frontend
yarn test
To run tests in a specific file:
yarn test controlledPlot --passWithNoTests
(The --passWithNoTests
flag is being used here because we're running tests on
all packages and most of them will not produce any matches for
"controlledPlot." This is not need if you cd
into that specific package
first.)
To further run only one test in the file, provide part of the test name
to match on to the -t
option :
cd frontend/packages/\@depmap/interactive
yarn test controlledPlot -t "partial name of test"
it('partial name of test that does stuff', () => {
});
All non-trivial code backend should have a corresponding test, except for code in the (which is not tested). Frontend code testing is nice but not required.
We use storybook to help develop react components in isolation. To run storybook, run:
cd frontend/packages/portal-frontend
yarn storybook
Global css and js can be added to portal_frontend/.storybook/preview-head.html
To instantiate a component with a set of props on storybook, create a/edit the <ComponentName>.stories.tsx file in the same directory location as <ComponentName>.tsx
The directory structure of the tests directory should mirror that of the
depmap directory. E.g., functions in
depmap/cell_line/views.py
should have their test in
tests/depmap/cell_line/test_views.py
. If there are many
tests for functions within one file, the name of the file been tested
may be a directory instead, e.g.
tests/depmap/interactive/views
has files named after the
tested functions.
We have factories defined in tests/factories.py
to help
set up database objects necessary for a test, see examples in
tests/depmap/dataset/test_models.py
. Some older tests use
fixtures like populated_db
, this method is deprecated in
favor of starting from an empty database and using factories.
See
tests/depmap/vector_catalog/nodes/continuous_tree/test_dataset_sort_key.py
for an example.
./flask recreate_dev_db
by default skips loading entirety
of non-core portions of the portal (nonstandard, celligner, tda, and
constellation). For constellation, sample data is loaded, while the
others are completely skipped. :
./flask recreate_dev_db
To load those, add their respective flags (-n
,
-c
, -t
, and -d
). :
./flask recreate_dev_db -nctd
There are bash scripts under the /portal-backend
directory which copy data from specific environments
and update values in portal-backend/flask
to simulate the non-dev environment locally.
For example, copy-data-from-iqa.sh
rebuilds the database with iqa data (and takes 5-10 minutes to run).
Sample data has many holes, but we want at least one gene, one compound, and one cell line that has all data for all features, so that we can get an idea of what the page looks like with all features/colors etc. If adding a new feature to the gene/compound/cell line page, ensure that sample data for the representative gene/compound
Gene: SOX10
Compound: afatinib
Cell line: UACC62 ( ACH-000425 )
These are also marked in the sample_data/subsets.py
file,
which enumerates the genes, compounds, and cell lines in the dev
database.
Likewise, we have a single private dataset named "Canary dataset" which only users "canary@sample.com", "<anything>@canary.com" or "<anything>@canary.org" are allowed to see. See "Testing with private data" to see details for how to test viewing private data.
All the url routes should separate the words with an underscore
(\_
) and lower case. For example: The URL to Cell STRAINER
would be {basename}/cell_strainer
DB tables are singular, lower case using underscores between words, consistent with class name.
Genes, compounds, and cell lines used are in sample_data/subset_files/subsets.py
We use black (python) and prettier (javascript) code formatters to generate smaller git diffs. Our install script installs a package that automatically runs these formatters when you attempt to push. If there are any changes to format, it puts those changes in git's unstaged changes. If this happens, commit the format changes and push again.
We use mypy to check python typings. Run:
mypy .
We also use import-linter to enforce package dependency rules. The rules are defined in a config file here. To run the linter:
lint-imports
We have a built in tool to help us:
We can collect an execution profile and visualize as a flamegraph. This requires a few more steps but is much more comprehensive. See below for the detailed steps.
We have a hook for generating flame graphs for any request as a quick and dirty way to get profiling information from either prod or a development environment.
This is now controlled by a cookie which you can configure by going to /profiler/config
(e.g. http://127.0.0.1:5000/profiler/config or https://cds.team/depmap/profiler/config )
The more profiling enabled, the slower requests will be because profiling adds overhead so best results will be to target what you want profiled. A good starting point is listing the following in "Modules to trace":
werkzeug.*
flask.*
depmap.*
(Timings are collected for all calls that are made from a function in modules that match the regular expressions in "Modules to trace". Once a call is made outside one of those modules, tracing is disabled until the function returns.)
To see collected profiles go to /profiler/profiles
and the newest profiles will be at the top. Click "View" to open
the profile.
We use stackdriver for error reporting, and have it automatically create a pivotal task when it encounters an error.
Back end python errors are automatically reported to stackdriver.
Front end errors must be manually reported (there is a pivotal task to
auto report). The global javascript variable errorHandler
can be used to report manually errors to stackdriver. Use
errorHandler.report(\'\<information about error\>\')
to
report errors.
This errorHandler is mocked out in dev, to cause an alert instead of
sending to stackdriver. If stackdriver integration needs testing in dev,
temporariliy modify the test for the dev environment in
layout.html
for loading and initializing the errorHandler.
The database build step takes the outputs of the preprocessing pipeline
in the s3 bucket and loads them into the database. To build the database
locally, run ./flask recreate_full_db
. To build the
database remotely and store it for download during deployment, see the
db build steps in Jenkins
http://datasci-dev.broadinstitute.org:8080/view/depmap/.
To build the db locally, run:
./flask recreate_dev_db
To deploy the latest successful build from travis, run the appropriate jenkins job here http://datasci-dev.broadinstitute.org:8080/view/depmap/job/DepMap%20Staging%20Environments/
Scripts for jenkins jobs are found in the depmap-deploy repo at https://github.com/broadinstitute/depmap-deploy
The final docker run
command with -v
directory mounts is found in the depmap-flask
script in
the ansible-configs repo which
We specify nonstandard, interactive-only datasets in
internal.py
(e.g. for the internal environment), in the
internal_nonstandard_datasets
dictionary. This dictionary
is handwritten and hand-curated, according to various people's requests
to add datasets.
Sometimes, we make a mistake with this curation, e.g. by setting the wrong transpose or failing to set use_arxspan_id. If these mistakes are used to load the database, the database gets loaded with nonstandard matrix indices of the wrong configuration.
This mistakes often happen, and so we want to have a way to invalidate
and re-load these datasets without having to load the entire database.
Thus, the load steps in db_load_commands
have two parts in
separate transactions. 1) loop through all nonstandard datasets and run
delete_cache_if_invalid_exists
. 2) loop through all
nonstandard datasets and add if they do not exist.
delete_cache_if_invalid_exists
checks the validity of a
cache by checking if the transpose
value of what is
currently loaded matches the transpose
value in the
current config. This decision to only store and check the
transpose
is somewhat arbitrary, and could definitely be
extended to store and check other options that affect db load, if the
use case arises.
If a dataset was loaded with the wrong transpose
option,
simply change it to the correct one and rebuild the db (reploying
staging might also just work).
If a dataset was loaded incorrectly another way (e.g. not using arxspan
ids), one can invalidate the cache by changing the value of
transpose
. Push the branch with this changed
transpose
, and let the travis job build. Then, run the
build db jenkins job with the 'USE_PREVIOUS' option. This should run
delete_cache_if_invalid_exists
, then complete the
transaction. The next loop/transaction (loading the dataset) will
typically fail due to the incorrect transpose; this is fine. After the
job completes, the first loop/transaction that deleted the cache should
be persisted and stored. So now change transpose
to the
correct value, push the branch, then let the db build again with the
'USE_PREVIOUS' option. The second loop/transaction should now run and
load the dataset correctly.
The db load does not delete datasets that may have been previously loaded and are in the db, but are no longer in the config.
The access control layer (primarily implemented in depmap.access_control) hides rows depending on the current user as reported in a header field on the request. (This header is added to the request when it passes through oauth2_proxy.)
For deployments with HAS_ACCESS_CONTROL=True
, we expect
requests to be authenticated and signed by oauth. This is enforced by
doing a signature check registered in flask's before_request.
For token authentication, oauth supports basic authentication derived from the usernames and passwords in the oauth2-htpassd file for the respective server. To generate a token for a user: :
import base64
token = base64.b64encode("<insert username>:<insert password>".encode('ascii'))
In order to see what a user can see, all "admin" users (specified in the access control config) have the ability to switch what username is used for purposes of filtering the database. Going to http://localhost:5000/access_control/override will allow admins to pretend to be a different user and will reflect what that user can see.
A request context is needed to provide an owner_id, which is needed for queries on tables with access controls. Attempts to query these tables without a request context will generate the following error. :
OperationalError: (sqlite3.OperationalError) user-defined function raised exception
We have set up the flask shell
command to automatically
creates a request context. However, this is not present if running
something outside of context, e.g. when using the python library
timeit
. To create a request context, include [from flask
import current_app;
current_app.test_request_context().push()]{.title-ref} in the
timeit
setup.
Given a python sqlalchemy query, the raw sql statement can be obtained
by str(query)
. However, if this query involves tables with
access controls, directly executing it in the sqlite3 command line tool
will give the following error.:
Error: no such function: owned_by_is_visible
To get around this error, names of tables must be changed to their write_only versions. E.g., :
select * from dataset; # instead of this
select * from dataset_write_only; # should be this
There are some steps that might be expected for adding a new dataset, e.g.
- adding a
DependencyDataset
/BiomarkerDataset
enum or a new model (e.g.Mutation
), with display name and units specified inshared.py
- adding any necessary pipeline steps
- adding a db load step. it should use
log_data_issue
for data issues (e.g. gene not found) - generating sample data with appropriate holes
Other requirements that might not be obvious are:
- Every dataset in the portal should have a corresponding download
entry. Add a download entry for this new dataset in
\_all_internal.py
or\_all_public.py
depending on whether it is public. The downloads list appears in the order they are specified, so put the dataset in an appropriate position (i.e. probably not before the quarterly depmap releases). If the dataset is public and not already hosted externally, upload the dataset to taiga (if it is not already there), download it, then re-upload it to the public download bucket (see section "Uploading to public/dmc download bucket"). - Add to settings in
internal.py
/external.py
. Examples of variable names are given for internal.
- Add the dataset enum to the list of datasets (
internal_datasets
). If associations are computed, also add it to association_deps or association_bioms.- If the dataset has a version that should be appended to its display name, add it to the dataset versions (
internal_versions
). All datasets that are regenerated multiple times (e.g., quarterly) need a version.
- Check if the dataset applies to an
is_
selectors on the DependencyDataset model. E.g.is_rnai
,is_compound_related
. - Does this dataset involve cell lines? Most likely yes. If so, add it to the enums list/table of cell line memberships in cell_line/views.py.
- Add to announcements/changelog
- Should the dataset be added to headliners? If so, add to headliners
in
internal_download_settings
and generate the SummaryStats. - Is this dataset an "important" one, that should be prioritized in
dropdown lists? If so, add a
global_priority
to its dataset instance.
Pipeline and db load steps for this new dataset can be tried out on real data through a custom branch deployment: https://docs.google.com/document/d/1SsEUGJzROxw37_-NdU3jLl_v7czwCqiry1cgaVLdD9c/edit#heading=h.v6k5gz8t7nsm
Please know that these instructions are stale, the CLI command has changed. Please see the code for download_taiga_and_upload
For routine quartely releases, see the release task in https://app.asana.com/0/1156387482589290/1156388333407152/f
Sometimes, we use data from third parties, and provide them for direct download without any transformation. For these, we just put them on taiga and re-upload them to a bucket. The public bucket (accessible from all environments) is depmap-external-downloads, under the Achilles project https://console.cloud.google.com/storage/browser/depmap-external-downloads. The DMC-only bucket is at https://console.cloud.google.com/storage/browser/depmap-dmc-downloads. The bucket structure should contain the taiga id.
Sometimes, we transform data in the portal (e.g. unify cell line/gene/compound names), and make these modified versions available for download. For these, we first put the data in taiga under processed portal downloads https://cds.team/taiga/folder/ec5dfb868a46467daa17f03ee61b3afa. The file name uploaded to taiga should be the file name we want when people download the file. To describe how the data was transformed, so please include the requested provenance in the taiga dataset description. The portal changes over time, so this provenance is important. Now that this dataset has a taiga id, the bucket path should include this taiga id (see code below).
After the data is on Taiga, we have a flask cli command
download_taiga_and_upload
that facilities easy upload.
Fill in the desired BucketUrl for the DownloadFile(s) you would like to
upload. This BucketUrl should be structured <subfolder>/<taiga name>.<version>/<taiga file name>
, e.g.
processed_portal_downloads/depmap-public-cell-line-metadata-183e.1/DepMap-2018q4-celllines.csv
for a processed portal download (export from db), or
drug/primary-screen-e5c7.1/primary_merged_row_meta.csv
for
a direct taiga re-upload.
Then, in the upload_download_commands.py
file, fill in the
files_to_upload
dictionary with the release name, file
names, and taiga formats for the files to be uploaded. Modify the
flask
script to use the appropriate environment for
public/dmc (xstaging
or dstaging
). Then run
:
./flask download_taiga_and_upload <env>
# where <env> is xstaging or dstaging
The env
parameter helps us make sure we don't accidently
upload e.g. an internal file into the public bucket, since the env
limits the downloads available.
See the DepMap Portal Operations google doc https://docs.google.com/document/d/1M9K6WkJQo5_9DDXnJWTUZQhE37wxZDCpIIfVZmM_Blg/edit#heading=h.fr8mxpr1nvjc
This will step you through the process of creating a new Flask route and mapping it to an interactive page that can be developed as a React component. Refer to this guide when you're adding new functionality to the Portal that makes sense as its own standalone page.
The Portal began life as a Flask app that leveraged Jinja templates to render all of its HTML. jQuery was used to sprinkle in some limited interactivity. A few such “old school” pages (notably the landing page) still exist.
After some time, we transitioned to using React as a frontend framework. This would lend itself to more interactive experiences and make it easier to develop and maintain the frontend code.
Initially this took the form of a global DepMap
JavaScript object that had
methods like DepMap.initDownloadsPage()
and DepMap.initCellLinePage()
that
could be called to initialize React on the page. This was not very efficient
because it meant all the code for every page was being delivered to the browser
as one massive file.
By that time we were already using Webpack as a means to deliver 3rd party
libraries alongside our application code. Webpack puts them together into one
JS file known as a “bundle.” We then modified the Webpack configuration to
output separate bundles for each page. Those DepMap.initXYZPage()
methods
went away. Instead each HTML page would have a <script>
tag that loads a
dedicated JS file built specifically to work with it and that automatically
bootstraps itself as a React app.
Note that the global
DepMap
object still exists today. Its usage is now much more limited. It acts as a convenient place for methods likeDepMap.launchContextManagerModal()
that launches a modal window from the navbar. It's an example of some complicated React code that doesn't map directly to a specific page.
Ultimately we settled on a solution that still uses Flask for routing. Each page still has its own Jinja template. The difference is that the template is very basic. It does just enough to load our navbar and load the correct React bundle for that page.
Hopefully you now have a feel for how things are structured and why. Creating a new page will consist of the following steps:
- Create a Jinja template
- Set up the routing in Flask
- Create a new React app
- Configure Webpack
In the examples below, the name MY_APP
will act as a placeholder for the name
of your page.
In depmap/portal-backend/depmap/templates/MY_APP/
create a new index.html
file. Its contents should look like this.
{% extends "full_page.html" %}
{% block page_title %}
{# The title that will appear in the browser's title/tab bar #}
{% endblock %}
{% block meta_description %}
{# include a short description that search engines can display #}
{% endblock %}
{% block content %}
<div class='full-screen-div'>
{% include "nav_footer/nav.html" %}
{# Replace MY_APP with the name of your page #}
<div id="MY_APP"></div>
</div>
{% endblock %}
{% block js %}
{# Replace MY_APP with the name of your page. Note the -data suffix. #}
<script id="MY_APP-data" type="application/json">
{
{# Any data that needs to be present
when your app starts up can go here. #}
"secretOfTheUniverse": 42
}
</script>
{# Replace MY_APP with the name of your page #}
<script src="{{ webpack_url('MY_APP.js') }}"></script>
{% endblock %}
In depmap/portal-backend/depmap/MY_APP/
create a new views.py
file. Its
contents should look like this.
from flask import Blueprint, render_template
blueprint = Blueprint(
"MY_APP",
__name__,
url_prefix="/MY_APP",
static_folder="../static",
)
@blueprint.route("/")
def view_MY_APP():
"""
Entry point
"""
return render_template("MY_APP/index.html")
👉 Make sure to import it here and register it here.
In depmap/frontend/packages/portal-frontend/src/apps/
create a new
MY_APP.tsx
file. Its contents should look like this.
import "src/public-path";
import React from "react";
import ReactDOM from "react-dom";
import ErrorBoundary from "src/common/components/ErrorBoundary";
const container = document.getElementById("MY_APP");
const dataElement = document.getElementById("MY_APP-data");
if (!dataElement || !dataElement.textContent) {
throw new Error(
`Expected a DOM element like <script type="application/json">{ ... }</script>'`
);
}
const data = JSON.parse(dataElement.textContent);
const { secretOfTheUniverse } = data;
const App = () => {
return (
<ErrorBoundary>
<div>
The answer to life, the universe, and everything is:
{secretOfTheUniverse}
</div>
</ErrorBoundary>
);
};
ReactDOM.render(<App />, container);
Now Webpack needs to know that your app should be considered its own bundle. To the webpack config, add this line:
"MY_APP": "./src/apps/MY_APP.tsx",
Note: You'll have to restart Webpack Dev Server for this change to be recognized.
If everything worked correctly you should be able to navigate to http://127.0.0.1:5000/MY_APP/ and see your new page!